Structured Extraction with Datalab API, and Handling Long Documents

Product Updates

6 mins

Structured Extraction with Datalab API, and Handling Long Documents

September 8, 2025

A few weeks ago we announced the beta release of Forge Extract, which lets you pass in JSON schemas and get exactly what you need out of your PDFs. It’s been super rewarding to hear how well it’s being received by everyone!

As we iterate on the feature to get it to general release, we wanted to showcase how you can run extraction through our API directly once you finish validating your schemas within Forge Extract.

We’ll cover:

Making Extraction requests with our Marker API
Tips on Handling Long Documents

Submitting an Extraction Request using Marker API

While marker lets you convert PDFs into HTML, JSON, or Markdown, Structured Extraction lets you define a schema and pull out only the fields you care about

You can do this by setting the page_schema parameter in your marker request, which forces it to fill in your schema after PDF conversion finishes.

The easiest way to generate this correctly is to use our editor in Forge Extract, or create a Pydantic schema, then convert to JSON with .model_dump_json(). We always recommend trying your schemas in Forge Extract first since it’s easy to debug issues before running on a larger batch.

Making an API call

Let’s say we’re using a recent 10-K filing from Meta.

We might use a schema like this to pull a few basic metrics out. Note that the description field is useful to add more context around what the field is. (In the future, we’ll be adding separate field validator rules.)

PAGE_SCHEMA = """{
    "type": "object",
    "properties": {
      "metrics": {
        "type": "object",
        "properties": {
          "diluted_eps_2025": {
            "type": "number",
            "description": "The diluted Earnings per Share (EPS) for 2025"
          },
          "diluted_eps_2024": {
            "type": "number",
            "description": "The diluted Earnings per Share (EPS) for 2024"
          },
          "pct_change_diluted_eps_2024_to_2025": {
            "type": "number",
            "description": "The percentage change in diluted Earnings per Share (EPS) from 2024 to 2025"
          }
        }
      }
    },
    "required": ["metrics"]
  }"""

Submitting a request to marker consists of two things:

Triggering the request with your configuration and file
Polling it to see if it’s complete

Let’s go ahead and submit our request.

import requests

url = "https://www.datalab.to/api/v1/marker"

form_data = {
    'file': ('meta_10k.pdf', open('meta_10k.pdf', 'rb'), 'application/pdf'),
    'page_schema': (None, PAGE_SCHEMA),
    'output_format': (None, 'json'),
    'use_llm': (None, True)
}
headers = {"X-Api-Key": "YOUR_API_KEY"}

# Submit your request

response = requests.post(url, files=form_data, headers=headers)
data = response.json()

Your response will look something like this:

{
	"success": true,
	"error": null,
	"request_id": "VRLOhcsLzfUqX3MQek1oww",
	"request_check_url": "https://www.datalab.to/api/v1/marker/VRLOhcsLzfUqX3MQek1oww",
	"versions": null
}

You can then poll for completion by using request_check_url every few seconds.

import time

# Use request_check_url to poll for job completion

max_polls = 300
check_url = data["request_check_url"]

for i in range(max_polls):
    time.sleep(2)
    response = requests.get(check_url, headers=headers) # Don't forget to send the auth headers
    data = response.json()

    if data["status"] == "complete":
        break

Note that status will be "processing" until it’s done (at which point it changes to "complete" .

When it’s done, your response will look something like this:

{
	"status": "complete",
	"json": {
		"children": [...],
	},
	"extraction_schema_json": "{...your extraction results...}",
	...
}

Two really important things to call out:

When you run in extraction mode, your extracted schema results will be available within extraction_schema_json .
- This field is returned as a string instead of a dict in case of JSON parse issues (we sometimes see edge cases especially with inline math equations and LaTeX, but it’s rare). You can usually load it directly into JSON and recover your whole schema.
- For each field you requested, we’ll also include a [fieldname]_citations which includes a list of Block IDs from your converted PDF that we cited.
When you run marker in Extraction mode, the original converted PDF is always available within the json response field. You can access all blocks within the children tag, and they maintain their original hierarchy (if there is one). Each block includes its original ID and bounding boxes, so you can show citations and track data lineage easily as part of your document ingestion pipelines!

We may modify some of the response structure, especially around citations, as we go from Beta to a General Release, but try it out and let us know how it works for now.

Dealing with Long Documents

Okay! We got it working with a simple document and a straightforward schema. What if you’ve got a multi-hundred page PDF?

There are a few things we’re working on in the coming weeks / months to help, including:

Improving inference speeds
Async mode that’ll store results in S3 with a notification, so you don’t have to poll forever

In the meantime, here are a few things you can do to keep things moving along.

Page Range

If your extraction schema is typically constrained to a set of pages within your document and you know this upfront, use the page_range parameter in the API to ensure we only process the relevant pages. You’ll only be charged for those (even if your document is much longer).

When you submit your marker request, set page_range to the right values. For example: 0,2-4 will process pages 0, 2, 3, and 4. Note that this overrides max_pages if you set that too, and that our page ranges are 0-indexed (so 0 is the first page).

Dynamic Page Ranges by Chaining Extractions

Let’s say you have a massive 100 page file with lots of different sections. If you don’t know what pages they’re on, but do know the specific extraction schemas you’d use for each section, here’s one way to scale up your inference speed and improve accuracy.

Submit the whole PDF, but set page_range to 0-6 (whichever range includes the entire Table of Contents). Run it with an Extraction schema that’s designed to pull out a table of contents.
Then, dynamically construct page_range values for each section
Submit separate requests to marker using each page_range and the corresponding extraction schema for the info you know is in them.

Here’s a complete example including marker submission, polling, and dynamic page range extraction.

import requests
import time
import json

API_URL = "https://www.datalab.to/api/v1/marker"
API_KEY = "YOUR_API_KEY"
HEADERS = {"X-Api-Key": API_KEY}

SAMPLE_TOC_EXTRACION_SCHEMA = {
  "type": "object",
  "title": "ToCExtractionSchema",
  "description": "Schema to pull out table of contents",
  "properties": {
    "table_of_contents": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "section_name": {
            "type": "string",
            "description": "the name of the section from table of contents"
          },
          "page_range": {
            "type": "string",
            "description": "the page range or page number of the item from the table of contents"
          }
        }
      }
    }
  },
  "required": [
    "table_of_contents"
  ]
}

def run_marker_extraction(pdf_path, schema_json, page_range=None):
    """
    Submit a marker request with schema and optional page range.
    Poll until complete, then return the parsed extraction schema as a dict.
    """
    with open(pdf_path, "rb") as f:
        files = {
            'file': ('document.pdf', f, 'application/pdf'),
            'page_schema': (None, schema_json),
            'use_llm': (None, True)
        }
        if page_range:
            files['page_range'] = (None, page_range)

        # Submit request
        response = requests.post(API_URL, files=files, headers=HEADERS)
        data = response.json()
        check_url = data["request_check_url"]

    # Poll until complete
    max_polls = 300
    for _ in range(max_polls):
        time.sleep(2)
        poll = requests.get(check_url, headers=HEADERS).json()

        if poll.get("status") == "failed":
            raise RuntimeError(f"Extraction failed: {poll.get('error')}")

        if poll.get("status") == "complete":
            return json.loads(poll.get('extraction_schema_json'))

    raise TimeoutError("Extraction job did not complete in time.")


def dynamic_page_range_extraction(pdf_path, toc_schema, schemas_by_section):
    """
    1. Extract TOC from first few pages.
    2. Parse TOC into section -> page_range mappings.
    3. Run marker again per section using its schema + page range.
    4. Merge results into a single dict.
    """
    # Step 1: Extract TOC
    toc_result = run_marker_extraction(pdf_path, schema_json=toc_schema, page_range="0-6")

    # Step 2: Parse TOC into usable mapping (customize parser as needed)
    section_page_ranges = parse_toc(toc_result)

    # Step 3: Extract per-section
    all_results = {}
    for section, page_range in section_page_ranges.items():
        schema_json = schemas_by_section.get(section)
        if schema_json:
            section_result = run_marker_extraction(pdf_path, schema_json=schema_json, page_range=page_range)
            all_results[section] = section_result

    return all_results


def parse_toc(toc_dict):
    """
    Example TOC parser: converts the TOC dict into {section: page_range}.
    In practice you'd implement parsing logic based on your schema design.
    """
    page_map = {}
    for item in toc_dict.get("table_of_contents", []):
        section = item.get("item")
        page = item.get("page number")
        if section and page:
            page_map[section] = page  # right now single page; could expand to ranges if you need to
    return page_map

Try it out

Sign up for Datalab and try out Forge Extract. Reach out to us if you want credits to try, or have any questions tailored to your needs!

‍