Product Updates
6 mins
September 8, 2025
A few weeks ago we announced the beta release of Forge Extract, which lets you pass in JSON schemas and get exactly what you need out of your PDFs. It’s been super rewarding to hear how well it’s being received by everyone!
As we iterate on the feature to get it to general release, we wanted to showcase how you can run extraction through our API directly once you finish validating your schemas within Forge Extract.
We’ll cover:
While marker
lets you convert PDFs into HTML, JSON, or Markdown, Structured Extraction lets you define a schema and pull out only the fields you care about
You can do this by setting the page_schema
parameter in your marker
request, which forces it to fill in your schema after PDF conversion finishes.
The easiest way to generate this correctly is to use our editor in Forge Extract, or create a Pydantic schema, then convert to JSON with .model_dump_json()
. We always recommend trying your schemas in Forge Extract first since it’s easy to debug issues before running on a larger batch.
Let’s say we’re using a recent 10-K filing from Meta.
We might use a schema like this to pull a few basic metrics out. Note that the description
field is useful to add more context around what the field is. (In the future, we’ll be adding separate field validator rules.)
PAGE_SCHEMA = """{
"type": "object",
"properties": {
"metrics": {
"type": "object",
"properties": {
"diluted_eps_2025": {
"type": "number",
"description": "The diluted Earnings per Share (EPS) for 2025"
},
"diluted_eps_2024": {
"type": "number",
"description": "The diluted Earnings per Share (EPS) for 2024"
},
"pct_change_diluted_eps_2024_to_2025": {
"type": "number",
"description": "The percentage change in diluted Earnings per Share (EPS) from 2024 to 2025"
}
}
}
},
"required": ["metrics"]
}"""
Submitting a request to marker
consists of two things:
Let’s go ahead and submit our request.
import requests
url = "https://www.datalab.to/api/v1/marker"
form_data = {
'file': ('meta_10k.pdf', open('meta_10k.pdf', 'rb'), 'application/pdf'),
'page_schema': (None, PAGE_SCHEMA),
'output_format': (None, 'json'),
'use_llm': (None, True)
}
headers = {"X-Api-Key": "YOUR_API_KEY"}
# Submit your request
response = requests.post(url, files=form_data, headers=headers)
data = response.json()
Your response will look something like this:
{
"success": true,
"error": null,
"request_id": "VRLOhcsLzfUqX3MQek1oww",
"request_check_url": "https://www.datalab.to/api/v1/marker/VRLOhcsLzfUqX3MQek1oww",
"versions": null
}
You can then poll for completion by using request_check_url
every few seconds.
import time
# Use request_check_url to poll for job completion
max_polls = 300
check_url = data["request_check_url"]
for i in range(max_polls):
time.sleep(2)
response = requests.get(check_url, headers=headers) # Don't forget to send the auth headers
data = response.json()
if data["status"] == "complete":
break
Note that status
will be "processing"
until it’s done (at which point it changes to "complete"
.
When it’s done, your response will look something like this:
{
"status": "complete",
"json": {
"children": [...],
},
"extraction_schema_json": "{...your extraction results...}",
...
}
Two really important things to call out:
extraction_schema_json
.string
instead of a dict
in case of JSON parse issues (we sometimes see edge cases especially with inline math equations and LaTeX, but it’s rare). You can usually load it directly into JSON and recover your whole schema.[fieldname]_citations
which includes a list of Block IDs
from your converted PDF that we cited.marker
in Extraction mode, the original converted PDF is always available within the json
response field. You can access all blocks within the children
tag, and they maintain their original hierarchy (if there is one). Each block includes its original ID
and bounding boxes, so you can show citations and track data lineage easily as part of your document ingestion pipelines!We may modify some of the response structure, especially around citations, as we go from Beta to a General Release, but try it out and let us know how it works for now.
Okay! We got it working with a simple document and a straightforward schema. What if you’ve got a multi-hundred page PDF?
There are a few things we’re working on in the coming weeks / months to help, including:
In the meantime, here are a few things you can do to keep things moving along.
If your extraction schema is typically constrained to a set of pages within your document and you know this upfront, use the page_range
parameter in the API to ensure we only process the relevant pages. You’ll only be charged for those (even if your document is much longer).
When you submit your marker
request, set page_range
to the right values. For example: 0,2-4
will process pages 0, 2, 3, and 4
. Note that this overrides max_pages
if you set that too, and that our page ranges are 0-indexed
(so 0 is the first page).
Let’s say you have a massive 100 page file with lots of different sections. If you don’t know what pages they’re on, but do know the specific extraction schemas you’d use for each section, here’s one way to scale up your inference speed and improve accuracy.
page_range
to 0-6
(whichever range includes the entire Table of Contents). Run it with an Extraction schema that’s designed to pull out a table of contents.page_range
values for each sectionmarker
using each page_range
and the corresponding extraction schema for the info you know is in them.Here’s a complete example including marker
submission, polling, and dynamic page range extraction.
import requests
import time
import json
API_URL = "https://www.datalab.to/api/v1/marker"
API_KEY = "YOUR_API_KEY"
HEADERS = {"X-Api-Key": API_KEY}
SAMPLE_TOC_EXTRACION_SCHEMA = {
"type": "object",
"title": "ToCExtractionSchema",
"description": "Schema to pull out table of contents",
"properties": {
"table_of_contents": {
"type": "array",
"items": {
"type": "object",
"properties": {
"section_name": {
"type": "string",
"description": "the name of the section from table of contents"
},
"page_range": {
"type": "string",
"description": "the page range or page number of the item from the table of contents"
}
}
}
}
},
"required": [
"table_of_contents"
]
}
def run_marker_extraction(pdf_path, schema_json, page_range=None):
"""
Submit a marker request with schema and optional page range.
Poll until complete, then return the parsed extraction schema as a dict.
"""
with open(pdf_path, "rb") as f:
files = {
'file': ('document.pdf', f, 'application/pdf'),
'page_schema': (None, schema_json),
'use_llm': (None, True)
}
if page_range:
files['page_range'] = (None, page_range)
# Submit request
response = requests.post(API_URL, files=files, headers=HEADERS)
data = response.json()
check_url = data["request_check_url"]
# Poll until complete
max_polls = 300
for _ in range(max_polls):
time.sleep(2)
poll = requests.get(check_url, headers=HEADERS).json()
if poll.get("status") == "failed":
raise RuntimeError(f"Extraction failed: {poll.get('error')}")
if poll.get("status") == "complete":
return json.loads(poll.get('extraction_schema_json'))
raise TimeoutError("Extraction job did not complete in time.")
def dynamic_page_range_extraction(pdf_path, toc_schema, schemas_by_section):
"""
1. Extract TOC from first few pages.
2. Parse TOC into section -> page_range mappings.
3. Run marker again per section using its schema + page range.
4. Merge results into a single dict.
"""
# Step 1: Extract TOC
toc_result = run_marker_extraction(pdf_path, schema_json=toc_schema, page_range="0-6")
# Step 2: Parse TOC into usable mapping (customize parser as needed)
section_page_ranges = parse_toc(toc_result)
# Step 3: Extract per-section
all_results = {}
for section, page_range in section_page_ranges.items():
schema_json = schemas_by_section.get(section)
if schema_json:
section_result = run_marker_extraction(pdf_path, schema_json=schema_json, page_range=page_range)
all_results[section] = section_result
return all_results
def parse_toc(toc_dict):
"""
Example TOC parser: converts the TOC dict into {section: page_range}.
In practice you'd implement parsing logic based on your schema design.
"""
page_map = {}
for item in toc_dict.get("table_of_contents", []):
section = item.get("item")
page = item.get("page number")
if section and page:
page_map[section] = page # right now single page; could expand to ranges if you need to
return page_map
Sign up for Datalab and try out Forge Extract. Reach out to us if you want credits to try, or have any questions tailored to your needs!