Company Updates
5 mins
November 4, 2025
Extracting structured data from invoices should be straightforward in 2025. Modern LLMs can follow schemas, parse complex documents, and return perfectly formatted JSON.
But here's the problem: they can only extract what they can actually read. And on real-world invoices (rotated scans, dense tables, skewed images), even frontier models stumble.
We set out to test a hypothesis: better OCR leads to better structured extraction.
A few months ago, we announced Forge Extract, Datalab's structured extraction engine that pulls targeted information from documents using JSON schemas you define. Forge Extract is powered by our state-of-the-art OCR models, built specifically to handle the documents that break other systems.
Over the last 6 months, we shipped numerous updates to Marker and Surya, pushing the boundaries of OCR accuracy - achieving SOTA math performance that powers AI labs, and outperforming massive frontier models on third-party OCR benchmarks. We also just released Chandra - our latest and greatest model that tops the olmOCR benchmark, and has helped convert the hardest documents into clean, accessible markdown.
High quality OCR is the foundation for structured extraction, and in this blog post we’ll show exactly how better OCR leads to more accurate structured extraction.
To put this thesis to the test, we'll cover one of the most common and challenging document types - invoices.
Invoices are the perfect testbed for our experiment. They're notoriously difficult to parse - filled with dense with text, complex tables with many rows and columns, handwritten notes, strikethroughs, and a myriad of other challenging artifacts. That’s why many organizations still extract payment details manually, a process that’s both time-consuming and error-prone
We curated a challenging dataset of invoices from a diverse set of industries, and employed a clear but comprehensive schema targeting the most important aspects of an invoice. We manually annotated this set of documents with the target schema to give us a accurate source of ground-truth.
{
"type": "object",
"title": "ExtractionSchema",
"description": "Schema for structured data extraction",
"properties": {
"invoice_items": {
"type": "array",
"description": "List of line items in the invoice",
"items": {
"type": "object",
"properties": {
"name": {
"type": "string",
"description": "A concise unique identifying name for the line item."
},
"rate": {
"type": "number",
"description": "The rate per unit of the item."
},
"units": {
"type": "number",
"description": "The number of units of the line item"
},
"total_cost": {
"type": "number",
"description": "The total cost for the line item."
},
"date": {
"type": "string",
"description": "Date for the line item. Should follow the format DD/MM/YYYY. If any of information (like the year) is missing, can replace with XXXX"
}
}
}
},
"payment_summary": {
"type": "object",
"description": "A summary of costs and other payment information. If this is not provided, all fields can be left blank. Do not infer anything.",
"properties": {
"gross_total": {
"type": "number",
"description": "Total cost for the invoice without any deductions."
},
"net_total": {
"type": "number",
"description": "Total cost for the invoice after any deductions."
}
}
},
"invoice_date": {
"type": "string",
"description": "Date of the invoice. Should follow the format DD/MM/YYYY. If any of information (like the year) is missing, can replace with XXXX"
},
"invoice_id": {
"type": "string",
"description": "A short string or number serving as the invoice ID or invoice number"
},
"invoice_company": {
"type": "string",
"description": "The name of the company that generated the invoice"
},
"billing_details": {
"type": "object",
"description": "Details about the entity being billed for the invoice",
"properties": {
"billing_company_or_person": {
"type": "string",
"description": "Name of the person of the company being billed"
},
"billing_address": {
"type": "string",
"description": "Address of the person or company being billed"
}
}
},
"remittance_address": {
"type": "string",
"description": "Address for remittance/payment. "
}
},
"required": [
"invoice_items",
"payment_summary",
"invoice_date",
"invoice_id",
"invoice_company",
"billing_details",
"remittance_address"
]
}Using Gemini 2.5 Flash and GPT-5-mini, we benchmarked the accuracy of structured extraction under two settings:
We experimented with a variety of image DPIs and resolutions for each setting, then used the best-performing configuration to report the final scores.

Results: Providing LLMs with Chandra's Markdown outputs significantly boosts accuracy.
Below are a few examples where Gemini gets it wrong without Datalab’s markdown to help ground the extraction.
Note that the screenshots below are truncated for clarity. Most documents span multiple pages, and extractions are long, often spanning >100 lines.
Invoices often hit processing systems after being manually scanned, and can be skewed, or even rotated by 90 degrees. On this rotated page, Gemini incorrectly extracts the cost for the last two line items, while Datalab’s extraction is correct.

Gemini completely misses row number 8 from this dense table! This also happens in another instance, while Datalab extracts all 10 rows perfectly. Note that the row indices in the extracted JSON are 0-indexed.

This document has a slightly skewed table that is scanned. Gemini completely misses a row from this scanned table - the last instance and 5th of “Channel 11 News at 6pm M-F”. Additionally, the price for the subsequent “Wheel of Fortune” line item has the wrong extracted cost (possibly mixed up with the following row). Once again, Datalab extracts all 22 line items from the document perfectly.

Additionally, the last page of this document also contains the gross and net totals, which Gemini fails to extract correctly.

Across our testing, we found that Gemini 2.5 Flash (and other LLMs) were highly performant at performing structured extraction, but suffered from common failure modes. As demonstrated above, without grounding text, they missed rows from tables, had minor errors in extracted digits, and in some cases, completely missed small text.
These errors can quickly add up and lead to costly issues when the data is ingested by downstream pipelines.
Using Datalab’s OCR as input provides highly accurate grounding for the models to then perform extraction upon, without missing or incorrectly extracting any details.
Forge Extract is now available via API and the Playground. If you don't have an account yet, sign up here and you'll automatically receive $5 in credits.
Define your own schema, upload your documents, and see how accurate structured extraction can get when it starts with world-class OCR.