Company Updates

5 mins

Grounded Intelligence: How High-Fidelity OCR Drives Accurate Structured Extraction

November 4, 2025

Extracting structured data from invoices should be straightforward in 2025. Modern LLMs can follow schemas, parse complex documents, and return perfectly formatted JSON.

But here's the problem: they can only extract what they can actually read. And on real-world invoices (rotated scans, dense tables, skewed images), even frontier models stumble.

We set out to test a hypothesis: better OCR leads to better structured extraction.

The Foundation: World-class OCR

A few months ago, we announced Forge Extract, Datalab's structured extraction engine that pulls targeted information from documents using JSON schemas you define. Forge Extract is powered by our state-of-the-art OCR models, built specifically to handle the documents that break other systems.

Over the last 6 months, we shipped numerous updates to Marker and Surya, pushing the boundaries of OCR accuracy -  achieving SOTA math performance that powers AI labs, and outperforming massive frontier models on third-party OCR benchmarks. We also just released Chandra - our latest and greatest model that tops the olmOCR benchmark, and has helped convert the hardest documents into clean, accessible markdown.

High quality OCR is the foundation for structured extraction, and in this blog post we’ll show exactly how better OCR leads to more accurate structured extraction.

To put this thesis to the test, we'll cover one of the most common and challenging document types -  invoices.

The Perfect Test Case: Invoices

Invoices are the perfect testbed for our experiment. They're notoriously difficult to parse - filled with dense with text, complex tables with many rows and columns, handwritten notes, strikethroughs, and a myriad of other challenging artifacts. That’s why many organizations still extract payment details manually, a process that’s both time-consuming and error-prone

We curated a challenging dataset of invoices from a diverse set of industries, and employed a clear but comprehensive schema targeting the most important aspects of an invoice. We manually annotated this set of documents with the target schema to give us a accurate source of ground-truth.

{
	"type": "object",
	"title": "ExtractionSchema",
	"description": "Schema for structured data extraction",
	"properties": {
		"invoice_items": {
			"type": "array",
			"description": "List of line items in the invoice",
			"items": {
				"type": "object",
				"properties": {
				"name": {
				"type": "string",
				"description": "A concise unique identifying name for the line item."
				},
				"rate": {
				"type": "number",
				"description": "The rate per unit of the item."
				},
				"units": {
				"type": "number",
				"description": "The number of units of the line item"
				},
				"total_cost": {
				"type": "number",
				"description": "The total cost for the line item."
				},
				"date": {
						"type": "string",
						"description": "Date for the line item. Should follow the format DD/MM/YYYY. If any of information (like the year) is missing, can replace with XXXX"
				}
			}
		}
	},
	"payment_summary": {
		"type": "object",
		"description": "A summary of costs and other payment information. If this is not provided, all fields can be left blank. Do not infer anything.",
		"properties": {
		"gross_total": {
			"type": "number",
			"description": "Total cost for the invoice without any deductions."
		},
		"net_total": {
			"type": "number",
			"description": "Total cost for the invoice after any deductions."
		}
	}
	},
	"invoice_date": {
		"type": "string",
		"description": "Date of the invoice. Should follow the format DD/MM/YYYY. If any of information (like the year) is missing, can replace with XXXX"
	},
	"invoice_id": {
		"type": "string",
		"description": "A short string or number serving as the invoice ID or invoice number"
	},
	"invoice_company": {
		"type": "string",
		"description": "The name of the company that generated the invoice"
	},
	"billing_details": {
		"type": "object",
		"description": "Details about the entity being billed for the invoice",
		"properties": {
		"billing_company_or_person": {
			"type": "string",
			"description": "Name of the person of the company being billed"
		},
		"billing_address": {
			"type": "string",
			"description": "Address of the person or company being billed"
			}
		}
	},
	"remittance_address": {
		"type": "string",
		"description": "Address for remittance/payment. "
		}
	},
	"required": [
		"invoice_items",
		"payment_summary",
		"invoice_date",
		"invoice_id",
		"invoice_company",
		"billing_details",
		"remittance_address"
	]
}

The Experiment: Images vs. OCR + Images

Using Gemini 2.5 Flash and GPT-5-mini, we benchmarked the accuracy of structured extraction under two settings:

  1. Image only: Providing LLM with only the images of each document page.
  2. Image + Chandra Markdown: Providing LLM with the image and Chandra-extracted markdown of each document page.

We experimented with a variety of image DPIs and resolutions for each setting, then used the best-performing configuration to report the final scores.

Results: Providing LLMs with Chandra's Markdown outputs significantly boosts accuracy.

Failure Modes Without OCR Grounding

Below are a few examples where Gemini gets it wrong without Datalab’s markdown to help ground the extraction.

Note that the screenshots below are truncated for clarity. Most documents span multiple pages, and extractions are long, often spanning >100 lines.

Rotated Invoices

Invoices often hit processing systems after being manually scanned, and can be skewed, or even rotated by 90 degrees. On this rotated page, Gemini incorrectly extracts the cost for the last two line items, while Datalab’s extraction is correct.

Dense Tables

Gemini completely misses row number 8 from this dense table! This also happens in another instance, while Datalab extracts all 10 rows perfectly. Note that the row indices in the extracted JSON are 0-indexed.

Scanned Tables

This document has a slightly skewed table that is scanned. Gemini completely misses a row from this scanned table - the last instance and 5th of “Channel 11 News at 6pm M-F”. Additionally, the price for the subsequent “Wheel of Fortune” line item has the wrong extracted cost (possibly mixed up with the following row). Once again, Datalab extracts all 22 line items from the document perfectly.

Small Text

Additionally, the last page of this document also contains the gross and net totals, which Gemini fails to extract correctly.

Why Grounding Matters

Across our testing, we found that Gemini 2.5 Flash (and other LLMs) were highly performant at performing structured extraction, but suffered from common failure modes. As demonstrated above, without grounding text, they missed rows from tables, had minor errors in extracted digits, and in some cases, completely missed small text.

These errors can quickly add up and lead to costly issues when the data is ingested by downstream pipelines.

Using Datalab’s OCR as input provides highly accurate grounding for the models to then perform extraction upon, without missing or incorrectly extracting any details.

Try it yourself!

Forge Extract is now available via API and the Playground. If you don't have an account yet, sign up here and you'll automatically receive $5 in credits.

Define your own schema, upload your documents, and see how accurate structured extraction can get when it starts with world-class OCR.

Table of contents: