Launch Week Day 3: Layout Model Updates

Product Updates

2 mins

Launch Week Day 3: Layout Model Updates

September 25, 2025

At Datalab, we’ve designed and trained custom models that form the backbone of our document processing platform. We’ve been hard at work improving these models, and over the past few weeks, we shipped major upgrades to our OCR and Layout models.

Layout Matters More Than You Think

Document variations are endless: Multi-column, newspaper style, tables, scribbled footnotes, text that wraps around images, etc. Layout models are trained to divide the page into blocks, and determine the reading order of these blocks. Get the reading order wrong, and the parsed output is un-readable. Miss a block, and you risk losing out on critical information, like a footnote in a legal document, or an important number in a financial table.

Our layout models were specially trained on edge cases to handle:

Well-formatted documents in their natural reading order in Datalab Parse
Positional information attached to every piece of extracted text, enabling fine-grained citations in Datalab Extract

See our model in action on some traditionally challenging layout formats:

We’re constantly looking to push the boundaries of our models, so send the craziest document layouts you’ve seen to [email protected]. We’ll send credits if you shock us!

3rd Party Benchmarks Back It Up

The layout recognition improvements aren’t just qualitative, they translate into real measurable performance gains. On the olmOCR-bench, we’ve achieved our highest score of 76.5. This puts us ahead of services like Mathpix, Azure, Textract, Google Document AI, open models like Qwen2.5VL, and even frontier models like GPT4o and Gemini Flash — And did we mention our model is only 700M parameters??

ICYMI: State-of-the-Art on arXiv math OCR

Our new OCR model is more accurate, and getting better every week! This is one of the many advantages of building and training our own models. Earlier this month, we shared that our models reached state-of-the-art math OCR performance. Read more about why this is extremely important for training frontier reasoning models here.

‍