Product Updates
2 mins
September 25, 2025
At Datalab, we’ve designed and trained custom models that form the backbone of our document processing platform. We’ve been hard at work improving these models, and over the past few weeks, we shipped major upgrades to our OCR and Layout models.
Document variations are endless: Multi-column, newspaper style, tables, scribbled footnotes, text that wraps around images, etc. Layout models are trained to divide the page into blocks, and determine the reading order of these blocks. Get the reading order wrong, and the parsed output is un-readable. Miss a block, and you risk losing out on critical information, like a footnote in a legal document, or an important number in a financial table.
Our layout models were specially trained on edge cases to handle:
See our model in action on some traditionally challenging layout formats:
We’re constantly looking to push the boundaries of our models, so send the craziest document layouts you’ve seen to [email protected]. We’ll send credits if you shock us!
The layout recognition improvements aren’t just qualitative, they translate into real measurable performance gains. On the olmOCR-bench, we’ve achieved our highest score of 76.5. This puts us ahead of services like Mathpix, Azure, Textract, Google Document AI, open models like Qwen2.5VL, and even frontier models like GPT4o and Gemini Flash — And did we mention our model is only 700M parameters??
Our new OCR model is more accurate, and getting better every week! This is one of the many advantages of building and training our own models. Earlier this month, we shared that our models reached state-of-the-art math OCR performance. Read more about why this is extremely important for training frontier reasoning models here.