Product Updates
3 mins
September 19, 2025
(Adapted from Vik's X post on September 9, 2025)
High-quality mathematical text is one of the most important ingredients for building reasoning-capable language models. Papers like DeepSeekMath and NVIDIA’s Nemotron demonstrate that continued pre-training on math-rich corpora significantly boosts reasoning accuracy. Hugging Face’s smolLM work shows that filtering out low-quality math tokens (keeping only the most precise) leads to better performance, even when the total token count drops.
Unfortunately, the best math data is often buried in decades-old research papers and textbooks, trapped behind complex layouts and finicky PDFs.
The path from scanned PDF to clean LaTeX isn’t straightforward. A typical math paper might contain dense formulas, inline symbols, proofs split across pages, or references embedded in diagrams. OCR systems can stumble on surprisingly small details: for example, mistaking an F for the Greek letter τ may look like a minor edit-distance error, but it completely changes the meaning of an equation.
Things get tougher once you leave English-language publications. Notation varies across regions; typesetting styles differ; some journals use older fonts or unconventional macros.
Layout diversity adds another layer of difficulty: textbooks interleave formulas with illustrations, tables, and references to figures that must stay linked.
All of this means that math OCR needs to be close to 100% accurate, handle different languages, and work across layouts.
Many practitioners rely on closed-source services like MathPix for math OCR. These tools perform well but come with limitations: they’re expensive, rate-limited, and not easily customizable for on-premise use.
The good news is that open-source models have caught up. At Datalab, we’ve been extending our open-source projects, Marker and Surya, to handle complex mathematical content. Marker now achieves state-of-the-art results on the external OLMOCR benchmark and has outperformed MathPix in internal evaluations by a tier-one AI research lab.
On documents where GPT-5 misread τ as F, Marker produced flawless output.
Even on challenging multilingual examples, it kept errors minimal — and we’re actively improving its handling of Chinese and other scripts.
You can try them here:
We’re continuing to push the limits of math OCR with:
If you’re interested in running high-throughput math OCR on-prem or need custom adaptations for your pipeline, feel free to reach out to us!