Cracking Math OCR: How We’re Unlocking High-Quality Data for Reasoning Models

Product Updates

3 mins

Cracking Math OCR: How We’re Unlocking High-Quality Data for Reasoning Models

September 19, 2025

(Adapted from Vik's X post on September 9, 2025)

‍

High-quality mathematical text is one of the most important ingredients for building reasoning-capable language models. Papers like DeepSeekMath and NVIDIA’s Nemotron demonstrate that continued pre-training on math-rich corpora significantly boosts reasoning accuracy. Hugging Face’s smolLM work shows that filtering out low-quality math tokens (keeping only the most precise) leads to better performance, even when the total token count drops.

Unfortunately, the best math data is often buried in decades-old research papers and textbooks, trapped behind complex layouts and finicky PDFs.

Why Math OCR Is So Hard

The path from scanned PDF to clean LaTeX isn’t straightforward. A typical math paper might contain dense formulas, inline symbols, proofs split across pages, or references embedded in diagrams. OCR systems can stumble on surprisingly small details: for example, mistaking an F for the Greek letter τ may look like a minor edit-distance error, but it completely changes the meaning of an equation.

Things get tougher once you leave English-language publications. Notation varies across regions; typesetting styles differ; some journals use older fonts or unconventional macros.

GPT5 gives up on OCRing this, but not before making a lot of mistakes first.

Layout diversity adds another layer of difficulty: textbooks interleave formulas with illustrations, tables, and references to figures that must stay linked.

All of this means that math OCR needs to be close to 100% accurate, handle different languages, and work across layouts.

Beyond Closed Tools: Open-Source Progress

Many practitioners rely on closed-source services like MathPix for math OCR. These tools perform well but come with limitations: they’re expensive, rate-limited, and not easily customizable for on-premise use.

The good news is that open-source models have caught up. At Datalab, we’ve been extending our open-source projects, Marker and Surya, to handle complex mathematical content. Marker now achieves state-of-the-art results on the external OLMOCR benchmark and has outperformed MathPix in internal evaluations by a tier-one AI research lab.

On documents where GPT-5 misread τ as F, Marker produced flawless output.

Even on challenging multilingual examples, it kept errors minimal — and we’re actively improving its handling of Chinese and other scripts.

You can try them here:

What’s Next

We’re continuing to push the limits of math OCR with:

Smarter layout analysis and text-quality heuristics
Significant speedups through multi-token prediction
Better multilingual support

If you’re interested in running high-throughput math OCR on-prem or need custom adaptations for your pipeline, feel free to reach out to us!