Saturating the olmOCR Benchmark

Company Updates

5 mins

Saturating the olmOCR Benchmark

November 13, 2025

The olmOCR benchmark is our favorite external benchmark. The team at AI2 put in a lot of work to make the benchmark fair and reliable. In our experience, it’s the only current OCR benchmark that matches our model vibe checks.

However, as we’ve been training new versions of Chandra, we noticed something odd. Despite the models passing our harder internal edge cases, and generally feeling better in vibe checks, olmocr scores were flat or slightly lower.

This led us to investigate which tests were failing, and why.

Edit distance

Most OCR benchmarks, including the commonly used olmOCR and OmniDocBench, rely to some extent on string matching - either exact or edit distance.

String matching can be very brittle, especially around equations.

For example, let’s say this is the ground truth:

s_{r,s}=\begin{cases} 2 \text{ if } r-s=2,4 \text{ mod 4},\\ 1 \text{ if } r-s=1,3 \text{ mod 4}. \end{cases}

Then this will be marked incorrect, even though the meaning is identical, because the 4 is in a slightly different place:

s_{r,s}=\begin{cases} 2 & \text{if } r-s=2,4 \text{ mod } 4, \\ 1 & \text{if } r-s=1,3 \text{ mod } 4. \end{cases}

(This is test 2503.07355_pg11_math_002 from the olmOCR benchmark).

These both render identically:

We noticed that most of the math failures in the olmOCR benchmark weren’t due to the model being wrong - they were due to these small formatting differences.

This also happens outside of math. This is test 22_689514 :

The benchmark expects the string Will expect you for Lunch at Belmont Wednesday any hour you prefer.

However, Chandra OCR spits out:

Will expect you for lunch at Belmont
Wednesday any hour you prefer

This has a newline, and lunch is not capitalized. Due to edit distance, this fails, but all the main text and semantic meaning is preserved. This is a test that should pass.

Opinionated tests

olmOCR has opinionated tests, which are usually universally useful. However, these opinions are sometimes highly specific to a given context.

The PDF in test 4f4c20ab94dda3045bedc9bbcd3f4f41d3ff237a_page_1_header_02 includes footnotes:

However, the test penalizes you for including 1 Prop. 2016/17:1, utg.omr. 12, bet. 2016/17:SfU3, rskr. 2016/17:121. in the output - in my opinion, this footnote is important to meaning.

Incorrect ground truth

It is very, very hard to label thousands of images perfectly to get a ground truth. Every benchmark set has mistakes in it, and the olmOCR bench is no exception.

Test 26_727870 is a letter that looks like this:

If you look at the top left, you can see where it says:

FRED. BROCKHURST, VICE-PRES'T
E. G. PERKINS, SECRETARY

Note that they are on separate lines. However, the test checks for the presence of the string BROCKHURST, VICE-PRES'T E.G., which fails, because Chandra ocr correctly adds in a newline.

This case is slightly ambiguous because you could argue the newline is subjective. Either way, it should pass with the newline.

A less ambiguous case is 10_810581.

Here, the text says Republican party - note the capitalization:

However, the test looks for We want no fusion or amalgamation with the Republican Party, the , and will fail if you have the correct capitalization from the document. In my opinion, the capitalization doesn’t matter too much either way, and both variants should pass.

Experiment

We ran a quick experiment to quantify how many of these failing test cases are due to exact string match brittleness. We gave Gemini the Chandra model output, the unit test from olmOCR, and the failure reason, then asked Gemini to check if the test should have passed. We specifically did not feed in the image, to avoid the LLM making decisions about the correct ground truth.

This enables us to pass the test in the case where the output is right, but exact matching fails. It doesn’t correct for the cases of overly opinionated tests, or incorrect ground truth. These are difficult to correct for in an unbiased way.

When we ran the olmOCR benchmark with Gemini post-correction, we got these results:

Here is a table with the same data:

```markdown
Initial Scores:
--------------------------------------------------------------------------------
arxiv_math.jsonl              :  84.3% (2466/2927 tests)
baseline                      :  99.9% (1393/1394 tests)
headers_footers.jsonl         :  91.7% ( 697/ 760 tests)
long_tiny_text.jsonl          :  92.3% ( 408/ 442 tests)
multi_column.jsonl            :  80.8% ( 714/ 884 tests)
old_scans.jsonl               :  50.6% ( 266/ 526 tests)
old_scans_math.jsonl          :  76.4% ( 350/ 458 tests)
table_tests.jsonl             :  88.1% ( 900/1022 tests)

================================================================================
Corrected Scores:
--------------------------------------------------------------------------------
arxiv_math.jsonl              :  95.2% (2789/2927 tests)
baseline                      :  99.9% (1393/1394 tests)
headers_footers.jsonl         :  92.8% ( 706/ 760 tests)
long_tiny_text.jsonl          :  97.9% ( 433/ 442 tests)
multi_column.jsonl            :  90.1% ( 797/ 884 tests)
old_scans.jsonl               :  73.0% ( 384/ 526 tests)
old_scans_math.jsonl          :  93.0% ( 426/ 458 tests)
table_tests.jsonl             :  94.8% ( 969/1022 tests)

================================================================================
Overall Statistics:
--------------------------------------------------------------------------------
Initial:    85.5% ( 7194/ 8413 tests)
Corrected:  93.9% ( 7897/ 8413 tests)
Improvement: +8.4% (+703 tests)
================================================================================
```

We estimate that 3-4% of the tests have incorrect ground truth, or are opinionated in a way that isn’t universally useful.

This means that the maximum possible score on this benchmark is 96-97%. Given how close we are to this limit, it is no longer a strong signal for us if a model improves on this benchmark.

I think we can draw two conclusions from this experiment:

Most performance gains without post-correction will come from improving how well your model matches the olmOCR benchmark formatting, not from improving the quality of your model, due to edit distance
If you do post-correction, there are only a few percentage points of potential gains, and it will not be a strong signal of model quality if you improve from, say, 93.9% to 95%

Next steps

I don’t want to minimize the amazing work that went into this benchmark, and others like it. But I do think this experiment indicates that we’re close to the limit of what traditional edit distance based benchmarks can tell us.

You see something similar with frontier LLMs - labs still publish benchmarks, but people mostly eval on vibes or their own internal metrics.

We also need wider domain coverage in our benchmarks. olmOCR is mostly english text from arxiv or internet archive. This does not cover a lot of important OCR corner cases. I’d love to see a benchmark where a state of the art model scores 70%, without the issues introduced by edit distance.

I think we need a fundamentally new approach to benchmarking OCR models. We’re working on something now - let us know if you’re interested in collaborating! (feel free to email [email protected])