Product Updates

Launch Week - Day 3: Introducing Agni: Solving Multi-Page Section Hierarchy in OCR

December 3, 2025

TLDR; Section hierarchy breaks almost every OCR system. We fixed this by training a new model called Agni.

The Problem: Inconsistent Section Hierarchy

OCR traditionally works at the page level - The contents of page 1 have no relevance when decoding the contents of page 10. Datalab’s parsing has always been page-level, and layout-aware: we detect and segment visual layout elements while decoding the contents.

Section headers, however, introduce an additional challenge. They form a hierarchical structure across an entire multi-page document. A heading might be a title, a chapter start, a subsection, or a sub-subsection. In long documents, this hierarchy must remain consistent. Without multi-page context, page-level OCR can easily misclassify headings: a subsection on page 3 may be labeled an <h1>, even if an <h1> already appeared on page 1 for a section or title.

Correctly identifying and normalizing section headers dramatically improves readability for both humans and LLMs. A stable hierarchy also enables highly accurate, auto-generated tables of contents. This is critical for RAG and other downstream pipelines.

Our Solution: Multi-Page Semantic Understanding

Determining section hierarchy is a semantic, document-level problem. Documents routinely span hundreds of pages, and assigning correct header levels requires information across all pages.

To solve this, we trained a compact, efficient model - codenamed “Agni” (yes, the naming pattern continues). Agni processes extremely long documents and assigns consistent section header levels to every detected section header block.

This model has been live in our API for the past few days, and now runs by default across all parsing modes. Even on 100+ page documents, Agni adds less than 100 ms of latency per document.

Examples

Research Paper

In this example, Datalab correctly assigns <h1> to the title, <h2> to each major section, and <h3> to subsections - and maintains this consistent hierarchy across all pages.

Mathematics Textbook

For a much longer document, Datalab assigns <h1> to each book section, <h2> to its chapters, <h3> to subsections, and <h4> to subsubsections. This textbook also contains examples with varying numbering that appear between sections and subsubsections. The parsed output assigns <h5> to all these examples, and goes further to assign <h6> to sub-parts of examples. This hierarchy is consistently maintained across more than 400 pages.

What’s Next

We’re actively improving this feature and rolling out expanded support for documents exceeding 1,000 pages. If you have interesting or challenging cases, we’d love to see them - reach out at [email protected].

Table of contents: