Section Hierarchy Across Long Docs

This example shows how Datalab assigns consistent section hierarchy across long documents. The document shown is a 400+ page mathematics textbook with sections, chapters, and subsections.

Traditional OCR systems struggle with this because section hierarchy is not a page-level problem. It requires understanding how headings relate to each other across the full document.

Most OCR systems parse documents page by page. Each page is treated independently, without context from earlier or later pages.

Section headers form a hierarchy across the entire document. A heading may represent a title, a chapter, or a subsection. Without multi-page context, OCR systems often mislabel these headers. A subsection later in the document may be tagged as a top-level section, even if higher-level sections appeared earlier.

Datalab treats section hierarchy as a document-level problem. To solve it, we trained a compact model called Agni. Agni processes long documents and assigns consistent header levels to every detected section header.

In this textbook, Datalab assigns h1 to book sections, h2 to chapters, h3 to subsections, and deeper levels for examples and sub-parts. This hierarchy is maintained consistently across more than 400 pages.

Datalab produces stable section hierarchy across multi-page documents. This improves readability, enables accurate tables of contents, and produces structured output that works reliably in downstream pipelines.