SEC filings are a rich source of financial information about companies, but can be difficult to parse at scale. Being able to pull the right information out accurately can lend itself to better decision-making with evidence from those documents.
In this case, we found an SEC filing from Novo Nordisk and used Datalab’s auto-segmentation feature to automatically group a 46 page file into 32 different segments, each inferred from the table of contents.
Unlike the document segmentation strategy which detects document level boundaries between pages, the table of contents strategy infers different sections from the table of contents and finds their respective page boundaries.
This is trickier because a page can shift sections in the middle, and can sometimes contain multiple sections. Accurately detecting these shifts in the middle, and more importantly, segmenting content closer to that boundary (instead of the page level) is crucial for RAG pipelines that want to index content from these different sections.
Datalab’s platform and API make evaluating and orchestrating these workflows easy for teams so they can test, see, iterate, and scale their document ingestion pipelines.