Product Updates
2 mins
September 24, 2025
We’re excited to announce the release of a new feature for Datalab users: Document Segmentation. You can now automatically detect page boundaries between multiple documents all contained within one PDF file.
It comes in two modes:
Anyone who has applied for rentals in NYC knows that feeling of combining multiple documents - reference letters, photo IDs, financial statements, retirement accounts, and more - into one massive 80 page packet.
We also see this happen in serious archival situations when physical back content gets digitized in scanners. At that scale, it’s tough to have someone manually oversee each scan job and put separators in between different documents, so pages get scanned together, and different documents end up all in one file.
We’ve come to refer to this as digitally stapling PDFs (or really, removing those staples). Like many important problems, this is a downstream consequence of decisions made in the name of convenience.
That convenience creates a number of issues later on. What if you want to extract key terms and clauses in rental application packets, but don’t want to run your entire pipeline on all 80+ pages? That document length can blow out your entire context window if you’re just using commercial LLMs. Worse, combining multiple documents might also force you to use long, deeply nested extraction schemas, which increases the complexity, duration, and error rate of your extraction pipelines.
We launched Segmentation to detect page boundaries between document types in these large files, so you can segment them back and then more efficiently index or process your content.
Consider a simple example from local city council meetings in NYC. Content is provided in one PDF file, but within, you’ll find a combination of different testimonies, supporting material, and appearance cards.
Auto Segmentation does a splendid job identifying page ranges for each type of document within. Within Forge, you can visually see which pages we identified for each segment, and use one of them as a launching point to run extraction on. More importantly, you can also inspect that response in raw JSON to understand how our API returns your results.
Once you have these page boundaries, it’s much simpler to separate your PDF into sub-documents and then apply extraction or some other processing task on each segment.
Segmentation is live in Forge, and in our API. We’ve included a full code sample here.