Healthcare Policy Segmentation

We often run into what we call “digitally stapled PDFs”, files which were originally different documents but combined into one in the name of convenience (to create a packet, for a batch delivery process, or for simpler file attachments in an email).

When teams re-process these documents later on, they find themselves having to segment them back into their original groups to conditionally process them in different extraction and/or indexing pipelines.

Datalab’s Segmentation feature automatically detects boundaries between documents in your file and returns the page boundaries for each group. Once you have that, you can dynamically apply post-processing, for example, by using different sets of extraction schemas for each segment.

Strategies like this can drastically improve your extraction speed by letting you parallelize inference. More importantly, a series of simpler, flatter extraction schemas yield more consistent and accurate results.

Datalab’s platform and API make evaluating and orchestrating these workflows easy for teams so they can test, see, iterate, and scale their document ingestion pipelines.