Product Updates
4 mins
August 19, 2025
We’re excited to announce that Structured Extraction now supports citations so you can verify where information we extract comes from within your source document. This is supported natively within our visual editor, Forge, and within our API, so you can create audit trails around extracted information in your document processing pipelines.
Often, parsing PDFs and converting them to formats like Markdown or JSON isn’t enough, and many RAG implementations are an inefficient means to an end. What you actually need is specific information you care about in a format that’s easy to read, index, and act on.
We’ve even seen some of our users take Markdown output from marker
to build their own structured extraction pipelines using large language models. While this works, it requires a bit of upkeep to ensure you can:
Certainly not for the faint of heart!
This is exactly why we launched Structured Extraction: to streamline this process so all you have to worry about is defining what you need, and let us handle accurate, precise extraction at scale.
Now, with citations, you’ll be able to verify and create audit trails around where information came from, making it easy to evaluate schema accuracy, create defensibility around decisions that leverage AI-native workflows, and properly trace where downstream information came from.
Let’s walk through a few examples to make this more concrete.
If you’re building AI-powered experiences (think chatbots, search experiences based on RAG, or agentic workflows) on top of knowledge systems from PDFs (manuals, SOPs, documentation, and so on), you know firsthand that hallucinations kill trust from your users - internal and external.
Accuracy and trust are the main axes you need to operate on. By leveraging citations, you can ensure all responses are properly grounded in your source documents (however large and complex) and minimize error propagation in your document processing workflows.
Maybe your regulatory teams are making decisions based on legislation, and your department is exploring using AI to streamline some of those workflows.
As we’ve all come to appreciate, AI cannot be held accountable for decisions, so it’s crucial to give your team defensibility for decisions they make using AI-native workflows.
By leveraging citations from our extraction pipeline, you always have something real to point to when synthesizing information from large sets of long documents — either in the moment when making a decision, or years down the road when someone wants to understand why a decision was made.
Financial documents are a goldmine of valuable information. Maybe you’re looking at 10-K filings to do extraction for downstream comparative analyses, or doing extraction on franchise disclosure documents (FDDs) to make franchising or investment decisions. Regardless of the specific use case, we know that accuracy at scale is paramount.
In the above example, we show extraction from tables from a 10-K filing. Once you verify your schema and accuracy, you can run it on multiple documents to do downstream comparative analyses for companies of interest.
Here’s another example of structured extraction from a Franchise Disclosure Document. In this case we pull information out of a complex, multi-page table and cite back to the original cell it came from.
If you’re part of a data science team at an investment firm building models and systems for automated investment decisions, having source document citations makes it much easier for your team to test hypotheses, run experiments, and iterate as you see what’s working (or what’s not) and why.
Running accurate, auditable inference at scale is challenging but necessary, especially when decisions need to go through medical legal review, etc.
Structured Extraction with Citations creates a clear path to streamlining these workflows without worrying about the risk of hallucinations or other types of errors slipping through the system.
For any citation we show within Forge, you can recover them from our Extraction API responses by looking at the citation field and finding the block IDs we cite. More concretely with our earlier example, we can see that each field in our Extraction API response has a _citation
field for each extracted field that references block IDs from the document after our PDF conversion runs with marker
.
So you have full programmatic access to citations of blocks directly within your source document after extraction runs. And the best part is that our system can work entirely on-prem, so you never have to worry about sensitive IP leaving your network.
In case you can’t tell, this is a feature we’re super excited about. On a personal level, it’s just awesome seeing where information’s coming from. Bigger picture, we have high conviction that this will help companies embracing AI do so effectively, since it offers the best of both worlds: AI-augmented workflows without compromising trust and accountability.
If you want to learn more about how Structured Extraction can help you, we’d love to walk you through it: drop us a note at [email protected]