Product Updates
4 mins
August 7, 2025
Today, we’re happy to announce the public beta of Forge Extract- our structured extraction engine that turns unstructured PDFs into machine-usable JSON, mapped to your own domain-specific schema.
Whether you’re working with clinical trials, financial disclosures, old product catalogs, scanned forms, or more, Forge Exract will deliver clean, accurate, structured data that’s easy to re-use and build on. It’s powered by Surya
and Marker
so you can be confident in its accuracy across a range of PDF types.
Given an unstructured bit of data (god forbid, a PDF), transform it into a format that’s reusable and scalable, by humans and machines. Not just HTML or Markdown, but a structured, JSON form containing only the information I need. This is the schema:
{
"type": "object",
"title": "ClinicalTrialExtractionSchema",
"description": "Schema of inclusion criteria on clinical trials",
"properties": {
"inclusion_criteria": {
"type": "object",
"description": "Inclusion criteria for eligibility in this trial",
"properties": {
"age_range": {
"type": "string",
"description": "age range (min to max)"
},
"gender": {
"type": "string",
"description": "what gender of patients are eligible"
},
"urine protein": {
"type": "string",
"description": "urine protein levels for eligibility, if any"
},
"platelet_levels": {
"type": "string",
"description": "platelet levels for eligibility"
},
"serum_albumin": {
"type": "string",
"description": "serum albumin levels for elgibility"
}
}
}
}
}
Now imagine getting a verifiable output containing exactly what you need:
This is something we used to do manually — pulling information out and storing it in a spreadsheet or database. What if we could do it on all PDFs, of any type, accurately, at scale?
Pulling that off is the holy grail of information extraction tasks. The kind of stuff that unlocks a future of collective intelligence by harnessing the knowledge we’ve so creatively locked away within PDFs over the last few decades.
In 1945, Vannevar Bush wrote a famous essay in the Atlantic called As We May Think, in which he envisioned a future where humans had a collective memory machine — aptly named memex (short for memory expansion).
He recognized, nearly 80 years ago, the problem of increasing volumes of information and wondered how we’d convert that into scalable systems for knowledge. In a way, he predicted much of modern information society (the internet, Wikipedia, you name it).
And though we’ve made significant progress in finding ways to turn individual insights into collective intelligence, we’re faced with the paradox of having more information than ever, yet still struggling to access and organize it, because so much remains trapped inside unstructured, messy PDFs (the nemesis of machines).
Consider for example:
Whether it’s within a team, company, or a larger group, fixing our data blindspots and meaningfully organizing the knowledge trapped in documents is essential if we want to build on what we already know instead of repeating the same mistakes.
Structured extraction isn’t just a technical problem to solve, but a step towards better institutional memory, progress, and collective problem solving.
Forge Extract is our attempt at tackling this grand problem, across so many disciplines. We’ve been trying it internally on different types of documents, from financial statements, vintage clothing catalogs, scientific papers, and more.
We’re just getting started, and there’s already a ton of exciting work coming up, including:
For now, we’d love for you to try it out and share what’s working, and what’s not. The public beta is now open and available for all subscribers.
Log in (or sign up) and give it a spin. If you need some credits, help getting started, or want to share your use cases with us, reach out at [email protected]
P.S. You can use it with our API too; for more information, check our our documentation here.