Product Updates

4 mins

Introducing Forge Extract: Turn PDFs into Structured JSON

August 7, 2025

Announcing the Forge Extract Public Beta

Today, we’re happy to announce the public beta of Forge Extract- our structured extraction engine that turns unstructured PDFs into machine-usable JSON, mapped to your own domain-specific schema.

Whether you’re working with clinical trials, financial disclosures, old product catalogs, scanned forms, or more, Forge Exract will deliver clean, accurate, structured data that’s easy to re-use and build on. It’s powered by Surya and Marker so you can be confident in its accuracy across a range of PDF types.

Given an unstructured bit of data (god forbid, a PDF), transform it into a format that’s reusable and scalable, by humans and machines. Not just HTML or Markdown, but a structured, JSON form containing only the information I need. This is the schema:

{
    "type": "object",
    "title": "ClinicalTrialExtractionSchema",
    "description": "Schema of inclusion criteria on clinical trials",
    "properties": {
        "inclusion_criteria": {
            "type": "object",
            "description": "Inclusion criteria for eligibility in this trial",
            "properties": {
                "age_range": {
                    "type": "string",
                    "description": "age range (min to max)"
                },
                "gender": {
                    "type": "string",
                    "description": "what gender of patients are eligible"
                },
                "urine protein": {
                    "type": "string",
                    "description": "urine protein levels for eligibility, if any"
                },
                "platelet_levels": {
                    "type": "string",
                    "description": "platelet levels for eligibility"
                },
                "serum_albumin": {
                    "type": "string",
                    "description": "serum albumin levels for elgibility"
                }
            }
        }
    }
}

Now imagine getting a verifiable output containing exactly what you need:

This is something we used to do manually — pulling information out and storing it in a spreadsheet or database. What if we could do it on all PDFs, of any type, accurately, at scale?

Pulling that off is the holy grail of information extraction tasks. The kind of stuff that unlocks a future of collective intelligence by harnessing the knowledge we’ve so creatively locked away within PDFs over the last few decades.

Collective Intelligence

In 1945, Vannevar Bush wrote a famous essay in the Atlantic called As We May Think, in which he envisioned a future where humans had a collective memory machine — aptly named memex (short for memory expansion).

He recognized, nearly 80 years ago, the problem of increasing volumes of information and wondered how we’d convert that into scalable systems for knowledge. In a way, he predicted much of modern information society (the internet, Wikipedia, you name it).

And though we’ve made significant progress in finding ways to turn individual insights into collective intelligence, we’re faced with the paradox of having more information than ever, yet still struggling to access and organize it, because so much remains trapped inside unstructured, messy PDFs (the nemesis of machines).

Consider for example:

  • Clinical trials with eligibility criteria buried in dense text, with slightly different formats and structure in each trial
  • Internal reports, product specs, or audit logs that no one can easily search or structure
  • Health records containing useful information about adverse effects of drugs tucked away in handwritten notes that get digitized

Whether it’s within a team, company, or a larger group, fixing our data blindspots and meaningfully organizing the knowledge trapped in documents is essential if we want to build on what we already know instead of repeating the same mistakes.

Structured extraction isn’t just a technical problem to solve, but a step towards better institutional memory, progress, and collective problem solving.

Try Forge Extract Today

Forge Extract is our attempt at tackling this grand problem, across so many disciplines. We’ve been trying it internally on different types of documents, from financial statements, vintage clothing catalogs, scientific papers, and more.

We’re just getting started, and there’s already a ton of exciting work coming up, including:

  • Auditability — so you can trace where information came from. Crucial for compliance, trust, or regulated reporting.
  • Performance  — better latency and accuracy on large documents and complex schemas.
  • Dedicated model  — purpose-built just for structured extraction ;)
  • Benchmarks  — we’ve been testing performance internally and it’s doing better than Gemini from our initial tests - stay tuned for more here.

For now, we’d love for you to try it out and share what’s working, and what’s not. The public beta is now open and available for all subscribers.

Log in (or sign up) and give it a spin. If you need some credits, help getting started, or want to share your use cases with us, reach out at [email protected]

P.S. You can use it with our API too; for more information, check our our documentation here.