Parse PDFs *Just the Way You Want*

Product Updates

3 mins

Parse PDFs Just the Way You Want

August 14, 2025

We're announcing two new features in beta today to help you parse your PDFs just the way you want:

Marker Prompt API, which allows you to tailor parse outputs with prompts
Forge Parse: a playground to evaluate and iterate quickly on your prompts

Our document intelligence models are state-of-the-art, but they can't read your minds or capture your preferences! And, despite our best efforts, we might never tame the vast wilderness of PDF parsing edge cases (which, incidentally, has driven our founder to near-madness).

We want to give you the ability to get your parse output just right, in any context, and we want it to be easy. That's where Forge Parse and Marker Prompt API come in. They work hand-in-hand:

Use Forge Parse to build confidence in your prompts. You can evaluate a single prompt over multiple documents at the same time and iterate until you know it works.
When you're done, Marker Prompt API will steer output or address issues across 10s-to-1000s of pages: you won't have to write or operate any of that code.

Some common use cases include:

Merging tables that are split across pages.
Removing unwanted artifacts, like line numbers in page gutters or unwanted newlines.
Handling OCR errors, especially with non-English languages or poorly-scanned docs

These features are in public beta and are available to all paying customers. Give them a spin and send us your document parsing preferences and edge cases! We’re keen to address all of them.

Example 1: Removing Unwanted Artifacts

A lot of research paper preprints and conference abstracts have line numbers in the left gutter of their pages.

They're part of the document and often get parsed into blocks. We have post-processors that try to strip them but their heuristics do not capture every case.

Now you can just prompt them away.

You can see how, exactly, specific blocks were changed in Forge Parse.

Example 2: Merging Tables Across Pages

In addition to rewriting blocks the way you want, we also handle prompts that merge artifacts across page boundaries.

A common request we get is to merge tables across pages. You can use the Marker Prompt API to do that.

Here's an example using Berkshire Hathaway's 2024 annual report, which also requests that Marker insert currency signs in every cell.

This prompt handles two changes at once: formatting and merging tables.

Example 3: Improving Table Parses

NYC's annual expense budget is a PDF that is over 800 pages long. It is littered with pages that summarize agency budgets that look like they came out of line feed printers:

There are tables and text on these pages, but how you might decide to parse this summary into blocks is pretty subjective.

My aim is to parse out a clean financial summary table. I don’t care about anything else.

Marker's output isn't ideal in this instance: it looks at this page, identifies one big table, and doesn’t compose rows/columns in a way that makes sense. Rows are merged when they shouldn’t be, the prose summaries don’t span multiple columns, etc.

These images are cropped for this post, but you can view output like this side-by-side easily in Forge Parse.

We have an escape hatch in this case, and it's the use_llm flag, which uses Marker's open-source post-processors to fix common errors.

use_llm's output is much better (and you can use it via Datalab's API):

Marker output when use_llm is true is much improved in this instance.

But what I really want is a clean table with just the financials and none of the prose summarizing agency responsibilities, etc. You can use the Marker Prompt API to do that:

Forge Parse will show you when blocks on a page change with a CHANGED tag on the top right.

Ah, much better!

Tips & Further Reading

Forge Parse is the best way to get started with the Marker Prompt API.

If your prompt isn’t working as well as you’d like, read these Marker Prompt API tips.

Other links you might want to check out:

And, if you have questions, please holler at us: [email protected]

‍