Product Updates

3 mins

Extracting Hyperlinks from PDFs

December 19, 2025

Many PDFs contain clickable hyperlinks - table of contents entries that jump to sections, cross-references between pages, or external URLs to websites. When you convert these PDFs to markdown or HTML, those links are typically lost.

We’ve added a new feature that preserves hyperlinks during OCR, so your converted documents maintain their navigation structure.

What It Does

The link extraction feature:

  • Extracts embedded hyperlinks from digital PDFs (not scanned documents)
  • Preserves external URLs (links to websites)
  • Converts internal page jumps to block-level anchors that work in HTML output

For example, a table of contents entry that links to a block on page 5 (1-indexed) in the PDF becomes an <a href="#block-4-1"> (0-indexed) link pointing to the actual content block in your HTML output.

Why It’s Useful

Table of Contents Navigation: Corporate documents, technical manuals, and reports often have clickable tables of contents. With link extraction, readers can click through your converted HTML just like the original PDF.

Cross-References: Legal documents and contracts frequently reference other sections (“see Section 4.2”). These cross-references stay clickable.

Citation Links: Academic papers and reports with hyperlinked citations maintain their references to external sources.

Form Instructions: Government forms often link to instruction pages or external resources - these links are preserved.

How to Use It

In the Playground

  1. Upload your PDF
  2. Expand “Additional Config” in the settings panel
  3. Check “Extract Links”
  4. Click “Parse Document”

Via API

Add extras: "extract_links" to your API request:

import requests

response = requests.post(
    "https://www.datalab.to/api/v1/marker",
    headers={"X-Api-Key": "YOUR_API_KEY"},
    files={"file": open("document.pdf", "rb")},
    data={
        "output_format": "html",
        "extras": "extract_links"
    }
)

result = response.json()
# result["html"] contains clickable links

The response includes:- html: Your converted document with <a href="..."> tags for links and <a id="..."> anchors for link targets.

Example Output

Below, we show you what extracted links look like in the HTML output.

Internal links (page jumps):

<p>
  For specifications related to electrical, automation and
  <a href="#block-3-5">SHE</a>
  (Safety, Health & Environmental), the machine shall endorse...
</p>

Anchor targets (where internal links point):

<a id="block-3-4"></a>
<table>
  <thead>
    <tr>
      <th>Key</th>
      <th>Definition</th>
    </tr>
  </thead>
  ...
</table>

External links (URLs):

<p>
  For more information, visit
  <a href="https://example.com/docs">our documentation</a>.
</p>

Try Extracting Links from PDFs

If you have documents with complex linking structures and want help optimizing extraction, reach out at [email protected].

Table of contents: