pdfmux

mcp
SUMMARY

PDF extraction that checks its own work. #2 reading order accuracy — zero AI, zero GPU, zero cost.

README.md

pdfmux

CI
PyPI
Python 3.11+
License: MIT
Downloads

The only PDF extractor that checks its own work. #2 overall on opendataloader-bench — ahead of docling, marker, mineru, and every other open-source extractor. Zero AI, zero API calls, zero GPU.

pdfmux terminal demo

PDF ──→ pdfmux ──→ Markdown / JSON
         │
         ├─ fast extract every page
         ├─ audit each page (5 quality checks)
         ├─ re-extract bad pages with surgical OCR
         ├─ detect headings via font-size analysis
         ├─ merge → clean → confidence score
         └─ extract tables, key-values, normalize dates/amounts

Most PDF extractors run once and hope for the best. pdfmux extracts, audits every page, and re-extracts the ones that came out wrong — automatically. No GPU, no API keys, no cloud dependency.

Benchmark

Tested on opendataloader-bench — 200 real-world PDFs across financial reports, legal filings, academic papers, and scanned documents.

Engine Overall Reading Order (NID) Tables (TEDS) Headings (MHS) Requires
opendataloader hybrid 0.909 0.935 0.928 0.828 AI API calls
pdfmux 0.900 0.918 0.887 0.844 CPU only, zero cost
docling 0.877 0.900 0.887 0.802 ~500MB ML models
marker 0.861 0.890 0.808 0.796 GPU recommended
opendataloader local 0.844 0.913 0.494 0.761 CPU only
mineru 0.831 0.857 0.873 0.743 GPU + ~2GB models

#2 overall, #1 among free tools. 99% of the paid #1 score — at zero cost per page. Beats docling (+2.3%), marker (+3.9%), and every other open-source extractor. Best heading detection of any engine, paid or free.

Quick Start

pip install pdfmux

pdfmux invoice.pdf
# ✓ invoice.pdf → invoice.md (2 pages, 95% confidence, via pymupdf4llm)

No config, no flags, no API keys needed.

Install

# core — handles digital PDFs instantly (the vast majority)
pip install pdfmux

# add OpenDataLoader for best-in-class accuracy (Java 11+ required)
pip install "pdfmux[opendataloader]"

# add OCR for scanned/image-heavy pages (~200MB, CPU-only)
pip install "pdfmux[ocr]"

# add table extraction (Docling — 97.9% table accuracy)
pip install "pdfmux[tables]"

# add LLM fallback for hardest cases (Gemini Flash)
pip install "pdfmux[llm]"

# everything
pip install "pdfmux[all]"

Requires Python 3.11+.

How It Works

Multi-pass extraction

Every PDF goes through a multi-pass pipeline. This is what makes pdfmux different.

Pass 1 — Fast extract + audit
  For each page:
    ├─ Extract text with PyMuPDF (instant)
    ├─ Count characters + images
    └─ Classify: "good" / "bad" / "empty"

  All pages good? → done. Zero overhead.

Pass 2 — Selective OCR (only bad pages)
  For each bad/empty page:
    ├─ Try RapidOCR  (~200MB, CPU, Apache 2.0)
    ├─ Try Surya OCR  (fallback, heavier)
    └─ Try Gemini LLM (fallback, API)

  Smart comparison:
    ├─ "bad" page (some text): only use OCR if it got MORE text
    └─ "empty" page (no text): accept any OCR result >10 chars

Pass 3 — Heading detection
  ├─ Analyze font sizes per page via PyMuPDF spans
  ├─ Identify heading levels from size clusters
  └─ Inject markdown heading markers (# / ## / ###)

Pass 4 — Merge + score
  ├─ Combine good pages + OCR'd pages in order
  ├─ Clean text (broken words, control chars, spacing)
  └─ Confidence score (honest — reflects actual quality)

The fast path is free. Digital PDFs pass through in ~0.01s/page with zero OCR overhead. The audit step adds negligible cost. You only pay for OCR on pages that actually need it.

Detection

pdfmux opens each PDF with PyMuPDF and classifies it:

Per page:
  ├─ Has >50 chars of text?             → digital
  ├─ Has images but no/little text?     → graphical (image-heavy)
  └─ No text, no images?                → empty

Document level:
  ├─ ≥80% digital pages                 → digital PDF
  ├─ ≥80% scanned pages                 → scanned PDF
  ├─ Image-heavy pages detected         → graphical PDF
  └─ Mix of types                       → mixed PDF

Table detection:
  ├─ Ruled line patterns (≥3 horiz + ≥2 vert lines)
  └─ Tab-separated or aligned text patterns

Routing

classify(pdf)
  │
  ├─ quality=fast     → PyMuPDF only (instant, free)
  ├─ quality=high     → Gemini Flash → OCR → PyMuPDF
  │
  └─ quality=standard (default):
       ├─ has tables (not graphical) → Docling → PyMuPDF fallback
       └─ everything else            → multi-pass pipeline

If an optional extractor isn't installed, pdfmux silently falls back to the next best option. No errors, no config.

Extractors

Tier Extractor What it handles Speed Size Install
Fast PyMuPDF / pymupdf4llm Digital PDFs with clean text 0.01s/page Base Base
OCR RapidOCR (PaddleOCR v4) Scanned / image-heavy pages 0.5-2s/page ~200MB pdfmux[ocr]
Tables Docling Table-heavy documents 0.3-3s/page ~500MB pdfmux[tables]
OCR Heavy Surya OCR Scanned PDFs (legacy, GPU) 1-5s/page ~5GB pdfmux[ocr-heavy]
LLM Gemini 2.5 Flash Complex layouts, handwriting 2-5s/page API pdfmux[llm]

Confidence scoring

Every result includes an honest confidence score:

  • 95-100% — clean digital text, fully extractable
  • 80-95% — good extraction, minor OCR noise on some pages
  • 50-80% — partial extraction, some pages couldn't be recovered
  • <50% — significant content missing, warnings included

When confidence is below 80%, pdfmux tells you exactly what went wrong and how to fix it (e.g., "Install pdfmux[ocr] for better results on 6 image-heavy pages").

Python API

import pdfmux

# Simple text extraction → Markdown string
text = pdfmux.extract_text("report.pdf")
print(text[:200])

# Structured extraction → dict with locked schema
data = pdfmux.extract_json("report.pdf")
print(f"{data['page_count']} pages, {data['confidence']:.0%}")
print(f"OCR pages: {data['ocr_pages']}")

# LLM-ready chunks → list of dicts with token estimates
chunks = pdfmux.load_llm_context("report.pdf")
for c in chunks:
    print(f"{c['title']}: {c['tokens']} tokens (pages {c['page_start']}-{c['page_end']})")

All three functions accept quality="fast", "standard" (default), or "high".

Types & Errors

Every object in the pipeline is typed and immutable. All types and errors are exported from the top-level package.

from pdfmux import (
    # Enums
    Quality,              # FAST, STANDARD, HIGH
    OutputFormat,         # MARKDOWN, JSON, CSV, LLM
    PageQuality,          # GOOD, BAD, EMPTY

    # Data objects (frozen dataclasses)
    PageResult,           # Single page: text, page_num, confidence, quality, extractor
    DocumentResult,       # Full document: pages, source, confidence, extractor_used
    Chunk,                # Section-aware chunk: title, text, page_start, page_end, tokens

    # Errors
    PdfmuxError,          # Base — catch this to handle all pdfmux errors
    FileError,            # File not found, unreadable, not a PDF
    ExtractionError,      # Extraction failed
    ExtractorNotAvailable,# Requested extractor not installed
    FormatError,          # Invalid output format
    AuditError,           # Audit could not complete
)

Catch broad or narrow:

try:
    text = pdfmux.extract_text("report.pdf")
except pdfmux.ExtractorNotAvailable as e:
    print(f"Missing dependency: {e}")
except pdfmux.PdfmuxError as e:
    print(f"pdfmux error: {e}")

Stream pages with bounded memory:

from pdfmux.extractors import get_extractor

ext = get_extractor("fast")
for page in ext.extract("large-500-pages.pdf"):  # Iterator[PageResult]
    process(page.text)  # bounded memory, even on 500-page PDFs

CLI Usage

Convert a single file

pdfmux invoice.pdf
# ✓ invoice.pdf → invoice.md (2 pages, 95% confidence, via pymupdf4llm)

With OCR installed (image-heavy PDFs)

pdfmux pitch-deck.pdf
# ✓ pitch-deck.pdf → pitch-deck.md (12 pages, 85% confidence, 6 pages OCR'd, via pymupdf4llm + rapidocr)

Output location

pdfmux report.pdf -o ./converted/report.md

Batch convert

pdfmux ./docs/ -o ./output/
# Converting 12 PDFs from ./docs/...
#   ✓ invoice.pdf → invoice.md (95%)
#   ✓ contract.pdf → contract.md (92%)
#   ✓ scan.pdf → scan.md (87%)
# Done: 12 converted, 0 failed

Output formats

# markdown (default)
pdfmux report.pdf

# json — structured output with metadata
pdfmux report.pdf -f json

# llm — section-aware chunks with token estimates
pdfmux report.pdf -f llm

# csv — extracts tables only
pdfmux data.pdf -f csv

Quality presets

# fast — PyMuPDF only, no ML, no audit (instant, free)
pdfmux report.pdf -q fast

# standard — multi-pass pipeline (default)
pdfmux report.pdf -q standard

# high — use LLM for everything (slow, costs ~$0.01/doc)
pdfmux report.pdf -q high

Diagnostics

# check what's installed
pdfmux doctor
# ┌──────────────┬─────────────┬─────────┬──────────────────────────────┐
# │ Extractor    │ Status      │ Version │ Install                      │
# ├──────────────┼─────────────┼─────────┼──────────────────────────────┤
# │ PyMuPDF      │ ✓ installed │ 1.25.3  │                              │
# │ RapidOCR     │ ✓ installed │ 3.0.6   │                              │
# │ Docling      │ ✗ missing   │ —       │ pip install pdfmux[tables]   │
# └──────────────┴─────────────┴─────────┴──────────────────────────────┘

# benchmark all extractors on a file
pdfmux bench report.pdf
# ┌──────────────┬────────┬────────────┬─────────────┬──────────────────────┐
# │ Extractor    │   Time │ Confidence │      Output │ Status               │
# ├──────────────┼────────┼────────────┼─────────────┼──────────────────────┤
# │ PyMuPDF      │  0.02s │        95% │ 3,241 chars │ ✓                    │
# │ Multi-pass   │  0.03s │        95% │ 3,241 chars │ ✓ all pages good     │
# │ RapidOCR     │  4.20s │        88% │ 2,891 chars │ ✓                    │
# └──────────────┴────────┴────────────┴─────────────┴──────────────────────┘

Analyze a PDF

pdfmux analyze report.pdf
# report.pdf — 12 pages
#
# ┌──────┬────────────┬────────────────────────┬────────┐
# │ Page │ Type       │ Quality                │  Chars │
# ├──────┼────────────┼────────────────────────┼────────┤
# │    1 │ digital    │ good → fast extraction │  1,204 │
# │    2 │ graphical  │ bad → needs OCR        │     42 │
# │    3 │ digital    │ good → fast extraction │  2,108 │
# └──────┴────────────┴────────────────────────┴────────┘
#
#   Confidence: 91%
#   OCR pages:  2
#   Extractor:  pymupdf4llm + rapidocr (1 page re-extracted)

Other options

# show confidence score in output
pdfmux report.pdf --confidence

# print to stdout instead of file
pdfmux report.pdf --stdout

All CLI options

Option Short Default Description
--output -o Same dir, .md ext Output file or directory
--format -f markdown Output format: markdown, json, csv, llm
--quality -q standard Quality: fast, standard, high
--schema -s none JSON schema file or preset for structured extraction
--confidence false Include confidence score in output
--stdout false Print to stdout instead of writing file

Output Formats

Markdown (default)

Clean markdown optimized for LLM consumption:

# Quarterly Report

Revenue for Q3 increased by 15% year-over-year...

## Financial Summary

| Metric | Q3 2025 | Q3 2024 |
|--------|---------|---------|
| Revenue | $12.3M | $10.7M |
| Profit | $3.1M | $2.4M |

JSON

Structured output with metadata:

{
  "source": "report.pdf",
  "converter": "pdfmux",
  "extractor": "pymupdf4llm + rapidocr (3 pages re-extracted)",
  "page_count": 12,
  "confidence": 0.91,
  "ocr_pages": [2, 5, 8],
  "warnings": [],
  "content": "# Quarterly Report\n\nRevenue for Q3...",
  "pages": [
    { "page": 1, "content": "# Quarterly Report..." },
    { "page": 2, "content": "## Financial Summary..." }
  ]
}

LLM (chunked JSON)

Section-aware chunks with token estimates, designed for RAG pipelines:

{
  "document": "report.pdf",
  "chunks": [
    {
      "title": "Quarterly Report",
      "text": "Revenue for Q3 increased by 15%...",
      "page_start": 1,
      "page_end": 2,
      "tokens": 312,
      "confidence": 0.95
    },
    {
      "title": "Financial Summary",
      "text": "| Metric | Q3 2025 | Q3 2024 |...",
      "page_start": 3,
      "page_end": 3,
      "tokens": 156,
      "confidence": 0.95
    }
  ]
}

CSV

Extracts tables from the document:

Metric,Q3 2025,Q3 2024
Revenue,$12.3M,$10.7M
Profit,$3.1M,$2.4M

Raises an error if no tables are found.

Structured Extraction

New in v1.1.0, improved in v1.2.0. Extract structured data from invoices, bank statements, and forms — no LLM, no cloud, no cost.

pdfmux auto-detects key-value pairs (colon-separated, whitespace-aligned, dot-leader patterns), extracts tables as typed JSON, and normalizes dates, amounts, and rates into clean values.

CLI

# JSON output with auto-detected structure
pdfmux statement.pdf -f json

# Schema-guided extraction — map to your own fields
pdfmux invoice.pdf --schema invoice.schema.json

# Use a built-in preset
pdfmux statement.pdf --schema bank-statement

When --schema is provided, the format auto-switches to JSON. Fields are matched using fuzzy string similarity — no exact key names required.

Python API

import pdfmux

data = pdfmux.extract_json("statement.pdf")
# data["pages"][0]["key_values"]  → extracted label: value pairs
# data["pages"][0]["tables"]      → headers + rows as structured JSON

What gets extracted

Key-value pairs — detected from Label: Value, Label Value (whitespace-aligned), and Label.......Value (dot-leader) patterns:

{"key": "Statement Date", "value": "2026-02-28", "page_num": 0}

Tables — headers and rows as typed arrays:

{
  "headers": ["Date", "Description", "Amount"],
  "rows": [["2026-02-01", "Payment received", "1,234.50"]]
}

Normalized values — dates become ISO 8601, amounts get parsed with currency and direction, rates get period detection:

{
  "amount": 1234.50,
  "direction": "debit",
  "currency": "AED"
}

Schema-guided mapping

Pass a JSON Schema and pdfmux maps extracted data to your fields using fuzzy matching + type coercion. Array fields map from tables, scalar fields map from key-value pairs. No LLM required.

{
  "properties": {
    "invoice_date": {"type": "string", "format": "date"},
    "total_amount": {"type": "number"},
    "line_items": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "description": {"type": "string"},
          "amount": {"type": "number"}
        }
      }
    }
  }
}

MCP Server

pdfmux includes a built-in MCP (Model Context Protocol) server so AI agents can read PDFs natively. Agents receive confidence scores, warnings, and structured extraction data (key-value pairs, tables, normalized values) alongside the text.

pdfmux serve

Claude Desktop / Cursor

Add to your config:

{
  "mcpServers": {
    "pdfmux": {
      "command": "pdfmux",
      "args": ["serve"]
    }
  }
}

Claude Code

claude mcp add pdfmux -- pdfmux serve

Tools

The server exposes three tools:

{
  "name": "convert_pdf",
  "description": "Convert a PDF to AI-readable Markdown",
  "parameters": {
    "file_path": "string — absolute path to the PDF",
    "format": "string — markdown (default)",
    "quality": "string — fast | standard | high (default: standard)"
  }
}

When confidence is below 80% or there are warnings, the response includes extraction metadata (confidence score, extractor used, OCR page numbers, actionable warnings).

Examples

See the examples/ directory for runnable scripts:

Environment Variables

Variable Required Description
GEMINI_API_KEY Only for pdfmux[llm] Google Gemini API key for LLM extraction
GOOGLE_API_KEY Alternative Alternative env var for Gemini API key

No environment variables are needed for the base install or the tables/ocr extras.

Why Not Just Use X?

Tool Good at Limitation
Marker GPU ML extraction Overkill for digital PDFs, needs GPU
Docling Tables (97.9% accuracy) Slow on non-table documents
pymupdf4llm Fast digital text Can't handle scanned or image-heavy layouts
MinerU Full ML pipeline Heavy, complex setup
MarkItDown Wide format support Not optimized for any specific PDF type
pdfmux Self-healing extraction Audits every page, re-extracts bad ones

pdfmux doesn't compete with these tools — it orchestrates them. The key insight: no single extractor wins on everything. pdfmux routes each page to the right one, verifies the result, and re-extracts if needed.

Project Structure

src/pdfmux/
├── __init__.py         # Public API: extract_text, extract_json, load_llm_context + type/error re-exports
├── py.typed            # PEP 561 marker — mypy/pyright recognize pdfmux as typed
├── types.py            # Frozen dataclasses + enums: Quality, OutputFormat, PageResult, DocumentResult, Chunk
├── errors.py           # Exception hierarchy: PdfmuxError → FileError, ExtractionError, FormatError, AuditError
├── pipeline.py         # Multi-pass routing + merge + process_batch() + security limits
├── detect.py           # PDF type classification + layout detection
├── audit.py            # 5-check per-page confidence scoring + quality classification
├── regions.py          # Region OCR — surgical image extraction for bad pages
├── parallel.py         # Parallel OCR dispatch with thread pool
├── chunking.py         # Section-aware splitting + token estimation
├── kv_extract.py       # Key-value pair extraction (colon, whitespace, dot-leader)
├── normalize.py        # Date/amount/rate normalization (pure Python)
├── schema.py           # Schema-guided extraction (fuzzy matching, type coercion)
├── headings.py         # Heading detection via font-size analysis
├── table_fallback.py   # Fallback table extraction when Docling is unavailable
├── postprocess.py      # Text cleanup
├── mcp_server.py       # MCP server (stdio JSON-RPC) with path restrictions
├── cli.py              # Typer CLI (convert, analyze, serve, doctor, bench, version)
├── extractors/
│   ├── __init__.py     # Extractor protocol + @register decorator + priority-ordered registry
│   ├── fast.py         # PyMuPDF — handles 90% of PDFs (priority 10)
│   ├── rapid_ocr.py    # RapidOCR — lightweight OCR (~200MB, priority 20)
│   ├── tables.py       # Docling — table-heavy docs (priority 40)
│   ├── ocr.py          # Surya — legacy heavy OCR (priority 30)
│   └── llm.py          # Gemini Flash — hardest cases (priority 50)
├── integrations/
│   ├── langchain.py    # PDFMuxLoader for LangChain
│   └── llamaindex.py   # PDFMuxReader for LlamaIndex
└── formatters/
    ├── markdown.py     # Markdown output
    ├── json_fmt.py     # JSON + LLM chunked output
    └── csv_fmt.py      # CSV output (tables only)

Development

git clone https://github.com/NameetP/pdfmux.git
cd pdfmux
python3.12 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

# run tests (151 tests)
pytest

# lint
ruff check src/ tests/
ruff format src/ tests/

License

MIT

Reviews (0)

No results found