pdfvision

Give AI agents human-like PDF vision

🔍 pdfvision gives AI agents human-like PDF vision — text, layout, and rendered page images in one pass, delivered as a CLI / library built for agents.

Mission: make every PDF reliably readable by AI agents. Surface text, layout, and page images together, and expose extraction gaps instead of hiding them.

💡 Why pdfvision

Hand an agent a PDF and it usually either can't read it at all, or swallows the whole file and blows past its context window. Neither is how anyone actually reads a PDF. A person goes page by page, looks at the figures and images, and zooms in when a detail won't resolve.

pdfvision gives agents those same eyes. A lightweight CLI, packed with recognition aids, built specifically so an agent can experiment with how it looks at a PDF — one page at a time, as text, as a rendered image, zoomed into a region — instead of being handed a single pre-baked answer it can't second-guess.

See whether text extraction actually worked

Every page reports charCount, imageCount, and textCoverage, so an agent can tell "this slide is an image, not text" and re-run with --render or --ocr instead of trusting an empty string.

Look at the page, not just the text

--render hands PNG paths straight to a vision model and --ocr attaches per-page OCR alongside the native text, so an agent can read a page visually when the text layer falls short.

Preserve layout and visual structure

--layout, --image-boxes, and --geometry expose reading order, raster positions, and per-item geometry as raw signals — the agent picks which lens fits and tries another when one falls short, rather than trusting one baked answer.

Spot anomalies a human would notice

With --layout, each page carries pages[].warnings — overlapping text, body running off the page, collisions with running headers/footers — the "this looks off" cues a text-only extractor silently drops.

Keep raw evidence available

Normalization is on by default but the pre-normalized text stays in rawText, and the xml format mirrors json as tags some LLMs locate more reliably — the original signal is never thrown away.

Make repeated agent reads cheap

A cache-first design (~30 ms on the second read) and first-class --remote URLs keep the trial-and-error above practical across a whole session.

The design principle is agent decides; pdfvision delivers raw signals. No auto-detect heuristics that decide for the agent and hide what the PDF actually contained.

🚀 Quick Start

# Try without installing
npx pdfvision document.pdf

# Render page images for a multimodal LLM
npx pdfvision document.pdf --render

# Pull from a URL
npx pdfvision --remote https://raw.githubusercontent.com/mozilla/pdf.js-sample-files/master/tracemonkey.pdf -f json

# Or install globally
npm install -g pdfvision
pdfvision document.pdf

🤖 Agent Skill

pdfvision ships a bundled agent skill at skills/pdfvision/ (a SKILL.md plus a small references/ set) so a Claude Code, Codex, or Cursor session knows when to reach for the CLI and how to pick flags. Install it with npx skills:

# Project install (default) — drops the skill into <cwd>/.claude/skills/pdfvision/
npx skills add yamadashy/pdfvision

# Global install — drops it into ~/.claude/skills/pdfvision/ instead
npx skills add yamadashy/pdfvision -g

The skill covers the daily extraction flow, the density-Overview-based silent-failure detection, and points at references/structured-output.md (full DocumentResult schema for programmatic consumers) and references/ocr.md (multi-language OCR, traineddata, troubleshooting) only when those specific cases apply.

📖 Usage

pdfvision <file.pdf> [options]
pdfvision --remote <url> [options]
pdfvision --clear-cache

Options:
  -p, --pages <range>     Page range (e.g. "1-5", "3", "1,3,5")
  -f, --format <type>     Output format: markdown (default), json, xml, toon
  -r, --render            Render pages as PNG images
      --render-output <dir>
                          Directory for rendered PNGs (requires --render)
      --render-scale <n>  Rasterisation multiplier (default 2; bounds (0, 4]). Requires --render or --ocr.
      --geometry          Emit per-text-item bbox + font size in pages[].spans (json/xml/toon)
      --layout            Reconstruct lines + blocks (with role / repeated flags) in pages[].layout;
                          also emit pages[].warnings (text_overlap / near_bottom_edge /
                          body_near_repeated_chrome / off_page)
      --image-boxes       Emit per-image bbox in pages[].imageBoxes
      --ocr               Run tesseract.js OCR; attach pages[].ocr (text/confidence/lang)
      --ocr-lang <lang>   Tesseract lang(s), plus-separated (e.g. eng+jpn). Default: eng
      --remote <url>      Download an http(s) PDF into the cache, then extract
      --no-cache          Skip the on-disk cache
      --no-normalize      Disable Unicode NFKC normalization (default: on; pre-normalization text
                          is preserved in JSON/XML \`rawText\` only when normalization changed
                          the string — pass this if you need raw codepoints in markdown too)
      --clear-cache       Wipe every cached extraction, render, and remote download, then exit
  -v, --version           Show version
  -h, --help              Show this help

Output formats

markdown (default) — per-page sections, density Overview table, image links inline. For LLM context windows.
json — full DocumentResult schema. For programmatic consumers.
xml — same data as JSON but tag-shaped. For LLMs that locate <page> / <text> tags more reliably than nested object keys.
toon — Token-Oriented Object Notation: a lossless, schema-aware encoding of the same DocumentResult schema, tuned for LLM token budgets. Uniform object arrays (overview, spans, imageBoxes, layout lines) collapse into a CSV-like tabular form that declares field names once instead of repeating them per row, cutting ~40% of tokens versus the pretty-printed JSON on geometry / layout-heavy output (where spans can outnumber the body text 5–10×). On plain text-body extraction the win is smaller since free text doesn't compress. Round-trips back to JSON, so programmatic consumers lose nothing.

Examples

# Specific pages as JSON
pdfvision document.pdf -p 1-3 -f json

# Render PNGs into ./images for a multimodal LLM
pdfvision document.pdf -r --render-output ./images

# Layout + image bboxes — agent reconstructs reading order itself,
# and pages[].warnings flags overlapping text, body running into the
# bottom edge, body colliding with running headers/footers, etc.
pdfvision document.pdf --layout --image-boxes -f json

# Per-text-item geometry (bbox + fontSize per glyph run)
pdfvision document.pdf -f json --geometry

# Same geometry as token-efficient TOON (spans become tabular rows)
pdfvision document.pdf -f toon --geometry

# OCR a scanned PDF (multi-language)
pdfvision scan.pdf --ocr --ocr-lang eng+jpn -f json

Coordinates use a top-down origin (0,0 at the top-left, y grows downward) in PDF user-space points so callers can overlay spans / image bboxes directly on the rendered PNG. Multiply by image.width / page.width to map onto pixels.

📚 Library API

import { processDocument } from 'pdfvision';

const result = await processDocument('./document.pdf', { pages: '1-3', render: true });

console.log(result.totalPages);          // number
console.log(result.metadata.title);      // string | null
for (const page of result.pages) {
  console.log(page.page, page.text);     // typed access, no JSON.parse
  if (page.image) console.log(page.image); // PNG path on disk when render: true
}

processFile() returns the same string output the CLI prints (markdown / json / xml / toon).

Exports: processDocument, processFile, parsePageRange, plus full type definitions for DocumentResult / PageResult / PageOverview / PageQuality / DocumentMetadata / ProcessDocumentOptions / ProcessOptions / OutputFormat / TextSpan / LayoutBlock / LayoutLine / PageLayout / ImageBox / PageOcr / PageWarning.

💾 Caching

Results land under <os-tmp>/pdfvision/<sha256-prefix>/ keyed by file content. POSIX 0700 / 0600 permissions, symlink/TOCTOU defences. Override the location with PDFVISION_CACHE_DIR=/path or wipe everything with pdfvision --clear-cache.

🛠️ Requirements

Node.js >= 22.13.0
@napi-rs/canvas (installed automatically; ships prebuilt binaries for common platforms)
tesseract.js is installed as an optional dependency and only loaded when --ocr is requested. Skip it with npm install --omit=optional if you don't need OCR.

pdfvision

pdfvision

💡 Why pdfvision

See whether text extraction actually worked

Look at the page, not just the text

Preserve layout and visual structure

Spot anomalies a human would notice

Keep raw evidence available

Make repeated agent reads cheap

🚀 Quick Start

🤖 Agent Skill

📖 Usage

Output formats

Examples

📚 Library API

💾 Caching

🛠️ Requirements

📜 License

Yorumlar (0)