academic-tools-mcp

An MCP server that gives LLM agents lean, focused tools for working with academic papers. Built on FastMCP.

Look up paper metadata, authors, abstracts, citations, and BibTeX entries. Download and read full paper PDFs section-by-section. Explore reference and citation graphs. Cross-reference with Wikipedia.

Data Sources

Provider	What it provides	Auth required
OpenAlex	Paper metadata, authors, abstracts, topics, citations, BibTeX	Optional API key (free)
arXiv	Preprint metadata, authors, abstracts, BibTeX, PDF download	None
bioRxiv/medRxiv	Preprint metadata, authors, abstracts, BibTeX, PDF download	None
ACL Anthology	PDF download for ACL venue papers (ACL, EMNLP, NAACL, etc.)	None
Crossref	Reference lists, title search / DOI discovery	Optional email (for polite pool)
OpenCitations	Reference and citation links with cross-referenced IDs	None
Wikipedia	Article search, summaries, page existence checks	Optional email (for User-Agent)

All API responses are cached locally. Multiple tool calls for the same paper = one API hit. Concurrent calls for the same paper are coalesced into a single fetch (request single-flight), transient failures (5xx, 429, timeouts) get one transparent retry, and definitive 404s are negative-cached for 24 hours so retry-happy agents don't burn rate budget on guaranteed misses.

Setup

Requires Python 3.11+ and uv.

git clone https://github.com/hunter-heidenreich/academic-tools-mcp.git
cd academic-tools-mcp
uv sync
cp .env.example .env   # then edit .env with your values

Configuration

All configuration is via environment variables in .env. Nothing is required to get started, but some variables unlock higher rate limits.

Variable	Required	Description
`OPENALEX_API_KEY`	No	Free API key from openalex.org
`OPENALEX_MAILTO`	No	Your email — gets you into the polite pool (faster)
`CROSSREF_MAILTO`	No	Your email — gets you into the Crossref polite pool (10 req/sec vs 5)
`WIKIPEDIA_MAILTO`	No	Your email — required by Wikimedia policy for the User-Agent header
`PDF_CONVERTER`	No	PDF-to-markdown backend: `mineru` (default), `marker`, or a custom command (see PDF Pipeline)
`PDF_CONVERTER_VENV`	No	Path to a virtualenv to activate before running the converter (e.g. `~/.venvs/mineru`)
`PDF_CONVERT_TIMEOUT`	No	Hard timeout for a single PDF→markdown conversion in seconds (default `1800` = 30 min). Set to `none` / `off` / `disabled` to disable.

Usage

With Claude Code

Add to your MCP config (~/.claude/claude_code_config.json):

{
  "mcpServers": {
    "academic-tools": {
      "command": "uv",
      "args": ["run", "--directory", "/path/to/academic-tools-mcp", "python", "-m", "academic_tools_mcp.server"]
    }
  }
}

Standalone

uv run python -m academic_tools_mcp.server

FastMCP CLI

uv run fastmcp run src/academic_tools_mcp/server.py:mcp

Tools

Papers (unified, auto-routed)

Tool	Description
`get_paper_metadata`	Title, dates, venue / categories, identifiers — shape varies by `_source`. Optional `follow_published=True` auto-chains a bioRxiv preprint to its journal version on OpenAlex when one exists.
`get_papers_metadata`	Bulk metadata for many identifiers at once. OpenAlex DOIs collapse into one batched HTTP call per 50; arXiv / bioRxiv fan out concurrently. Designed for reference-graph enrichment after `get_paper_references`. Cap 100 per call.
`get_paper_authors`	Author list with source-appropriate detail (affiliations, corresponding author, OpenAlex IDs)
`get_paper_abstract`	Plain text abstract
`get_paper_bibtex`	Ready-to-paste BibTeX entry

Pass an arXiv ID (2301.00001, hep-th/9901001) or any DOI — including bioRxiv/medRxiv (10.1101/...), ACL Anthology (10.18653/v1/...), or generic publisher DOIs. Each response carries a _source field ("arxiv" / "biorxiv" / "openalex") so you know which provider answered and which fields to expect. arXiv IDs always route to arXiv; bioRxiv DOIs route to bioRxiv; everything else (including ACL) routes to OpenAlex.

Tool	Description
`search_arxiv`	Search arXiv with field prefixes (`ti:`, `au:`, `abs:`, `cat:`) and boolean operators

Authors

Tool	Description
`get_author`	Name, ORCID, institutions (current + historical with years), h-index, i10-index, works/citation counts, top topics

Accepts OpenAlex author IDs (from get_paper_authors) or ORCIDs.

PDF pipeline (unified)

Tool	Description
`download_pdf`	Download and cache the PDF — auto-detects arXiv, ACL Anthology, bioRxiv/medRxiv. Streams chunks to disk (peak memory = 64 KiB) and aborts mid-stream if the response would exceed `MAX_PDF_BYTES` (default 200 MB). Re-downloading with `force_refresh=True` cascades: the cached markdown + section index are dropped automatically so the next `convert_paper` picks up the new bytes.
`convert_paper`	Convert PDF to markdown, parse into sections (slow: tens of minutes; `PDF_CONVERT_TIMEOUT` caps it at 30 min by default). The server runs at most one conversion at a time across all callers — a second concurrent caller gets `{busy: True, retryable: True, in_progress: {...}}` immediately rather than queueing
`get_paper_sections`	Section index with titles, sub-heading previews, token counts
`get_paper_section`	Markdown of a section (by index or title substring); truncated by default (16000 chars)
`find_in_paper`	Substring (or whole-word) search inside one converted paper. Returns each hit's section + char offset + ~120-char snippet. Char offsets align with `get_paper_section`'s stripped text so you can chain straight to the surrounding context.

All four tools accept any identifier (arXiv ID, DOI, or freeform label) and auto-route to the correct provider's cache namespace. For papers not hosted on arXiv/ACL/bioRxiv, fetch the PDF yourself and hand it to import_paper — see Manual import below.

References and citations (DOI required)

Tool	Description
`get_paper_references_count`	Survey outgoing-reference coverage across both Crossref and OpenCitations in one call — returns per-source counts so you can pick which to page through
`get_paper_references`	Paginated outgoing references. Default `source="auto"` surveys both Crossref and OpenCitations in parallel and pages from whichever has more; pass `source="crossref"` for structured metadata or `source="opencitations"` for broader DOI coverage to skip the survey
`get_paper_citations_count`	Number of incoming citations (OpenCitations)
`get_paper_citations`	Paginated incoming citations with DOIs, dates, self-citation flags, and cross-referenced IDs (OpenCitations)
`search_crossref_by_title`	DOI discovery by bibliographic query (also works for bioRxiv papers); each hit warms the works cache so a follow-up `get_paper_metadata(doi)` is free

For citations, follow the count-then-page pattern: call get_paper_citations_count first to see the total, then page through with page and page_size. For references the source="auto" default does the survey for you on the first call. Paginated responses include _source (on references) and has_more so agents know which shape to expect and when to stop. This prevents token blowouts on papers with long bibliographies or many citations.

Source trade-off for references: Crossref returns structured reference metadata (author, title, year, journal, DOI) when publishers deposit it; quality varies. OpenCitations aggregates from Crossref, PubMed, DataCite, OpenAIRE, and JaLC — it may have entries Crossref lacks, but returns DOI-to-DOI links only (no bibliographic metadata).

Manual import

Tool	Description
`import_paper`	Import a local `.pdf` (e.g. from Zotero or a file you downloaded) or pre-converted `.md`/`.markdown` with a user-supplied identifier. File type is detected by extension.

For PDFs outside arXiv/bioRxiv/ACL, fetch the file yourself (browser, curl, publisher portal, institutional proxy) and then call import_paper — the server deliberately does not download arbitrary URLs.

After importing a PDF, use the unified pipeline tools (convert_paper → get_paper_sections → get_paper_section) with the same identifier. Markdown imports skip the conversion step and go straight to get_paper_sections / get_paper_section.

Provider-aware routing: if the identifier is an arXiv ID, bioRxiv DOI, or ACL DOI, the file is stored in that provider's cache namespace automatically. A subsequent download_pdf("2301.00001") will find an already-imported PDF — no duplicates.

Wikipedia

Tool	Description
`search_wikipedia`	Search for articles matching a query
`get_wikipedia_summary`	Title, description, extract, URL, and page type (`standard` / `disambiguation`); errors if the page doesn't exist

PDF Pipeline

The PDF-to-markdown pipeline converts downloaded PDFs into section-level markdown that agents can read piece by piece, avoiding token blowouts from dumping entire papers into context.

The pipeline is converter-agnostic. Set PDF_CONVERTER in .env to choose your backend:

# Named backends
PDF_CONVERTER=mineru          # default — https://github.com/opendatalab/MinerU
PDF_CONVERTER=marker          # https://github.com/datalab-to/marker

# Custom command template — use {input} and {output_dir} placeholders
PDF_CONVERTER=my-tool --in "{input}" --out "{output_dir}"

If your converter lives in a virtualenv, set PDF_CONVERTER_VENV:

PDF_CONVERTER_VENV=~/.venvs/mineru

The converter must accept a PDF input path and an output directory, and produce one or more .md files in that directory. The pipeline finds the markdown file automatically.

Note: PDF converters are external tools with their own licenses. MinerU is AGPL-3.0; Marker is GPL. This project invokes them as CLI subprocesses and does not link or import their code. The PDF pipeline is entirely optional — all metadata, BibTeX, and citation tools work without it.

Installing MinerU (example setup)

python -m venv ~/.venvs/mineru
source ~/.venvs/mineru/bin/activate
pip install mineru

Then in .env:

PDF_CONVERTER=mineru
PDF_CONVERTER_VENV=~/.venvs/mineru

Caching

API responses and downloaded files are cached under .cache/:

.cache/
  openalex/works/          # OpenAlex work objects (JSON)
  openalex/authors/        # OpenAlex author objects (JSON)
  arxiv/papers/            # arXiv paper entries (JSON)
  arxiv/pdfs/              # Downloaded PDFs
  arxiv/markdown/          # Converted markdown
  arxiv/sections/          # Section indices (JSON)
  biorxiv/papers/          # bioRxiv paper entries (JSON)
  biorxiv/pdfs/            # Downloaded PDFs
  biorxiv/markdown/        # Converted markdown
  biorxiv/sections/        # Section indices (JSON)
  acl_anthology/pdfs/      # Downloaded PDFs
  acl_anthology/markdown/  # Converted markdown
  acl_anthology/sections/  # Section indices (JSON)
  crossref/works/          # Crossref work objects (JSON)
  opencitations/references/# OpenCitations reference lists (JSON)
  opencitations/citations/ # OpenCitations citation lists (JSON)
  wikipedia/summaries/     # Wikipedia page summaries (JSON)
  manual/pdfs/             # Manually imported PDFs
  manual/markdown/         # Converted markdown
  manual/sections/         # Section indices (JSON)

Cache keys are SHA-256 hashes of canonical identifiers. Writes are atomic (temp file + os.replace) so a crash mid-write can't leave a corrupt entry; corrupt entries from earlier versions self-heal on read. Positive entries have no expiration — delete .cache/ to start fresh. Negative entries (definitive 404s) live in a sibling _neg/ subdirectory under each entity with a 24-hour TTL, so retry-happy agents don't repeatedly hit the network for known-bad identifiers but newly-registered DOIs still surface within a day.

Development

uv sync                          # Install dependencies
uv run pytest -v                 # Run all tests (485 tests)
uv run pytest tests/test_bibtex.py -v   # Run one test file
uv run pytest -k "test_particle" -v     # Run tests matching a pattern

Architecture

server.py (21 MCP tools; FastMCP lifespan closes pooled clients on shutdown)
  │
  ├── API clients          openalex.py, arxiv.py, biorxiv.py,
  │                        crossref.py, opencitations.py, wikipedia.py,
  │                        acl_anthology.py
  │
  ├── PDF + content        manual.py         (local-file import)
  │                        papers.py         (PDF → markdown → sections;
  │                                           global single-conversion lock;
  │                                           in-paper find_in_markdown)
  │                        cache_search.py   (BM25 over cached markdown)
  │                        bibtex.py         (BibTeX generation)
  │                        _pdf_download.py  (streaming download helper)
  │
  └── Shared infrastructure (every API client routes through these)
        _http.py           one-shot retry, structured errors, backpressure
        _clients.py        per-provider pooled httpx.AsyncClient
        _singleflight.py   concurrent same-key callers coalesce to one fetch
        cache.py           atomic file cache with negative-cache (24h TTL)

Key design decisions:

Lean responses. Tools return only what's needed — not the full API response. An agent calling get_paper_authors doesn't get flooded with unrelated metadata.
One tool per job, auto-routed. The four core paper tools (get_paper_metadata, get_paper_authors, get_paper_abstract, get_paper_bibtex) dispatch on identifier shape rather than forcing the agent to pick between arXiv/bioRxiv/OpenAlex families. Provider-native fields are preserved and tagged with _source.
Batch where it matters. get_papers_metadata collapses N parallel singletons into one HTTP call per 50 OpenAlex DOIs (/works?filter=doi:...|...) plus concurrent fan-out for arXiv / bioRxiv — designed for reference-graph enrichment.
One API hit per entity. All tools for a given DOI share one cached response. Concurrent same-key callers are coalesced by single-flight to one fetch.
Per-provider concurrency. Each provider has its own concurrency cap (arxiv=1 single-connection rule, openalex=4, crossref=3, etc.) — multiple GETs run in flight up to the cap while a brief gap-lock enforces inter-start spacing. Reference-graph traversals are dramatically faster than the previous serialise-everything model.
Persistent connections, transparent retries. Each provider holds one pooled httpx.AsyncClient so TCP+TLS handshakes are reused. Transient failures (5xx, 429, timeouts, network errors) get one in-process retry that honours Retry-After (capped) before surfacing to the agent.
Burst caps with structured backpressure. Each provider refuses to stack more than 5 concurrent callers behind its rate-limit gap. The 6th gets {error, retryable: True, backpressure: True} immediately so the agent learns to slow down rather than waiting silently.
Negative caching for definitive 404s. Known-bad identifiers are cached for 24h so retries don't burn rate budget; transient errors are NOT cached.
Streaming PDF downloads with size guard. PDFs stream chunked to a temp file with atomic rename — peak memory = 64 KiB, not 2× the PDF — and abort mid-stream if MAX_PDF_BYTES (default 200 MB) is exceeded. Force-refresh cascades: re-downloading drops the cached markdown + sections so the next conversion picks up the new bytes.
Single-conversion lock for PDFs. At most one PDF→markdown subprocess runs at a time across the whole server; concurrent callers get a busy error with what's running and how long it's been going.
Count-then-page for large data. Citation and reference tools expose a _count tool so agents can check sizes before fetching. get_paper_references(source="auto") does the survey for you.
Provider-aware routing. Manual imports auto-detect identifier types and store in the correct provider's cache, preventing duplicates.
Subprocess isolation for PDF converters. The PDF pipeline shells out to external tools rather than importing them, keeping the dependency tree light and avoiding license entanglement.
Pre-computed aggregates. List responses include counts (author_count, topic_count, total_sections, etc.) so agents don't need follow-up calls to check sizes.
Structured error hints. Error responses include a suggestion field with recovery guidance (e.g. which search tool to try).

Versioning

This project uses calendar versioning: each release is named for the day it
was cut — YYYY.MM.DD, tagged vYYYY.MM.DD in git (a rare same-day re-release
takes a .postN suffix). See CHANGELOG.md for the release
history.

License

MIT — see LICENSE.