academic-tools-mcp
Health Warn
- License — License: MIT
- Description — Repository has a description
- Active repo — Last push 0 days ago
- Low visibility — Only 5 GitHub stars
Code Pass
- Code scan — Scanned 12 files during light audit, no dangerous patterns found
Permissions Pass
- Permissions — No dangerous permissions requested
No AI report is available for this listing yet.
MCP server giving LLM agents lean, identifier-routed tools to look up, read, and cross-reference academic papers across 7 providers (OpenAlex, arXiv, bioRxiv, ACL Anthology, Crossref, OpenCitations, Wikipedia).
academic-tools-mcp
An MCP server that gives LLM agents lean, focused tools for working with academic papers. Built on FastMCP.
Look up paper metadata, authors, abstracts, citations, and BibTeX entries. Download and read full paper PDFs section-by-section. Explore reference and citation graphs. Cross-reference with Wikipedia.
Data Sources
| Provider | What it provides | Auth required |
|---|---|---|
| OpenAlex | Paper metadata, authors, abstracts, topics, citations, BibTeX | Optional API key (free) |
| arXiv | Preprint metadata, authors, abstracts, BibTeX, PDF download | None |
| bioRxiv/medRxiv | Preprint metadata, authors, abstracts, BibTeX, PDF download | None |
| ACL Anthology | PDF download for ACL venue papers (ACL, EMNLP, NAACL, etc.) | None |
| Crossref | Reference lists, title search / DOI discovery | Optional email (for polite pool) |
| OpenCitations | Reference and citation links with cross-referenced IDs | None |
| Wikipedia | Article search, summaries, page existence checks | Optional email (for User-Agent) |
All API responses are cached locally. Multiple tool calls for the same paper = one API hit. Concurrent calls for the same paper are coalesced into a single fetch (request single-flight), transient failures (5xx, 429, timeouts) get one transparent retry, and definitive 404s are negative-cached for 24 hours so retry-happy agents don't burn rate budget on guaranteed misses.
Setup
Requires Python 3.11+ and uv.
git clone https://github.com/hunter-heidenreich/academic-tools-mcp.git
cd academic-tools-mcp
uv sync
cp .env.example .env # then edit .env with your values
Configuration
All configuration is via environment variables in .env. Nothing is required to get started, but some variables unlock higher rate limits.
| Variable | Required | Description |
|---|---|---|
OPENALEX_API_KEY |
No | Free API key from openalex.org |
OPENALEX_MAILTO |
No | Your email — gets you into the polite pool (faster) |
CROSSREF_MAILTO |
No | Your email — gets you into the Crossref polite pool (10 req/sec vs 5) |
WIKIPEDIA_MAILTO |
No | Your email — required by Wikimedia policy for the User-Agent header |
PDF_CONVERTER |
No | PDF-to-markdown backend: mineru (default), marker, or a custom command (see PDF Pipeline) |
PDF_CONVERTER_VENV |
No | Path to a virtualenv to activate before running the converter (e.g. ~/.venvs/mineru) |
PDF_CONVERT_TIMEOUT |
No | Hard timeout for a single PDF→markdown conversion in seconds (default 1800 = 30 min). Set to none / off / disabled to disable. |
Usage
With Claude Code
Add to your MCP config (~/.claude/claude_code_config.json):
{
"mcpServers": {
"academic-tools": {
"command": "uv",
"args": ["run", "--directory", "/path/to/academic-tools-mcp", "python", "-m", "academic_tools_mcp.server"]
}
}
}
Standalone
uv run python -m academic_tools_mcp.server
FastMCP CLI
uv run fastmcp run src/academic_tools_mcp/server.py:mcp
Tools
Papers (unified, auto-routed)
| Tool | Description |
|---|---|
get_paper_metadata |
Title, dates, venue / categories, identifiers — shape varies by _source. Optional follow_published=True auto-chains a bioRxiv preprint to its journal version on OpenAlex when one exists. |
get_papers_metadata |
Bulk metadata for many identifiers at once. OpenAlex DOIs collapse into one batched HTTP call per 50; arXiv / bioRxiv fan out concurrently. Designed for reference-graph enrichment after get_paper_references. Cap 100 per call. |
get_paper_authors |
Author list with source-appropriate detail (affiliations, corresponding author, OpenAlex IDs) |
get_paper_abstract |
Plain text abstract |
get_paper_bibtex |
Ready-to-paste BibTeX entry |
Pass an arXiv ID (2301.00001, hep-th/9901001) or any DOI — including bioRxiv/medRxiv (10.1101/...), ACL Anthology (10.18653/v1/...), or generic publisher DOIs. Each response carries a _source field ("arxiv" / "biorxiv" / "openalex") so you know which provider answered and which fields to expect. arXiv IDs always route to arXiv; bioRxiv DOIs route to bioRxiv; everything else (including ACL) routes to OpenAlex.
| Tool | Description |
|---|---|
search_arxiv |
Search arXiv with field prefixes (ti:, au:, abs:, cat:) and boolean operators |
Authors
| Tool | Description |
|---|---|
get_author |
Name, ORCID, institutions (current + historical with years), h-index, i10-index, works/citation counts, top topics |
Accepts OpenAlex author IDs (from get_paper_authors) or ORCIDs.
PDF pipeline (unified)
| Tool | Description |
|---|---|
download_pdf |
Download and cache the PDF — auto-detects arXiv, ACL Anthology, bioRxiv/medRxiv. Streams chunks to disk (peak memory = 64 KiB) and aborts mid-stream if the response would exceed MAX_PDF_BYTES (default 200 MB). Re-downloading with force_refresh=True cascades: the cached markdown + section index are dropped automatically so the next convert_paper picks up the new bytes. |
convert_paper |
Convert PDF to markdown, parse into sections (slow: tens of minutes; PDF_CONVERT_TIMEOUT caps it at 30 min by default). The server runs at most one conversion at a time across all callers — a second concurrent caller gets {busy: True, retryable: True, in_progress: {...}} immediately rather than queueing |
get_paper_sections |
Section index with titles, sub-heading previews, token counts |
get_paper_section |
Markdown of a section (by index or title substring); truncated by default (16000 chars) |
find_in_paper |
Substring (or whole-word) search inside one converted paper. Returns each hit's section + char offset + ~120-char snippet. Char offsets align with get_paper_section's stripped text so you can chain straight to the surrounding context. |
All four tools accept any identifier (arXiv ID, DOI, or freeform label) and auto-route to the correct provider's cache namespace. For papers not hosted on arXiv/ACL/bioRxiv, fetch the PDF yourself and hand it to import_paper — see Manual import below.
References and citations (DOI required)
| Tool | Description |
|---|---|
get_paper_references_count |
Survey outgoing-reference coverage across both Crossref and OpenCitations in one call — returns per-source counts so you can pick which to page through |
get_paper_references |
Paginated outgoing references. Default source="auto" surveys both Crossref and OpenCitations in parallel and pages from whichever has more; pass source="crossref" for structured metadata or source="opencitations" for broader DOI coverage to skip the survey |
get_paper_citations_count |
Number of incoming citations (OpenCitations) |
get_paper_citations |
Paginated incoming citations with DOIs, dates, self-citation flags, and cross-referenced IDs (OpenCitations) |
search_crossref_by_title |
DOI discovery by bibliographic query (also works for bioRxiv papers); each hit warms the works cache so a follow-up get_paper_metadata(doi) is free |
For citations, follow the count-then-page pattern: call get_paper_citations_count first to see the total, then page through with page and page_size. For references the source="auto" default does the survey for you on the first call. Paginated responses include _source (on references) and has_more so agents know which shape to expect and when to stop. This prevents token blowouts on papers with long bibliographies or many citations.
Source trade-off for references: Crossref returns structured reference metadata (author, title, year, journal, DOI) when publishers deposit it; quality varies. OpenCitations aggregates from Crossref, PubMed, DataCite, OpenAIRE, and JaLC — it may have entries Crossref lacks, but returns DOI-to-DOI links only (no bibliographic metadata).
Manual import
| Tool | Description |
|---|---|
import_paper |
Import a local .pdf (e.g. from Zotero or a file you downloaded) or pre-converted .md/.markdown with a user-supplied identifier. File type is detected by extension. |
For PDFs outside arXiv/bioRxiv/ACL, fetch the file yourself (browser, curl, publisher portal, institutional proxy) and then call import_paper — the server deliberately does not download arbitrary URLs.
After importing a PDF, use the unified pipeline tools (convert_paper → get_paper_sections → get_paper_section) with the same identifier. Markdown imports skip the conversion step and go straight to get_paper_sections / get_paper_section.
Provider-aware routing: if the identifier is an arXiv ID, bioRxiv DOI, or ACL DOI, the file is stored in that provider's cache namespace automatically. A subsequent download_pdf("2301.00001") will find an already-imported PDF — no duplicates.
Wikipedia
| Tool | Description |
|---|---|
search_wikipedia |
Search for articles matching a query |
get_wikipedia_summary |
Title, description, extract, URL, and page type (standard / disambiguation); errors if the page doesn't exist |
PDF Pipeline
The PDF-to-markdown pipeline converts downloaded PDFs into section-level markdown that agents can read piece by piece, avoiding token blowouts from dumping entire papers into context.
The pipeline is converter-agnostic. Set PDF_CONVERTER in .env to choose your backend:
# Named backends
PDF_CONVERTER=mineru # default — https://github.com/opendatalab/MinerU
PDF_CONVERTER=marker # https://github.com/datalab-to/marker
# Custom command template — use {input} and {output_dir} placeholders
PDF_CONVERTER=my-tool --in "{input}" --out "{output_dir}"
If your converter lives in a virtualenv, set PDF_CONVERTER_VENV:
PDF_CONVERTER_VENV=~/.venvs/mineru
The converter must accept a PDF input path and an output directory, and produce one or more .md files in that directory. The pipeline finds the markdown file automatically.
Note: PDF converters are external tools with their own licenses. MinerU is AGPL-3.0; Marker is GPL. This project invokes them as CLI subprocesses and does not link or import their code. The PDF pipeline is entirely optional — all metadata, BibTeX, and citation tools work without it.
Installing MinerU (example setup)
python -m venv ~/.venvs/mineru
source ~/.venvs/mineru/bin/activate
pip install mineru
Then in .env:
PDF_CONVERTER=mineru
PDF_CONVERTER_VENV=~/.venvs/mineru
Caching
API responses and downloaded files are cached under .cache/:
.cache/
openalex/works/ # OpenAlex work objects (JSON)
openalex/authors/ # OpenAlex author objects (JSON)
arxiv/papers/ # arXiv paper entries (JSON)
arxiv/pdfs/ # Downloaded PDFs
arxiv/markdown/ # Converted markdown
arxiv/sections/ # Section indices (JSON)
biorxiv/papers/ # bioRxiv paper entries (JSON)
biorxiv/pdfs/ # Downloaded PDFs
biorxiv/markdown/ # Converted markdown
biorxiv/sections/ # Section indices (JSON)
acl_anthology/pdfs/ # Downloaded PDFs
acl_anthology/markdown/ # Converted markdown
acl_anthology/sections/ # Section indices (JSON)
crossref/works/ # Crossref work objects (JSON)
opencitations/references/# OpenCitations reference lists (JSON)
opencitations/citations/ # OpenCitations citation lists (JSON)
wikipedia/summaries/ # Wikipedia page summaries (JSON)
manual/pdfs/ # Manually imported PDFs
manual/markdown/ # Converted markdown
manual/sections/ # Section indices (JSON)
Cache keys are SHA-256 hashes of canonical identifiers. Writes are atomic (temp file + os.replace) so a crash mid-write can't leave a corrupt entry; corrupt entries from earlier versions self-heal on read. Positive entries have no expiration — delete .cache/ to start fresh. Negative entries (definitive 404s) live in a sibling _neg/ subdirectory under each entity with a 24-hour TTL, so retry-happy agents don't repeatedly hit the network for known-bad identifiers but newly-registered DOIs still surface within a day.
Development
uv sync # Install dependencies
uv run pytest -v # Run all tests (485 tests)
uv run pytest tests/test_bibtex.py -v # Run one test file
uv run pytest -k "test_particle" -v # Run tests matching a pattern
Architecture
server.py (21 MCP tools; FastMCP lifespan closes pooled clients on shutdown)
│
├── API clients openalex.py, arxiv.py, biorxiv.py,
│ crossref.py, opencitations.py, wikipedia.py,
│ acl_anthology.py
│
├── PDF + content manual.py (local-file import)
│ papers.py (PDF → markdown → sections;
│ global single-conversion lock;
│ in-paper find_in_markdown)
│ cache_search.py (BM25 over cached markdown)
│ bibtex.py (BibTeX generation)
│ _pdf_download.py (streaming download helper)
│
└── Shared infrastructure (every API client routes through these)
_http.py one-shot retry, structured errors, backpressure
_clients.py per-provider pooled httpx.AsyncClient
_singleflight.py concurrent same-key callers coalesce to one fetch
cache.py atomic file cache with negative-cache (24h TTL)
Key design decisions:
- Lean responses. Tools return only what's needed — not the full API response. An agent calling
get_paper_authorsdoesn't get flooded with unrelated metadata. - One tool per job, auto-routed. The four core paper tools (
get_paper_metadata,get_paper_authors,get_paper_abstract,get_paper_bibtex) dispatch on identifier shape rather than forcing the agent to pick between arXiv/bioRxiv/OpenAlex families. Provider-native fields are preserved and tagged with_source. - Batch where it matters.
get_papers_metadatacollapses N parallel singletons into one HTTP call per 50 OpenAlex DOIs (/works?filter=doi:...|...) plus concurrent fan-out for arXiv / bioRxiv — designed for reference-graph enrichment. - One API hit per entity. All tools for a given DOI share one cached response. Concurrent same-key callers are coalesced by single-flight to one fetch.
- Per-provider concurrency. Each provider has its own concurrency cap (arxiv=1 single-connection rule, openalex=4, crossref=3, etc.) — multiple GETs run in flight up to the cap while a brief gap-lock enforces inter-start spacing. Reference-graph traversals are dramatically faster than the previous serialise-everything model.
- Persistent connections, transparent retries. Each provider holds one pooled
httpx.AsyncClientso TCP+TLS handshakes are reused. Transient failures (5xx, 429, timeouts, network errors) get one in-process retry that honoursRetry-After(capped) before surfacing to the agent. - Burst caps with structured backpressure. Each provider refuses to stack more than 5 concurrent callers behind its rate-limit gap. The 6th gets
{error, retryable: True, backpressure: True}immediately so the agent learns to slow down rather than waiting silently. - Negative caching for definitive 404s. Known-bad identifiers are cached for 24h so retries don't burn rate budget; transient errors are NOT cached.
- Streaming PDF downloads with size guard. PDFs stream chunked to a temp file with atomic rename — peak memory = 64 KiB, not 2× the PDF — and abort mid-stream if
MAX_PDF_BYTES(default 200 MB) is exceeded. Force-refresh cascades: re-downloading drops the cached markdown + sections so the next conversion picks up the new bytes. - Single-conversion lock for PDFs. At most one PDF→markdown subprocess runs at a time across the whole server; concurrent callers get a
busyerror with what's running and how long it's been going. - Count-then-page for large data. Citation and reference tools expose a
_counttool so agents can check sizes before fetching.get_paper_references(source="auto")does the survey for you. - Provider-aware routing. Manual imports auto-detect identifier types and store in the correct provider's cache, preventing duplicates.
- Subprocess isolation for PDF converters. The PDF pipeline shells out to external tools rather than importing them, keeping the dependency tree light and avoiding license entanglement.
- Pre-computed aggregates. List responses include counts (
author_count,topic_count,total_sections, etc.) so agents don't need follow-up calls to check sizes. - Structured error hints. Error responses include a
suggestionfield with recovery guidance (e.g. which search tool to try).
Versioning
This project uses calendar versioning: each release is named for the day it
was cut — YYYY.MM.DD, tagged vYYYY.MM.DD in git (a rare same-day re-release
takes a .postN suffix). See CHANGELOG.md for the release
history.
License
MIT — see LICENSE.
Reviews (0)
Sign in to leave a review.
Leave a reviewNo results found