scholar-megasearch

Massive multi-source academic literature search for Claude Code.
One skill that fans out subagents across 20+ scholarly databases, merges everything into a single deduplicated corpus, and acquires the original PDFs.

한국어 README · SKILL.md · Source catalog · Orchestration

License Last commit Top language

One sentence: ask Claude Code for a topic, and get back a single ranked,
deduplicated, provenance-tracked corpus of papers — with the PDFs already on disk.

🔭 20+ databases in one pass — arXiv, Semantic Scholar, Crossref, OpenAlex, PubMed/PMC, bioRxiv/medRxiv, DOAJ, CORE, BASE, OpenAIRE, Zenodo, Unpaywall, HAL, DBLP, IACR, SSRN, CiteSeerX, Europe PMC, plus web/GitHub.
🧵 Subagent fan-out — one searcher per source bucket, running in parallel, so breadth doesn't cost you serial wall-clock.
🧹 Dedup with provenance — merged by DOI → arXiv-id → normalized title; every paper records which databases surfaced it.
📊 Corroboration ranking — papers found by more independent databases rank higher, not just the ones with good SEO.
📄 Original PDFs — top-K acquired automatically via open-access routes, with a manifest of what landed and what needs a paywall fallback.
🧭 Domain-aware routing — physics, life sciences, CS, crypto, economics, or math each pull the right subset of databases.

Why It Exists

Searching one database at a time is how good papers get missed. arXiv won't show
you the published version's citation count; Semantic Scholar won't surface the
bioRxiv preprint; Google Scholar won't tell you which of its hits also appear in
three other indexes. So you end up running the same query in five tabs, copy-pasting
into a doc, hand-deduplicating, and then hunting each PDF down separately.

scholar-megasearch collapses that into one request. It treats "search the
literature" as a fan-out problem: decompose the topic, send each source bucket to
its own subagent, and reconcile everything afterward. The output isn't a chat reply —
it's a corpus on disk where every entry is deduplicated, ranked by how many
independent databases corroborate it, and backed by a downloaded PDF wherever a free
route exists.

How It Works

scholar-megasearch pipeline: topic → decompose into facets → fan out one subagent per source bucket → merge_corpus.py (dedup + rank) → fetch_pdfs.py → synthesize

Orchestration runs as a deterministic Workflow when available, and falls back to
direct Agent fan-out otherwise. A domain → bucket routing table picks the right
4–7 buckets per topic.

Dedup & ranking

Records from different databases are merged when they share any of: a normalized
DOI, an arXiv id (version-stripped), or a normalized title. The merged record keeps
the richest value per field — the longest abstract, the most complete author list,
the maximum citation count — and accumulates the set of sources that found it.
Ranking is then (number of sources, citation count, year), descending. Pass
--min-sources 2 to keep only papers corroborated by two or more databases — a
high-precision shortlist that filters out single-index noise.

A typical run

A mid-depth sweep (≈ L3 Deep) on a focused topic looks roughly like this (illustrative):

topic: "spin–orbit torque switching in ferrimagnets"
  facets:  6 subqueries        buckets: A B C E G (5 searchers)
  raw hits: ~310 across buckets
  unique:   ~150 after dedup   (≈60 corroborated by ≥2 databases)
  PDFs:     22 / 25 acquired   (3 flagged needs_mcp — paywalled)
  output:   ./literature_search/spin-orbit-torque-ferri_2026-05-29/

Install

git clone https://github.com/TaewoooPark/scholar-megasearch.git
cd scholar-megasearch
bash setup/install.sh [email protected]      # email used for Unpaywall OA + arXiv politeness

The script installs the skills into ~/.claude/skills/, builds
~/.claude/skill_venv and ~/.claude/paper_search_mcp_venv, installs the local MCP
servers (paper-search-mcp from git main — the PyPI build lacks Crossref/OpenAlex;
arxiv-mcp-server via uvx), and writes setup/mcp.servers.resolved.json. Semantic
Scholar (Bucket B) is the remote Ai2 Asta MCP —
nothing to install, and it works without a key (rate-limited). For higher rate
limits, request a free key and add a header to the asta entry:
"headers": { "x-api-key": "YOUR_ASTA_KEY" } — paste the literal key (a ${ENV}
placeholder is sent verbatim and rejected with HTTP 403). Asta use is subject to Ai2's
terms (see Attribution). Merge that file's mcpServers entries into
~/.claude.json and restart Claude Code.

Requirements


Python	3.11+
`uv`	for `uvx arxiv-mcp-server`
`git`	`pip install` of `paper-search-mcp` (git main) at install time
Claude Code	the skill is triggered from within a session

Usage

Inside Claude Code, trigger the skill in natural language:

search every database for spin–orbit torque switching and grab the PDFs
MoE 관련 최근 1년 논문 방대하게 검색해줘, PDF까지

Or invoke it as a slash command, optionally pinning the depth level (see
Depth levels) — prepend depth=N, LN, or a bare 1–5, or use a
phrase like quick / 전수조사. Everything after the command is the topic:

/scholar-megasearch depth=4 spin–orbit torque switching in ferrimagnets
/scholar-megasearch L5 altermagnetism candidate materials    # L5 = grab every source's PDFs
/scholar-megasearch quick first look at skyrmion racetrack memory
/scholar-megasearch 전수조사 위상 절연체 표면 상태 측정         # 전수조사 → L5
/scholar-megasearch MoE routing papers from the last year       # no level → defaults to L2

Or run the scripts directly:

# merge per-source result files into one ranked corpus
python3 ~/.claude/skills/scholar-megasearch/scripts/merge_corpus.py \
  ./literature_search/<topic>_<date>/raw \
  -o corpus.json --md corpus.md --min-sources 2

# acquire original PDFs for the top 25 ranked papers
python3 ~/.claude/skills/scholar-megasearch/scripts/fetch_pdfs.py \
  corpus.json -o ./pdfs --email [email protected] --top 25

Depth levels

One knob scales breadth (facets × buckets × hits per query) and recursion
(extra waves) together. Pick a level per run — an explicit depth=N / LN / bare
1–5 wins; otherwise it's inferred from phrasing (quick/빠르게 → L1 …
every source/전수조사 → L5); otherwise it defaults to L2.

Level	Facets	Buckets	Hits/query	Waves	PDFs	Output
L1 · Quick	3	4	15	wave 1 only	top 10	corpus
L2 · Standard (default)	5	5	25	wave 1 only	top 30	corpus
L3 · Deep	6	6	30	+ citation snowball	top 50	corpus
L4 · Exhaustive	8	7 (all)	40	+ snowball + completeness-critic pass	top 100	corpus + ≥2 shortlist
L5 · Total (전수조사)	8	7 (all)	40	+ snowball + critic loop-until-dry	all sources	corpus + ≥2 shortlist

Each wave is a fan-out followed by a merge into the same corpus: the citation
snowball (L3+) seeds the top DOIs/arXiv ids back through citation graphs; the
completeness-critic (L4+) names missing subtopics/authors that become the next
wave's facets, looped until dry at L5. L4/L5 also emit a --min-sources 2 shortlist.
Higher levels spawn more subagents and cost more tokens — L5 is bounded only by the
token budget. PDF acquisition scales with the level too — fetch_pdfs.py --top of
10 / 30 / 50 / 100, and all (the whole corpus) at L5.

Outputs

Everything lands under ./literature_search/<topic>_<date>/ in the working directory:

literature_search/<topic>_<date>/
├── raw/<bucket>.json     # per-source hits (one file per subagent)
├── corpus.json           # deduplicated, ranked, provenance-tracked corpus
├── corpus.md             # human-readable digest
├── pdfs/                  # acquired original PDFs + manifest.json
└── summary.md            # synthesized review

PDFs are named NN_<slug>.pdf by their corpus.json rank, and summary.md numbers each
paper [#NN] with the same rank — so a summary entry maps directly to its pdfs/NN_*.pdf
file and its manifest.json row.

Source Buckets

Bucket	Databases
A · Preprints	arXiv (search · semantic · citation graph)
B · Citations	Semantic Scholar via Ai2 Asta (official MCP) + paper-search-mcp
C · DOI / published	Crossref, OpenAlex
D · Life sciences	PubMed, PMC, bioRxiv, medRxiv, Europe PMC
E · Open access	DOAJ, CORE, BASE, OpenAIRE, Zenodo, Unpaywall, HAL
F · Domain	DBLP (CS), IACR (crypto), SSRN (econ/law), CiteSeerX
G · Web	DuckDuckGo, GitHub, crawl4ai / firecrawl

Domain → bucket routing

Topic domain	Always	Plus
Physics / materials / cond-mat	A · B · C	E · G
CS / ML / systems	A · B · F (DBLP)	C · G (GitHub)
Biology / medicine / neuro	D · B · C	E
Cryptography / security	A · F (IACR) · B	G (GitHub)
Economics / social science / law	F (SSRN) · B · C	G
Math	A · B · C	F
Interdisciplinary / unknown	A · B · C · D	E · G

Full per-bucket tool lists are in
skills/scholar-megasearch/references/sources.md;
the orchestration templates (Workflow + Agent fan-out) and the record schema are in
references/orchestration.md.

Repository Layout

scholar-megasearch/
├── README.md · README.ko.md · LICENSE
├── setup/
│   ├── install.sh            # skills + venvs + MCP servers + resolved config
│   ├── requirements.txt      # pinned search/acquisition deps
│   └── mcp.servers.json      # MCP registration template for ~/.claude.json
└── skills/
    ├── scholar-megasearch/   # the skill
    │   ├── SKILL.md
    │   ├── references/{sources.md, orchestration.md}
    │   └── scripts/{merge_corpus.py, fetch_pdfs.py, search_local.py}
    └── arxiv-search/          # supporting venv-search skill

This repository contains only original MIT-licensed work (the two skills and the
setup scripts). The third-party MCP servers are not vendored — setup/install.sh
fetches the local ones (arxiv-mcp-server, paper-search-mcp) from upstream at install
time, and Semantic Scholar is the remote Ai2 Asta service. See Attribution.

Notes & Limitations

PDF acquisition is open-access-first. fetch_pdfs.py only uses free/legal
routes (a known OA pdf_url, arXiv, Unpaywall) and verifies every file is a real
%PDF-. Closed-access papers are flagged needs_mcp in the manifest; fetching
those is left to the session's MCP download tools.
arXiv rate-limits heavy fan-out (HTTP 429). Searchers stagger and lean on
Semantic Scholar / OpenAlex when arXiv pushes back.
paper-search-mcp must be the git-main build — the PyPI release omits
Crossref and OpenAlex. The installer handles this.
The claude.ai Scholar Gateway is best-effort — it may be absent in
headless/cron runs, so it is never a bucket's only source.
Honest synthesis. summary.md reports what was actually searched and which
sources failed; nothing is invented to fill a gap.

Attribution

The MCP servers are third-party — installed from their upstream sources by
setup/install.sh, or (for Asta) used as a remote service. None of their code is
redistributed here:

Ai2 Asta Scientific Corpus Tool — the official Semantic Scholar MCP by the
Allen Institute for AI, used as a remote
service under Ai2's Asta License Agreement
and Terms of Use (no code vendored).
paper-search-mcp — openags/paper-search-mcp (pip install from git main)
arxiv-mcp-server — launched on demand via uvx

Semantic Scholar data returned through Asta is licensed ODC-BY and governed by the
Semantic Scholar API license: when
you publish results built on it, attribute Semantic Scholar (link back to
semanticscholar.org) and do not redistribute, sell, or sublicense the raw data.
Individual papers/abstracts may carry their own licenses (e.g. CC BY-NC).

Original work in this repository (the scholar-megasearch and arxiv-search skills
and the setup scripts) is released under the MIT License — this covers our
code only, not the third-party services or the data they return.