ref-downloader

Stop losing an afternoon to chasing dozens of reference PDFs by hand.
One DOI in, every reference PDF out — using your existing institutional access.

中文完整文档 / Full Chinese version

Status: beta (v0.2.0). Windows + Microsoft Edge verified path. macOS / Linux / Chromium untested. Expect rough edges around supplementary downloads and publisher-site changes. PR-worthy issues welcome.

Heads up — not a paywall bypass. ref-downloader uses your institutional access. If your university or organization subscribes to a journal, those refs work. If they don't, those refs become manual_pending for you to follow up on by hand.

Demo (30-second console preview)

$ python run_ref_downloader.py 10.1021/jacs.5c05017

=== Ref Downloader Wrapper ===
DOI:         10.1021/jacs.5c05017
PROJECT:     jacs.5c05017
Config:      config.example.toml + config.local.toml

>>> extract_refs.py
  Title: Designing Natural Cell-Inspired Heme-Spurred Membrane...
  References found: 38

>>> validate_refs.py
  Total: 38  Verified: 38  Failed: 0  No DOI: 0

>>> download_refs.py
  [ 1] downloaded (842 KB)        Lee2016_NatEnergy.pdf
  [ 2] downloaded (1.2 MB)        Wang2018_AdvMater.pdf
  [ 3] manual_pending (auth_redirect)
  [ 4] downloaded (655 KB)        Chen2019_JACS.pdf
  [ 5] failed (challenge_timeout)
  [ 6] ignored (ignored_institution_access)
  ... 31 more refs processed ...
  [38] downloaded (956 KB)        Park2024_JElectrochemSoc.pdf

========== Download report ==========
Total references:  38
Main PDFs:         33 downloaded · 3 manual_pending · 1 failed · 1 ignored
SI files:          12 captured
PDFs land in:      ./jacs.5c05017_refs/jacs.5c05017/
=====================================

What you get
Why not Zotero, scihub, or generic scrapers?
Quick start
Requirements
Install
Usage examples
Configuration
Architecture
Supported publishers
Known limitations
Contributing
Security
License

What you get

Paywalled refs work without setup. Drives your real Microsoft Edge profile, so any institutional login already in your browser carries through. No API keys, no proxies, no reverse engineering.
One DOI in, every reference PDF out. Crossref-driven extraction + 17+ publisher-specific download paths (Wiley PDFDirect, Elsevier viewer, AIP loading-page wait — see per-publisher reliability tier), not generic scraping.
You always know which refs failed and why. download_report.csv gives every ref a status + reason (manual_pending (auth_redirect), failed (challenge_timeout), ignored); events.jsonl keeps the per-ref event trace.
Pick up where you left off after a VPN drop, browser crash, or Ctrl+C. State persists per project; rerunning skips already-downloaded refs and retries only the failures.

Why not Zotero, scihub, or generic scrapers?

vs. Zotero's Find Available PDF — walks one paper at a time and silently gives up at SSO redirects. ref-downloader walks the whole reference list at once and treats SSO as a configurable step instead of a dead end.
vs. scihub-style tools — don't carry your institutional license, so paywalled refs you legitimately have access to just fail. ref-downloader uses your authenticated browser session, so subscriptions you already pay for actually count.
vs. generic web scrapers — don't know Wiley needs PDFDirect, Elsevier needs a viewer click, or AIP serves a Chinese loading page first. ref-downloader has 17+ publisher-specific paths plus hot-session retry for Elsevier.

Quick start

The skill is self-contained under skills/ref-downloader/. Pick the install path for your agent framework:

git clone https://github.com/ltczding-gif/ref-downloader.git

# Pick ONE install destination for your agent framework:
#   Claude Code:        cp -r ref-downloader/skills/ref-downloader ~/.claude/skills/
#   Codex CLI:          cp -r ref-downloader/skills/ref-downloader ~/.codex/skills/
#   Copilot CLI / VSC:  cp -r ref-downloader/skills/ref-downloader .github/skills/
#   Project-local:      cp -r ref-downloader/skills/ref-downloader .agents/skills/

cd ~/.claude/skills/ref-downloader     # or wherever you copied it
pip install playwright pymupdf
playwright install msedge
cp config.example.toml config.local.toml      # then set [crossref].mailto

# In your agent: just describe the task; the skill triggers via its description.
# Direct CLI for testing: python scripts/run_ref_downloader.py 10.1021/jacs.5c05017

What you'll see: 30–80 refs discovered for a typical chemistry/physics paper, then a mix of downloaded (refs your institution covers), manual_pending (SSO bounce or paywall), and occasional failed (publisher quirk). Run on a DOI from a journal your institution actually subscribes to for the highest hit rate. Details below.

Requirements

OS: Windows 10/11 (verified). macOS / Linux untested — PRs welcome.
Browser: Microsoft Edge (Stable channel). The script claims your persistent Edge profile, so close all Edge windows before running.
Python: 3.11 or newer (uses stdlib tomllib).
Optional: A Zotero installation (auto-detects DOI from a PDF's filename via Zotero's SQLite database — much faster than text extraction).
Optional: PyMuPDF (pip install pymupdf) for DOI extraction from PDF text when Zotero lookup is unavailable.

Install

As an agent skill (recommended)

Pick the install path for your agent framework:

Framework	Install command
Claude Code	`cp -r skills/ref-downloader ~/.claude/skills/`
Claude Agent SDK	same (auto-discovers `~/.claude/skills/`)
Codex CLI	`cp -r skills/ref-downloader ~/.codex/skills/`
Copilot CLI / VS Code agent	`cp -r skills/ref-downloader .github/skills/`
Any framework (project-local)	`cp -r skills/ref-downloader .agents/skills/`

Then install Python prereqs INSIDE the copied skill folder (the skill protocol doesn't manage Python deps):

cd ~/.claude/skills/ref-downloader            # or wherever you copied it
pip install playwright pymupdf                # or use the source's requirements.txt
playwright install msedge

cp config.example.toml config.local.toml
# Edit config.local.toml — at minimum set [crossref].mailto.
# Windows: notepad config.local.toml
# macOS / Linux: $EDITOR config.local.toml   (or vim / nano / code / ...)

As a Python tool (for developers)

If you want to hack on the code, the skill folder is a runnable Python project:

git clone https://github.com/ltczding-gif/ref-downloader.git
cd ref-downloader

pip install -r requirements.txt -r requirements-dev.txt
playwright install msedge

cp skills/ref-downloader/config.example.toml skills/ref-downloader/config.local.toml
# Edit config.local.toml — at minimum set [crossref].mailto.

# Run the offline test suite
python -m pytest tests/ -v

# Run the tool directly
python skills/ref-downloader/scripts/run_ref_downloader.py 10.1021/jacs.5c05017

Usage examples

(After install — paths assume the skill is at <SKILL_DIR>, e.g. ~/.claude/skills/ref-downloader/. In source, <SKILL_DIR> = skills/ref-downloader/.)

Input: a DOI

python <SKILL_DIR>/scripts/run_ref_downloader.py 10.1021/jacs.5c05017

Default output: <cwd>/jacs.5c05017_refs/jacs.5c05017/

Input: a local PDF (with DOI in metadata or in PDF text)

python <SKILL_DIR>/scripts/run_ref_downloader.py "C:\path\to\your_paper.pdf"

Default output: <pdf_dir>/your_paper_refs/<doi-derived-name>/

Custom output directory

python <SKILL_DIR>/scripts/run_ref_downloader.py 10.1021/jacs.5c05017 --output-dir refs/

Non-interactive (CI / batch)

python <SKILL_DIR>/scripts/run_ref_downloader.py 10.1021/jacs.5c05017 --yes --auto

Alternate config file

python <SKILL_DIR>/scripts/run_ref_downloader.py 10.1021/jacs.5c05017 --config ./alt.toml

Configuration

All configuration lives in config.local.toml (gitignored). Copy config.example.toml to bootstrap.

Section	Key	Purpose
`[crossref]`	`mailto`	Your email — entry into Crossref polite pool
`[zotero]`	`db_path`	Optional path to `zotero.sqlite` for DOI lookup from PDF filename
`[browser]`	`edge_profile_dir`	Edge profile directory; empty = OS default
`[browser]`	`disable_extensions`	Set `true` to launch with `--disable-extensions`
`[institution]`	`auth_hosts`	Hostnames that mean "you got bounced to SSO" (e.g. `["sso.your-uni.edu"]`)
`[institution]`	`auth_url_fragments`	URL substrings indicating SSO (e.g. `["oauth", "saml"]`)
`[institution]`	`auth_page_titles`	`<title>` text for SSO pages (catches HTML served as PDF)
`[institution]`	`auth_loading_titles`	Loading-page titles (also reused for AIP/AVS publisher loading detection)
`[institution]`	`ignored_access_dois`	DOIs you know are paywalled at your institution; skipped without retry

Environment variables override file values:

Variable	Maps to
`REF_DOWNLOADER_MAILTO`	`crossref.mailto`
`REF_DOWNLOADER_ZOTERO_DB`	`zotero.db_path`
`REF_DOWNLOADER_EDGE_PROFILE`	`browser.edge_profile_dir`
`REF_DOWNLOADER_DISABLE_EXTENSIONS`	`browser.disable_extensions` (`1`/`true` to enable)
`REF_DOWNLOADER_CONFIG`	Path to alternate TOML file

See skills/ref-downloader/config.example.toml for full documentation.

Architecture

Three-stage pipeline + a wrapper:

skills/ref-downloader/
├── SKILL.md                            agent runbook (slim entry)
├── references/agent-runbook.md         extended manual flow + DOI fallback
├── config.example.toml                 config schema (copy to config.local.toml)
└── scripts/
    ├── run_ref_downloader.py           entry — config + DOI resolution + sequencing
    │     └─> extract_refs.py    (1) Crossref API: fetch parent's reference list
    │     └─> validate_refs.py   (2) Crossref API: per-ref metadata + publisher classify
    │     └─> download_refs.py   (3) Playwright/Edge: download main PDF + SI per publisher
    └── _config.py                      TOML + env-var loader

You can also run the three scripts manually for debugging or partial restarts. See the agent runbook in skills/ref-downloader/references/agent-runbook.md for the manual flow.

Agent users can install or inspect the packaged skill at skills/ref-downloader/SKILL.md. The repository root remains the human-facing Python project; the skill bundle is kept separate so Codex does not treat README, changelog, tests, and source files as always-associated skill context.

Supported publishers

ACS, Nature, Science, Elsevier, Wiley, RSC, Springer, PNAS, ECS, IOP, AIP, AVS, IEEE, OSA, KPS, Beilstein, APS, Annual Reviews, Taylor & Francis. Maturity varies — see docs/SUPPORTED_PUBLISHERS.md for the per-publisher tier table and known issues.

Known limitations

Windows + Microsoft Edge only: that's the verified path. macOS / Linux / Chromium support has not been tested. If you try, please open an issue with results.
Headed mode required: empirically, headless=True yields empty results for Wiley / ACS supplementary downloads. The default is headed.
Edge must be fully closed before running: Playwright needs exclusive access to the persistent profile. Check Task Manager for any background msedge.exe processes.
SSO redirects are detected, not solved: when the script bounces to your institution's SSO, the ref becomes manual_pending so you can sign in interactively. Configure [institution] to teach it which redirects to recognize.
SI download is the most fragile path: main PDFs are reliable; SI lookup varies by publisher and is the area most likely to need a tweak when a publisher updates their site.
Paywalled content needs institutional access: this is not a bypass tool.
Crossref dependency: papers with no reference list deposited at Crossref can't be processed automatically.

Contributing

See CONTRIBUTING.md for guidance on:

Adding a new publisher (DOI prefix → strategy)
Adding institutional SSO patterns
Reporting download failures with useful logs

Security

This tool launches your real Edge profile, with all your cookies and saved sessions. Read SECURITY.md before running it against a profile you also use for daily browsing.

License

MIT — see LICENSE.

ref-downloader

ref-downloader

Demo (30-second console preview)

Contents

What you get

Why not Zotero, scihub, or generic scrapers?

Quick start

Requirements

Install

As an agent skill (recommended)

As a Python tool (for developers)

Usage examples

Input: a DOI

Input: a local PDF (with DOI in metadata or in PDF text)

Custom output directory

Non-interactive (CI / batch)

Alternate config file

Configuration

Architecture

Supported publishers

Known limitations

Contributing

Security

License

Yorumlar (0)