ref-downloader
Health Gecti
- License — License: MIT
- Description — Repository has a description
- Active repo — Last push 0 days ago
- Community trust — 86 GitHub stars
Code Gecti
- Code scan — Scanned 9 files during light audit, no dangerous patterns found
Permissions Gecti
- Permissions — No dangerous permissions requested
Bu listing icin henuz AI raporu yok.
Batch-download reference PDFs from a DOI or paper PDF using Crossref and your institutional Edge session.
ref-downloader
Stop losing an afternoon to chasing dozens of reference PDFs by hand.
One DOI in, every reference PDF out — using your existing institutional access.
Status: beta (v0.2.0). Windows + Microsoft Edge verified path. macOS / Linux / Chromium untested. Expect rough edges around supplementary downloads and publisher-site changes. PR-worthy issues welcome.
Heads up — not a paywall bypass. ref-downloader uses your institutional access. If your university or organization subscribes to a journal, those refs work. If they don't, those refs become
manual_pendingfor you to follow up on by hand.
Demo (30-second console preview)
$ python run_ref_downloader.py 10.1021/jacs.5c05017
=== Ref Downloader Wrapper ===
DOI: 10.1021/jacs.5c05017
PROJECT: jacs.5c05017
Config: config.example.toml + config.local.toml
>>> extract_refs.py
Title: Designing Natural Cell-Inspired Heme-Spurred Membrane...
References found: 38
>>> validate_refs.py
Total: 38 Verified: 38 Failed: 0 No DOI: 0
>>> download_refs.py
[ 1] downloaded (842 KB) Lee2016_NatEnergy.pdf
[ 2] downloaded (1.2 MB) Wang2018_AdvMater.pdf
[ 3] manual_pending (auth_redirect)
[ 4] downloaded (655 KB) Chen2019_JACS.pdf
[ 5] failed (challenge_timeout)
[ 6] ignored (ignored_institution_access)
... 31 more refs processed ...
[38] downloaded (956 KB) Park2024_JElectrochemSoc.pdf
========== Download report ==========
Total references: 38
Main PDFs: 33 downloaded · 3 manual_pending · 1 failed · 1 ignored
SI files: 12 captured
PDFs land in: ./jacs.5c05017_refs/jacs.5c05017/
=====================================
Contents
- What you get
- Why not Zotero, scihub, or generic scrapers?
- Quick start
- Requirements
- Install
- Usage examples
- Configuration
- Architecture
- Supported publishers
- Known limitations
- Contributing
- Security
- License
What you get
- Paywalled refs work without setup. Drives your real Microsoft Edge profile, so any institutional login already in your browser carries through. No API keys, no proxies, no reverse engineering.
- One DOI in, every reference PDF out. Crossref-driven extraction + 17+ publisher-specific download paths (Wiley PDFDirect, Elsevier viewer, AIP loading-page wait — see per-publisher reliability tier), not generic scraping.
- You always know which refs failed and why.
download_report.csvgives every ref a status + reason (manual_pending (auth_redirect),failed (challenge_timeout),ignored);events.jsonlkeeps the per-ref event trace. - Pick up where you left off after a VPN drop, browser crash, or
Ctrl+C. State persists per project; rerunning skips already-downloaded refs and retries only the failures.
Why not Zotero, scihub, or generic scrapers?
- vs. Zotero's Find Available PDF — walks one paper at a time and silently gives up at SSO redirects. ref-downloader walks the whole reference list at once and treats SSO as a configurable step instead of a dead end.
- vs. scihub-style tools — don't carry your institutional license, so paywalled refs you legitimately have access to just fail. ref-downloader uses your authenticated browser session, so subscriptions you already pay for actually count.
- vs. generic web scrapers — don't know Wiley needs PDFDirect, Elsevier needs a viewer click, or AIP serves a Chinese loading page first. ref-downloader has 17+ publisher-specific paths plus hot-session retry for Elsevier.
Quick start
The skill is self-contained under skills/ref-downloader/. Pick the install path for your agent framework:
git clone https://github.com/ltczding-gif/ref-downloader.git
# Pick ONE install destination for your agent framework:
# Claude Code: cp -r ref-downloader/skills/ref-downloader ~/.claude/skills/
# Codex CLI: cp -r ref-downloader/skills/ref-downloader ~/.codex/skills/
# Copilot CLI / VSC: cp -r ref-downloader/skills/ref-downloader .github/skills/
# Project-local: cp -r ref-downloader/skills/ref-downloader .agents/skills/
cd ~/.claude/skills/ref-downloader # or wherever you copied it
pip install playwright pymupdf
playwright install msedge
cp config.example.toml config.local.toml # then set [crossref].mailto
# In your agent: just describe the task; the skill triggers via its description.
# Direct CLI for testing: python scripts/run_ref_downloader.py 10.1021/jacs.5c05017
What you'll see: 30–80 refs discovered for a typical chemistry/physics paper, then a mix of downloaded (refs your institution covers), manual_pending (SSO bounce or paywall), and occasional failed (publisher quirk). Run on a DOI from a journal your institution actually subscribes to for the highest hit rate. Details below.
Requirements
- OS: Windows 10/11 (verified). macOS / Linux untested — PRs welcome.
- Browser: Microsoft Edge (Stable channel). The script claims your persistent Edge profile, so close all Edge windows before running.
- Python: 3.11 or newer (uses stdlib
tomllib). - Optional: A Zotero installation (auto-detects DOI from a PDF's filename via Zotero's SQLite database — much faster than text extraction).
- Optional: PyMuPDF (
pip install pymupdf) for DOI extraction from PDF text when Zotero lookup is unavailable.
Install
As an agent skill (recommended)
Pick the install path for your agent framework:
| Framework | Install command |
|---|---|
| Claude Code | cp -r skills/ref-downloader ~/.claude/skills/ |
| Claude Agent SDK | same (auto-discovers ~/.claude/skills/) |
| Codex CLI | cp -r skills/ref-downloader ~/.codex/skills/ |
| Copilot CLI / VS Code agent | cp -r skills/ref-downloader .github/skills/ |
| Any framework (project-local) | cp -r skills/ref-downloader .agents/skills/ |
Then install Python prereqs INSIDE the copied skill folder (the skill protocol doesn't manage Python deps):
cd ~/.claude/skills/ref-downloader # or wherever you copied it
pip install playwright pymupdf # or use the source's requirements.txt
playwright install msedge
cp config.example.toml config.local.toml
# Edit config.local.toml — at minimum set [crossref].mailto.
# Windows: notepad config.local.toml
# macOS / Linux: $EDITOR config.local.toml (or vim / nano / code / ...)
As a Python tool (for developers)
If you want to hack on the code, the skill folder is a runnable Python project:
git clone https://github.com/ltczding-gif/ref-downloader.git
cd ref-downloader
pip install -r requirements.txt -r requirements-dev.txt
playwright install msedge
cp skills/ref-downloader/config.example.toml skills/ref-downloader/config.local.toml
# Edit config.local.toml — at minimum set [crossref].mailto.
# Run the offline test suite
python -m pytest tests/ -v
# Run the tool directly
python skills/ref-downloader/scripts/run_ref_downloader.py 10.1021/jacs.5c05017
Usage examples
(After install — paths assume the skill is at <SKILL_DIR>, e.g. ~/.claude/skills/ref-downloader/. In source, <SKILL_DIR> = skills/ref-downloader/.)
Input: a DOI
python <SKILL_DIR>/scripts/run_ref_downloader.py 10.1021/jacs.5c05017
Default output: <cwd>/jacs.5c05017_refs/jacs.5c05017/
Input: a local PDF (with DOI in metadata or in PDF text)
python <SKILL_DIR>/scripts/run_ref_downloader.py "C:\path\to\your_paper.pdf"
Default output: <pdf_dir>/your_paper_refs/<doi-derived-name>/
Custom output directory
python <SKILL_DIR>/scripts/run_ref_downloader.py 10.1021/jacs.5c05017 --output-dir refs/
Non-interactive (CI / batch)
python <SKILL_DIR>/scripts/run_ref_downloader.py 10.1021/jacs.5c05017 --yes --auto
Alternate config file
python <SKILL_DIR>/scripts/run_ref_downloader.py 10.1021/jacs.5c05017 --config ./alt.toml
Configuration
All configuration lives in config.local.toml (gitignored). Copy config.example.toml to bootstrap.
| Section | Key | Purpose |
|---|---|---|
[crossref] |
mailto |
Your email — entry into Crossref polite pool |
[zotero] |
db_path |
Optional path to zotero.sqlite for DOI lookup from PDF filename |
[browser] |
edge_profile_dir |
Edge profile directory; empty = OS default |
[browser] |
disable_extensions |
Set true to launch with --disable-extensions |
[institution] |
auth_hosts |
Hostnames that mean "you got bounced to SSO" (e.g. ["sso.your-uni.edu"]) |
[institution] |
auth_url_fragments |
URL substrings indicating SSO (e.g. ["oauth", "saml"]) |
[institution] |
auth_page_titles |
<title> text for SSO pages (catches HTML served as PDF) |
[institution] |
auth_loading_titles |
Loading-page titles (also reused for AIP/AVS publisher loading detection) |
[institution] |
ignored_access_dois |
DOIs you know are paywalled at your institution; skipped without retry |
Environment variables override file values:
| Variable | Maps to |
|---|---|
REF_DOWNLOADER_MAILTO |
crossref.mailto |
REF_DOWNLOADER_ZOTERO_DB |
zotero.db_path |
REF_DOWNLOADER_EDGE_PROFILE |
browser.edge_profile_dir |
REF_DOWNLOADER_DISABLE_EXTENSIONS |
browser.disable_extensions (1/true to enable) |
REF_DOWNLOADER_CONFIG |
Path to alternate TOML file |
See skills/ref-downloader/config.example.toml for full documentation.
Architecture
Three-stage pipeline + a wrapper:
skills/ref-downloader/
├── SKILL.md agent runbook (slim entry)
├── references/agent-runbook.md extended manual flow + DOI fallback
├── config.example.toml config schema (copy to config.local.toml)
└── scripts/
├── run_ref_downloader.py entry — config + DOI resolution + sequencing
│ └─> extract_refs.py (1) Crossref API: fetch parent's reference list
│ └─> validate_refs.py (2) Crossref API: per-ref metadata + publisher classify
│ └─> download_refs.py (3) Playwright/Edge: download main PDF + SI per publisher
└── _config.py TOML + env-var loader
You can also run the three scripts manually for debugging or partial restarts. See the agent runbook in skills/ref-downloader/references/agent-runbook.md for the manual flow.
Agent users can install or inspect the packaged skill at skills/ref-downloader/SKILL.md. The repository root remains the human-facing Python project; the skill bundle is kept separate so Codex does not treat README, changelog, tests, and source files as always-associated skill context.
Supported publishers
ACS, Nature, Science, Elsevier, Wiley, RSC, Springer, PNAS, ECS, IOP, AIP, AVS, IEEE, OSA, KPS, Beilstein, APS, Annual Reviews, Taylor & Francis. Maturity varies — see docs/SUPPORTED_PUBLISHERS.md for the per-publisher tier table and known issues.
Known limitations
- Windows + Microsoft Edge only: that's the verified path. macOS / Linux / Chromium support has not been tested. If you try, please open an issue with results.
- Headed mode required: empirically,
headless=Trueyields empty results for Wiley / ACS supplementary downloads. The default is headed. - Edge must be fully closed before running: Playwright needs exclusive access to the persistent profile. Check Task Manager for any background
msedge.exeprocesses. - SSO redirects are detected, not solved: when the script bounces to your institution's SSO, the ref becomes
manual_pendingso you can sign in interactively. Configure[institution]to teach it which redirects to recognize. - SI download is the most fragile path: main PDFs are reliable; SI lookup varies by publisher and is the area most likely to need a tweak when a publisher updates their site.
- Paywalled content needs institutional access: this is not a bypass tool.
- Crossref dependency: papers with no reference list deposited at Crossref can't be processed automatically.
Contributing
See CONTRIBUTING.md for guidance on:
- Adding a new publisher (DOI prefix → strategy)
- Adding institutional SSO patterns
- Reporting download failures with useful logs
Security
This tool launches your real Edge profile, with all your cookies and saved sessions. Read SECURITY.md before running it against a profile you also use for daily browsing.
License
MIT — see LICENSE.
Yorumlar (0)
Yorum birakmak icin giris yap.
Yorum birakSonuc bulunamadi