ref-downloader

agent
Guvenlik Denetimi
Gecti
Health Gecti
  • License — License: MIT
  • Description — Repository has a description
  • Active repo — Last push 0 days ago
  • Community trust — 86 GitHub stars
Code Gecti
  • Code scan — Scanned 9 files during light audit, no dangerous patterns found
Permissions Gecti
  • Permissions — No dangerous permissions requested

Bu listing icin henuz AI raporu yok.

SUMMARY

Batch-download reference PDFs from a DOI or paper PDF using Crossref and your institutional Edge session.

README.md

ref-downloader

Stop losing an afternoon to chasing dozens of reference PDFs by hand.
One DOI in, every reference PDF out — using your existing institutional access.

Version: 0.2.0
Status: beta
License: MIT
Python 3.11+
Verified on Windows + Edge

中文完整文档 / Full Chinese version

Status: beta (v0.2.0). Windows + Microsoft Edge verified path. macOS / Linux / Chromium untested. Expect rough edges around supplementary downloads and publisher-site changes. PR-worthy issues welcome.

Heads up — not a paywall bypass. ref-downloader uses your institutional access. If your university or organization subscribes to a journal, those refs work. If they don't, those refs become manual_pending for you to follow up on by hand.

Demo (30-second console preview)

$ python run_ref_downloader.py 10.1021/jacs.5c05017

=== Ref Downloader Wrapper ===
DOI:         10.1021/jacs.5c05017
PROJECT:     jacs.5c05017
Config:      config.example.toml + config.local.toml

>>> extract_refs.py
  Title: Designing Natural Cell-Inspired Heme-Spurred Membrane...
  References found: 38

>>> validate_refs.py
  Total: 38  Verified: 38  Failed: 0  No DOI: 0

>>> download_refs.py
  [ 1] downloaded (842 KB)        Lee2016_NatEnergy.pdf
  [ 2] downloaded (1.2 MB)        Wang2018_AdvMater.pdf
  [ 3] manual_pending (auth_redirect)
  [ 4] downloaded (655 KB)        Chen2019_JACS.pdf
  [ 5] failed (challenge_timeout)
  [ 6] ignored (ignored_institution_access)
  ... 31 more refs processed ...
  [38] downloaded (956 KB)        Park2024_JElectrochemSoc.pdf

========== Download report ==========
Total references:  38
Main PDFs:         33 downloaded · 3 manual_pending · 1 failed · 1 ignored
SI files:          12 captured
PDFs land in:      ./jacs.5c05017_refs/jacs.5c05017/
=====================================

Contents

What you get

  • Paywalled refs work without setup. Drives your real Microsoft Edge profile, so any institutional login already in your browser carries through. No API keys, no proxies, no reverse engineering.
  • One DOI in, every reference PDF out. Crossref-driven extraction + 17+ publisher-specific download paths (Wiley PDFDirect, Elsevier viewer, AIP loading-page wait — see per-publisher reliability tier), not generic scraping.
  • You always know which refs failed and why. download_report.csv gives every ref a status + reason (manual_pending (auth_redirect), failed (challenge_timeout), ignored); events.jsonl keeps the per-ref event trace.
  • Pick up where you left off after a VPN drop, browser crash, or Ctrl+C. State persists per project; rerunning skips already-downloaded refs and retries only the failures.

Why not Zotero, scihub, or generic scrapers?

  • vs. Zotero's Find Available PDF — walks one paper at a time and silently gives up at SSO redirects. ref-downloader walks the whole reference list at once and treats SSO as a configurable step instead of a dead end.
  • vs. scihub-style tools — don't carry your institutional license, so paywalled refs you legitimately have access to just fail. ref-downloader uses your authenticated browser session, so subscriptions you already pay for actually count.
  • vs. generic web scrapers — don't know Wiley needs PDFDirect, Elsevier needs a viewer click, or AIP serves a Chinese loading page first. ref-downloader has 17+ publisher-specific paths plus hot-session retry for Elsevier.

Quick start

The skill is self-contained under skills/ref-downloader/. Pick the install path for your agent framework:

git clone https://github.com/ltczding-gif/ref-downloader.git

# Pick ONE install destination for your agent framework:
#   Claude Code:        cp -r ref-downloader/skills/ref-downloader ~/.claude/skills/
#   Codex CLI:          cp -r ref-downloader/skills/ref-downloader ~/.codex/skills/
#   Copilot CLI / VSC:  cp -r ref-downloader/skills/ref-downloader .github/skills/
#   Project-local:      cp -r ref-downloader/skills/ref-downloader .agents/skills/

cd ~/.claude/skills/ref-downloader     # or wherever you copied it
pip install playwright pymupdf
playwright install msedge
cp config.example.toml config.local.toml      # then set [crossref].mailto

# In your agent: just describe the task; the skill triggers via its description.
# Direct CLI for testing: python scripts/run_ref_downloader.py 10.1021/jacs.5c05017

What you'll see: 30–80 refs discovered for a typical chemistry/physics paper, then a mix of downloaded (refs your institution covers), manual_pending (SSO bounce or paywall), and occasional failed (publisher quirk). Run on a DOI from a journal your institution actually subscribes to for the highest hit rate. Details below.

Requirements

  • OS: Windows 10/11 (verified). macOS / Linux untested — PRs welcome.
  • Browser: Microsoft Edge (Stable channel). The script claims your persistent Edge profile, so close all Edge windows before running.
  • Python: 3.11 or newer (uses stdlib tomllib).
  • Optional: A Zotero installation (auto-detects DOI from a PDF's filename via Zotero's SQLite database — much faster than text extraction).
  • Optional: PyMuPDF (pip install pymupdf) for DOI extraction from PDF text when Zotero lookup is unavailable.

Install

As an agent skill (recommended)

Pick the install path for your agent framework:

Framework Install command
Claude Code cp -r skills/ref-downloader ~/.claude/skills/
Claude Agent SDK same (auto-discovers ~/.claude/skills/)
Codex CLI cp -r skills/ref-downloader ~/.codex/skills/
Copilot CLI / VS Code agent cp -r skills/ref-downloader .github/skills/
Any framework (project-local) cp -r skills/ref-downloader .agents/skills/

Then install Python prereqs INSIDE the copied skill folder (the skill protocol doesn't manage Python deps):

cd ~/.claude/skills/ref-downloader            # or wherever you copied it
pip install playwright pymupdf                # or use the source's requirements.txt
playwright install msedge

cp config.example.toml config.local.toml
# Edit config.local.toml — at minimum set [crossref].mailto.
# Windows: notepad config.local.toml
# macOS / Linux: $EDITOR config.local.toml   (or vim / nano / code / ...)

As a Python tool (for developers)

If you want to hack on the code, the skill folder is a runnable Python project:

git clone https://github.com/ltczding-gif/ref-downloader.git
cd ref-downloader

pip install -r requirements.txt -r requirements-dev.txt
playwright install msedge

cp skills/ref-downloader/config.example.toml skills/ref-downloader/config.local.toml
# Edit config.local.toml — at minimum set [crossref].mailto.

# Run the offline test suite
python -m pytest tests/ -v

# Run the tool directly
python skills/ref-downloader/scripts/run_ref_downloader.py 10.1021/jacs.5c05017

Usage examples

(After install — paths assume the skill is at <SKILL_DIR>, e.g. ~/.claude/skills/ref-downloader/. In source, <SKILL_DIR> = skills/ref-downloader/.)

Input: a DOI

python <SKILL_DIR>/scripts/run_ref_downloader.py 10.1021/jacs.5c05017

Default output: <cwd>/jacs.5c05017_refs/jacs.5c05017/

Input: a local PDF (with DOI in metadata or in PDF text)

python <SKILL_DIR>/scripts/run_ref_downloader.py "C:\path\to\your_paper.pdf"

Default output: <pdf_dir>/your_paper_refs/<doi-derived-name>/

Custom output directory

python <SKILL_DIR>/scripts/run_ref_downloader.py 10.1021/jacs.5c05017 --output-dir refs/

Non-interactive (CI / batch)

python <SKILL_DIR>/scripts/run_ref_downloader.py 10.1021/jacs.5c05017 --yes --auto

Alternate config file

python <SKILL_DIR>/scripts/run_ref_downloader.py 10.1021/jacs.5c05017 --config ./alt.toml

Configuration

All configuration lives in config.local.toml (gitignored). Copy config.example.toml to bootstrap.

Section Key Purpose
[crossref] mailto Your email — entry into Crossref polite pool
[zotero] db_path Optional path to zotero.sqlite for DOI lookup from PDF filename
[browser] edge_profile_dir Edge profile directory; empty = OS default
[browser] disable_extensions Set true to launch with --disable-extensions
[institution] auth_hosts Hostnames that mean "you got bounced to SSO" (e.g. ["sso.your-uni.edu"])
[institution] auth_url_fragments URL substrings indicating SSO (e.g. ["oauth", "saml"])
[institution] auth_page_titles <title> text for SSO pages (catches HTML served as PDF)
[institution] auth_loading_titles Loading-page titles (also reused for AIP/AVS publisher loading detection)
[institution] ignored_access_dois DOIs you know are paywalled at your institution; skipped without retry

Environment variables override file values:

Variable Maps to
REF_DOWNLOADER_MAILTO crossref.mailto
REF_DOWNLOADER_ZOTERO_DB zotero.db_path
REF_DOWNLOADER_EDGE_PROFILE browser.edge_profile_dir
REF_DOWNLOADER_DISABLE_EXTENSIONS browser.disable_extensions (1/true to enable)
REF_DOWNLOADER_CONFIG Path to alternate TOML file

See skills/ref-downloader/config.example.toml for full documentation.

Architecture

Three-stage pipeline + a wrapper:

skills/ref-downloader/
├── SKILL.md                            agent runbook (slim entry)
├── references/agent-runbook.md         extended manual flow + DOI fallback
├── config.example.toml                 config schema (copy to config.local.toml)
└── scripts/
    ├── run_ref_downloader.py           entry — config + DOI resolution + sequencing
    │     └─> extract_refs.py    (1) Crossref API: fetch parent's reference list
    │     └─> validate_refs.py   (2) Crossref API: per-ref metadata + publisher classify
    │     └─> download_refs.py   (3) Playwright/Edge: download main PDF + SI per publisher
    └── _config.py                      TOML + env-var loader

You can also run the three scripts manually for debugging or partial restarts. See the agent runbook in skills/ref-downloader/references/agent-runbook.md for the manual flow.

Agent users can install or inspect the packaged skill at skills/ref-downloader/SKILL.md. The repository root remains the human-facing Python project; the skill bundle is kept separate so Codex does not treat README, changelog, tests, and source files as always-associated skill context.

Supported publishers

ACS, Nature, Science, Elsevier, Wiley, RSC, Springer, PNAS, ECS, IOP, AIP, AVS, IEEE, OSA, KPS, Beilstein, APS, Annual Reviews, Taylor & Francis. Maturity varies — see docs/SUPPORTED_PUBLISHERS.md for the per-publisher tier table and known issues.

Known limitations

  • Windows + Microsoft Edge only: that's the verified path. macOS / Linux / Chromium support has not been tested. If you try, please open an issue with results.
  • Headed mode required: empirically, headless=True yields empty results for Wiley / ACS supplementary downloads. The default is headed.
  • Edge must be fully closed before running: Playwright needs exclusive access to the persistent profile. Check Task Manager for any background msedge.exe processes.
  • SSO redirects are detected, not solved: when the script bounces to your institution's SSO, the ref becomes manual_pending so you can sign in interactively. Configure [institution] to teach it which redirects to recognize.
  • SI download is the most fragile path: main PDFs are reliable; SI lookup varies by publisher and is the area most likely to need a tweak when a publisher updates their site.
  • Paywalled content needs institutional access: this is not a bypass tool.
  • Crossref dependency: papers with no reference list deposited at Crossref can't be processed automatically.

Contributing

See CONTRIBUTING.md for guidance on:

  • Adding a new publisher (DOI prefix → strategy)
  • Adding institutional SSO patterns
  • Reporting download failures with useful logs

Security

This tool launches your real Edge profile, with all your cookies and saved sessions. Read SECURITY.md before running it against a profile you also use for daily browsing.

License

MIT — see LICENSE.

Yorumlar (0)

Sonuc bulunamadi