zotero-canvas-batch

Turn every PDF in a Zotero collection into an
Obsidian Canvas knowledge-graph, in one
parallel run. Each canvas is a structured bird's-eye view of the paper
— Background → Method → Experiment → Conclusion, with edges between
them — and every node carries a zotero://open-pdf/... deep-link so a
click jumps straight to the relevant PDF page in Zotero.

The heavy lifting is delegated to the Gemini CLI;
this repo is the glue: it queries Zotero's SQLite, extracts PDF text
line-by-line with pdfplumber, builds an index→page map, prompts
Gemini, and writes the resulting JSON as .canvas files.

→ Project homepage / tutorial

What it gives you

74 papers in a Zotero collection
            │
            ▼ batch.py (parallel, ~5 workers)
74 .canvas files in your Obsidian vault
   • metadata node: title / author / year / venue / DOI / citekey
   • semantically grouped nodes with inline citations [12-14]
   • zotero://open-pdf links on every claim
   • directional edges between related nodes

Drop the canvases into an Obsidian folder and you can navigate a
literature corpus the same way you'd navigate a codebase — by
following links, not by re-reading.

Designed to be called by an agent

This tool is built to be invoked by an LLM agent (e.g. Claude Code) as
a subprocess, not just typed into a shell. All inputs come from CLI
flags or env vars, output paths are predictable, and progress is
exposed via a self-refreshing progress.html. See SKILL.md
for the agent-facing interface contract.

Install

# 1. Python deps
pip install pdfplumber

# 2. Gemini CLI (one-time)
#    https://github.com/google-gemini/gemini-cli
#    then run `gemini` once to authenticate.

# 3. Clone this repo
git clone https://github.com/george-wyy/zotero-canvas-batch.git
cd zotero-canvas-batch

Tested on macOS with Zotero 7, Python 3.10+.

Configure

Either pass flags directly, or copy .env.example to .env and fill
in the paths that match your Zotero install:

cp .env.example .env
# edit .env
set -a; source .env; set +a   # bash/zsh
# or: fish: source .env in a way that exports

Defaults assume the stock Zotero layout (~/Zotero/zotero.sqlite,
~/Zotero/storage). If you moved your library, set ZOTERO_DB and
ZOTERO_STORAGE.

Run

# Dry-run: just list what would be processed
python batch.py -c "MyCollection" -n

# Actually generate, 5 workers
python batch.py -c "MyCollection" -w 5

# Override output directory and model
python batch.py -c "MyCollection" \
  --output-dir ~/ObsidianVault/Literature/MyCollection \
  --model gemini-2.5-pro

# Recurse into sub-collections of a parent collection
python batch.py -c "MyCollection" -r

# Re-process just two papers
python batch.py -c "MyCollection" --keys ABCD1234 EFGH5678

While batch.py runs, it writes a live progress.html next to the
canvases — open it in your browser for a per-paper progress table.

Output filenames look like:

{citekey} - {Author} - {Year} - {Short Title}-canvas-{model}.canvas

If citekey isn't supplied via --citekey-map, the script falls back
to {author}{year}.

`rename_patch.py`

If you generated canvases with an older version or by hand, this
script renames them into the canonical filename above and patches the
metadata node with the Citation Key field.

python rename_patch.py ./canvas-out --citekey-map citekey_map.json

Optional: citekey map

If you also keep a BetterBibTeX
export, you can hand in a JSON map of attachment-key → BibTeX citekey
so canvas filenames use the human-readable key:

{
  "TEPVVD2P": "schonwerth2025exploring",
  "8K3ZMXQS": "lin2024hcicalibration"
}

Pass with --citekey-map path.json or set ZCB_CITEKEY_MAP.

Downstream pipelines

See examples/rw-pipeline/ for a worked example that uses the
generated canvases as a knowledge base to draft a literature-review
chapter (classify → screen → extract evidence → write paragraphs).
That pipeline is project-specific and meant as a template you adapt.

Caveats

Reads Zotero's SQLite by shutil.copy2ing it to a tempfile first
(Zotero holds a write lock), so the original is never touched.
PDF extraction is line-based with pdfplumber; very long papers
are truncated at MAX_LINES = 1200.
Costs whatever your Gemini CLI usage costs; on gemini-2.5-flash a
74-paper run is small change.

License

MIT — see LICENSE.