Paper Format Agent

An open-source DOCX formatter for academic papers that proves it never touched your text.

Paper Format Agent reformats a thesis or paper — fonts, indents, alignment, spacing, headings, captions — to match a target format guide, and it ships with a verifiable content fingerprint so you can confirm your actual academic writing came out byte-identical to how it went in. Everything runs locally on your machine. It's also packaged as an installable agent skill (SKILL.md + agents/openai.yaml), so tools like Claude Code or Codex CLI can invoke it directly instead of a human clicking through a GUI.

Proof, not a promise

Real fields from an actual run's format_report.json:

{
  "content_fingerprint_before": "793e6533fd670418141d11fdcf014be19750408129ecff8b1b78a2641a3786db",
  "content_fingerprint_after":  "793e6533fd670418141d11fdcf014be19750408129ecff8b1b78a2641a3786db",
  "content_changed": false,
  "content_guard_enforced": true
}

The before/after fingerprints match, and an independent paragraph-by-paragraph .text diff over the whole document confirms every word survived. What did change on that same file: body text went from unset font/indent/alignment to SimSun (宋体) 12pt, a 2-character first-line indent, and justified alignment; the abstract title became SimSun 18pt centered; keywords became SimSun 12pt left-aligned. The same run also reported the real problems it found — char_below_min (document under the guide's minimum length) and blank_page_risk — rather than silently claiming a perfect score.

Why This Exists

Every closed-source formatting service (论文无忧, WPS 论文排版, 大以论文, AIPoliDoc, and similar) asks you to trust that your content survives the reformatting pass — none of them let you verify it.

The content guard is the smallest honest promise: change the formatting, but not a single character of the text — and if that can't be confirmed, the run aborts with an error (content guard failed) instead of shipping a silently-altered document. It's fail-closed and enforced by default.
Open-source and auditable: read the code, or just diff the fingerprint yourself.
Formatting-only automation across margins, fonts, line spacing, headings, captions, tables, and references, plus required-section checks (abstracts, keywords, table of contents) and running headers / centered page-number footers.
Reports are usable by students, supervisors, reviewers, and CI.

Status

This project is a practical open-source MVP. It is suitable for demos, internal pilots, agent workflows, and synthetic benchmark development. Before relying on it for high-stakes submissions, expand the regression corpus, template coverage, and object-level scoring for tables, figures, equations, footnotes, headers, and footers.

Agent Skill

This repository includes a top-level SKILL.md and agents/openai.yaml, so agent users can treat the repo as an installable skill.

The skill teaches an agent how to:

inspect input files safely
run the formatter in content-preserving mode
review format_report.json
validate changes before returning results
add new template rules with tests

Quick Start

pip install -r requirements.txt

python -m paper_format_agent.cli \
  --format-file "format_guide.docx" \
  --paper-file "paper.docx" \
  --out-dir "./output" \
  --engine auto \
  --strict-required-sections

Optional GUI:

python run_gui.py

Batch processing:

python -m paper_format_agent.cli \
  --format-file "format_guide.docx" \
  --paper-dir "./papers" \
  --out-dir "./batch_output" \
  --engine python \
  --strict-required-sections

Batch mode writes one output folder per paper plus batch_summary.json, including pass rate, score averages, content-change count, and per-paper report locations.

Template Packs And Synthetic Examples

The repository includes privacy-safe template packs and synthetic examples so users can try the workflow without uploading real papers:

templates/ contains JSON presets for Chinese thesis, journal article, and IEEE-style conference formatting.
examples/ contains a synthetic format guide and sample reports for demos, issues, and PRs.
docs/TEMPLATE_PACKS.md explains the template contract and contribution checklist.

Template files are intentionally plain JSON. They are easy to review, easy to customize locally, and safe to extend through small PRs.

Outputs

File	Purpose
`formatted_paper_v3.docx`	repaired DOCX document
`format_rules.json`	extracted formatting rules
`format_report.json`	machine-readable score and checks
`format_report.html`	human-readable report
`modify_log.json`	formatting operation log
`engine_report.json`	Word COM / LibreOffice / Python post-process result
`marker_dump.json`	optional paragraph classification dump

Safety Model

By default, the pipeline enforces a content guard. Reports include:

content_changed
content_guard_enforced
content_fingerprint_before
content_fingerprint_after
diagnostics with severity, evidence, and suggested fixes for failed checks

For normal academic formatting, content_changed should be false.

Validation

python tools/validate_skill.py
python -m unittest discover -s tests -p "test_*.py"
python tools/compile_check.py
python tools/release_audit.py

Before publishing from a local workspace, also run:

python tools/release_audit.py --include-local

This optional check includes untracked and ignored local artifacts, such as generated outputs, scratch files, caches, and private document formats.

Good First PRs

We want many small, reviewable PRs. Good contribution areas:

Add a synthetic test for a school, journal, or conference formatting rule.
Add a new synthetic template pack in templates/.
Improve a narrowly scoped rule extractor.
Add scoring coverage for tables, figures, references, equations, headers, or footers.
Improve report wording or diagnostics.
Add local-first integrations such as MCP, GitHub Actions, or batch processing.
Improve this repo's SKILL.md workflow for agent users.

New contributors can start from the task-ready board in
docs/CONTRIBUTOR_TASKS.md. Each task lists user
pain, expected PR shape, and suggested labels.

See CONTRIBUTING.md, ROADMAP.md, and AGENTS.md.

Architecture

format guide + paper.docx
  -> rule extraction
  -> paragraph type tagging
  -> style application
  -> numbering cleanup
  -> optional engine post-process
  -> scoring and reports

Detailed notes:

Privacy

Do not commit real papers, private school templates, reviewer comments, API keys, or generated documents. Use synthetic fixtures or anonymized snippets in tests.

License

MIT. See LICENSE.