workflow-tracker

Automatic experiment & workflow tracker for ML/AI projects — no manual logging needed. Detects experiments in real time, stores structured records, generates progress reports on demand.

中文文档

Why?

Common pains in ML experimentation:

Ran dozens of experiments, forgot the parameters and conclusions two weeks later
Notes scattered across chat logs, terminal output, and memory — impossible to aggregate
Writing weekly reports means digging through all experiment records from scratch
Paper CHANGELOGs and experiment logs have inconsistent formats

workflow-tracker solves all of these automatically — detects experiment activity and silently records it. Zero extra effort.

Install

npx skills add workflow-tracker -g

Or from a local clone:

git clone https://github.com/MarkD1Zzz/workflow-tracker.git
npx skills add ./workflow-tracker -g

Requires: Node.js ≥ 18, Claude Code or compatible agent.

Features

Auto-Detect & Silent Logging

Triggers automatically on these signals, without interrupting your workflow:

Running training/evaluation scripts
Metric changes (accuracy, F1, loss, etc.)
Parameter changes ("change lr to 0.001")
Verbal experiment conclusions ("tried X, result was Y")

Dual-Mode Output

Project Type	Detection Signal	Output Files
Engineering	`data/train/`, `train.py`, `pipeline`	`workflow.json` + `workflow.md`
Paper	`tex/`, `manuscript`, `figures/`	`CHANGELOG.md` + `experiment_log.md`

Three-Level Structure

Phase → Task → Experiment

Each experiment auto-extracts: Hypothesis / Method / Parameters / Results (with delta) / Conclusion / Tags

Report Generation

Say "generate report" to produce:

Paper project: Update CHANGELOG.md + experiment_log.md
Engineering project: Generate .docx.json + .pptx.json intermediate format (render with any tool later)

Examples

Scenario 1: Engineering — Classifier Swap

You: Swapped Stage 2 MLP for SVM(linear, C=1). Accuracy: 93.75% → 94.79%. SVM is deterministic.

Claude: Recorded. SVM(linear) → SUCCESS, delta +1.04pp.
       → workflow.json + workflow.md updated

Scenario 2: Paper — Ablation Study

You: Finished attention module ablation. SE 94.2%, CBAM 94.8%, FAA 96.1%.

Claude: Paper project detected (F:/paper/).
       → CHANGELOG.md appended with timeline entry
       → experiment_log.md appended with detailed record

Scenario 3: Report Generation

You: Generate a progress report for the last two weeks.

Claude: Generated report_20260614.docx.json + report_20260614.pptx.json
        Run node render.js or python render.py to produce final files.

Output Formats

workflow.json (Engineering)

{
  "project": "Welding Defect Classification",
  "updated": "2026-06-14T14:30",
  "phases": [{
    "name": "Phase 1: Accuracy Optimization",
    "status": "in_progress",
    "tasks": [{
      "name": "Task 1.1: Classifier Replacement",
      "status": "completed",
      "experiments": [{
        "date": "2026-06-14",
        "title": "SVM(linear) replaces MLP",
        "method": "SVC(kernel='linear', C=1, class_weight='balanced')",
        "params": {"kernel": "linear", "C": 1},
        "results": {"baseline": 93.75, "new": 94.79, "delta": 1.04},
        "conclusion": "SUCCESS",
        "tags": ["classifier", "svm", "breakthrough"]
      }]
    }]
  }]
}

CHANGELOG.md (Paper)

## 2026-06-14 — Attention Module Ablation

### Background
Comparing SE / CBAM / FAA attention modules on NEU-DET.

### Results
| Module | Accuracy | Delta vs SE |
|--------|----------|-------------|
| SE     | 94.2%    | baseline    |
| CBAM   | 94.8%    | +0.6pp      |
| FAA    | 96.1%    | +1.9pp      |

### Conclusion
FAA significantly outperforms SE and CBAM. Ablation validates the attention redundancy hypothesis.

How It Works

Project Type Detection: Scans directory structure (tex/→paper, data/train/→engineering)
Signal Detection: Matches experiment keywords + numeric change patterns in conversation
Batch Writing: Accumulates experiments, writes once per round (avoids excessive IO)
Delta Auto-Calculation: Computes difference whenever old and new values appear
Tag Auto-Classification: Assigns tags like architecture, hyperparameter-tuning, classifier, data-augmentation based on method type

Use Cases

Deep learning model training & tuning
Academic paper ablation study management
GAN/VAE/Diffusion model iteration
Computer vision classification/detection/segmentation
Any ML workflow that needs "what was tried → what happened → what it means" tracking

Sub-Skills

manuscript-check — Paper Manuscript Integrity Checker

Six-step closed-loop verification for academic paper manuscripts. Activated when you question data provenance, architecture naming, ablation authenticity, or narrative consistency.

Step	Action
1. Source Verification	Trace back to original paper/code as ground truth
2. Impact Analysis	Grep all occurrences across manuscript, estimate blast radius
3. Batch Edit	Sync tex body, tables, figure scripts in one pass
4. Residue Check	Verify old terms reach zero hits post-edit
5. Consistency Audit	Detect contradictions between sections (numeric, terminology, evidence)
6. Memory Persist	Update project memory files with final state

Trigger signals: "verify X", "was this experiment actually run?", "X is my work not a citation", "X never existed", "sync figures"

Scoped to: F:/论文/ paper project (hardcoded architecture facts for RFS/EAAI context).

Repo Structure

workflow-tracker/
├── SKILL.md               # Main skill file (Claude Code entry point)
├── SKILL_EN.md            # English skill definition
├── README.md              # This file (English)
├── README_zh.md           # Chinese documentation
├── LICENSE                # MIT
├── evals.json             # 6 test cases, 25 assertions
├── .gitignore
└── manuscript-check/      # Sub-skill: paper manuscript integrity checker
    └── SKILL.md           # Six-step verification workflow

Development

Running Tests

cd workspace/iteration-2
python grade_all.py

Benchmark (v2)

Metric	Value
Avg Response Time	131s
Avg Tokens	27k
Pass Rate (6 evals)	100%
Paper Mode	✓
JSON Intermediate Format	✓

License

Credits

Built on the Claude Code Skills framework. Inspired by real-world experiment management needs from welding defect classification, ConvNeXt-FAA paper research, and spot_welding_gan GAN training projects.

Changelog

v1.1.0 (2026-06-16)

New: manuscript-check sub-skill — six-step paper manuscript integrity verification
- Source-to-manuscript cross-referencing with grep impact analysis
- Multi-section batch editing (tex + tables + figure scripts)
- Post-edit residue checking + narrative consistency audit
- Automatic memory file persistence
Improved: Bootstrap CHANGELOG.md + experiment_log.md on first paper project load

v1.0.0 (2026-06-14)

Initial release
Auto-detect & silent logging for ML experiments
Dual-mode output: engineering (workflow.json + workflow.md) / paper (CHANGELOG.md + experiment_log.md)
Three-level structure: Phase → Task → Experiment
Report generation: .docx.json + .pptx.json intermediate format