video-recap-skills

skill
Guvenlik Denetimi
Gecti
Health Gecti
  • License — License: MIT
  • Description — Repository has a description
  • Active repo — Last push 0 days ago
  • Community trust — 94 GitHub stars
Code Gecti
  • Code scan — Scanned 12 files during light audit, no dangerous patterns found
Permissions Gecti
  • Permissions — No dangerous permissions requested

Bu listing icin henuz AI raporu yok.

SUMMARY

AI narration skill: input a video, output a voiceover recap video for Claude Code|AI解说skill,输入视频,输出带中文旁白的解说视频,适配Claude Code

README.md

video-recap-skills

中文说明 · English

A Claude Code plugin that turns a video into a Chinese-narration recap — story research, ASR + VLM scene understanding, agent-written narration, TTS voiceover, subtitles, and dynamic audio mixing — built as a bundle of small, independent skills. Runs on ffmpeg + one Xiaomi MiMo API key.

License: MIT
Claude Code Plugin
Powered by MiMo
Python
Cross-platform

Demo

https://github.com/user-attachments/assets/92698ec6-0d23-4f9f-8825-c3684ef57aff

What is it?

video-recap-skills helps an agent create short-form narrated recap videos from existing video files.
It is a bundle of five independent skills + a thin orchestrator. Each stage is a self-contained
skill (its own code, no shared modules); they communicate only through JSON/MP4 artifacts in a
shared work_dir. The agent writes the narration; the tooling does the deterministic media work.

Easy to run: the entire pipeline — speech-to-text, vision understanding, and text-to-speech —
runs on a single Xiaomi MiMo API key plus ffmpeg. No GPU,
no local model downloads, no extra services. Works on macOS, Linux, and Windows.

flowchart TB
    input([Input video]) --> understand
    context[[Story research / context]] -.-> script

    subgraph understand[video-understanding]
        direction LR
        scene[Scene cuts] --- asr[ASR dialogue] --- vlm[VLM frame facts] --- brief[Brief + index]
    end

    subgraph script[video-script · the agent writes narration.json]
        direction LR
        write[Write] --- review[Review gate] --- validate[Validate timing]
    end

    cut[video-cut · optional, cut mode]
    subgraph produce[produce]
        direction LR
        voice[video-voiceover · MiMo TTS] --- assemble[video-assemble · mux + duck + subtitles]
    end
    output([Recap video])

    understand --> script
    script --> cut --> produce
    script --> produce
    produce --> output

    classDef s fill:#eef6ff,stroke:#4f86c6,color:#1f2937;
    classDef w fill:#f3ecff,stroke:#7c3aed,color:#1f2937;
    classDef p fill:#ecfdf3,stroke:#16a34a,color:#1f2937;
    class input,context,understand s;
    class script,cut w;
    class produce,output p;

Architecture — the skill bundle

video-recap is the user-facing orchestrator; it chains the stage skills (each invoked as a
subprocess) and pauses for the agent to write the narration. The four pure-tool stages are hidden
(user-invocable: false); video-recap and video-script are the ones you invoke.

Skill Does In → Out (the work_dir contract)
video-understanding scene detect · frame extract · ASR (mimo-v2.5-asr) · VLM (mimo-v2.5) · fuse timeline · build brief (+ optional --consolidate index) videoscenes / asr_result / vlm_analysis / silence_periods / timeline_fusion / agent_narration_brief.md
video-script writing rules (SKILL.md) + review (LLM-as-judge) + lint/validate brief + indexnarration.json
video-cut clip plan → cut source + remap narration (cut mode) clip_plan.json + videoedited_source.mp4 + narration_mapped.json
video-voiceover synthesize narration audio (MiMo TTS, mimo-v2.5-tts) narration.jsontts_segments/ + tts_meta.json
video-assemble mux · duck original audio · render subtitles video + tts_metarecap_<name>.mp4 + subtitles.srt/.ass
video-recap orchestrator + --doctor videorecap_<name>.mp4

Each skill ships its own lib.py (merged config + utils) — there is no shared code file; the
JSON artifacts are the only interface. See each skill's SKILL.md for its full options.

Why use it?

  • One key, runs anywhere — ASR, VLM, and TTS all go through Xiaomi MiMo's OpenAI-compatible API. The only local dependency is ffmpeg; no GPU or model files. macOS / Linux / Windows.
  • Story research before writing — pull plot, characters, relationships, and world context into the brief so the recap is not visual guesswork.
  • ASR + VLM understandingmimo-v2.5-asr dialogue transcripts combined with scene cuts, mimo-v2.5 VLM descriptions, and frame-level facts.
  • Optional 整理 / index build-up--consolidate rolls per-scene VLM into a global character/relationship/plot index; --consolidate-asr cleans the transcript (timing preserved).
  • Quality review gatereview.py grades the draft (hallucination, hook, throughline, density…) as a logged, advisory pass; validate.py stays the deterministic hard gate.
  • Original audio stays alive — voiceover is mixed with ducking instead of replacing dialogue and ambience.
  • Script-first reruns — edit narration.json, then rerun voiceover/assembly without redoing video analysis.
  • Cut-style recaps--edit-mode cut selects source ranges in clip_plan.json to turn long videos into shorter narrated edits.

Installation

1. Install the plugin

Ask Claude Code:

Install this plugin: https://github.com/worldwonderer/video-recap-skills

2. Install ffmpeg

# macOS
brew install ffmpeg
# Debian/Ubuntu
sudo apt install ffmpeg
# Windows (choose one)
choco install ffmpeg   # or: scoop install ffmpeg   |   winget install ffmpeg

Python 3.10+ is the only other requirement — the scripts use the standard library plus ffmpeg
on PATH (no pip install needed for the pipeline itself).

3. Set your MiMo API key

One key powers ASR + VLM + TTS. Keep it in an environment variable, never in the repo.

export MIMO_API_KEY=your-mimo-key
  • Pay-as-you-go sk-* keys default to https://api.xiaomimimo.com/v1.
  • Token-Plan tp-* keys auto-route to the Token-Plan cluster (default cn):
export MIMO_TOKEN_PLAN_CLUSTER=cn   # cn | sgp | ams
# or pin the base URL explicitly: export MIMO_API_URL=https://token-plan-cn.xiaomimimo.com/v1

Zero-config otherwise — every overridable env var (models, ASR window, voice, loudness, subtitles…)
is documented in skills/video-recap/references/config-playbook.md.
Advanced: split routes per capability with MIMO_VIDEO_API_KEY / MIMO_TTS_API_KEY / MIMO_ASR_API_KEY
(and the matching *_API_URL forms); each falls back to MIMO_API_KEY / MIMO_API_URL.

Quick start

After installing, tell Claude Code:

Create a recap video for /path/to/video.mp4 using video-recap.
Context: <show / movie / character background>.

The orchestrator runs the understanding stage, then pauses with an agent_narration_brief.md.
The agent writes narration.json (per the video-script skill), then you rerun the same command
to resume — validate → (cut) → voiceover → assemble.

To drive it manually:

# 1. Analyze → pause with the brief
python3 skills/video-recap/scripts/recap.py /path/to/video.mp4 --work-dir work_dir \
  --context "show name, characters, or story background" \
  --consolidate                                 # optional: build the global understanding index

# 2. Read work_dir/agent_narration_brief.md, write work_dir/narration.json
#    (optional quality pass): python3 skills/video-script/scripts/review.py --work-dir work_dir

# 3. Rerun the SAME command to produce the recap
python3 skills/video-recap/scripts/recap.py /path/to/video.mp4 --work-dir work_dir

Cut mode (long video → short narrated edit; target duration is a planning goal):

python3 skills/video-recap/scripts/recap.py /path/to/video.mp4 --work-dir work_dir \
  --edit-mode cut --target-duration 10m

Write both work_dir/clip_plan.json and work_dir/narration.json in original source time; the
orchestrator builds edited_source.mp4, maps narration to narration_mapped.json, then resumes.

Burn subtitles into the final video (re-encodes; needs an ffmpeg with the subtitles/libass filter):

python3 skills/video-recap/scripts/recap.py /path/to/video.mp4 --work-dir work_dir --burn-subtitles

No dialogue / no key for ASR? Pass --skip-asr to run the pipeline without a transcript.

Doctor (ffmpeg filters, MiMo key, ASR/VLM/TTS config):

python3 skills/video-recap/scripts/recap.py --doctor

Output

  • recap_<video>.mp4 — final recap video · subtitles.srt (+ subtitles.ass with --burn-subtitles)
  • work_dir/agent_narration_brief.md — timing + scene brief for the agent
  • work_dir/narration.json — the recap narration script · work_dir/narration_lint.json — timing diagnostics
  • work_dir/narration_review.md — optional review findings (advisory)
  • work_dir/vlm_analysis.json, asr_result.json, silence_periods.json, timeline_fusion.json — understanding artifacts
  • work_dir/understanding_index.json / asr_clean.json — optional --consolidate outputs
  • work_dir/clip_plan.json, edited_source.mp4, narration_mapped.json — cut-mode artifacts
  • work_dir/mimo_video_overview.json — optional MiMo scene-chunk understanding (--mimo-video-overview)
  • work_dir/tts_segments/, tts_meta.json — TTS audio + placement

Development

Each skill ships its own lib.py, so tests run one process per skill (a plain pytest tests/
would collide on the lib module name):

bash scripts/test.sh                 # all skills (or: bash scripts/test.sh script)
# Windows (no bash): run each group, e.g. python -m pytest tests/script
ruff check skills tests              # lint
python3 skills/video-recap/scripts/recap.py --doctor   # runtime check

Tests live in tests/<skill>/. CI runs the same checks (.github/workflows/skill-validate.yml).

Useful references

Acknowledgements

License

MIT — see LICENSE.

Yorumlar (0)

Sonuc bulunamadi