ppt-audio-to-video
v0.1.0Convert narration audio plus slide decks into a narrated video. Use when the user has an audio-only `mp4/m4a/mp3/wav` and a `ppt/pptx/pdf` deck, and needs slide images, transcript extraction, slide timing planning, or final `mp4` rendering with `whisper-cpp` and `ffmpeg`.
Installation
PPT Audio To Video
Use this skill when the source video has narration audio but no usable slide visuals, and the final deliverable should be a slide-based lecture video.
Resolve bundled scripts relative to this skill directory. If the runtime has already opened this SKILL.md, prefer paths like scripts/extract_slide_outline.py and scripts/render_from_timing_csv.py instead of machine-specific absolute paths.
Core workflow
- Inventory inputs.
- Confirm which of these exist: audio-only
mp4/m4a/mp3/wav,ppt/pptx,pdf, and any pre-rendered slide images. -
Prefer an existing
pdfor image directory for rendering. Treatpptxas the source of slide text and as a fallback for export. -
Prepare tools.
- Required for deterministic steps:
ffmpeg,ffprobe,pdftoppm. - Required for transcription:
whisper-clifromwhisper-cppplus a multilingual model such asggml-small.bin. -
If only
pptxexists and nopdf/imagesexist, preferKeynoteorPowerPointexport on macOS. Usesofficeonly as fallback because profile or rendering issues are common. -
Produce slide images.
- If
pdfexists, render it to images:bash pdftoppm -png -r 200 "$PDF" "$OUTDIR/slide" - If only
pptxexists, export topdfor slide images withKeynoteorPowerPoint, then continue frompdf. -
Keep slide filenames ordered and stable, such as
slide-01.png,slide-02.png, ... -
Extract slide text.
- Run:
bash python3 scripts/extract_slide_outline.py --pptx "$PPTX" --out "$WORKDIR/slide_outline.csv" -
Use the output to identify slide titles, distinctive keywords, and section changes.
-
Extract clean audio for ASR.
- For audio-only
mp4, extract monowav:bash ffmpeg -y -i "$AUDIO_MP4" -ar 16000 -ac 1 -c:a pcm_s16le "$WORKDIR/audio.wav" -
If the source is already
wav/mp3/m4a, convert to the same monowavform if needed. -
Transcribe with
whisper-cli. - Example:
bash whisper-cli -ng -m "$MODEL" -f "$WORKDIR/audio.wav" -l zh -ocsv -osrt -of "$WORKDIR/transcript" - Prefer
transcript.csvfor downstream parsing.transcript.srtis useful for manual review. -
If GPU allocation fails on macOS, retry with
-ngto force CPU mode. -
Build
slide_timings.csv. - Do not average slide durations unless the user explicitly asks for it.
- Read the transcript and slide outline together, then create a monotonic timing plan by topic changes, section boundaries, and unique keywords.
- Use this schema:
csv slide,start_sec,end_sec,duration_sec,reason 1,0.000,15.000,15.000,opening title and agenda 2,15.000,100.000,85.000,architecture overview starts here - Keep slide numbers sequential and ensure
duration_sec = end_sec - start_sec. -
Validate that the last
end_secmatches the audio duration or is within a small tolerance. -
Render the final video.
- Run:
bash python3 scripts/render_from_timing_csv.py --images "$SLIDE_IMAGES_DIR" --timings "$WORKDIR/slide_timings.csv" --audio "$WORKDIR/audio.wav" --output "$OUT_VIDEO" -
The script generates an
ffconcatfile, validates timing continuity, and callsffmpegto encode the finalmp4. -
Verify and iterate.
- Check output duration with
ffprobe. - If a slide cuts too early or too late, edit only the affected rows in
slide_timings.csvand rerun the render script. - Keep the transcript, outline, and timing CSV as reproducible working files.
Heuristics for timing alignment
- Use section-divider slides briefly. These slides usually hold for 5-20 seconds.
- Use the first segment that clearly switches topic as the next slide start.
- Prefer exact topic transitions over title-word matching. ASR often distorts proper nouns and product names.
- Let the model infer timings, but keep the render step deterministic through
slide_timings.csv. - When confidence is low, produce a first-cut video and tell the user which slide boundaries likely need review.
Common commands
Install dependencies on macOS if missing:
brew install ffmpeg poppler whisper-cpp
Typical multilingual model download:
mkdir -p .models
curl -L 'https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-small.bin' -o .models/ggml-small.bin
Bundled scripts
scripts/extract_slide_outline.pyExtract slide text frompptxinto CSV or JSON for timing analysis.scripts/render_from_timing_csv.pyValidate a timing CSV, generate anffconcat, and render the final video withffmpeg.