Smart Audio Analyzer

The only audio skill with persistent voice profiles. Beyond transcription — it knows WHO is speaking, detects the scene, and generates structured notes.

唯一带声纹档案的录音分析 skill。转写只是第一步——它还能认出谁在说话，自动判断场景，按模板出纪要。

What Makes This Different

Feature	This Skill	Others
Transcription	✅ AssemblyAI (default) + Whisper + Gemini	✅ Usually one engine
Speaker ID by voiceprint	✅ Persistent profiles across sessions	❌ None
Scene auto-detection	✅ 5 built-in scenes + extensible	❌ One-size-fits-all
Structured output	✅ Scene-specific templates	⚠️ Generic summary
Multi-language	✅ Chinese + English	Varies

Quick Start

# 1. Install
cd skills/audio-analyzer/scripts && npm install

# 2. Configure (pick ONE — AssemblyAI recommended)
cp .env.example .env
# Edit .env: set ASSEMBLYAI_API_KEY

# 3. Run
node analyze.js /path/to/recording.m4a

Zero-config alternative: If no API key is set, it will attempt local Whisper or Gemini fallback.

安装

# 1. 放到 workspace/skills/ 下
cp -r audio-analyzer /path/to/.openclaw/workspace/skills/

# 2. 安装依赖
cd skills/audio-analyzer/scripts && npm install

# 3. 配置 ASR 引擎（选一个即可，推荐 AssemblyAI）
cp .env.example .env
# 编辑 .env，填入 ASSEMBLYAI_API_KEY

# 4. 多 agent 环境：每个 agent 的 workspace 都需要一份

Bootstrap 片段

将以下内容添加到你的 agent bootstrap.md：

## 音频文件处理
当收到音频文件（.m4a/.mp3/.wav/.ogg/.flac）时，**必须**按以下流程处理：
1. 运行 `cd <workspace>/skills/audio-analyzer/scripts && node analyze.js <音频文件绝对路径>` 进行转写+说话人分离
2. 读取转写结果，根据内容自动判断场景（或按用户指定）
3. 读取 skills/audio-analyzer/references/scenes/<场景>.md 加载模板
4. 读取 skills/audio-analyzer/references/voice-profiles.md 对照音色档案
5. 按模板生成结构化纪要
6. 与用户确认说话人身份，更新音色档案

**不要**尝试用 summarize、pdf、image 等工具处理音频文件。

Core Pipeline

Audio File → Transcribe + Speaker Separation → Voice Profile Matching
→ Scene Detection → Load Template → Generate Notes → Update Profiles

Step 1: Transcribe

cd scripts && node analyze.js <文件路径>

ASR Engine Priority: 1. AssemblyAI (default, best quality) — needs ASSEMBLYAI_API_KEY 2. Gemini — needs GEMINI_API_KEY or OpenRouter key 3. Whisper (local) — needs whisper installed locally

Output: - <filename>_transcript.txt — timestamped dialogue with speaker labels - <filename>_raw.json — raw JSON with speaker metadata

Step 2: Speaker Identification

Cross-references references/voice-profiles.md:

Read all known voice profiles (speech patterns, content patterns)
Analyze each speaker against profiles
Match rules:
High confidence → auto-label with name
Partial match → label as "possibly XXX" with evidence
No match → label as "Unknown Speaker"
Ask user to confirm
Update profiles after confirmation

Step 3: Scene Detection

Auto-detects based on transcript content:

Scene	Typical Keywords	Template
🚣 Rowing Training	stroke rate, pace, catch, drive	`scenes/rowing.md`
💼 Work Meeting	project, deadline, requirements, bug	`scenes/meeting.md`
🎤 Interview	user pain points, use case, feedback	`scenes/interview.md`
🎓 Talk/Lecture	welcome, today's topic, Q&A	`scenes/talk.md`
📝 General	(fallback)	`scenes/general.md`

Override manually: node analyze.js file.m4a meeting

Step 4-5: Generate Structured Notes

Loads scene-specific template → generates structured output with key points, action items, and insights.

Step 6: Update Voice Profiles

After user confirms speaker identities, updates references/voice-profiles.md: - New person → add entry (role, speech patterns, content patterns) - Known person → refine description - Shared across all scenes and future recordings

Extending Scenes

Add a new .md file in references/scenes/:

references/scenes/
├── rowing.md      # 🚣 Rowing Training
├── meeting.md     # 💼 Work Meeting
├── interview.md   # 🎤 Interview
├── talk.md        # 🎓 Talk/Lecture
└── general.md     # 📝 General (fallback)

Requirements

Node.js 18+
At least ONE of: AssemblyAI key, Gemini key, or local Whisper
cd scripts && npm install

Error Handling

Situation	Response
API quota exceeded	"Transcription service unavailable, check API quota"
File > 100MB	Warn user: estimated 5-10 min processing
Empty transcript	"No speech detected in audio"
Network error	"Connection error, please retry"
No ASR engine available	List setup instructions for each engine

Advanced: Voiceprint Extraction (Optional)

The skill includes an optional voiceprint.py tool for embedding-based speaker identification using ONNX neural models. This is separate from the text-based voice profile matching in the core pipeline.

What it does

Extracts speaker audio segments using ffmpeg
Computes 256-dim speaker embeddings via WeSpeaker ONNX model
Stores embeddings locally in references/voice-db.json
Matches new speakers against stored embeddings (cosine similarity)

Setup (optional — core skill works without this)

# 1. Install Python dependencies
pip install numpy librosa onnxruntime

# 2. Install ffmpeg
apt install ffmpeg  # or: brew install ffmpeg

# 3. Download WeSpeaker model
mkdir -p ~/.openclaw/models/wespeaker
# Download cnceleb_resnet34_LM.onnx from:
# https://github.com/wenet-e2e/wespeaker/releases
# Set: export WESPEAKER_MODEL=~/.openclaw/models/wespeaker/cnceleb_resnet34_LM.onnx

Usage

# Extract voiceprints from a transcribed recording
python3 voiceprint.py extract recording.m4a recording_raw.json

# Enroll a known speaker
python3 voiceprint.py enroll "JoJo" jojo_sample.m4a

# Identify speaker in new audio
python3 voiceprint.py identify unknown.m4a

Privacy Notice

All voice embeddings are stored locally in references/voice-db.json
Voice embeddings are never sent externally
Audio files ARE uploaded to cloud ASR (AssemblyAI/Gemini) for transcription. For fully offline operation, use local Whisper
Speaker identity updates require explicit user confirmation
To delete all voiceprint data: rm references/voice-db.json

Voice Profiles (Text-Based)

See references/voice-profiles.md. Shared across all scenes — same person is recognized regardless of context. This is the lightweight alternative that works without the ONNX model.