Voice Messages (STT + TTS) for OpenClaw 🎙️

Complete voice message setup using faster-whisper for transcription and Edge TTS for voice replies.

What we configure

✅ STT (Speech-to-Text) — transcribe voice messages via faster-whisper
✅ TTS (Text-to-Speech) — voice replies via Edge TTS
🎯 Result: voice → text → reply with voice

Installation

1. Create virtual environment (venv)

For Ubuntu create an isolated venv:

python3 -m venv ~/.openclaw/workspace/voice-messages

2. Install faster-whisper

Install packages in venv:

~/.openclaw/workspace/voice-messages/bin/pip install faster-whisper

What gets installed: - faster-whisper — Python library for transcription - Dependencies: ctranslate2, onnxruntime, huggingface-hub, av, numpy, and others. - Size: ~250 MB

Transcription Script

Path and content

File: ~/.openclaw/workspace/voice-messages/transcribe.py

#!/usr/bin/env python3
import argparse
from faster_whisper import WhisperModel


def transcribe(audio_path: str, model_name: str = "small", lang: str = "en", device: str = "cpu") -> str:
    model = WhisperModel(
        model_name,
        device=device,
        compute_type="int8" if device == "cpu" else "float16",
    )
    segments, _ = model.transcribe(audio_path, language=lang, vad_filter=True)
    text = " ".join(seg.text.strip() for seg in segments if seg.text and seg.text.strip()).strip()
    return text


def main():
    p = argparse.ArgumentParser()
    p.add_argument("--audio", required=True)
    p.add_argument("--model", default="small")
    p.add_argument("--lang", default="en")
    p.add_argument("--device", default="cpu", choices=["cpu", "cuda"])
    args = p.parse_args()

    text = transcribe(args.audio, args.model, args.lang, args.device)
    print(text if text else "")


if __name__ == "__main__":
    main()

What the script does: 1. Accepts audio file path (--audio) 2. Loads Whisper model (--model): small by default 3. Sets language (--lang): en for English 4. Transcribes with VAD filter (Voice Activity Detection) 5. Outputs clean text to stdout

Make file executable:

chmod +x ~/.openclaw/workspace/voice-messages/transcribe.py

OpenClaw Configuration

1. Configure STT (`tools.media.audio`)

Add to ~/.openclaw/openclaw.json:

{
  "tools": {
    "media": {
      "audio": {
        "enabled": true,
        "maxBytes": 20971520,
        "models": [
          {
            "type": "cli",
            "command": "~/.openclaw/workspace/voice-messages/bin/python",
            "args": [
              "~/.openclaw/workspace/voice-messages/transcribe.py",
              "--audio",
              "{{MediaPath}}",
              "--lang",
              "en",
              "--model",
              "small"
            ],
            "timeoutSeconds": 120
          }
        ]
      }
    }
  }
}

Parameters:

Parameter	Value	Description
`enabled`	`true`	Enable audio transcription
`maxBytes`	`20971520`	Max file size (20 MB)
`type`	`"cli"`	Model type: CLI command
`command`	Python path	Path to python in venv
`args`	argument array	Arguments for script
`{{MediaPath}}`	placeholder	Replaced with audio file path
`timeoutSeconds`	`120`	Transcription timeout (2 minutes)

2. Configure TTS (`messages.tts`)

Add to ~/.openclaw/openclaw.json:

{
  "messages": {
    "tts": {
      "auto": "inbound",
      "provider": "edge",
      "edge": {
        "voice": "en-US-JennyNeural",
        "lang": "en-US"
      }
    }
  }
}

Parameters:

Parameter	Value	Description
`auto`	`"inbound"`	Key mode! — reply with voice only on incoming voice messages
`provider`	`"edge"`	TTS provider (free, no API key)
`voice`	`"en-US-JennyNeural"`	Voice (see available below)
`lang`	`"en-US"`	Locale (en-US for US english)

3. Full configuration example

{
  "tools": {
    "media": {
      "audio": {
        "enabled": true,
        "maxBytes": 20971520,
        "models": [
          {
            "type": "cli",
            "command": "~/.openclaw/workspace/voice-messages/bin/python",
            "args": [
              "~/.openclaw/workspace/voice-messages/transcribe.py",
              "--audio",
              "{{MediaPath}}",
              "--lang",
              "en",
              "--model",
              "small"
            ],
            "timeoutSeconds": 120
          }
        ]
      }
    },
  },
  "messages": {
    "tts": {
      "auto": "inbound",
      "provider": "edge",
      "edge": {
        "voice": "en-US-JennyNeural",
        "lang": "en-US"
      }
    },
    "ackReactionScope": "group-mentions"
  }
}

Apply Changes

Restart Gateway

# Method 1: via openclaw CLI
openclaw gateway restart

# Method 2: via systemd
systemctl --user restart openclaw-gateway

# Check status
systemctl --user status openclaw-gateway
# Should show: active (running)

Testing

Test STT (transcription)

Action: Send a voice message to your Telegram bot

Expected result:

[Audio] User text: [Telegram ...] <media:audio> Transcript: <transcribed text>

Example response:

[Audio] User text: [Telegram kd (@someuser) id:12345678 +5s ...] <media:audio> Transcript: Hello. How are you?

Test TTS (voice replies)

Action: After successful transcription, bot should send a voice reply

Expected result: - Voice file arrives in Telegram - Voice note (round bubble)

Expected behavior: - Incoming voice → bot replies with voice - Text messages → bot replies with text (this is normal!)

Available Edge TTS Voices

Female voices

Voice	ID	Usage example
Jenny	`en-US-JennyNeural`	← current
Ana	`en-US-AnaNeural`	Softer

Male voices

Voice	ID	Usage example
Dmitry	`en-US-RogerNeural`	More bass

How to change voice:

cat ~/.openclaw/openclaw.json | 
  jq '.messages.tts.edge.voice = "en-US-MichelleNeural"' > ~/.openclaw/openclaw.json.tmp
mv ~/.openclaw/openclaw.json.tmp ~/.openclaw/openclaw.json
systemctl --user restart openclaw-gateway

Additional Edge TTS Parameters

Adjusting speed, pitch, volume

{
  "messages": {
    "tts": {
      "edge": {
        "voice": "en-US-JennyNeural",
        "lang": "en-US",
        "rate": "+10%",      // Speed: -50% to +100%
        "pitch": "-5%",     // Pitch: -50% to +50%
        "volume": "+5%"     // Volume: -100% to +100%
      }
    }
  }
}

Troubleshooting

Problem: Voice not transcribed

Logs show:

[ERROR] Transcription failed

Possible causes: 1. File too large — > 20 MB bash # Solution: Increase maxBytes in config maxBytes: 52428800 # 50 MB

Timeout — transcription took > 2 minutes bash # Solution: Increase timeoutSeconds timeoutSeconds: 180 # 3 minutes
Model not downloaded — first run bash # Solution: Wait while it downloads (1-2 minutes) # Models are cached in ~/.cache/huggingface/

Problem: No voice reply

Possible causes: 1. Reply too short (< 10 characters) - TTS skips very short replies - Solution: this is expected behavior

auto: "inbound" but text message
TTS in inbound mode replies with voice only on voice messages
Text messages get text replies — this is correct!
Edge TTS unavailable bash # Check curl -s "https://speech.platform.bing.com/consumer/api/v1/tts" | head -c 100 # If error — temporarily unavailable

Performance

Transcription time (Raspberry Pi 4/ARM)

Whisper Model	Est. time	Quality
`tiny`	~5-10 sec	Low
`base`	~10-20 sec	Medium
`small`	~20-40 sec	High ← current
`medium`	~40-80 sec	Very high
`large`	~80-160 sec	Maximum

Recommendation: For Raspberry Pi use small or base. medium/large will be very slow.

Where Whisper models are stored

~/.cache/huggingface/

Models download automatically on first run.

Done! 🎉

After completing these steps:

✅ faster-whisper installed in venv
✅ transcribe.py script created
✅ OpenClaw configured (STT + TTS)
✅ Gateway restarted
✅ Voice messages working

Now your Telegram bot: - 🎙️ Accepts voice → transcribes via faster-whisper - 🎤 Replies with voice → generates via Edge TTS - 💬 Accepts text → replies with text (as usual)

Useful links: - OpenClaw docs: https://docs.openclaw.ai - TTS docs: https://docs.openclaw.ai/tts - Audio docs: https://docs.openclaw.ai/nodes/audio - Install skills: npx clawhub search voice

Created: 2026-03-01 for OpenClaw 2026.2.26

voice-stt-tts

Installation

Voice Messages (STT + TTS) for OpenClaw 🎙️

What we configure

Installation

1. Create virtual environment (venv)

2. Install faster-whisper

Transcription Script

Path and content

Make file executable:

OpenClaw Configuration

1. Configure STT (tools.media.audio)

2. Configure TTS (messages.tts)

3. Full configuration example

Apply Changes

Restart Gateway

Testing

Test STT (transcription)

Test TTS (voice replies)

Available Edge TTS Voices

Female voices

Male voices

Additional Edge TTS Parameters

Adjusting speed, pitch, volume

Troubleshooting

Problem: Voice not transcribed

Problem: No voice reply

Performance

Transcription time (Raspberry Pi 4/ARM)

Where Whisper models are stored

Done! 🎉

1. Configure STT (`tools.media.audio`)

2. Configure TTS (`messages.tts`)