SkillHub

salute-speech

v1.0.1

Transcribe audio files using Sber Salute Speech async API. Russian-first STT with support for ru-RU, en-US, kk-KZ, ky-KG, uz-UZ.

Sourced from ClawHub, Authored by chorus12

Installation

Please help me install the skill `salute-speech` from SkillHub official store. npx skills add chorus12/salute-speech

Audio Transcription with Sber Salute Speech

Transcribe audio/video files to text with timestamps via Salute Speech async REST API.

Requirements

  • API Key: Environment variable SALUTE_AUTH_DATA must be set (Base64-encoded client_id:client_secret or raw authorization key from https://developers.sber.ru/studio/).
  • SSL note: The script disables SSL verification by default (verify_ssl=False) because Sber's certificate chain is non-standard. This is expected.

Supported formats & encodings

Audio encoding Content-Type Typical extensions
MP3 audio/mpeg .mp3
PCM_S16LE audio/wav .wav
OPUS audio/ogg .ogg, .opus
FLAC audio/flac .flac
ALAW audio/alaw .alaw
MULAW audio/mulaw .mulaw

Supported languages

ru-RU, en-US, kk-KZ (Kazakh), ky-KG (Kyrgyz), uz-UZ (Uzbek).

Workflow

  1. Identify input files — from user request.
  2. Read API key from host environment.
  3. Run transcription — execute salute_transcribe.py with uv and appropriate arguments.
  4. Deliver results — present to user human-readable transcript with timestamps to the user and give a direct link to files.

Usage

uv run --with requests {baseDir}/salute_transcribe.py 
  --file /path/to/audio.mp3 
  --output_dir ~/.openclaw/workspace/transcriptions 
  --lang ru-RU

Arguments

Argument Required Default Description
--file Yes Path to audio/video file
--output_dir No ~/.openclaw/workspace/transcribations Output directory for results
--lang No ru-RU Language code: ru-RU, en-US, kk-KZ, ky-KG, uz-UZ
--audio-encoding No MP3 Codec: MP3, PCM_S16LE, OPUS, FLAC, ALAW, MULAW
--model No general Recognition model: general or callcenter
--hyp-count No 1 Number of alternative hypotheses: 1 or 2
--max-wait-time No 300 Max seconds to wait for async result
--print No off Also print transcription to stdout

Content-Type mapping

When the file extension doesn't match audio/mpeg, adjust content_type in the script or add logic. Current default is audio/mpeg (MP3). For .wav files use audio/wav, etc.

Output files

For input file meetingABC.mp3 the script produces:

File Description
meetingABC_recognition_orig.json Raw API response (full JSON with all hypotheses, timing, confidence)
meetingABC_pretty.txt Formatted human-readable transcript with timestamps

Output text format

[00:01 - 00:20]:
Ну, даже если сосредоточиться на идее узкой щели.

[00:20 - 00:45]:
Следующий фрагмент текста здесь.

Notes

  • Token is valid for ~30 minutes; the script fetches a new one each run.
  • Large files (>1 hour) may need --max-wait-time increased beyond 300s.
  • The callcenter model is optimized for telephony audio (8kHz, mono).
  • Profanity filter is disabled by default (enable_profanity_filter=False).
  • The script uses normalized text by default (numbers as digits, abbreviations expanded). Raw text is also available in the JSON output.