Audio Transcription with Sber Salute Speech

Transcribe audio/video files to text with timestamps via Salute Speech async REST API.

Requirements

API Key: Environment variable SALUTE_AUTH_DATA must be set (Base64-encoded client_id:client_secret or raw authorization key from https://developers.sber.ru/studio/).
SSL note: The script disables SSL verification by default (verify_ssl=False) because Sber's certificate chain is non-standard. This is expected.

Supported formats & encodings

Audio encoding	Content-Type	Typical extensions
`MP3`	`audio/mpeg`	`.mp3`
`PCM_S16LE`	`audio/wav`	`.wav`
`OPUS`	`audio/ogg`	`.ogg`, `.opus`
`FLAC`	`audio/flac`	`.flac`
`ALAW`	`audio/alaw`	`.alaw`
`MULAW`	`audio/mulaw`	`.mulaw`

Supported languages

ru-RU, en-US, kk-KZ (Kazakh), ky-KG (Kyrgyz), uz-UZ (Uzbek).

Workflow

Identify input files — from user request.
Read API key from host environment.
Run transcription — execute salute_transcribe.py with uv and appropriate arguments.
Deliver results — present to user human-readable transcript with timestamps to the user and give a direct link to files.

Usage

uv run --with requests {baseDir}/salute_transcribe.py 
  --file /path/to/audio.mp3 
  --output_dir ~/.openclaw/workspace/transcriptions 
  --lang ru-RU

Arguments

Argument	Required	Default	Description
`--file`	Yes	—	Path to audio/video file
`--output_dir`	No	`~/.openclaw/workspace/transcribations`	Output directory for results
`--lang`	No	`ru-RU`	Language code: `ru-RU`, `en-US`, `kk-KZ`, `ky-KG`, `uz-UZ`
`--audio-encoding`	No	`MP3`	Codec: `MP3`, `PCM_S16LE`, `OPUS`, `FLAC`, `ALAW`, `MULAW`
`--model`	No	`general`	Recognition model: `general` or `callcenter`
`--hyp-count`	No	`1`	Number of alternative hypotheses: `1` or `2`
`--max-wait-time`	No	`300`	Max seconds to wait for async result
`--print`	No	off	Also print transcription to stdout

Content-Type mapping

When the file extension doesn't match audio/mpeg, adjust content_type in the script or add logic. Current default is audio/mpeg (MP3). For .wav files use audio/wav, etc.

Output files

For input file meetingABC.mp3 the script produces:

File	Description
`meetingABC_recognition_orig.json`	Raw API response (full JSON with all hypotheses, timing, confidence)
`meetingABC_pretty.txt`	Formatted human-readable transcript with timestamps

Output text format

[00:01 - 00:20]:
Ну, даже если сосредоточиться на идее узкой щели.

[00:20 - 00:45]:
Следующий фрагмент текста здесь.

Notes

Token is valid for ~30 minutes; the script fetches a new one each run.
Large files (>1 hour) may need --max-wait-time increased beyond 300s.
The callcenter model is optimized for telephony audio (8kHz, mono).
Profanity filter is disabled by default (enable_profanity_filter=False).
The script uses normalized text by default (numbers as digits, abbreviations expanded). Raw text is also available in the JSON output.

salute-speech

Installation