IMA TTS (Text-to-Speech)

Overview

Call IMA Open API to create text-to-speech audio. Same flow as other IMA creation skills: query products → create task → poll until done. Task type is text_to_speech. This skill targets seed-tts-2.0 only — seed-tts-1.1 is not supported; the script defaults to seed-tts-2.0 when no model is specified.

⚙️ How This Skill Works

This skill uses a bundled Python script scripts/ima_tts_create.py to call the IMA Open API:

Sends text (prompt) to https://api.imastudio.com
Uses --user-id only locally for preference storage
Returns an audio URL when synthesis is complete
Reflection mechanism: on create failure, retries up to 3 times with parameter adjustments

What gets sent to IMA: prompt (text to speak), model selection, parameters (e.g. voice_id, speed). Not sent: API key in prompt body; user_id is local only.

Agent Execution

Use the bundled script:

# List available TTS models (optional; default is seed-tts-2.0)
python3 {baseDir}/scripts/ima_tts_create.py --api-key $IMA_API_KEY --list-models

# Generate speech (default model: seed-tts-2.0; omit --model-id to use default)
python3 {baseDir}/scripts/ima_tts_create.py 
  --api-key $IMA_API_KEY 
  --model-id seed-tts-2.0 
  --prompt "Text to be spoken here." 
  --user-id {user_id} 
  --output-json

Script outputs JSON; parse it for url and pass to the user via the UX protocol below.

Environment

Base URL: https://api.imastudio.com

Header	Required	Value
`Authorization`	✅	`Bearer ima_your_api_key_here`
`x-app-source`	✅	`ima_skills`
`x_app_language`	recommended	`en` / `zh`

⚠️ MANDATORY: Always Query Product List First

You MUST call /open/v1/product/list with category=text_to_speech before creating any task. attribute_id is required; if 0 or missing → "Invalid product attribute" and task fails.

GET /open/v1/product/list?app=ima&platform=web&category=text_to_speech

Then traverse the V2 tree: type=2 = model groups, type=3 = versions (leaves). Only type=3 nodes have credit_rules and form_config. Use a leaf’s model_id, id (= model_version), and credit_rules[0].attribute_id / points for create.

Core Flow

1. GET /open/v1/product/list?app=ima&platform=web&category=text_to_speech
   → Get attribute_id, credit, model_version, form_config

2. POST /open/v1/tasks/create
   → task_type: "text_to_speech", parameters[].parameters.prompt = text to speak

3. POST /open/v1/tasks/detail  { "task_id": "..." }
   → Poll every 2–5s until medias[].resource_status == 1 and status != "failed"
   → Read medias[].url (and optional duration_str, format)

Task Detail API — Actual Response Shape

Poll POST /open/v1/tasks/detail until completion. Response uses the same structure as other IMA audio tasks:

Field	Type	Meaning
`resource_status`	int or null	0=处理中, 1=可用, 2=失败, 3=已删除；null 视为 0
`status`	string	"pending" / "processing" / "success" / "failed"
`url`	string	Audio URL when resource_status=1 (mp3/wav)
`duration_str`	string	Optional, e.g. "30s"
`format`	string	Optional, e.g. "mp3", "wav"

Completed success example:

{
  "id": "task_xxx",
  "medias": [{
    "resource_status": 1,
    "status": "success",
    "url": "https://cdn.../output.mp3",
    "duration_str": "12s",
    "format": "mp3"
  }]
}

Rules:

Treat resource_status: null as 0 (processing).
Success only when all medias have resource_status == 1 and status != "failed".
On resource_status == 2 or status == "failed", stop and handle error (e.g. use error_msg / remark).

API 2: Create Task

POST /open/v1/tasks/create

text_to_speech — no image input. src_img_url: [], input_images: [].

{
  "task_type": "text_to_speech",
  "enable_multi_model": false,
  "src_img_url": [],
  "parameters": [{
    "attribute_id":  "<from credit_rules>",
    "model_id":      "<model_id>",
    "model_name":    "<model_name>",
    "model_version": "<version_id>",
    "app":           "ima",
    "platform":      "web",
    "category":      "text_to_speech",
    "credit":        "<points>",
    "parameters": {
      "prompt":       "Text to be spoken.",
      "n":            1,
      "input_images": [],
      "cast":         {"points": "<points>", "attribute_id": "<attribute_id>"}
    }
  }]
}

prompt must be inside parameters[].parameters, not at top level. Extra fields (e.g. voice_id, speed) come from product form_config; include only those present in the product’s credit_rules/form_config.

Response: data.id = task_id for polling.

Supported Task Type & Models

category	Capability	Input
`text_to_speech`	Text → Speech	prompt (text to speak)

Models: This skill supports seed-tts-2.0 only (seed-tts-1.1 is not supported). The script defaults to --model-id seed-tts-2.0 when none is provided. For current attribute_id and credit, the script reads from the product list at runtime.

seed-tts-2.0 — Verified request parameters

The following parameters[].parameters shape has been verified to work for seed-tts-2.0 (attribute_id/credit come from product list and may differ by app/platform):

Parameter	Type	Required	Description
`prompt`	string	✅	Text to speak (合成文本).
`n`	int	✅	Usually 1.
`model`	string	✅	Sub-model: `seed-tts-2.0-expressive` (default) or `seed-tts-2.0-standard`.
`speaker`	string	optional	Speaker ID / 发音人，e.g. `zh_male_sophie_uranus_bigtts`（音色列表 1257544 中原生 voice_type）. 注意：使用原生格式（如 `zh_male__uranus_bigtts`），不支持 `BV_streaming` 格式。
`audio_params`	object	optional	`emotion`（情感）、`speech_rate`（语速 [-50,100]）、`loudness_rate`（音量 [-50,100]）等，见 1598757 请求 Body.
`additions`	object	optional	e.g. `{"explicit_language": "crosslingual", "context_texts": []}`.
`cast`	object	✅	`{"points": <credit>, "attribute_id": <attribute_id>}` from product list.

Script example with extra params:

python3 ima_tts_create.py --api-key $IMA_API_KEY --model-id seed-tts-2.0 
  --prompt "阳光青年音色测试，你好世界。" 
  --extra-params '{"model":"seed-tts-2.0-expressive","speaker":"zh_male_sophie_uranus_bigtts","audio_params":{"emotion":"neutral"},"additions":{"explicit_language":"crosslingual","context_texts":[]}}' 
  --output-json

Note: The script gets attribute_id and credit from the product list (e.g. app=ima&platform=web → often 2 pts / attribute_id 4419 for seed-tts-2.0). If you have a different app/platform (e.g. webAgent), the product list may return different credit_rules (e.g. 5 pts / attribute_id 8987); the script uses whatever the product list returns for the chosen model.

Speaker / 音色列表（seed-tts-2.0 兼容火山引擎音色）： 完整音色 ID 与场景分类见项目内 volcengine_tts_timbre_list.json。该文件来自火山引擎豆包语音合成音色列表，使用原生 voice_type 格式（如 zh_male_sophie_uranus_bigtts 魅力苏菲、zh_female_vv_uranus_bigtts Vivi）。⚠️ 注意： IMA API 只支持原生格式（*_uranus_bigtts 系列），不支持 BV*_streaming 豆包音色 ID。

与火山引擎 2.0 文档对照： 上述参数与 HTTP Chunked/SSE 单向流式 V3 请求 Body 一致：req_params.text → prompt，req_params.speaker → speaker（必填项），req_params.model → model（expressive/standard），req_params.audio_params（emotion、speech_rate、loudness_rate 等），req_params.additions（如 explicit_language）。2.0 能力说明见豆包语音合成2.0能力介绍（语音指令、引用上文、语音标签等）。

🎤 当用户说「帮我制作旁白/配音」时如何询问

当用户表达「帮我制作旁白」「做一段配音」「把这段文字读出来」等意图时，必须先收集关键信息再调用脚本，避免缺参或盲目默认。

必问

询问项	对应参数	说明
要朗读的内容 / 旁白文案	`prompt`	合成文本，必填。若用户只给主题，可请用户提供具体文案或由你生成后让用户确认。

建议问（让用户选择）

询问项	对应参数	选项来源与示例
音色 / 发音人	`speaker`	从项目内 `volcengine_tts_timbre_list.json`（或音色列表 1257544）按场景推荐：通用场景（魅力苏菲 `zh_male_sophie_uranus_bigtts`、Vivi `zh_female_vv_uranus_bigtts`、云舟 `zh_male_m191_uranus_bigtts`）、视频配音（大壹 `zh_male_dayi_uranus_bigtts`、猴哥 `zh_male_sunwukong_uranus_bigtts`）、角色扮演（知性灿灿 `zh_female_cancan_uranus_bigtts`、撒娇学妹 `zh_female_sajiaoxuemei_uranus_bigtts`）。可简短列出 3–5 个候选让用户选，或问「要男声/女声？偏解说/读书/助手？」再缩小范围。⚠️ 使用原生格式（`*_uranus_bigtts`）。

可选问（按需补充）

询问项	对应参数	说明与取值
情感 / 情绪	`audio_params.emotion`	部分音色支持，如 neutral、sad、angry；详见音色列表-多情感音色。
语速	`audio_params.speech_rate`	范围 [-50, 100]，0 为正常，100 约 2 倍速。可通过 `--extra-params '{"audio_params":{"speech_rate":20}}'` 传入。
音量	`audio_params.loudness_rate`	范围 [-50, 100]，0 为正常（mix 音色不支持）。
模型风格	`model`	`seed-tts-2.0-expressive`（默认，表现力强）或 `seed-tts-2.0-standard`（更稳定）。

脚本对应： --prompt 必填；--speaker、--emotion 直接支持；语速/音量/模型等通过 --extra-params 传入 JSON（见上文 Script example）。

📥 User Input Parsing (Parameter Recognition)

Map user intent to parameters using product form_config (e.g. voice, speed):

User intent / phrasing	Parameter (if in form_config)	Notes
旁白 / 配音 / 朗读 / 把这段读出来	prompt + speaker（建议问）	先问清内容与音色，再调用；见上方「当用户说制作旁白/配音时如何询问」。
女声 / 女声朗读 / female voice	voice_id / voice_type / speaker	Use value from form_config or e.g. speaker ID
男声 / 男声朗读 / male voice	voice_id / voice_type / speaker	Use value from form_config or e.g. speaker ID
发音人 / 音色 / speaker	speaker	seed-tts-2.0: e.g. zh_male_sophie_uranus_bigtts，见 volcengine_tts_timbre_list.json（原生格式）
情感 / 情绪 / emotion	audio_params.emotion	e.g. "neutral", "sad"；部分音色支持
语速快/慢 / speed up/slow	audio_params.speech_rate	范围 [-50, 100]，0 为正常
音调 / pitch	pitch	If supported
大声/小声 / volume	audio_params.loudness_rate	范围 [-50, 100]
风格 expressive/standard	model	seed-tts-2.0: seed-tts-2.0-expressive / seed-tts-2.0-standard

If the user does not specify, use form_config defaults. Do not send parameters not present in the product’s credit_rules/attributes or form_config (reflection will strip them on retry).

🧠 User Preference Memory

Storage: ~/.openclaw/memory/ima_prefs.json

{
  "user_{user_id}": {
    "text_to_speech": {
      "model_id": "...",
      "model_name": "...",
      "credit": 2,
      "last_used": "..."
    }
  }
}

Before generation: Load prefs; if user_{user_id}.text_to_speech exists, use that model and optionally mention it.
After success: Save used model to user_{user_id}.text_to_speech.
On explicit change: e.g. “换成XXX” / “以后都用XXX” → switch and save.

💬 User Experience Protocol (IM / Feishu / Discord)

TTS usually completes in a few seconds to tens of seconds. Do not leave users without feedback.

Step 0 — Initial Acknowledgment (Normal Reply)

First reply with a short acknowledgment (normal reply, not message tool), e.g.:

好的，正在帮你把这段文字转成语音。
OK, converting this text to speech.

Step 1 — Pre-Generation Notification (message tool)

Push once:

🔊 开始语音合成，请稍候…
• 模型：[Model Name]
• 预计耗时：[X ~ Y 秒]
• 消耗积分：[N pts]

Step 2 — Progress

Poll every 2–5s. Every 10–15s send a progress update, e.g.:

⏳ 语音合成中… [P]%
已等待 [elapsed]s，预计最长 [max]s

Cap progress at 95% until API returns success.

Step 3 — Success (message tool)

When resource_status == 1 and status != "failed", send the audio and caption:

media = medias[0].url
caption example:

✅ 语音合成成功！
• 模型：[Model Name]
• 耗时：[actual]s
• 消耗积分：[N pts]
🔗 原始链接：[url]

Use the URL from the API (do not use local file paths).

Step 4 — Failure (message tool)

On failure or API/network error, send a short, user-friendly message and suggestions:

❌ 语音合成失败
• 原因：[自然语言原因]
• 建议：换个模型重试或检查文本长度/内容

需要我帮你用其他模型重试吗？

Error translation (do not expose raw API/technical errors):

Technical	✅ Say (CN)	✅ Say (EN)
401 Unauthorized	密钥无效或未授权，请至 imaclaw.ai 生成新密钥	API key invalid; generate at imaclaw.ai
4008 Insufficient points	积分不足，请至 imaclaw.ai 购买积分	Insufficient points; buy at imaclaw.ai
Invalid product attribute	参数配置异常，请稍后重试	Configuration error, try again later
Error 6006 / 6010	积分或参数不匹配，请换模型或重试	Points/params mismatch, try another model
resource_status == 2 / status failed	合成失败，建议换模型或缩短文本	Synthesis failed, try another model or shorter text
timeout	合成超时，请稍后重试	Timed out, try again later
Network error	网络不稳定，请检查后重试	Network unstable, check and retry

Links: API key — https://www.imaclaw.ai/imaclaw/apikey ；Credits — https://www.imaclaw.ai/imaclaw/subscription

Step 5 — Done

After Step 0–4, no further reply needed. Do not send duplicate confirmations.

Common Mistakes

Mistake	Fix
prompt at top level	Put prompt inside `parameters[].parameters`
Wrong or missing attribute_id	Always call product list first; use credit_rules[0]
Single poll	Poll until all medias have resource_status == 1
Ignoring status when resource_status=1	Check status != "failed"
Sending params not in form_config/credit_rules	Use only params from product list; script reflection will strip others on retry

Security & Local Data

Network: This skill uses only https://api.imastudio.com (no image upload domain for TTS).
Local files: ~/.openclaw/memory/ima_prefs.json (preferences), ~/.openclaw/logs/ima_skills/ (logs, e.g. 7-day retention). No prompts or API keys stored.
API key: Set via environment (e.g. IMA_API_KEY) or agent config; never hardcode.

Python Example (Minimal)

import time
import requests

BASE = "https://api.imastudio.com"
HEADERS = {
    "Authorization": "Bearer ima_your_key",
    "Content-Type": "application/json",
    "x-app-source": "ima_skills",
}

# 1. Product list
r = requests.get(
    f"{BASE}/open/v1/product/list",
    headers=HEADERS,
    params={"app": "ima", "platform": "web", "category": "text_to_speech"},
)
tree = r.json()["data"]
# ... find type=3 node, get attribute_id, model_id, model_version, credit ...

# 2. Create task
body = {
    "task_type": "text_to_speech",
    "enable_multi_model": False,
    "src_img_url": [],
    "parameters": [{
        "attribute_id": attribute_id,
        "model_id": model_id,
        "model_name": model_name,
        "model_version": model_version,
        "app": "ima", "platform": "web",
        "category": "text_to_speech",
        "credit": credit,
        "parameters": {
            "prompt": "Hello, world.",
            "n": 1,
            "input_images": [],
            "cast": {"points": credit, "attribute_id": attribute_id},
        },
    }],
}
r = requests.post(f"{BASE}/open/v1/tasks/create", headers=HEADERS, json=body)
task_id = r.json()["data"]["id"]

# 3. Poll
while True:
    r = requests.post(f"{BASE}/open/v1/tasks/detail", headers=HEADERS, json={"task_id": task_id})
    task = r.json()["data"]
    medias = task.get("medias") or []
    if not medias:
        time.sleep(3)
        continue
    rs = medias[0].get("resource_status")
    if rs is None: rs = 0
    if rs == 2 or (medias[0].get("status") or "").lower() == "failed":
        raise RuntimeError(medias[0].get("error_msg") or "failed")
    if rs == 1 and (medias[0].get("url") or medias[0].get("watermark_url")):
        url = medias[0]["url"] or medias[0]["watermark_url"]
        print(url)  # e.g. https://cdn.../output.mp3
        break
    time.sleep(3)

Quick Reference

Item	Value
Task type	`text_to_speech`
Product list	`GET /open/v1/product/list?category=text_to_speech`
Create	`POST /open/v1/tasks/create` (prompt inside parameters[].parameters)
Poll	`POST /open/v1/tasks/detail` every 2–5s
Done when	All medias: resource_status=1, status≠"failed", url present
Script	`scripts/ima_tts_create.py` (--list-models, --model-id, --prompt, --output-json)

ima-tts-ai

Installation