swarm-self-heal
v0.1.1Swarm reliability watchdog for OpenClaw — validates gateway/channel and every lane, performs bounded recovery, and emits auditable receipts.
Installation
Please help me install the skill `swarm-self-heal` from SkillHub official store.
npx skills add tkuehnl/swarm-self-heal
When to use this skill
Use this skill when the user wants to: - Diagnose why a multi-agent swarm feels "stuck" or partially offline - Check gateway + channel + lane liveness in one run - Perform bounded auto-recovery (restart + retry only) - Capture auditable receipts for incident timelines - Keep a primary watchdog lane plus a backup lane in place
Commands
# Install/refresh watchdog scripts + cron wiring
bash skills/swarm-self-heal/scripts/setup.sh
# Run an immediate canary check
bash skills/swarm-self-heal/scripts/check.sh
# Run watchdog directly (uses deployed workspace path)
bash ~/.openclaw/workspace-studio/scripts/anvil_watchdog.sh
# Optional: increase lane ping timeout for slower providers
PING_TIMEOUT_SECONDS=180 bash ~/.openclaw/workspace-studio/scripts/anvil_watchdog.sh
What it checks
- Gateway health via
openclaw health - Channel readiness via
openclaw channels status --json --probe - Passive lane recency via
openclaw status --json(latest OpenClaw-compatible) - Active lane probe only when stale for
main,builder-1,builder-2,reviewer,designer - Bounded recovery with a single restart pass + targeted re-probe of infra failures
Output contract
The watchdog output includes:
- timestamp
- targets
- ok_agents
- failed_agents
- actions
- VERDICT
- RECEIPT
Safety model
- Bounded recovery only (single restart pass per run)
- No destructive state wipes
- No blind reinstall behavior
- Recovery actions are explicit in output
Notes
- Cron wiring sets both primary and backup watchdog lanes to
xhighthinking. - Telegram target is auto-derived from config when available, with a safe fallback.
- Healthy runs can be summarized as a single line to reduce operator noise.