SkillHub

kubernetes

v1.0.3

Complete Platform Agent Swarm — A coordinated multi-agent system for Kubernetes and OpenShift platform operations. Includes Orchestrator (Jarvis), Cluster Ops (Atlas), GitOps (Flow), Security (Shield), Observability (Pulse), Artifacts (Cache), and Developer Experience (Desk).

Sourced from ClawHub, Authored by Chin K

Installation

Please help me install the skill `kubernetes` from SkillHub official store. npx skills add kcns008/kubernetes

Cluster Agent Swarm — Complete Platform Operations

⚠️ Runtime Requirements & Credentials Required Before Use

This skill package provides powerful Kubernetes/OpenShift cluster management capabilities but requires specific credentials and runtime configurations before it can function. Installers must configure these prerequisites:

Cluster Access

Requirement Description Environment Variable
Kubeconfig Valid kubeconfig file with cluster access KUBECONFIG or ~/.kube/config
kubectl Kubernetes CLI installed and configured Must be in PATH
oc OpenShift CLI (for OpenShift clusters) Must be in PATH

Cloud Provider Credentials

Platform Required Credentials Notes
AWS/EKS/ROSA AWS credentials with EKS/ROSA access AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, optionally AWS_SESSION_TOKEN
Azure/ARO Azure authentication az login or AZURE_CLIENT_ID, AZURE_CLIENT_SECRET, AZURE_TENANT_ID
GCP/GKE GCP authentication gcloud auth application-default login or GOOGLE_APPLICATION_CREDENTIALS

External Service Integrations

Service Required Description
ArgoCD Recommended Server URL + auth token for GitOps sync/management
Prometheus Recommended URL for metric queries
Loki/Elasticsearch Optional URL for log queries
HashiCorp Vault Optional Token or Kubernetes auth for secrets
Container Registry Optional Auth for image scanning/promotion

Session Setup

Before using the agents, you MUST set up a session context:

# Set up session context for your environment
bash skills/orchestrator/scripts/setup-session.sh <environment> [context-name]

# Environments: dev, qa, staging, prod
# Note: prod requires human approval for all modifications

Security Considerations

  • Agents operate with least privilege by default
  • All credential access is logged
  • Production modifications require human approval
  • Secrets are never logged or stored in code

⚠️ Security & Risk Considerations - Read Before Installing

Source Verification

  • This skill installs code from a third-party GitHub repository
  • Always verify the source: Confirm https://github.com/kcns008/cluster-agent-swarm-skills is the intended repository
  • Prefer installing from a pinned commit hash or verified release tag
  • Review commit history and authors before installing

Install Mechanism Warning

  • This is a scripted skill (not instruction-only) - it will write executable code to disk
  • Installing via npx skills add pulls and executes code from GitHub
  • Review scripts before running - some scripts perform destructive operations (deletions, cleanup, promotion)
  • Consider using offline/cached toolchains where possible

Persistence & Blast Radius

  • Agents maintain persistent state across sessions via:
  • WORKING.md - session progress tracking
  • LOGS.md - action audit trail
  • MEMORY.md - long-term learnings
  • Agents are configured to commit changes to these files as part of normal operation
  • This persistence increases blast radius if misused - limit repository write access if concerned

Human Approval Enforcement

  • The skill documentation claims human approval required for production changes
  • This is a procedural control, NOT a technical enforcement
  • Your platform MUST enforce an approval gate before allowing production operations
  • Do not rely on agent self-restriction for production safety

Principle of Least Privilege - Required

  • DO NOT provide owner/root-level cloud credentials
  • Create dedicated, minimal-permission service accounts for:
  • Kubernetes namespace-level access (not cluster-admin)
  • AWS IAM roles with limited EKS permissions
  • Azure service principals with limited subscription access
  • GCP service accounts with limited project permissions
  • Never provide production credentials until you have audited the code in non-production

Sandbox Before Production

  1. Run this skill in an isolated/non-production environment first
  2. Manually step through scripts to understand their behavior
  3. Pay special attention to:
  4. *-cleanup.sh scripts - may delete resources
  5. *-promote.sh scripts - may promote artifacts
  6. *-delete.sh scripts - explicitly destructive
  7. Verify no unexpected network calls to external endpoints

Supply Chain Tools

  • Scripts may download binaries (syft, cosign, trivy, etc.)
  • Only allow downloads from trusted release sources (official GitHub releases, package managers)
  • Consider curating offline toolchains if your environment requires it

This is the complete cluster-agent-swarm skill package. When you add this skill, you get access to ALL 7 specialized agents working together as a coordinated swarm.

Installation Options

npx skills add https://github.com/kcns008/cluster-agent-swarm-skills

This installs all 7 agents as a single combined skill with access to all capabilities.

Install Individual Skills

Each agent can also be installed separately using GitHub tree path or --skill flag:

# Using GitHub tree path (recommended)
npx skills add https://github.com/kcns008/cluster-agent-swarm-skills/tree/main/skills/orchestrator

# Using --skill flag (if supported by your skills tool)
npx skills add https://github.com/kcns008/cluster-agent-swarm-skills --skill orchestrator

# Available individual skills:
# - orchestrator  (Jarvis - task routing)
# - cluster-ops   (Atlas - cluster operations)
# - gitops        (Flow - ArgoCD, Helm, Kustomize)
# - security      (Shield - RBAC, policies)
# - observability (Pulse - metrics, alerts)
# - artifacts     (Cache - registries, SBOM)
# - developer-experience (Desk - namespaces, onboarding)
# - qmd           (Local hybrid search for markdown notes/docs)

npx skills add https://github.com/kcns008/cluster-agent-swarm-skills/skills/gitops

Security - Shield (RBAC, policies, CVEs)

npx skills add https://github.com/kcns008/cluster-agent-swarm-skills/skills/security

Observability - Pulse (metrics, alerts, incidents)

npx skills add https://github.com/kcns008/cluster-agent-swarm-skills/skills/observability

Artifacts - Cache (registries, SBOM, promotions)

npx skills add https://github.com/kcns008/cluster-agent-swarm-skills/skills/artifacts

Developer Experience - Desk (namespaces, onboarding)

npx skills add https://github.com/kcns008/cluster-agent-swarm-skills/skills/developer-experience

QMD - Local Markdown Search (notes, docs, knowledge bases)

npx skills add https://github.com/kcns008/cluster-agent-swarm-skills/skills/qmd

---

## The Swarm — Agent Roster

| Agent | Code Name | Session Key | Domain |
|-------|-----------|-------------|--------|
| Orchestrator | Jarvis | `agent:platform:orchestrator` | Task routing, coordination, standups |
| Cluster Ops | Atlas | `agent:platform:cluster-ops` | Cluster lifecycle, nodes, upgrades |
| GitOps | Flow | `agent:platform:gitops` | ArgoCD, Helm, Kustomize, deploys |
| Security | Shield | `agent:platform:security` | RBAC, policies, secrets, scanning |
| Observability | Pulse | `agent:platform:observability` | Metrics, logs, alerts, incidents |
| Artifacts | Cache | `agent:platform:artifacts` | Registries, SBOM, promotion, CVEs |
| Developer Experience | Desk | `agent:platform:developer-experience` | Namespaces, onboarding, support |

---

## Agent Capabilities Summary

### What Agents CAN Do
- Read cluster state (`kubectl get`, `kubectl describe`, `oc get`)
- Deploy via GitOps (`argocd app sync`, Flux reconciliation)
- Create documentation and reports
- Investigate and triage incidents
- Provision standard resources (namespaces, quotas, RBAC)
- Run health checks and audits
- Scan images and generate SBOMs
- Query metrics and logs
- Execute pre-approved runbooks

### What Agents CANNOT Do (Human-in-the-Loop Required)
- Delete production resources (`kubectl delete` in prod)
- Modify cluster-wide policies (NetworkPolicy, OPA, Kyverno cluster policies)
- Make direct changes to secrets without rotation workflow
- Modify network routes or service mesh configuration
- Scale beyond defined resource limits
- Perform irreversible cluster upgrades
- Approve production deployments (can prepare, human approves)
- Change RBAC at cluster-admin level

---

## Communication Patterns

### @Mentions
Agents communicate via @mentions in shared task comments:

@Shield Please review the RBAC for payment-service v3.2 before I sync. @Pulse Is the CPU spike related to the deployment or external traffic? @Atlas The staging cluster needs 2 more worker nodes.

### Thread Subscriptions
- Commenting on a task → auto-subscribe
- Being @mentioned → auto-subscribe
- Being assigned → auto-subscribe
- Once subscribed → receive ALL future comments on heartbeat

### Escalation Path
1. Agent detects issue
2. Agent attempts resolution within guardrails
3. If blocked → @mention another agent or escalate to human
4. P1 incidents → all relevant agents auto-notified

---

## Heartbeat Schedule

Agents wake on staggered 5-minute intervals:

/5 * * * * Atlas (Cluster Ops - needs fast response for incidents) /5 * * * * Pulse (Observability - needs fast response for alerts) /5 * * * * Shield (Security - fast response for CVEs and threats) /10 * * * * Flow (GitOps - deployments can wait a few minutes) /10 * * * * Cache (Artifacts - promotions are scheduled) /15 * * * * Desk (DevEx - developer requests aren't usually urgent) */15 * * * * Orchestrator (Coordination - overview and standups)

---

## Key Principles

- **Roles over genericism** — Each agent has a defined SOUL with exactly who they are
- **Files over mental notes** — Only files persist between sessions
- **Staggered schedules** — Don't wake all agents at once
- **Shared context** — One source of truth for tasks and communication
- **Heartbeat, not always-on** — Balance responsiveness with cost
- **Human-in-the-loop** — Critical actions require approval
- **Guardrails over freedom** — Define what agents can and cannot do
- **Audit everything** — Every action logged to activity feed
- **Reliability first** — System stability always wins over new features
- **Security by default** — Deny access, approve by exception

---

## Detailed Agent Capabilities

### Orchestrator (Jarvis)
- Task routing: determining which agent should handle which request
- Workflow orchestration: coordinating multi-agent operations
- Daily standups: compiling swarm-wide status reports
- Priority management: determining urgency and sequencing of work
- Cross-agent communication: facilitating collaboration
- Accountability: tracking what was promised vs what was delivered

### Cluster Ops (Atlas)
- OpenShift/Kubernetes cluster operations (upgrades, scaling, patching)
- Node pool management and autoscaling
- Resource quota management and capacity planning
- Network troubleshooting (OVN-Kubernetes, Cilium, Calico)
- Storage class management and PVC/CSI issues
- etcd backup, restore, and health monitoring
- Multi-platform expertise (OCP, EKS, AKS, GKE, ROSA, ARO)

### GitOps (Flow)
- ArgoCD application management (sync, rollback, sync waves, hooks)
- Helm chart development, debugging, and templating
- Kustomize overlays and patch generation
- ApplicationSet templates for multi-cluster deployments
- Deployment strategy management (canary, blue-green, rolling)
- Git repository management and branching strategies
- Drift detection and remediation
- Secrets management integration (Vault, Sealed Secrets, External Secrets)

### Security (Shield)
- RBAC audit and management
- NetworkPolicy review and enforcement
- Security policy validation (OPA, Kyverno)
- Vulnerability scanning (image scanning, CVE triage)
- Secret rotation workflows
- Security incident investigation
- Compliance reporting

### Observability (Pulse)
- Prometheus/Grafana metric queries
- Log aggregation and search (Loki, Elasticsearch)
- Alert triage and investigation
- SLO tracking and error budget monitoring
- Incident response coordination
- Dashboards and visualization
- Telemetry pipeline troubleshooting

### Artifacts (Cache)
- Container registry management
- Image scanning and CVE analysis
- SBOM generation and tracking
- Artifact promotion workflows
- Version management
- Registry caching and proxying

### Developer Experience (Desk)
- Namespace provisioning
- Resource quota and limit range management
- Developer onboarding
- Template generation
- Developer support and troubleshooting
- Documentation generation

---

## File Structure

cluster-agent-swarm-skills/ ├── SKILL.md # This file - combined swarm ├── AGENTS.md # Swarm configuration and protocols ├── skills/ │ ├── orchestrator/ # Jarvis - task routing │ │ └── SKILL.md │ ├── cluster-ops/ # Atlas - cluster operations │ │ └── SKILL.md │ ├── gitops/ # Flow - GitOps │ │ └── SKILL.md │ ├── security/ # Shield - security │ │ └── SKILL.md │ ├── observability/ # Pulse - monitoring │ │ └── SKILL.md │ ├── artifacts/ # Cache - artifacts │ │ └── SKILL.md │ └── developer-experience/ # Desk - DevEx │ └── SKILL.md │ └── qmd/ # Local markdown search │ └── SKILL.md ├── scripts/ # Shared scripts └── references/ # Shared documentation ```


Reference Documentation

For detailed capabilities of each agent, refer to individual SKILL.md files: - skills/orchestrator/SKILL.md - Full Orchestrator documentation - skills/cluster-ops/SKILL.md - Full Cluster Ops documentation - skills/gitops/SKILL.md - Full GitOps documentation - skills/security/SKILL.md - Full Security documentation - skills/observability/SKILL.md - Full Observability documentation - skills/artifacts/SKILL.md - Full Artifacts documentation - skills/developer-experience/SKILL.md - Full Developer Experience documentation - skills/qmd/SKILL.md - QMD local markdown search documentation