kubernetes
v1.0.3Complete Platform Agent Swarm — A coordinated multi-agent system for Kubernetes and OpenShift platform operations. Includes Orchestrator (Jarvis), Cluster Ops (Atlas), GitOps (Flow), Security (Shield), Observability (Pulse), Artifacts (Cache), and Developer Experience (Desk).
Installation
Cluster Agent Swarm — Complete Platform Operations
⚠️ Runtime Requirements & Credentials Required Before Use
This skill package provides powerful Kubernetes/OpenShift cluster management capabilities but requires specific credentials and runtime configurations before it can function. Installers must configure these prerequisites:
Cluster Access
| Requirement | Description | Environment Variable |
|---|---|---|
| Kubeconfig | Valid kubeconfig file with cluster access | KUBECONFIG or ~/.kube/config |
| kubectl | Kubernetes CLI installed and configured | Must be in PATH |
| oc | OpenShift CLI (for OpenShift clusters) | Must be in PATH |
Cloud Provider Credentials
| Platform | Required Credentials | Notes |
|---|---|---|
| AWS/EKS/ROSA | AWS credentials with EKS/ROSA access | AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, optionally AWS_SESSION_TOKEN |
| Azure/ARO | Azure authentication | az login or AZURE_CLIENT_ID, AZURE_CLIENT_SECRET, AZURE_TENANT_ID |
| GCP/GKE | GCP authentication | gcloud auth application-default login or GOOGLE_APPLICATION_CREDENTIALS |
External Service Integrations
| Service | Required | Description |
|---|---|---|
| ArgoCD | Recommended | Server URL + auth token for GitOps sync/management |
| Prometheus | Recommended | URL for metric queries |
| Loki/Elasticsearch | Optional | URL for log queries |
| HashiCorp Vault | Optional | Token or Kubernetes auth for secrets |
| Container Registry | Optional | Auth for image scanning/promotion |
Session Setup
Before using the agents, you MUST set up a session context:
# Set up session context for your environment
bash skills/orchestrator/scripts/setup-session.sh <environment> [context-name]
# Environments: dev, qa, staging, prod
# Note: prod requires human approval for all modifications
Security Considerations
- Agents operate with least privilege by default
- All credential access is logged
- Production modifications require human approval
- Secrets are never logged or stored in code
⚠️ Security & Risk Considerations - Read Before Installing
Source Verification
- This skill installs code from a third-party GitHub repository
- Always verify the source: Confirm
https://github.com/kcns008/cluster-agent-swarm-skillsis the intended repository - Prefer installing from a pinned commit hash or verified release tag
- Review commit history and authors before installing
Install Mechanism Warning
- This is a scripted skill (not instruction-only) - it will write executable code to disk
- Installing via
npx skills addpulls and executes code from GitHub - Review scripts before running - some scripts perform destructive operations (deletions, cleanup, promotion)
- Consider using offline/cached toolchains where possible
Persistence & Blast Radius
- Agents maintain persistent state across sessions via:
WORKING.md- session progress trackingLOGS.md- action audit trailMEMORY.md- long-term learnings- Agents are configured to commit changes to these files as part of normal operation
- This persistence increases blast radius if misused - limit repository write access if concerned
Human Approval Enforcement
- The skill documentation claims human approval required for production changes
- This is a procedural control, NOT a technical enforcement
- Your platform MUST enforce an approval gate before allowing production operations
- Do not rely on agent self-restriction for production safety
Principle of Least Privilege - Required
- DO NOT provide owner/root-level cloud credentials
- Create dedicated, minimal-permission service accounts for:
- Kubernetes namespace-level access (not cluster-admin)
- AWS IAM roles with limited EKS permissions
- Azure service principals with limited subscription access
- GCP service accounts with limited project permissions
- Never provide production credentials until you have audited the code in non-production
Sandbox Before Production
- Run this skill in an isolated/non-production environment first
- Manually step through scripts to understand their behavior
- Pay special attention to:
*-cleanup.shscripts - may delete resources*-promote.shscripts - may promote artifacts*-delete.shscripts - explicitly destructive- Verify no unexpected network calls to external endpoints
Supply Chain Tools
- Scripts may download binaries (syft, cosign, trivy, etc.)
- Only allow downloads from trusted release sources (official GitHub releases, package managers)
- Consider curating offline toolchains if your environment requires it
This is the complete cluster-agent-swarm skill package. When you add this skill, you get access to ALL 7 specialized agents working together as a coordinated swarm.
Installation Options
Install All Skills (Recommended)
npx skills add https://github.com/kcns008/cluster-agent-swarm-skills
This installs all 7 agents as a single combined skill with access to all capabilities.
Install Individual Skills
Each agent can also be installed separately using GitHub tree path or --skill flag:
# Using GitHub tree path (recommended)
npx skills add https://github.com/kcns008/cluster-agent-swarm-skills/tree/main/skills/orchestrator
# Using --skill flag (if supported by your skills tool)
npx skills add https://github.com/kcns008/cluster-agent-swarm-skills --skill orchestrator
# Available individual skills:
# - orchestrator (Jarvis - task routing)
# - cluster-ops (Atlas - cluster operations)
# - gitops (Flow - ArgoCD, Helm, Kustomize)
# - security (Shield - RBAC, policies)
# - observability (Pulse - metrics, alerts)
# - artifacts (Cache - registries, SBOM)
# - developer-experience (Desk - namespaces, onboarding)
# - qmd (Local hybrid search for markdown notes/docs)
npx skills add https://github.com/kcns008/cluster-agent-swarm-skills/skills/gitops
Security - Shield (RBAC, policies, CVEs)
npx skills add https://github.com/kcns008/cluster-agent-swarm-skills/skills/security
Observability - Pulse (metrics, alerts, incidents)
npx skills add https://github.com/kcns008/cluster-agent-swarm-skills/skills/observability
Artifacts - Cache (registries, SBOM, promotions)
npx skills add https://github.com/kcns008/cluster-agent-swarm-skills/skills/artifacts
Developer Experience - Desk (namespaces, onboarding)
npx skills add https://github.com/kcns008/cluster-agent-swarm-skills/skills/developer-experience
QMD - Local Markdown Search (notes, docs, knowledge bases)
npx skills add https://github.com/kcns008/cluster-agent-swarm-skills/skills/qmd
---
## The Swarm — Agent Roster
| Agent | Code Name | Session Key | Domain |
|-------|-----------|-------------|--------|
| Orchestrator | Jarvis | `agent:platform:orchestrator` | Task routing, coordination, standups |
| Cluster Ops | Atlas | `agent:platform:cluster-ops` | Cluster lifecycle, nodes, upgrades |
| GitOps | Flow | `agent:platform:gitops` | ArgoCD, Helm, Kustomize, deploys |
| Security | Shield | `agent:platform:security` | RBAC, policies, secrets, scanning |
| Observability | Pulse | `agent:platform:observability` | Metrics, logs, alerts, incidents |
| Artifacts | Cache | `agent:platform:artifacts` | Registries, SBOM, promotion, CVEs |
| Developer Experience | Desk | `agent:platform:developer-experience` | Namespaces, onboarding, support |
---
## Agent Capabilities Summary
### What Agents CAN Do
- Read cluster state (`kubectl get`, `kubectl describe`, `oc get`)
- Deploy via GitOps (`argocd app sync`, Flux reconciliation)
- Create documentation and reports
- Investigate and triage incidents
- Provision standard resources (namespaces, quotas, RBAC)
- Run health checks and audits
- Scan images and generate SBOMs
- Query metrics and logs
- Execute pre-approved runbooks
### What Agents CANNOT Do (Human-in-the-Loop Required)
- Delete production resources (`kubectl delete` in prod)
- Modify cluster-wide policies (NetworkPolicy, OPA, Kyverno cluster policies)
- Make direct changes to secrets without rotation workflow
- Modify network routes or service mesh configuration
- Scale beyond defined resource limits
- Perform irreversible cluster upgrades
- Approve production deployments (can prepare, human approves)
- Change RBAC at cluster-admin level
---
## Communication Patterns
### @Mentions
Agents communicate via @mentions in shared task comments:
@Shield Please review the RBAC for payment-service v3.2 before I sync. @Pulse Is the CPU spike related to the deployment or external traffic? @Atlas The staging cluster needs 2 more worker nodes.
### Thread Subscriptions
- Commenting on a task → auto-subscribe
- Being @mentioned → auto-subscribe
- Being assigned → auto-subscribe
- Once subscribed → receive ALL future comments on heartbeat
### Escalation Path
1. Agent detects issue
2. Agent attempts resolution within guardrails
3. If blocked → @mention another agent or escalate to human
4. P1 incidents → all relevant agents auto-notified
---
## Heartbeat Schedule
Agents wake on staggered 5-minute intervals:
/5 * * * * Atlas (Cluster Ops - needs fast response for incidents) /5 * * * * Pulse (Observability - needs fast response for alerts) /5 * * * * Shield (Security - fast response for CVEs and threats) /10 * * * * Flow (GitOps - deployments can wait a few minutes) /10 * * * * Cache (Artifacts - promotions are scheduled) /15 * * * * Desk (DevEx - developer requests aren't usually urgent) */15 * * * * Orchestrator (Coordination - overview and standups)
---
## Key Principles
- **Roles over genericism** — Each agent has a defined SOUL with exactly who they are
- **Files over mental notes** — Only files persist between sessions
- **Staggered schedules** — Don't wake all agents at once
- **Shared context** — One source of truth for tasks and communication
- **Heartbeat, not always-on** — Balance responsiveness with cost
- **Human-in-the-loop** — Critical actions require approval
- **Guardrails over freedom** — Define what agents can and cannot do
- **Audit everything** — Every action logged to activity feed
- **Reliability first** — System stability always wins over new features
- **Security by default** — Deny access, approve by exception
---
## Detailed Agent Capabilities
### Orchestrator (Jarvis)
- Task routing: determining which agent should handle which request
- Workflow orchestration: coordinating multi-agent operations
- Daily standups: compiling swarm-wide status reports
- Priority management: determining urgency and sequencing of work
- Cross-agent communication: facilitating collaboration
- Accountability: tracking what was promised vs what was delivered
### Cluster Ops (Atlas)
- OpenShift/Kubernetes cluster operations (upgrades, scaling, patching)
- Node pool management and autoscaling
- Resource quota management and capacity planning
- Network troubleshooting (OVN-Kubernetes, Cilium, Calico)
- Storage class management and PVC/CSI issues
- etcd backup, restore, and health monitoring
- Multi-platform expertise (OCP, EKS, AKS, GKE, ROSA, ARO)
### GitOps (Flow)
- ArgoCD application management (sync, rollback, sync waves, hooks)
- Helm chart development, debugging, and templating
- Kustomize overlays and patch generation
- ApplicationSet templates for multi-cluster deployments
- Deployment strategy management (canary, blue-green, rolling)
- Git repository management and branching strategies
- Drift detection and remediation
- Secrets management integration (Vault, Sealed Secrets, External Secrets)
### Security (Shield)
- RBAC audit and management
- NetworkPolicy review and enforcement
- Security policy validation (OPA, Kyverno)
- Vulnerability scanning (image scanning, CVE triage)
- Secret rotation workflows
- Security incident investigation
- Compliance reporting
### Observability (Pulse)
- Prometheus/Grafana metric queries
- Log aggregation and search (Loki, Elasticsearch)
- Alert triage and investigation
- SLO tracking and error budget monitoring
- Incident response coordination
- Dashboards and visualization
- Telemetry pipeline troubleshooting
### Artifacts (Cache)
- Container registry management
- Image scanning and CVE analysis
- SBOM generation and tracking
- Artifact promotion workflows
- Version management
- Registry caching and proxying
### Developer Experience (Desk)
- Namespace provisioning
- Resource quota and limit range management
- Developer onboarding
- Template generation
- Developer support and troubleshooting
- Documentation generation
---
## File Structure
cluster-agent-swarm-skills/ ├── SKILL.md # This file - combined swarm ├── AGENTS.md # Swarm configuration and protocols ├── skills/ │ ├── orchestrator/ # Jarvis - task routing │ │ └── SKILL.md │ ├── cluster-ops/ # Atlas - cluster operations │ │ └── SKILL.md │ ├── gitops/ # Flow - GitOps │ │ └── SKILL.md │ ├── security/ # Shield - security │ │ └── SKILL.md │ ├── observability/ # Pulse - monitoring │ │ └── SKILL.md │ ├── artifacts/ # Cache - artifacts │ │ └── SKILL.md │ └── developer-experience/ # Desk - DevEx │ └── SKILL.md │ └── qmd/ # Local markdown search │ └── SKILL.md ├── scripts/ # Shared scripts └── references/ # Shared documentation ```
Reference Documentation
For detailed capabilities of each agent, refer to individual SKILL.md files:
- skills/orchestrator/SKILL.md - Full Orchestrator documentation
- skills/cluster-ops/SKILL.md - Full Cluster Ops documentation
- skills/gitops/SKILL.md - Full GitOps documentation
- skills/security/SKILL.md - Full Security documentation
- skills/observability/SKILL.md - Full Observability documentation
- skills/artifacts/SKILL.md - Full Artifacts documentation
- skills/developer-experience/SKILL.md - Full Developer Experience documentation
- skills/qmd/SKILL.md - QMD local markdown search documentation