KubeGraf
AI SRE platform for Kubernetes that detects incidents, performs automated root cause analysis, and applies HMAC-signed SafeFix patches in under 5 minutes.
At a Glance
Get started with KubeGraf at no cost
14-day free trial of the Pro plan. No credit card required.
Engagement
Available On
Alternatives
Updated May 2026
About KubeGraf
KubeGraf is an AI Site Reliability Engineering (SRE) platform built for Kubernetes teams that need to detect, diagnose, and resolve cluster incidents fast. Developed by Orkastor, it combines an in-cluster Go agent with a SaaS control plane and a six-agent AI pipeline to take incidents from detection to safe remediation in a median of 4 minutes 21 seconds. The tool is available as both a SaaS product and a self-hosted/air-gapped enterprise deployment, with v1.0.0 released on March 24, 2026.
What It Is
KubeGraf sits above your existing observability stack — Prometheus, Grafana, OpenTelemetry — and adds an autonomous reasoning and remediation layer. Rather than surfacing dashboards and alerts, it reasons over telemetry, proposes a cryptographically-signed remediation with evidence, and applies it only after a human approves. The core product is licensed under Apache 2.0 and the source code is publicly available on GitHub.
How the Six-Agent Pipeline Works
When an incident occurs, KubeGraf's AI pipeline activates six specialized agents in sequence:
- Topology — maps the service graph and blast radius
- RootCause — forms and ranks hypotheses across logs, traces, and deploys
- LogReasoner — extracts the panic line or OOM event from container logs
- TraceWalker — follows the failure across distributed traces
- CodeAware — identifies the PR or commit that introduced the regression
- Remediation — drafts a YAML patch with a dry-run preview and HMAC-SHA256 signature
The pipeline delivers a root cause with 3–7 evidence citations in a vendor-stated median of 4–7 seconds. Every SafeFix™ patch is applied to a 10% canary first and auto-reverted within 30 seconds if metrics degrade.
Security and Data Architecture
KubeGraf's agent runs inside the cluster and makes only outbound calls — no inbound ports are required. The agent pushes allowlisted snapshots (never raw secrets or config-map values) to the control plane. HMAC signing keys stay in the customer's KMS (AWS, GCP, Azure, or HashiCorp Vault). The enterprise tier ships the full control plane as a Helm chart for on-premises or air-gapped deployments, with support for bring-your-own LLM endpoints (Anthropic Claude, OpenAI, AWS Bedrock, Azure OpenAI, or any OpenAI-compatible private endpoint).
Deployment and Setup Path
Installation follows a single Helm chart drop into any Kubernetes cluster — no privileged access, no hard dependency on Prometheus, Istio, or OpenTelemetry. The agent registers with the control plane within approximately 60 seconds. KubeGraf also ships a CLI binary (available via Homebrew on macOS, curl on Linux, and Scoop on Windows) that launches either a web dashboard at localhost:3000 or a full-featured terminal UI suitable for SSH sessions. Authentication works with all standard Kubernetes auth methods: client certificates, bearer tokens, OIDC, GKE, EKS, AKS, and exec-based credential plugins.
Integrations and Platform Support
KubeGraf connects natively to the tools the vendor describes as the standard Kubernetes observability stack:
- Cloud providers: Amazon EKS, Azure AKS, Google GKE
- Package management: Helm
- Metrics: Prometheus
- Tracing: OpenTelemetry
- GitOps: ArgoCD, Flux
- Alerting: Slack, PagerDuty, Opsgenie, email, webhooks
- Version control: GitHub
Update: v1.0.0 Launch
The GitHub repository shows v1.0.0 was published on March 24, 2026, following a pre-launch announcement in the README that listed the release date as March 23, 2026. The repository was created in November 2025 and last pushed in March 2026. The Apache 2.0 license covers the core agent and CLI codebase. The SaaS control plane at app.kubegraf.io is a separate commercial offering layered on top of the open-source agent.
Community Discussions
Be the first to start a conversation about KubeGraf
Share your experience with KubeGraf, ask questions, or help others learn from your insights.
Pricing
Pro Trial
14-day free trial of the Pro plan. No credit card required.
- Full Pro plan access
- 3 clusters
- Up to 1,500 pods
- 20 deep investigations per month
- 200 normal investigations per month
Pro
AI-powered Kubernetes reliability for production teams. Self-serve signup, 14-day free trial included.
- 3 clusters
- Up to 1,500 pods
- Up to 10 team members
- 20 deep investigations per month
- 200 normal investigations per month
- 90-day data retention
- Agentic AI root cause analysis
- AI investigation chat
- Auto-remediation (SafeFix™)
- GitOps — ArgoCD & Flux sync
- SLO & burn-rate alerts
- Slack, email & webhook alerts
- PagerDuty & Opsgenie integration
- kubectl terminal & live exec
- Multi-cluster dashboards
- API access & webhooks
- Priority email support (NBD)
- Self-serve onboarding
Enterprise
Built for regulated & mission-critical Kubernetes at scale. Procurement-ready, dedicated success team, deployment flexibility.
- Unlimited clusters
- 10,000+ pods, no cap
- Unlimited team members
- Unlimited deep investigations
- Unlimited normal investigations
- Custom data retention (up to 7 years)
- SSO / SAML
- SCIM directory sync
- On-prem / air-gapped deployment
- Dedicated database & infrastructure
- Custom data residency (EU / US / APAC)
- 7-year WORM audit logs export
- Premium technical support (24/7, <1hr P1)
- Dedicated Customer Success Manager
- Included onboarding & training
- 99.9% uptime SLA & P1 response SLA
- Marketplace billing available
Capabilities
Key Features
- Automated incident detection (200ms anomaly detection)
- Six-agent AI root cause analysis pipeline
- SafeFix™ HMAC-SHA256 signed remediation patches
- Canary-first rollout with auto-revert on metric degradation
- One-click approval from Slack, dashboard, or CLI
- Terminal UI for SSH and low-bandwidth environments
- Web dashboard for visual monitoring
- Multi-cluster management and correlation
- GitOps integration (ArgoCD, Flux)
- kubectl terminal and live exec
- SLO and burn-rate alerting
- AI investigation chat
- Knowledge graph across logs, traces, and code
- Air-gapped and on-premises enterprise deployment
- Bring-your-own LLM endpoint support
- SSO/SAML and SCIM 2.0 (Enterprise)
- Audit logs with tamper-evident chain
- Security posture assessment
