KubeGraf
AI SRE platform for Kubernetes that detects incidents, performs automated root cause analysis, and applies HMAC-signed safe fixes in under 5 minutes.
At a Glance
Forever free plan for a single cluster with basic monitoring. No credit card required.
14-day free trial on all paid Cloud plans. No credit card required.
Engagement
Available On
Alternatives
Updated May 2026
About KubeGraf
KubeGraf is an AI SRE platform for Kubernetes built by Orkastor, designed to detect, diagnose, and fix cluster incidents autonomously. It combines an in-cluster Go agent with a SaaS control plane and a six-agent AI pipeline to take incidents from detection to remediation in a median of 4 minutes and 21 seconds. The platform is available as both a cloud-hosted SaaS and a self-hosted binary, with an enterprise tier supporting air-gapped deployments.
What It Is
KubeGraf sits above your existing observability stack — Prometheus, Grafana, OpenTelemetry — and acts as an autonomous SRE layer. Rather than surfacing more dashboards and alerts, it reasons over telemetry, proposes evidence-backed remediations, and applies them under a human approval gate. The core product is an in-cluster agent (22 MB, no privileged access required, no inbound ports) that reads cluster state in read-only mode by default, normalizes it into a knowledge graph, and streams allowlisted snapshots to the control plane. Raw secrets and config map values are never transmitted.
The Six-Agent AI Pipeline
The AI engine — branded OrkasAI — runs six specialized agents in sequence when an incident is detected:
- Topology — maps the service graph and blast radius
- RootCause — forms and ranks hypotheses
- LogReasoner — extracts panic lines and error signatures from container logs
- TraceWalker — follows distributed traces across services
- CodeAware — identifies the PR or deploy that introduced the regression
- Remediation — drafts a YAML patch with evidence citations
The pipeline reportedly delivers root cause analysis with 94% confidence and 3–7 evidence citations in a 4–7 second median window.
SafeFix™ and the Approval Model
SafeFix™ is KubeGraf's remediation mechanism. Every patch is HMAC-SHA256 signed end-to-end, dry-run validated against policy and RBAC, and applied to a 10% canary first. If metrics degrade after the canary rollout, the fix auto-reverts within 30 seconds. Approval can be triggered from Slack, the web dashboard, or the CLI — the engineer sees the exact YAML diff and the evidence chain before clicking. The platform positions this as "the right amount of human in the loop" compared to fully autonomous auto-remediation tools.
Deployment Model and Security
KubeGraf ships in two modes. The Cloud product deploys a Helm agent into the cluster that reports to app.kubegraf.io, providing a hosted dashboard, incident management, SLOs, and multi-cluster federation. The Local product is a self-hosted binary where all data stays on the operator's infrastructure. The Enterprise tier ships the full control plane as a Helm chart for on-prem or air-gapped clusters, with tokens and HMAC secrets stored in the customer's KMS. LLM provider is configurable — Anthropic Claude and OpenAI ship by default, with AWS Bedrock and Azure OpenAI available, and enterprise customers can point at a private model endpoint.
Integrations and Platform Support
KubeGraf connects natively to Amazon EKS, Azure AKS, Google GKE, Helm, Prometheus, GitHub, OpenTelemetry, Slack, and ArgoCD. GitOps sync via ArgoCD and Flux is available on higher tiers. PagerDuty and Opsgenie integrations are available for larger teams. The terminal UI and web dashboard provide two interaction surfaces, and the CLI supports one-click approval workflows.
Current Status
KubeGraf is actively developed under Orkastor and is in production availability as of late 2025. The pricing page notes a comparison to Komodor's 2024 removal of their free tier, and the comparison table references publicly available product information dated December 2025. Documentation is versioned under docs-next, signaling active iteration. The platform supports a 14-day free trial on all paid cloud plans with no credit card required.
Community Discussions
Be the first to start a conversation about KubeGraf
Share your experience with KubeGraf, ask questions, or help others learn from your insights.
Pricing
Free
Forever free plan for a single cluster with basic monitoring. No credit card required.
- 1 cluster · 5 nodes · 3 members
- Cluster health & pod metrics
- 7-day metric retention
- kubectl terminal
- Community support
14-Day Free Trial
14-day free trial on all paid Cloud plans. No credit card required.
- Full access to selected paid plan features
- No credit card required
- Cancel anytime
Starter
Incident detection, Slack alerts, and 30-day retention for small teams.
- 3 clusters · 30 nodes · 10 members
- Everything in Free
- Incident detection & timeline
- Slack, email & webhook alerts
- 30-day metric retention
- Email support
Starter (Annual)
Incident detection, Slack alerts, and 30-day retention — billed annually.
- 3 clusters · 30 nodes · 10 members
- Everything in Free
- Incident detection & timeline
- Slack, email & webhook alerts
- 30-day metric retention
- Email support
Pro
AI root cause analysis, SLOs, auto-remediation, and GitOps for growing teams.
- 10 clusters · unlimited nodes · 25 members
- Everything in Starter
- AI root cause analysis
- Auto-remediation (SafeFix™)
- SLO monitoring & burn-rate alerts
- GitOps — ArgoCD / Flux sync
- 90-day metric retention
- Priority support
Pro (Annual)
AI root cause analysis, SLOs, auto-remediation, and GitOps — billed annually.
- 10 clusters · unlimited nodes · 25 members
- Everything in Starter
- AI root cause analysis
- Auto-remediation (SafeFix™)
- SLO monitoring & burn-rate alerts
- GitOps — ArgoCD / Flux sync
- 90-day metric retention
- Priority support
Business
Multi-cluster federation, SCIM, PagerDuty, and unlimited scale.
- Unlimited clusters · members · workspaces
- Everything in Pro
- Multi-cluster federation dashboard
- SCIM / directory sync
- PagerDuty & Opsgenie integration
- 1-year metric retention
- Audit logs & compliance reports
- Dedicated Slack channel support
Business (Annual)
Multi-cluster federation, SCIM, PagerDuty, and unlimited scale — billed annually.
- Unlimited clusters · members · workspaces
- Everything in Pro
- Multi-cluster federation dashboard
- SCIM / directory sync
- PagerDuty & Opsgenie integration
- 1-year metric retention
- Audit logs & compliance reports
- Dedicated Slack channel support
Enterprise
Dedicated infra, SSO/SAML, custom SLA, on-prem and air-gapped deployment. Custom pricing.
- Everything in Business
- Dedicated database & infra
- Single Sign-On (SSO / SAML)
- Custom data retention & residency
- 99.9% uptime SLA
- On-prem / air-gapped deployment
- Dedicated success manager
- Private LLM / BYOK
- 1-hour P1 SLA + TAM
Capabilities
Key Features
- Automated root cause analysis with 6-agent AI pipeline
- SafeFix™ HMAC-SHA256 signed patch generation
- Canary-first rollout with auto-revert on metric degradation
- 200ms anomaly detection scanning 100 signals per second
- One-click approval from Slack, dashboard, or CLI
- Knowledge graph across logs, traces, deploys, and code
- SLO monitoring and burn-rate alerts
- Multi-cluster federation dashboard
- Air-gapped and on-prem deployment support
- Terminal UI and web dashboard
- kubectl terminal access
- Automated post-mortem generation
- GitOps integration with ArgoCD and Flux
- SCIM/directory sync
- Audit logs and compliance reports
- Configurable LLM provider (Claude, OpenAI, Bedrock, Azure OpenAI)
- Local-first mode with zero data exfiltration
- RBAC-aware dry-run validation before fix application
