NVIDIA Dynamo
An open-source, datacenter-scale distributed inference serving framework that orchestrates SGLang, TensorRT-LLM, and vLLM across multi-GPU clusters with KV-aware routing, disaggregated serving, and automatic scaling.
At a Glance
Fully open-source under Apache 2.0 license, free to use, modify, and distribute.
Engagement
Available On
Alternatives
Listed Jul 2026
About NVIDIA Dynamo
NVIDIA Dynamo is an open-source inference orchestration framework built by NVIDIA for datacenter-scale LLM serving. It sits above individual inference engines — SGLang, TensorRT-LLM, and vLLM — and turns a cluster of GPUs into a coordinated, high-throughput inference system. The project is licensed under Apache 2.0, written primarily in Rust for performance with Python for extensibility, and is actively developed at github.com/ai-dynamo/dynamo.
What It Is
Dynamo is the orchestration layer above inference engines, not a replacement for them. Where a single inference engine optimizes one GPU or node, Dynamo coordinates many nodes together. It handles disaggregated prefill/decode, intelligent KV-aware request routing, multi-tier KV cache management, SLA-driven autoscaling, and fast cold-start weight streaming. The result is a system that can serve LLM, reasoning, multimodal, and video generation workloads at datacenter scale with an OpenAI-compatible API.
Core Architecture and Capabilities
Dynamo's architecture is built around several composable components:
- Disaggregated Prefill/Decode: Separates prefill and decode into independently scalable GPU pools, letting each phase run on hardware tuned for its workload.
- KV-Aware Router: Routes requests based on worker load and KV cache overlap to eliminate redundant prefill computation.
- KV Block Manager (KVBM): Offloads KV cache across GPU → CPU → SSD → remote storage (S3/Azure blob), extending effective context length beyond GPU memory.
- ModelExpress: Streams model weights GPU-to-GPU via NIXL/NVLink for fast cold-start on new replicas.
- Planner: An SLA-driven autoscaler that profiles workloads and right-sizes GPU pools to meet latency targets at minimum TCO.
- Grove: A Kubernetes operator for topology-aware gang scheduling across racks, hosts, and NUMA nodes.
- AIConfigurator: Simulates thousands of deployment configurations to find the optimal serving topology without burning GPU-hours.
- Fault Tolerance: Canary health checks and in-flight request migration so worker failures don't surface to users.
Deployment Model and Setup Paths
Dynamo supports three primary deployment paths:
- Container (fastest): Pull a prebuilt container from NGC (
nvcr.io/nvidia/ai-dynamo/sglang-runtime:1.2.1,tensorrtllm-runtime:1.2.1, orvllm-runtime:1.2.1) and start a frontend and worker in minutes. - PyPI install: Install via
uv pip install "ai-dynamo[sglang]"or the vLLM variant for local development without containers. - Kubernetes (recommended for production): Install the Dynamo Platform operator and deploy with a single YAML manifest using the
DynamoGraphDeploymentRequestCRD. Supports AWS EKS, Google GKE, and Azure AKS with cloud-specific guides.
For Kubernetes, Dynamo exposes two request routing topologies: a Dynamo-native frontend path (client → Frontend → Router → workers) and a Gateway API path using the Kubernetes Gateway API Inference Extension (GAIE) for platforms that standardize on Gateway API.
Backend Support and Integrations
Dynamo is backend-agnostic. All three supported backends — SGLang, TensorRT-LLM, and vLLM — support disaggregated serving, KV-aware routing, the SLA-based Planner, multimodal workloads, and tool calling. KVBM support is available for TensorRT-LLM and vLLM, with SGLang support in progress. KV cache integrations include HiCache, LMCache, and FlexKV. The framework also integrates with LangChain and the NVIDIA NeMo Agent Toolkit for agentic workloads.
Update: Dynamo v1.2.1
The latest release is v1.2.1, published June 13, 2026. Version 1.0 introduced zero-config Kubernetes deployment via the DynamoGraphDeploymentRequest (DGDR) CRD, agentic inference features (per-request priority hints, session metadata, SGLang subagent KV isolation), multimodal encode/prefill/decode disaggregation with embedding cache, native FastVideo and SGLang Diffusion support for video generation, and storage-tier KV offload with S3/Azure blob. The 1.2.x series adds a Tool Calling Probe Snapshot, the Fastokens Tokenizer, and continued Kubernetes platform improvements including topology-aware KV transfer and shadow engine failover. The GitHub repository reports over 70 community contributors and an active biweekly office hours program.
Community Discussions
Be the first to start a conversation about NVIDIA Dynamo
Share your experience with NVIDIA Dynamo, ask questions, or help others learn from your insights.
Pricing
Open Source
Fully open-source under Apache 2.0 license, free to use, modify, and distribute.
- Disaggregated prefill/decode serving
- KV-aware routing
- SLA-driven autoscaling Planner
- Kubernetes-native deployment
- SGLang, TensorRT-LLM, and vLLM backends
Capabilities
Key Features
- Disaggregated prefill/decode serving
- KV-aware request routing
- KV Block Manager (KVBM) with multi-tier offloading
- SLA-driven autoscaling Planner
- ModelExpress fast weight streaming
- OpenAI-compatible API frontend
- Kubernetes-native deployment with CRD operator
- Gateway API Inference Extension (GAIE) support
- Multimodal encode/prefill/decode disaggregation
- Video generation support (FastVideo, SGLang Diffusion)
- LoRA adapter support
- Tool calling and reasoning parsing
- Fault tolerance with in-flight request migration
- Inference simulation with DynoSim
- Topology-aware gang scheduling (Grove)
- AIConfigurator deployment optimizer
- Prometheus + Grafana observability
- Distributed tracing and health checks
- Multi-node Kubernetes deployments
- Autoscaling with rolling updates
