# NVIDIA Dynamo

> An open-source, datacenter-scale distributed inference serving framework that orchestrates SGLang, TensorRT-LLM, and vLLM across multi-GPU clusters with KV-aware routing, disaggregated serving, and automatic scaling.

NVIDIA Dynamo is an open-source inference orchestration framework built by NVIDIA for datacenter-scale LLM serving. It sits above individual inference engines — SGLang, TensorRT-LLM, and vLLM — and turns a cluster of GPUs into a coordinated, high-throughput inference system. The project is licensed under Apache 2.0, written primarily in Rust for performance with Python for extensibility, and is actively developed at github.com/ai-dynamo/dynamo.

## What It Is

Dynamo is the orchestration layer above inference engines, not a replacement for them. Where a single inference engine optimizes one GPU or node, Dynamo coordinates many nodes together. It handles disaggregated prefill/decode, intelligent KV-aware request routing, multi-tier KV cache management, SLA-driven autoscaling, and fast cold-start weight streaming. The result is a system that can serve LLM, reasoning, multimodal, and video generation workloads at datacenter scale with an OpenAI-compatible API.

## Core Architecture and Capabilities

Dynamo's architecture is built around several composable components:

- **Disaggregated Prefill/Decode:** Separates prefill and decode into independently scalable GPU pools, letting each phase run on hardware tuned for its workload.
- **KV-Aware Router:** Routes requests based on worker load and KV cache overlap to eliminate redundant prefill computation.
- **KV Block Manager (KVBM):** Offloads KV cache across GPU → CPU → SSD → remote storage (S3/Azure blob), extending effective context length beyond GPU memory.
- **ModelExpress:** Streams model weights GPU-to-GPU via NIXL/NVLink for fast cold-start on new replicas.
- **Planner:** An SLA-driven autoscaler that profiles workloads and right-sizes GPU pools to meet latency targets at minimum TCO.
- **Grove:** A Kubernetes operator for topology-aware gang scheduling across racks, hosts, and NUMA nodes.
- **AIConfigurator:** Simulates thousands of deployment configurations to find the optimal serving topology without burning GPU-hours.
- **Fault Tolerance:** Canary health checks and in-flight request migration so worker failures don't surface to users.

## Deployment Model and Setup Paths

Dynamo supports three primary deployment paths:

- **Container (fastest):** Pull a prebuilt container from NGC (`nvcr.io/nvidia/ai-dynamo/sglang-runtime:1.2.1`, `tensorrtllm-runtime:1.2.1`, or `vllm-runtime:1.2.1`) and start a frontend and worker in minutes.
- **PyPI install:** Install via `uv pip install "ai-dynamo[sglang]"` or the vLLM variant for local development without containers.
- **Kubernetes (recommended for production):** Install the Dynamo Platform operator and deploy with a single YAML manifest using the `DynamoGraphDeploymentRequest` CRD. Supports AWS EKS, Google GKE, and Azure AKS with cloud-specific guides.

For Kubernetes, Dynamo exposes two request routing topologies: a Dynamo-native frontend path (`client → Frontend → Router → workers`) and a Gateway API path using the Kubernetes Gateway API Inference Extension (GAIE) for platforms that standardize on Gateway API.

## Backend Support and Integrations

Dynamo is backend-agnostic. All three supported backends — SGLang, TensorRT-LLM, and vLLM — support disaggregated serving, KV-aware routing, the SLA-based Planner, multimodal workloads, and tool calling. KVBM support is available for TensorRT-LLM and vLLM, with SGLang support in progress. KV cache integrations include HiCache, LMCache, and FlexKV. The framework also integrates with LangChain and the NVIDIA NeMo Agent Toolkit for agentic workloads.

## Update: Dynamo v1.2.1

The latest release is v1.2.1, published June 13, 2026. Version 1.0 introduced zero-config Kubernetes deployment via the `DynamoGraphDeploymentRequest` (DGDR) CRD, agentic inference features (per-request priority hints, session metadata, SGLang subagent KV isolation), multimodal encode/prefill/decode disaggregation with embedding cache, native FastVideo and SGLang Diffusion support for video generation, and storage-tier KV offload with S3/Azure blob. The 1.2.x series adds a Tool Calling Probe Snapshot, the Fastokens Tokenizer, and continued Kubernetes platform improvements including topology-aware KV transfer and shadow engine failover. The GitHub repository reports over 70 community contributors and an active biweekly office hours program.

## Features
- Disaggregated prefill/decode serving
- KV-aware request routing
- KV Block Manager (KVBM) with multi-tier offloading
- SLA-driven autoscaling Planner
- ModelExpress fast weight streaming
- OpenAI-compatible API frontend
- Kubernetes-native deployment with CRD operator
- Gateway API Inference Extension (GAIE) support
- Multimodal encode/prefill/decode disaggregation
- Video generation support (FastVideo, SGLang Diffusion)
- LoRA adapter support
- Tool calling and reasoning parsing
- Fault tolerance with in-flight request migration
- Inference simulation with DynoSim
- Topology-aware gang scheduling (Grove)
- AIConfigurator deployment optimizer
- Prometheus + Grafana observability
- Distributed tracing and health checks
- Multi-node Kubernetes deployments
- Autoscaling with rolling updates

## Integrations
SGLang, TensorRT-LLM, vLLM, Kubernetes, AWS EKS, Google GKE, Azure AKS, Amazon ECS, LangChain, NVIDIA NeMo Agent Toolkit, HiCache, LMCache, FlexKV, Prometheus, Grafana, etcd, NATS JetStream, Hugging Face, NVIDIA NGC, Docker

## Platforms
LINUX, API, VSC_EXTENSION, CLI

## Pricing
Open Source

## Version
v1.2.1

## Links
- Website: https://docs.nvidia.com/dynamo/latest
- Documentation: https://docs.nvidia.com/dynamo/latest
- Repository: https://github.com/ai-dynamo/dynamo
- EveryDev.ai: https://www.everydev.ai/tools/nvidia-dynamo