NVIDIA Dynamo

Name: NVIDIA Dynamo
Availability: OnlineOnly
Author: NVIDIA

An open-source, datacenter-scale distributed inference serving framework that orchestrates SGLang, TensorRT-LLM, and vLLM across multi-GPU clusters with KV-aware routing, disaggregated serving, and automatic scaling.

Visit Website

At a Glance

Pricing

Open Source

Fully open-source under Apache 2.0 license, free to use, modify, and distribute.

Engagement

Available On

Linux

API

VS Code

CLI

NVIDIASanta Clara, CAEst. 1993$55.6B raised

Listed Jul 2026

About NVIDIA Dynamo

NVIDIA Dynamo is an open-source inference orchestration framework built by NVIDIA for datacenter-scale LLM serving. It sits above individual inference engines — SGLang, TensorRT-LLM, and vLLM — and turns a cluster of GPUs into a coordinated, high-throughput inference system. The project is licensed under Apache 2.0, written primarily in Rust for performance with Python for extensibility, and is actively developed at github.com/ai-dynamo/dynamo.

What It Is

Dynamo is the orchestration layer above inference engines, not a replacement for them. Where a single inference engine optimizes one GPU or node, Dynamo coordinates many nodes together. It handles disaggregated prefill/decode, intelligent KV-aware request routing, multi-tier KV cache management, SLA-driven autoscaling, and fast cold-start weight streaming. The result is a system that can serve LLM, reasoning, multimodal, and video generation workloads at datacenter scale with an OpenAI-compatible API.

Core Architecture and Capabilities

Dynamo's architecture is built around several composable components:

Disaggregated Prefill/Decode: Separates prefill and decode into independently scalable GPU pools, letting each phase run on hardware tuned for its workload.
KV-Aware Router: Routes requests based on worker load and KV cache overlap to eliminate redundant prefill computation.
KV Block Manager (KVBM): Offloads KV cache across GPU → CPU → SSD → remote storage (S3/Azure blob), extending effective context length beyond GPU memory.
ModelExpress: Streams model weights GPU-to-GPU via NIXL/NVLink for fast cold-start on new replicas.
Planner: An SLA-driven autoscaler that profiles workloads and right-sizes GPU pools to meet latency targets at minimum TCO.
Grove: A Kubernetes operator for topology-aware gang scheduling across racks, hosts, and NUMA nodes.
AIConfigurator: Simulates thousands of deployment configurations to find the optimal serving topology without burning GPU-hours.
Fault Tolerance: Canary health checks and in-flight request migration so worker failures don't surface to users.

Deployment Model and Setup Paths

Dynamo supports three primary deployment paths:

Container (fastest): Pull a prebuilt container from NGC (nvcr.io/nvidia/ai-dynamo/sglang-runtime:1.2.1, tensorrtllm-runtime:1.2.1, or vllm-runtime:1.2.1) and start a frontend and worker in minutes.
PyPI install: Install via uv pip install "ai-dynamo[sglang]" or the vLLM variant for local development without containers.
Kubernetes (recommended for production): Install the Dynamo Platform operator and deploy with a single YAML manifest using the DynamoGraphDeploymentRequest CRD. Supports AWS EKS, Google GKE, and Azure AKS with cloud-specific guides.

For Kubernetes, Dynamo exposes two request routing topologies: a Dynamo-native frontend path (client → Frontend → Router → workers) and a Gateway API path using the Kubernetes Gateway API Inference Extension (GAIE) for platforms that standardize on Gateway API.

Backend Support and Integrations

Dynamo is backend-agnostic. All three supported backends — SGLang, TensorRT-LLM, and vLLM — support disaggregated serving, KV-aware routing, the SLA-based Planner, multimodal workloads, and tool calling. KVBM support is available for TensorRT-LLM and vLLM, with SGLang support in progress. KV cache integrations include HiCache, LMCache, and FlexKV. The framework also integrates with LangChain and the NVIDIA NeMo Agent Toolkit for agentic workloads.

Update: Dynamo v1.2.1

The latest release is v1.2.1, published June 13, 2026. Version 1.0 introduced zero-config Kubernetes deployment via the DynamoGraphDeploymentRequest (DGDR) CRD, agentic inference features (per-request priority hints, session metadata, SGLang subagent KV isolation), multimodal encode/prefill/decode disaggregation with embedding cache, native FastVideo and SGLang Diffusion support for video generation, and storage-tier KV offload with S3/Azure blob. The 1.2.x series adds a Tool Calling Probe Snapshot, the Fastokens Tokenizer, and continued Kubernetes platform improvements including topology-aware KV transfer and shadow engine failover. The GitHub repository reports over 70 community contributors and an active biweekly office hours program.

Community Discussions

Be the first to start a conversation about NVIDIA Dynamo

Share your experience with NVIDIA Dynamo, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source

Fully open-source under Apache 2.0 license, free to use, modify, and distribute.

Disaggregated prefill/decode serving
KV-aware routing
SLA-driven autoscaling Planner
Kubernetes-native deployment
SGLang, TensorRT-LLM, and vLLM backends

Capabilities

Key Features

Disaggregated prefill/decode serving
KV-aware request routing
KV Block Manager (KVBM) with multi-tier offloading
SLA-driven autoscaling Planner
ModelExpress fast weight streaming
OpenAI-compatible API frontend
Kubernetes-native deployment with CRD operator
Gateway API Inference Extension (GAIE) support
Multimodal encode/prefill/decode disaggregation
Video generation support (FastVideo, SGLang Diffusion)
LoRA adapter support
Tool calling and reasoning parsing
Fault tolerance with in-flight request migration
Inference simulation with DynoSim
Topology-aware gang scheduling (Grove)
AIConfigurator deployment optimizer
Prometheus + Grafana observability
Distributed tracing and health checks
Multi-node Kubernetes deployments
Autoscaling with rolling updates

Integrations

SGLang

TensorRT-LLM

vLLM

Kubernetes

AWS EKS

Google GKE

Azure AKS

Amazon ECS

LangChain

NVIDIA NeMo Agent Toolkit

HiCache

LMCache

FlexKV

Prometheus

Grafana

etcd

NATS JetStream

Hugging Face

NVIDIA NGC

Docker

API Available

View Docs

Back to all tools Suggest an edit