# tiny-vllm

> A hands-on course and full source code for building a high-performance LLM inference engine in C++ and CUDA, implementing features like KV cache, PagedAttention, and continuous batching.

tiny-vllm is an open-source educational project by Jędrzej Maczan that teaches you how to build a high-performance LLM inference engine from scratch using C++ and CUDA. It serves as both a complete, working inference server for Llama 3.2 1B Instruct and a structured course that walks through every implementation decision, from floating-point number formats to PagedAttention CUDA kernels.

## What It Is

tiny-vllm is a "younger and smaller sibling of vLLM" — a self-described learning tool that gives you the full source code of a production-capable inference server alongside a detailed written course explaining how each component works. The project targets engineers and students who want to understand LLM inference at the systems level, not just use a high-level Python API. It is licensed under Apache License 2.0 and hosted on GitHub, where it has accumulated over 576 stars and 25 forks since its creation in February 2026.

## What the Engine Implements

The inference engine covers the full stack of a modern LLM serving system:

- **Model loading**: Reads weights from Safetensors format (Llama 3.2 1B Instruct)
- **Full forward pass**: Prefill and decode phases with all CUDA kernels
- **Attention**: Grouped-query attention (GQA), causal masking, scaled dot-product attention
- **Normalization and activations**: RMSNorm with parallel reduction, RoPE positional encoding, SiLU activation
- **Batching**: Both static batching and continuous batching
- **Memory optimization**: KV cache, buffer reuse, PagedAttention with paged KV cache
- **Performance**: FlashAttention-like online softmax, cuBLAS matrix multiplication via `cublasGemmEx`

## Course Structure and Learning Path

The repository doubles as a written course with chapters covering every major concept needed to build the engine. Topics include how floating-point and bfloat16 numbers work, GPU vs CPU memory management with CUDA, writing your first CUDA kernel for embedding gather, parallel reduction for RMSNorm, the column-major to row-major transposition trick for cuBLAS, and the distinction between prefill and decode phases. The author explicitly frames the course as a JIT (just-in-time) learning resource — readers are encouraged to code alongside the text, make mistakes, and use LLMs as a personalized tutor to fill gaps.

## System Requirements and Setup

The project targets NVIDIA GPU hardware and requires:

- Linux (developed and tested on Linux 6.19.8 x86_64)
- CUDA Toolkit (13.1)
- C++17 with GCC (15.2.1)
- NVIDIA GPU (developed on RTX 5090; any CUDA-capable GPU with minor path adjustments)
- The only external dependency is `nlohmann/json` 3.12.0 (single header file included)
- Llama 3.2 1B Instruct weights in Safetensors format from Hugging Face

Build and run is handled by a single `./test.sh` script. The author notes that path adjustments for CUDA, GCC, and NVCC may be needed depending on the host machine.

## Audience and Use Cases

tiny-vllm is explicitly positioned for two audiences: individual learners on a self-directed AI/ML path, and lecturers who want a teaching resource for university courses on GPU programming or LLM systems. The course references and recommends complementary resources including Andrej Karpathy's nanoGPT and llm.c repositories, George Hotz's tinygrad, and the GPU MODE Discord community. The author also notes that a future course on ML compilers or alternative attention mechanisms may follow if there is sufficient interest.

## Current Status

The repository was created in February 2026 and last updated in May 2026. Core features including PagedAttention, continuous batching, and online softmax are marked as implemented. Several sections of the written course (notably the Attention, GQA, Paged Attention, and Online Softmax chapters) are marked as TODO with code present but prose explanations still in progress. The project is actively maintained with 1 open issue at the time of data collection.

## Features
- Full LLM forward pass (prefill + decode)
- Load Llama 3.2 1B Instruct from Safetensors
- All computation with CUDA kernels
- KV cache
- Static batching
- Continuous batching
- Online softmax / FlashAttention-like attention
- PagedAttention with paged KV cache
- Grouped-query attention (GQA)
- RMSNorm with parallel reduction
- RoPE positional encoding
- SiLU activation function
- cuBLAS matrix multiplication via cublasGemmEx
- Causal masking
- Buffer reuse for memory optimization
- Written course with step-by-step implementation guide

## Integrations
CUDA Toolkit, cuBLAS, Safetensors, Hugging Face (model weights), nlohmann/json, GCC, CMake

## Platforms
LINUX, API, VSC_EXTENSION, CLI

## Pricing
Open Source

## Version
main

## Links
- Website: https://github.com/jmaczan/tiny-vllm
- Documentation: https://github.com/jmaczan/tiny-vllm
- Repository: https://github.com/jmaczan/tiny-vllm
- EveryDev.ai: https://www.everydev.ai/tools/tiny-vllm
