tiny-vllm

Name: tiny-vllm
Availability: OnlineOnly
Author: Jędrzej Maczan

A hands-on course and full source code for building a high-performance LLM inference engine in C++ and CUDA, implementing features like KV cache, PagedAttention, and continuous batching.

Visit Website

At a Glance

Pricing

Open Source

Fully free and open-source under Apache License 2.0. Full source code and course content available on GitHub.

Engagement

Available On

Linux

API

VS Code

CLI

Jędrzej MaczanWrocław, PolandEst. 2024

Listed May 2026

About tiny-vllm

tiny-vllm is an open-source educational project by Jędrzej Maczan that teaches you how to build a high-performance LLM inference engine from scratch using C++ and CUDA. It serves as both a complete, working inference server for Llama 3.2 1B Instruct and a structured course that walks through every implementation decision, from floating-point number formats to PagedAttention CUDA kernels.

What It Is

tiny-vllm is a "younger and smaller sibling of vLLM" — a self-described learning tool that gives you the full source code of a production-capable inference server alongside a detailed written course explaining how each component works. The project targets engineers and students who want to understand LLM inference at the systems level, not just use a high-level Python API. It is licensed under Apache License 2.0 and hosted on GitHub, where it has accumulated over 576 stars and 25 forks since its creation in February 2026.

What the Engine Implements

The inference engine covers the full stack of a modern LLM serving system:

Model loading: Reads weights from Safetensors format (Llama 3.2 1B Instruct)
Full forward pass: Prefill and decode phases with all CUDA kernels
Attention: Grouped-query attention (GQA), causal masking, scaled dot-product attention
Normalization and activations: RMSNorm with parallel reduction, RoPE positional encoding, SiLU activation
Batching: Both static batching and continuous batching
Memory optimization: KV cache, buffer reuse, PagedAttention with paged KV cache
Performance: FlashAttention-like online softmax, cuBLAS matrix multiplication via cublasGemmEx

Course Structure and Learning Path

The repository doubles as a written course with chapters covering every major concept needed to build the engine. Topics include how floating-point and bfloat16 numbers work, GPU vs CPU memory management with CUDA, writing your first CUDA kernel for embedding gather, parallel reduction for RMSNorm, the column-major to row-major transposition trick for cuBLAS, and the distinction between prefill and decode phases. The author explicitly frames the course as a JIT (just-in-time) learning resource — readers are encouraged to code alongside the text, make mistakes, and use LLMs as a personalized tutor to fill gaps.

System Requirements and Setup

The project targets NVIDIA GPU hardware and requires:

Linux (developed and tested on Linux 6.19.8 x86_64)
CUDA Toolkit (13.1)
C++17 with GCC (15.2.1)
NVIDIA GPU (developed on RTX 5090; any CUDA-capable GPU with minor path adjustments)
The only external dependency is nlohmann/json 3.12.0 (single header file included)
Llama 3.2 1B Instruct weights in Safetensors format from Hugging Face

Build and run is handled by a single ./test.sh script. The author notes that path adjustments for CUDA, GCC, and NVCC may be needed depending on the host machine.

Audience and Use Cases

tiny-vllm is explicitly positioned for two audiences: individual learners on a self-directed AI/ML path, and lecturers who want a teaching resource for university courses on GPU programming or LLM systems. The course references and recommends complementary resources including Andrej Karpathy's nanoGPT and llm.c repositories, George Hotz's tinygrad, and the GPU MODE Discord community. The author also notes that a future course on ML compilers or alternative attention mechanisms may follow if there is sufficient interest.

Current Status

The repository was created in February 2026 and last updated in May 2026. Core features including PagedAttention, continuous batching, and online softmax are marked as implemented. Several sections of the written course (notably the Attention, GQA, Paged Attention, and Online Softmax chapters) are marked as TODO with code present but prose explanations still in progress. The project is actively maintained with 1 open issue at the time of data collection.

Community Discussions

Be the first to start a conversation about tiny-vllm

Share your experience with tiny-vllm, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source

Fully free and open-source under Apache License 2.0. Full source code and course content available on GitHub.

Full inference engine source code in C++ and CUDA
Complete written course with step-by-step implementation guide
Apache License 2.0 — free to use, modify, and distribute
Community contributions via GitHub pull requests

Capabilities

Key Features

Full LLM forward pass (prefill + decode)
Load Llama 3.2 1B Instruct from Safetensors
All computation with CUDA kernels
KV cache
Static batching
Continuous batching
Online softmax / FlashAttention-like attention
PagedAttention with paged KV cache
Grouped-query attention (GQA)
RMSNorm with parallel reduction
RoPE positional encoding
SiLU activation function
cuBLAS matrix multiplication via cublasGemmEx
Causal masking
Buffer reuse for memory optimization
Written course with step-by-step implementation guide

Integrations

CUDA Toolkit

cuBLAS

Safetensors

Hugging Face (model weights)

nlohmann/json

GCC

CMake

API Available

View Docs

Back to all tools Suggest an edit