EveryDev.ai
Sign inSubscribe
AI Tools by Topic
  • AI Coding Assistants
  • Agent Frameworks
  • MCP Servers
  • AI Prompt Tools
  • Vibe Coding Tools
  • AI Design Tools
  • AI Database Tools
  • AI Website Builders
  • AI Testing Tools
  • LLM Evaluations
Follow Us
  • X / Twitter
  • LinkedIn
  • Reddit
  • Discord
  • Threads
  • Bluesky
  • Mastodon
  • YouTube
  • GitHub
  • Instagram
Get Started
  • About
  • Editorial Standards
  • Corrections & Disclosures
  • Community Guidelines
  • Advertise
  • Contact Us
  • Newsletter
  • Submit a Tool
  • Start a Discussion
  • Write A Blog
  • Share A Build
  • Terms of Service
  • Privacy Policy
Explore with AI
  • ChatGPT
  • Gemini
  • Claude
  • Grok
  • Perplexity
Agent Experience
  • llms.txt
Theme
With AI, Everyone is a Dev. EveryDev.ai © 2026
Main Menu
  • Tools
  • Developers
  • Topics
  • Discussions
  • Communities
  • News
  • Podcasts
  • Blogs
  • Builds
  • Contests
  • Compare
  • Arena
Create
    Home
    Tools

    2,508+ AI tools

    • New
    • Trending
    • Featured
    • Compare
    • Arena
    Categories
    • Agents1666
    • Coding1214
    • Infrastructure542
    • Marketing451
    • Design437
    • Projects396
    • Research371
    • Analytics339
    • Testing233
    • MCP227
    • Data213
    • Security200
    • Integration170
    • Learning155
    • Communication148
    • Prompts144
    • Extensions137
    • Commerce125
    • Voice122
    • DevOps99
    • Web78
    • Finance21
    1. Home
    2. Tools
    3. tiny-vllm
    tiny-vllm icon

    tiny-vllm

    Local Inference
    Featured

    A hands-on course and full source code for building a high-performance LLM inference engine in C++ and CUDA, implementing features like KV cache, PagedAttention, and continuous batching.

    Visit Website

    At a Glance

    Pricing
    Open Source

    Fully free and open-source under Apache License 2.0. Full source code and course content available on GitHub.

    Engagement

    Available On

    Linux
    API
    VS Code
    CLI

    Resources

    WebsiteDocsGitHubllms.txt

    Topics

    Local InferenceAI CoursesAI Infrastructure

    Alternatives

    ModularRamaLamaModelHub
    Developer
    Jędrzej MaczanWrocław, PolandEst. 2024

    Listed May 2026

    About tiny-vllm

    tiny-vllm is an open-source educational project by Jędrzej Maczan that teaches you how to build a high-performance LLM inference engine from scratch using C++ and CUDA. It serves as both a complete, working inference server for Llama 3.2 1B Instruct and a structured course that walks through every implementation decision, from floating-point number formats to PagedAttention CUDA kernels.

    What It Is

    tiny-vllm is a "younger and smaller sibling of vLLM" — a self-described learning tool that gives you the full source code of a production-capable inference server alongside a detailed written course explaining how each component works. The project targets engineers and students who want to understand LLM inference at the systems level, not just use a high-level Python API. It is licensed under Apache License 2.0 and hosted on GitHub, where it has accumulated over 576 stars and 25 forks since its creation in February 2026.

    What the Engine Implements

    The inference engine covers the full stack of a modern LLM serving system:

    • Model loading: Reads weights from Safetensors format (Llama 3.2 1B Instruct)
    • Full forward pass: Prefill and decode phases with all CUDA kernels
    • Attention: Grouped-query attention (GQA), causal masking, scaled dot-product attention
    • Normalization and activations: RMSNorm with parallel reduction, RoPE positional encoding, SiLU activation
    • Batching: Both static batching and continuous batching
    • Memory optimization: KV cache, buffer reuse, PagedAttention with paged KV cache
    • Performance: FlashAttention-like online softmax, cuBLAS matrix multiplication via cublasGemmEx

    Course Structure and Learning Path

    The repository doubles as a written course with chapters covering every major concept needed to build the engine. Topics include how floating-point and bfloat16 numbers work, GPU vs CPU memory management with CUDA, writing your first CUDA kernel for embedding gather, parallel reduction for RMSNorm, the column-major to row-major transposition trick for cuBLAS, and the distinction between prefill and decode phases. The author explicitly frames the course as a JIT (just-in-time) learning resource — readers are encouraged to code alongside the text, make mistakes, and use LLMs as a personalized tutor to fill gaps.

    System Requirements and Setup

    The project targets NVIDIA GPU hardware and requires:

    • Linux (developed and tested on Linux 6.19.8 x86_64)
    • CUDA Toolkit (13.1)
    • C++17 with GCC (15.2.1)
    • NVIDIA GPU (developed on RTX 5090; any CUDA-capable GPU with minor path adjustments)
    • The only external dependency is nlohmann/json 3.12.0 (single header file included)
    • Llama 3.2 1B Instruct weights in Safetensors format from Hugging Face

    Build and run is handled by a single ./test.sh script. The author notes that path adjustments for CUDA, GCC, and NVCC may be needed depending on the host machine.

    Audience and Use Cases

    tiny-vllm is explicitly positioned for two audiences: individual learners on a self-directed AI/ML path, and lecturers who want a teaching resource for university courses on GPU programming or LLM systems. The course references and recommends complementary resources including Andrej Karpathy's nanoGPT and llm.c repositories, George Hotz's tinygrad, and the GPU MODE Discord community. The author also notes that a future course on ML compilers or alternative attention mechanisms may follow if there is sufficient interest.

    Current Status

    The repository was created in February 2026 and last updated in May 2026. Core features including PagedAttention, continuous batching, and online softmax are marked as implemented. Several sections of the written course (notably the Attention, GQA, Paged Attention, and Online Softmax chapters) are marked as TODO with code present but prose explanations still in progress. The project is actively maintained with 1 open issue at the time of data collection.

    tiny-vllm - 1

    Community Discussions

    Be the first to start a conversation about tiny-vllm

    Share your experience with tiny-vllm, ask questions, or help others learn from your insights.

    Pricing

    OPEN SOURCE

    Open Source

    Fully free and open-source under Apache License 2.0. Full source code and course content available on GitHub.

    • Full inference engine source code in C++ and CUDA
    • Complete written course with step-by-step implementation guide
    • Apache License 2.0 — free to use, modify, and distribute
    • Community contributions via GitHub pull requests

    Capabilities

    Key Features

    • Full LLM forward pass (prefill + decode)
    • Load Llama 3.2 1B Instruct from Safetensors
    • All computation with CUDA kernels
    • KV cache
    • Static batching
    • Continuous batching
    • Online softmax / FlashAttention-like attention
    • PagedAttention with paged KV cache
    • Grouped-query attention (GQA)
    • RMSNorm with parallel reduction
    • RoPE positional encoding
    • SiLU activation function
    • cuBLAS matrix multiplication via cublasGemmEx
    • Causal masking
    • Buffer reuse for memory optimization
    • Written course with step-by-step implementation guide

    Integrations

    CUDA Toolkit
    cuBLAS
    Safetensors
    Hugging Face (model weights)
    nlohmann/json
    GCC
    CMake
    API Available
    View Docs

    Reviews & Ratings

    No ratings yet

    Be the first to rate tiny-vllm and help others make informed decisions.

    Developer

    Jędrzej Maczan

    Jędrzej Maczan builds open-source educational tools for GPU programming and LLM systems engineering. He created tiny-vllm as both a working C++/CUDA inference engine and a structured course for learning LLM internals from scratch. His projects emphasize deep, hands-on understanding of systems-level AI, and he actively encourages community contributions and pull requests.

    Founded 2024
    Wrocław, Poland
    1 employees

    Used by

    PyTorch (Open source contributions)
    DNV (Professional role)
    Cohere Labs (Collaborator)
    Read more about Jędrzej Maczan
    WebsiteGitHubX / Twitter
    1 tool in directory

    Similar Tools

    Modular icon

    Modular

    AI infrastructure platform with MAX framework, Mojo language, and Mammoth for GPU-portable GenAI serving across NVIDIA and AMD hardware.

    RamaLama icon

    RamaLama

    An open-source CLI tool that simplifies running and serving AI models locally using OCI containers, with automatic GPU detection and multi-registry support.

    ModelHub icon

    ModelHub

    A macOS menu-bar app that lets you discover, download, and manage local LLMs from Hugging Face, with support for Ollama, MLX, LM Studio, llama.cpp, and vLLM.

    Browse all tools

    Related Topics

    Local Inference

    Tools and platforms for running AI inference locally without cloud dependence.

    111 tools

    AI Courses

    Structured courses, workshops, and comprehensive training programs for AI, machine learning, and development.

    62 tools

    AI Infrastructure

    Infrastructure designed for deploying and running AI models.

    252 tools
    Browse all topics
    Back to all tools
    Discussions