llama.cpp

Local Inference

LLM inference in C/C++ enabling efficient local execution of large language models across various hardware platforms.

Visit Website

At a Glance

Pricing

Open Source

Free and open source under MIT license

Engagement

0views

0saves

0discussions

Available On

Windows

macOS

Linux

Web

API

Resources

Website Docs GitHub llms.txt

Topics

Local Inference AI Development Libraries AI Infrastructure

About llama.cpp

llama.cpp is a high-performance C/C++ library for running large language model (LLM) inference locally on a wide variety of hardware. Originally developed to enable running Meta's LLaMA models on consumer hardware, it has evolved into a comprehensive framework supporting numerous model architectures and quantization formats. The project prioritizes efficiency, portability, and minimal dependencies, making it ideal for developers who want to deploy LLMs without relying on cloud services.

Pure C/C++ Implementation provides a lightweight, dependency-free codebase that compiles easily across platforms without requiring heavy frameworks like PyTorch or TensorFlow.
Extensive Quantization Support enables running large models on limited hardware through various quantization methods (4-bit, 5-bit, 8-bit), dramatically reducing memory requirements while maintaining reasonable quality.
Multi-Platform Hardware Acceleration supports CUDA, Metal, OpenCL, Vulkan, and CPU-optimized SIMD instructions, allowing optimal performance on NVIDIA GPUs, Apple Silicon, AMD GPUs, and modern CPUs.
Model Format Compatibility works with GGUF format and supports conversion from various model formats, enabling use of models from Hugging Face and other sources.
Server Mode includes a built-in HTTP server with OpenAI-compatible API endpoints, making it easy to integrate into existing applications and workflows.
Active Community Development benefits from rapid iteration and contributions from a large open-source community, with frequent updates adding support for new models and optimizations.

To get started, clone the repository, build using CMake with your preferred backend (CPU, CUDA, Metal, etc.), download a GGUF-format model, and run inference using the provided command-line tools or server. The project includes comprehensive documentation covering build options, model conversion, and API usage.

Community Discussions

Be the first to start a conversation about llama.cpp

Share your experience with llama.cpp, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source

Free and open source under MIT license

Full source code access
All features included
Community support
MIT License

View official pricing

Capabilities

Key Features

Pure C/C++ implementation with no dependencies
4-bit, 5-bit, and 8-bit quantization support
CUDA GPU acceleration for NVIDIA GPUs
Metal acceleration for Apple Silicon
Vulkan and OpenCL support
CPU SIMD optimizations (AVX, AVX2, AVX512)
GGUF model format support
Built-in HTTP server with OpenAI-compatible API
Model conversion tools
Batch processing support
KV cache quantization
Speculative decoding
Grammar-based sampling
Multi-modal model support
Cross-platform compatibility

Integrations

Hugging Face Models

OpenAI API compatible clients

LangChain

LlamaIndex

Ollama

Text Generation WebUI

LocalAI

API Available

View Docs

Back to all tools

llama.cpp

At a Glance

Pricing

Engagement

Available On

Resources

Topics

About llama.cpp

Community Discussions

Be the first to start a conversation about llama.cpp

Pricing

Open Source

Capabilities

Key Features

Integrations

Modular

SGLang

PaddlePaddle