# MiniCPM

> MiniCPM is a series of ultra-efficient open-source large language models designed for end-side devices, featuring sparse attention, hybrid reasoning, and 3x+ generation speedup.

MiniCPM is an open-source family of highly efficient large language models (LLMs) developed by OpenBMB (THUNLP and Modelbest Inc.), designed explicitly for deployment on end-side and edge devices. The series achieves state-of-the-art performance at its scale through systematic innovations in model architecture, training data, training algorithms, and inference systems. The latest models—MiniCPM4, MiniCPM4.1, and MiniCPM-SALA—deliver over 3–7x generation speedup compared to similar-sized models on edge hardware, while supporting context lengths up to 1 million tokens.

**Key Features:**

- **Efficient Model Architecture** — *MiniCPM4 and MiniCPM4.1 use InfLLM-V2 trainable sparse attention, where each token computes relevance with less than 5% of tokens in 128K long-text processing, drastically reducing computational overhead.*
- **MiniCPM-SALA Hybrid Attention** — *The first large-scale hybrid model integrating 25% sparse attention (InfLLM-V2) and 75% linear attention (Lightning Attention), enabling 1M-token inference on consumer GPUs like the NVIDIA RTX 5090.*
- **Hybrid Reasoning Mode** — *MiniCPM4.1 supports both deep reasoning and non-reasoning modes, toggled via `enable_thinking` in the chat template or inline `/think`/`/no_think` tokens.*
- **BitCPM4 Ternary Quantization** — *Compresses model parameters to 1.58-bit width via quantization-aware training (QAT), achieving comparable performance to full-precision models at a fraction of the size.*
- **Multiple Inference Backends** — *Supports HuggingFace Transformers, vLLM, SGLang, CPM.cu (recommended for maximum speed), llama.cpp, and Ollama for flexible deployment.*
- **Speculative Decoding (EAGLE3)** — *Achieves up to 3x decoding speedup in reasoning mode using the EAGLE3 draft model with vLLM and SGLang.*
- **MiniCPM4-MCP Tool Use** — *Fine-tuned variant supporting tool calling across 16 MCP servers spanning office, lifestyle, communication, and work management categories.*
- **MiniCPM4-Survey Agent** — *Specialized model for trustworthy long-form survey generation using a Plan-Retrieve-Write multi-agent framework with RL training.*
- **Long Context Support** — *MiniCPM4.1 natively supports 64K tokens with YaRN-based extension to 128K+; MiniCPM-SALA scales to 1M+ tokens via HyPE positional encoding.*
- **Apache 2.0 License** — *All models and code are released under the Apache License 2.0, free to use, modify, and distribute.*

## Features
- Trainable sparse attention (InfLLM-V2)
- Hybrid sparse + linear attention (MiniCPM-SALA)
- 1M-token context on consumer GPUs
- Hybrid reasoning mode (deep reasoning / non-reasoning)
- BitCPM4 ternary quantization (1.58-bit)
- EAGLE3 speculative decoding (3x speedup)
- MCP tool use across 16 servers
- Survey generation agent (MiniCPM4-Survey)
- HuggingFace, vLLM, SGLang, CPM.cu, llama.cpp, Ollama support
- YaRN long context extension
- GPTQ, AWQ, GGUF, MLX quantized variants
- Apache 2.0 open-source license

## Integrations
HuggingFace Transformers, vLLM, SGLang, CPM.cu, llama.cpp, Ollama, ModelScope, OpenVINO, Intel Core Ultra (AIPC), MCP (Model Context Protocol), NVIDIA CUDA

## Platforms
API, CLI, DEVELOPER_SDK

## Pricing
Open Source

## Version
2.4.2

## Links
- Website: https://github.com/OpenBMB/MiniCPM
- Documentation: https://modelbest.feishu.cn/wiki/D2tFw8Pcsi5CIzkaHNacLK64npg
- Repository: https://github.com/OpenBMB/MiniCPM
- EveryDev.ai: https://www.everydev.ai/tools/minicpm