MiniCPM
MiniCPM is a series of ultra-efficient open-source large language models designed for end-side devices, featuring sparse attention, hybrid reasoning, and 3x+ generation speedup.
At a Glance
Fully free and open-source under Apache License 2.0. All models and code are free to use, modify, and distribute.
Engagement
Available On
Listed Apr 2026
About MiniCPM
MiniCPM is an open-source family of highly efficient large language models (LLMs) developed by OpenBMB (THUNLP and Modelbest Inc.), designed explicitly for deployment on end-side and edge devices. The series achieves state-of-the-art performance at its scale through systematic innovations in model architecture, training data, training algorithms, and inference systems. The latest models—MiniCPM4, MiniCPM4.1, and MiniCPM-SALA—deliver over 3–7x generation speedup compared to similar-sized models on edge hardware, while supporting context lengths up to 1 million tokens.
Key Features:
- Efficient Model Architecture — MiniCPM4 and MiniCPM4.1 use InfLLM-V2 trainable sparse attention, where each token computes relevance with less than 5% of tokens in 128K long-text processing, drastically reducing computational overhead.
- MiniCPM-SALA Hybrid Attention — The first large-scale hybrid model integrating 25% sparse attention (InfLLM-V2) and 75% linear attention (Lightning Attention), enabling 1M-token inference on consumer GPUs like the NVIDIA RTX 5090.
- Hybrid Reasoning Mode — MiniCPM4.1 supports both deep reasoning and non-reasoning modes, toggled via
enable_thinkingin the chat template or inline/think//no_thinktokens. - BitCPM4 Ternary Quantization — Compresses model parameters to 1.58-bit width via quantization-aware training (QAT), achieving comparable performance to full-precision models at a fraction of the size.
- Multiple Inference Backends — Supports HuggingFace Transformers, vLLM, SGLang, CPM.cu (recommended for maximum speed), llama.cpp, and Ollama for flexible deployment.
- Speculative Decoding (EAGLE3) — Achieves up to 3x decoding speedup in reasoning mode using the EAGLE3 draft model with vLLM and SGLang.
- MiniCPM4-MCP Tool Use — Fine-tuned variant supporting tool calling across 16 MCP servers spanning office, lifestyle, communication, and work management categories.
- MiniCPM4-Survey Agent — Specialized model for trustworthy long-form survey generation using a Plan-Retrieve-Write multi-agent framework with RL training.
- Long Context Support — MiniCPM4.1 natively supports 64K tokens with YaRN-based extension to 128K+; MiniCPM-SALA scales to 1M+ tokens via HyPE positional encoding.
- Apache 2.0 License — All models and code are released under the Apache License 2.0, free to use, modify, and distribute.
Community Discussions
Be the first to start a conversation about MiniCPM
Share your experience with MiniCPM, ask questions, or help others learn from your insights.
Pricing
Open Source
Fully free and open-source under Apache License 2.0. All models and code are free to use, modify, and distribute.
- All MiniCPM model weights (MiniCPM4, MiniCPM4.1, MiniCPM-SALA, BitCPM4, etc.)
- Apache License 2.0
- HuggingFace and ModelScope model downloads
- Full source code access
- Community support via Discord and WeChat
Capabilities
Key Features
- Trainable sparse attention (InfLLM-V2)
- Hybrid sparse + linear attention (MiniCPM-SALA)
- 1M-token context on consumer GPUs
- Hybrid reasoning mode (deep reasoning / non-reasoning)
- BitCPM4 ternary quantization (1.58-bit)
- EAGLE3 speculative decoding (3x speedup)
- MCP tool use across 16 servers
- Survey generation agent (MiniCPM4-Survey)
- HuggingFace, vLLM, SGLang, CPM.cu, llama.cpp, Ollama support
- YaRN long context extension
- GPTQ, AWQ, GGUF, MLX quantized variants
- Apache 2.0 open-source license
Integrations
Demo Video

