# MiniCPM > MiniCPM is a series of ultra-efficient open-source large language models designed for end-side devices, featuring sparse attention, hybrid reasoning, and 3x+ generation speedup. MiniCPM is an open-source family of highly efficient large language models (LLMs) developed by OpenBMB (THUNLP and Modelbest Inc.), designed explicitly for deployment on end-side and edge devices. The series achieves state-of-the-art performance at its scale through systematic innovations in model architecture, training data, training algorithms, and inference systems. The latest models—MiniCPM4, MiniCPM4.1, and MiniCPM-SALA—deliver over 3–7x generation speedup compared to similar-sized models on edge hardware, while supporting context lengths up to 1 million tokens. **Key Features:** - **Efficient Model Architecture** — *MiniCPM4 and MiniCPM4.1 use InfLLM-V2 trainable sparse attention, where each token computes relevance with less than 5% of tokens in 128K long-text processing, drastically reducing computational overhead.* - **MiniCPM-SALA Hybrid Attention** — *The first large-scale hybrid model integrating 25% sparse attention (InfLLM-V2) and 75% linear attention (Lightning Attention), enabling 1M-token inference on consumer GPUs like the NVIDIA RTX 5090.* - **Hybrid Reasoning Mode** — *MiniCPM4.1 supports both deep reasoning and non-reasoning modes, toggled via `enable_thinking` in the chat template or inline `/think`/`/no_think` tokens.* - **BitCPM4 Ternary Quantization** — *Compresses model parameters to 1.58-bit width via quantization-aware training (QAT), achieving comparable performance to full-precision models at a fraction of the size.* - **Multiple Inference Backends** — *Supports HuggingFace Transformers, vLLM, SGLang, CPM.cu (recommended for maximum speed), llama.cpp, and Ollama for flexible deployment.* - **Speculative Decoding (EAGLE3)** — *Achieves up to 3x decoding speedup in reasoning mode using the EAGLE3 draft model with vLLM and SGLang.* - **MiniCPM4-MCP Tool Use** — *Fine-tuned variant supporting tool calling across 16 MCP servers spanning office, lifestyle, communication, and work management categories.* - **MiniCPM4-Survey Agent** — *Specialized model for trustworthy long-form survey generation using a Plan-Retrieve-Write multi-agent framework with RL training.* - **Long Context Support** — *MiniCPM4.1 natively supports 64K tokens with YaRN-based extension to 128K+; MiniCPM-SALA scales to 1M+ tokens via HyPE positional encoding.* - **Apache 2.0 License** — *All models and code are released under the Apache License 2.0, free to use, modify, and distribute.* ## Features - Trainable sparse attention (InfLLM-V2) - Hybrid sparse + linear attention (MiniCPM-SALA) - 1M-token context on consumer GPUs - Hybrid reasoning mode (deep reasoning / non-reasoning) - BitCPM4 ternary quantization (1.58-bit) - EAGLE3 speculative decoding (3x speedup) - MCP tool use across 16 servers - Survey generation agent (MiniCPM4-Survey) - HuggingFace, vLLM, SGLang, CPM.cu, llama.cpp, Ollama support - YaRN long context extension - GPTQ, AWQ, GGUF, MLX quantized variants - Apache 2.0 open-source license ## Integrations HuggingFace Transformers, vLLM, SGLang, CPM.cu, llama.cpp, Ollama, ModelScope, OpenVINO, Intel Core Ultra (AIPC), MCP (Model Context Protocol), NVIDIA CUDA ## Platforms API, CLI, DEVELOPER_SDK ## Pricing Open Source ## Version 2.4.2 ## Links - Website: https://github.com/OpenBMB/MiniCPM - Documentation: https://modelbest.feishu.cn/wiki/D2tFw8Pcsi5CIzkaHNacLK64npg - Repository: https://github.com/OpenBMB/MiniCPM - EveryDev.ai: https://www.everydev.ai/tools/minicpm