MiniCPM

Name: MiniCPM
Availability: OnlineOnly
Author: OpenBMB

MiniCPM is a series of ultra-efficient open-source large language models designed for end-side devices, featuring sparse attention, hybrid reasoning, and 3x+ generation speedup.

Visit Website

At a Glance

Pricing

Open Source

Fully free and open-source under Apache License 2.0. All models and code are free to use, modify, and distribute.

Engagement

Available On

API

CLI

SDK

OpenBMBBeijing, ChinaEst. 2022$200M raised

Listed Apr 2026

About MiniCPM

MiniCPM is an open-source family of highly efficient large language models (LLMs) developed by OpenBMB (THUNLP and Modelbest Inc.), designed explicitly for deployment on end-side and edge devices. The series achieves state-of-the-art performance at its scale through systematic innovations in model architecture, training data, training algorithms, and inference systems. The latest models—MiniCPM4, MiniCPM4.1, and MiniCPM-SALA—deliver over 3–7x generation speedup compared to similar-sized models on edge hardware, while supporting context lengths up to 1 million tokens.

Key Features:

Efficient Model Architecture — MiniCPM4 and MiniCPM4.1 use InfLLM-V2 trainable sparse attention, where each token computes relevance with less than 5% of tokens in 128K long-text processing, drastically reducing computational overhead.
MiniCPM-SALA Hybrid Attention — The first large-scale hybrid model integrating 25% sparse attention (InfLLM-V2) and 75% linear attention (Lightning Attention), enabling 1M-token inference on consumer GPUs like the NVIDIA RTX 5090.
Hybrid Reasoning Mode — MiniCPM4.1 supports both deep reasoning and non-reasoning modes, toggled via enable_thinking in the chat template or inline /think//no_think tokens.
BitCPM4 Ternary Quantization — Compresses model parameters to 1.58-bit width via quantization-aware training (QAT), achieving comparable performance to full-precision models at a fraction of the size.
Multiple Inference Backends — Supports HuggingFace Transformers, vLLM, SGLang, CPM.cu (recommended for maximum speed), llama.cpp, and Ollama for flexible deployment.
Speculative Decoding (EAGLE3) — Achieves up to 3x decoding speedup in reasoning mode using the EAGLE3 draft model with vLLM and SGLang.
MiniCPM4-MCP Tool Use — Fine-tuned variant supporting tool calling across 16 MCP servers spanning office, lifestyle, communication, and work management categories.
MiniCPM4-Survey Agent — Specialized model for trustworthy long-form survey generation using a Plan-Retrieve-Write multi-agent framework with RL training.
Long Context Support — MiniCPM4.1 natively supports 64K tokens with YaRN-based extension to 128K+; MiniCPM-SALA scales to 1M+ tokens via HyPE positional encoding.
Apache 2.0 License — All models and code are released under the Apache License 2.0, free to use, modify, and distribute.

Community Discussions

Be the first to start a conversation about MiniCPM

Share your experience with MiniCPM, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source

Fully free and open-source under Apache License 2.0. All models and code are free to use, modify, and distribute.

All MiniCPM model weights (MiniCPM4, MiniCPM4.1, MiniCPM-SALA, BitCPM4, etc.)
Apache License 2.0
HuggingFace and ModelScope model downloads
Full source code access
Community support via Discord and WeChat

Capabilities

Key Features

Trainable sparse attention (InfLLM-V2)
Hybrid sparse + linear attention (MiniCPM-SALA)
1M-token context on consumer GPUs
Hybrid reasoning mode (deep reasoning / non-reasoning)
BitCPM4 ternary quantization (1.58-bit)
EAGLE3 speculative decoding (3x speedup)
MCP tool use across 16 servers
Survey generation agent (MiniCPM4-Survey)
HuggingFace, vLLM, SGLang, CPM.cu, llama.cpp, Ollama support
YaRN long context extension
GPTQ, AWQ, GGUF, MLX quantized variants
Apache 2.0 open-source license

Integrations

HuggingFace Transformers

vLLM

SGLang

CPM.cu

llama.cpp

Ollama

ModelScope

OpenVINO

Intel Core Ultra (AIPC)

MCP (Model Context Protocol)

NVIDIA CUDA

API Available

View Docs

Demo Video

Watch on YouTube

Back to all tools

About MiniCPM

Key Features:

Efficient Model Architecture — MiniCPM4 and MiniCPM4.1 use InfLLM-V2 trainable sparse attention, where each token computes relevance with less than 5% of tokens in 128K long-text processing, drastically reducing computational overhead.
MiniCPM-SALA Hybrid Attention — The first large-scale hybrid model integrating 25% sparse attention (InfLLM-V2) and 75% linear attention (Lightning Attention), enabling 1M-token inference on consumer GPUs like the NVIDIA RTX 5090.
Hybrid Reasoning Mode — MiniCPM4.1 supports both deep reasoning and non-reasoning modes, toggled via enable_thinking in the chat template or inline /think//no_think tokens.
BitCPM4 Ternary Quantization — Compresses model parameters to 1.58-bit width via quantization-aware training (QAT), achieving comparable performance to full-precision models at a fraction of the size.
Multiple Inference Backends — Supports HuggingFace Transformers, vLLM, SGLang, CPM.cu (recommended for maximum speed), llama.cpp, and Ollama for flexible deployment.
Speculative Decoding (EAGLE3) — Achieves up to 3x decoding speedup in reasoning mode using the EAGLE3 draft model with vLLM and SGLang.
MiniCPM4-MCP Tool Use — Fine-tuned variant supporting tool calling across 16 MCP servers spanning office, lifestyle, communication, and work management categories.
MiniCPM4-Survey Agent — Specialized model for trustworthy long-form survey generation using a Plan-Retrieve-Write multi-agent framework with RL training.
Long Context Support — MiniCPM4.1 natively supports 64K tokens with YaRN-based extension to 128K+; MiniCPM-SALA scales to 1M+ tokens via HyPE positional encoding.
Apache 2.0 License — All models and code are released under the Apache License 2.0, free to use, modify, and distribute.

Community Discussions

Be the first to start a conversation about MiniCPM

Share your experience with MiniCPM, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source

Fully free and open-source under Apache License 2.0. All models and code are free to use, modify, and distribute.

All MiniCPM model weights (MiniCPM4, MiniCPM4.1, MiniCPM-SALA, BitCPM4, etc.)
Apache License 2.0
HuggingFace and ModelScope model downloads
Full source code access
Community support via Discord and WeChat

Capabilities

Key Features

Trainable sparse attention (InfLLM-V2)
Hybrid sparse + linear attention (MiniCPM-SALA)
1M-token context on consumer GPUs
Hybrid reasoning mode (deep reasoning / non-reasoning)
BitCPM4 ternary quantization (1.58-bit)
EAGLE3 speculative decoding (3x speedup)
MCP tool use across 16 servers
Survey generation agent (MiniCPM4-Survey)
HuggingFace, vLLM, SGLang, CPM.cu, llama.cpp, Ollama support
YaRN long context extension
GPTQ, AWQ, GGUF, MLX quantized variants
Apache 2.0 open-source license

Integrations

HuggingFace Transformers

vLLM

SGLang

CPM.cu

llama.cpp

Ollama

ModelScope

OpenVINO

Intel Core Ultra (AIPC)

MCP (Model Context Protocol)

NVIDIA CUDA

API Available

View Docs

Demo Video

Watch on YouTube

MiniCPM

At a Glance

Engagement

Available On

Resources

Topics

Alternatives

About MiniCPM

Community Discussions

Be the first to start a conversation about MiniCPM

Pricing

Open Source

Capabilities

Key Features

Integrations

Demo Video

MiniCPM

At a Glance

Engagement

Available On

Resources

Topics

Alternatives

About MiniCPM

Community Discussions

Be the first to start a conversation about MiniCPM

Pricing

Open Source

Capabilities

Key Features

Integrations

Demo Video