LLMLingua
An open-source prompt compression library that reduces LLM prompt lengths by up to 20x using a compact language model to remove non-essential tokens with minimal performance loss.
At a Glance
Fully free and open-source under the MIT License. Free to use, modify, and distribute.
Engagement
Available On
Alternatives
Listed May 2026
About LLMLingua
LLMLingua is an open-source Python library developed by Microsoft Research that compresses prompts for large language models (LLMs) by up to 20x, reducing inference costs and latency with minimal performance degradation. It uses a compact, well-trained language model (e.g., GPT2-small, LLaMA-7B) to identify and remove non-essential tokens from prompts. The library includes three main methods — LLMLingua, LongLLMLingua, and LLMLingua-2 — each targeting different compression scenarios, plus SecurityLingua for jailbreak defense.
- LLMLingua compresses prompts using a small language model to identify and drop low-importance tokens; install via
pip install llmlinguaand use thePromptCompressorclass to compress any prompt. - LongLLMLingua addresses the "lost in the middle" problem in long-context LLMs, improving RAG performance by up to 21.4% using only 1/4 of the tokens; use the
rank_method="longllmlingua"parameter. - LLMLingua-2 is a task-agnostic compression method trained via data distillation from GPT-4, offering 3x–6x faster performance than LLMLingua; enable it with
use_llmlingua2=True. - SecurityLingua is a safety guardrail that uses security-aware prompt compression to detect jailbreak attacks with 100x fewer token costs than state-of-the-art guardrail approaches.
- Structured Prompt Compression allows fine-grained control over which sections to compress using
<llmlingua></llmlingua>tags with optionalrateandcompressparameters. - Cost Savings are achieved by reducing both prompt and generation lengths, with reported savings on GPT-4 API usage.
- KV-Cache Compression accelerates the inference process by compressing the key-value cache.
- Framework Integrations include LangChain, LlamaIndex, and Microsoft Prompt Flow, making it easy to drop into existing RAG pipelines.
- No LLM Retraining Required — the compression is applied at inference time without modifying the target LLM.
- Quantized Model Support allows running with models like TheBloke/Llama-2-7b-Chat-GPTQ using under 8GB of GPU memory.
Community Discussions
Be the first to start a conversation about LLMLingua
Share your experience with LLMLingua, ask questions, or help others learn from your insights.
Pricing
Open Source (MIT)
Fully free and open-source under the MIT License. Free to use, modify, and distribute.
- LLMLingua prompt compression
- LongLLMLingua long-context compression
- LLMLingua-2 task-agnostic compression
- SecurityLingua jailbreak defense
- Structured prompt compression
Capabilities
Key Features
- Up to 20x prompt compression
- LLMLingua, LongLLMLingua, and LLMLingua-2 methods
- Task-agnostic compression via data distillation
- Structured prompt compression with custom tags
- KV-Cache compression
- SecurityLingua jailbreak defense
- No LLM retraining required
- Quantized model support
- RAG performance improvement
- Cost savings on LLM API usage
