LLMLingua

Name: LLMLingua
Availability: OnlineOnly
Author: Microsoft

An open-source prompt compression library that reduces LLM prompt lengths by up to 20x using a compact language model to remove non-essential tokens with minimal performance loss.

Visit Website

At a Glance

Pricing

Open Source

Fully free and open-source under the MIT License. Free to use, modify, and distribute.

Engagement

Available On

Android

API

SDK

CLI

MicrosoftOne Microsoft Way, Washington 98052-7329Est. 1975$30B raised

Listed May 2026

About LLMLingua

LLMLingua is an open-source Python library developed by Microsoft Research that compresses prompts for large language models (LLMs) by up to 20x, reducing inference costs and latency with minimal performance degradation. It uses a compact, well-trained language model (e.g., GPT2-small, LLaMA-7B) to identify and remove non-essential tokens from prompts. The library includes three main methods — LLMLingua, LongLLMLingua, and LLMLingua-2 — each targeting different compression scenarios, plus SecurityLingua for jailbreak defense.

LLMLingua compresses prompts using a small language model to identify and drop low-importance tokens; install via pip install llmlingua and use the PromptCompressor class to compress any prompt.
LongLLMLingua addresses the "lost in the middle" problem in long-context LLMs, improving RAG performance by up to 21.4% using only 1/4 of the tokens; use the rank_method="longllmlingua" parameter.
LLMLingua-2 is a task-agnostic compression method trained via data distillation from GPT-4, offering 3x–6x faster performance than LLMLingua; enable it with use_llmlingua2=True.
SecurityLingua is a safety guardrail that uses security-aware prompt compression to detect jailbreak attacks with 100x fewer token costs than state-of-the-art guardrail approaches.
Structured Prompt Compression allows fine-grained control over which sections to compress using <llmlingua></llmlingua> tags with optional rate and compress parameters.
Cost Savings are achieved by reducing both prompt and generation lengths, with reported savings on GPT-4 API usage.
KV-Cache Compression accelerates the inference process by compressing the key-value cache.
Framework Integrations include LangChain, LlamaIndex, and Microsoft Prompt Flow, making it easy to drop into existing RAG pipelines.
No LLM Retraining Required — the compression is applied at inference time without modifying the target LLM.
Quantized Model Support allows running with models like TheBloke/Llama-2-7b-Chat-GPTQ using under 8GB of GPU memory.

Community Discussions

Be the first to start a conversation about LLMLingua

Share your experience with LLMLingua, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source (MIT)

Fully free and open-source under the MIT License. Free to use, modify, and distribute.

LLMLingua prompt compression
LongLLMLingua long-context compression
LLMLingua-2 task-agnostic compression
SecurityLingua jailbreak defense
Structured prompt compression

Capabilities

Key Features

Up to 20x prompt compression
LLMLingua, LongLLMLingua, and LLMLingua-2 methods
Task-agnostic compression via data distillation
Structured prompt compression with custom tags
KV-Cache compression
SecurityLingua jailbreak defense
No LLM retraining required
Quantized model support
RAG performance improvement
Cost savings on LLM API usage

Integrations

LangChain

LlamaIndex

Microsoft Prompt Flow

GPT-4

GPT-2

LLaMA

phi-2

HuggingFace

API Available

View Docs

Back to all tools

LLMLingua

Prompt Engineering

An open-source prompt compression library that reduces LLM prompt lengths by up to 20x using a compact language model to remove non-essential tokens with minimal performance loss.

Visit Website

At a Glance

Pricing

Open Source

Fully free and open-source under the MIT License. Free to use, modify, and distribute.

Engagement

Discussions

Available On

Android

API

SDK

CLI

Resources

Website Docs GitHub llms.txt

Topics

Prompt Engineering LLM Orchestration AI Development Libraries

Alternatives

Outlines BAML The Token Company

Developer

MicrosoftOne Microsoft Way, Washington 98052-7329Est. 1975$30B raised

Listed May 2026

About LLMLingua

LLMLingua compresses prompts using a small language model to identify and drop low-importance tokens; install via pip install llmlingua and use the PromptCompressor class to compress any prompt.
LongLLMLingua addresses the "lost in the middle" problem in long-context LLMs, improving RAG performance by up to 21.4% using only 1/4 of the tokens; use the rank_method="longllmlingua" parameter.
LLMLingua-2 is a task-agnostic compression method trained via data distillation from GPT-4, offering 3x–6x faster performance than LLMLingua; enable it with use_llmlingua2=True.
SecurityLingua is a safety guardrail that uses security-aware prompt compression to detect jailbreak attacks with 100x fewer token costs than state-of-the-art guardrail approaches.
Structured Prompt Compression allows fine-grained control over which sections to compress using <llmlingua></llmlingua> tags with optional rate and compress parameters.
Cost Savings are achieved by reducing both prompt and generation lengths, with reported savings on GPT-4 API usage.
KV-Cache Compression accelerates the inference process by compressing the key-value cache.
Framework Integrations include LangChain, LlamaIndex, and Microsoft Prompt Flow, making it easy to drop into existing RAG pipelines.
No LLM Retraining Required — the compression is applied at inference time without modifying the target LLM.
Quantized Model Support allows running with models like TheBloke/Llama-2-7b-Chat-GPTQ using under 8GB of GPU memory.

Community Discussions

Be the first to start a conversation about LLMLingua

Share your experience with LLMLingua, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source (MIT)

Fully free and open-source under the MIT License. Free to use, modify, and distribute.

LLMLingua prompt compression
LongLLMLingua long-context compression
LLMLingua-2 task-agnostic compression
SecurityLingua jailbreak defense
Structured prompt compression

Capabilities

Key Features

Up to 20x prompt compression
LLMLingua, LongLLMLingua, and LLMLingua-2 methods
Task-agnostic compression via data distillation
Structured prompt compression with custom tags
KV-Cache compression
SecurityLingua jailbreak defense
No LLM retraining required
Quantized model support
RAG performance improvement
Cost savings on LLM API usage

Integrations

LangChain

LlamaIndex

Microsoft Prompt Flow

GPT-4

GPT-2

LLaMA

phi-2

HuggingFace

API Available

View Docs

Back to all tools