# LLMLingua

> An open-source prompt compression library that reduces LLM prompt lengths by up to 20x using a compact language model to remove non-essential tokens with minimal performance loss.

LLMLingua is an open-source Python library developed by Microsoft Research that compresses prompts for large language models (LLMs) by up to 20x, reducing inference costs and latency with minimal performance degradation. It uses a compact, well-trained language model (e.g., GPT2-small, LLaMA-7B) to identify and remove non-essential tokens from prompts. The library includes three main methods — LLMLingua, LongLLMLingua, and LLMLingua-2 — each targeting different compression scenarios, plus SecurityLingua for jailbreak defense.

- **LLMLingua** compresses prompts using a small language model to identify and drop low-importance tokens; install via `pip install llmlingua` and use the `PromptCompressor` class to compress any prompt.
- **LongLLMLingua** addresses the "lost in the middle" problem in long-context LLMs, improving RAG performance by up to 21.4% using only 1/4 of the tokens; use the `rank_method="longllmlingua"` parameter.
- **LLMLingua-2** is a task-agnostic compression method trained via data distillation from GPT-4, offering 3x–6x faster performance than LLMLingua; enable it with `use_llmlingua2=True`.
- **SecurityLingua** is a safety guardrail that uses security-aware prompt compression to detect jailbreak attacks with 100x fewer token costs than state-of-the-art guardrail approaches.
- **Structured Prompt Compression** allows fine-grained control over which sections to compress using `<llmlingua></llmlingua>` tags with optional `rate` and `compress` parameters.
- **Cost Savings** are achieved by reducing both prompt and generation lengths, with reported savings on GPT-4 API usage.
- **KV-Cache Compression** accelerates the inference process by compressing the key-value cache.
- **Framework Integrations** include LangChain, LlamaIndex, and Microsoft Prompt Flow, making it easy to drop into existing RAG pipelines.
- **No LLM Retraining Required** — the compression is applied at inference time without modifying the target LLM.
- **Quantized Model Support** allows running with models like TheBloke/Llama-2-7b-Chat-GPTQ using under 8GB of GPU memory.

## Features
- Up to 20x prompt compression
- LLMLingua, LongLLMLingua, and LLMLingua-2 methods
- Task-agnostic compression via data distillation
- Structured prompt compression with custom tags
- KV-Cache compression
- SecurityLingua jailbreak defense
- No LLM retraining required
- Quantized model support
- RAG performance improvement
- Cost savings on LLM API usage

## Integrations
LangChain, LlamaIndex, Microsoft Prompt Flow, GPT-4, GPT-2, LLaMA, phi-2, HuggingFace

## Platforms
ANDROID, API, DEVELOPER_SDK, CLI

## Pricing
Open Source

## Version
v0.2.2

## Links
- Website: https://llmlingua.com/
- Documentation: https://github.com/microsoft/LLMLingua/blob/main/DOCUMENT.md
- Repository: https://github.com/microsoft/LLMLingua
- EveryDev.ai: https://www.everydev.ai/tools/llmlingua
