LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models
Abstract
LLMLingua compresses prompts for large language models using a budget controller, token-level iterative algorithm, and instruction tuning to maintain performance while achieving high compression ratios.
Large language models (LLMs) have been applied in various applications due to their astonishing capabilities. With advancements in technologies such as chain-of-thought (CoT) prompting and in-context learning (ICL), the prompts fed to LLMs are becoming increasingly lengthy, even exceeding tens of thousands of tokens. To accelerate model inference and reduce cost, this paper presents LLMLingua, a coarse-to-fine prompt compression method that involves a budget controller to maintain semantic integrity under high compression ratios, a token-level iterative compression algorithm to better model the interdependence between compressed contents, and an instruction tuning based method for distribution alignment between language models. We conduct experiments and analysis over four datasets from different scenarios, i.e., GSM8K, BBH, ShareGPT, and Arxiv-March23; showing that the proposed approach yields state-of-the-art performance and allows for up to 20x compression with little performance loss. Our code is available at https://aka.ms/LLMLingua.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Prompt Compression in the Wild: Measuring Latency, Rate Adherence, and Quality for Faster LLM Inference (2026)
- BEAVER: A Training-Free Hierarchical Prompt Compression Method via Structure-Aware Page Selection (2026)
- DiffuMask: Diffusion Language Model for Token-level Prompt Pruning (2026)
- MemoSight: Unifying Context Compression and Multi Token Prediction for Reasoning Acceleration (2026)
- PoC: Performance-oriented Context Compression for Large Language Models via Performance Prediction (2026)
- Stability Implies Redundancy: Delta Attention Selective Halting for Efficient Long-Context Prefilling (2026)
- ZoomR: Memory Efficient Reasoning through Multi-Granularity Key Value Retrieval (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2310.05736 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper