LLMLingua: Efficient Token Removal for Large Language Models

LLMLingua utilizes a compact, well-trained language model (e.g., GPT2-small, LLaMA-7B) to identify and remove non-essential tokens in prompts. This approach enables efficient inference with large language models (LLMs), achieving up to 20x compression with minimal performance loss.

Microsoft
Not Applicable
March 3, 2024
LLMLingua: Speed Up LLM Inference with KV-Cache
LLMLingua: Enhance LLM Inference Speed by 20x!