StreamingLLM: Efficient Framework for Infinite Sequence Length Generalization

StreamingLLM, an efficient framework that enables LLMs trained with a finite length attention window to generalize to infinite sequence length without any fine-tuning. We show that StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4 million tokens and more. In addition, we discover that adding a placeholder token as a dedicated attention sink during pre-training can further improve streaming deployment. In streaming settings, StreamingLLM outperforms the sliding window recomputation baseline by up to 22.2x speedup.

Guangxuan-Xiao
Not Applicable
February 3, 2024
Efficient Streaming Language Models with Attention Sinks - ICLR 2024