← Verba: A Slick UI Tool for Visualizing RAG Components RAG Application Deployment with Embedchain Integration →

StreamingLLM: Efficient Framework for Infinite Sequence Length Generalization

by Fede Nolasco | Mar 18, 2024

 Context Window | Infinite-Length Inputs | Llm

StreamingLLM: Efficient Framework for Infinite Sequence Length Generalization

StreamingLLM, an efficient framework that enables LLMs trained with a finite length attention window to generalize to infinite sequence length without any fine-tuning. We show that StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4 million tokens and more. In addition, we discover that adding a placeholder token as a dedicated attention sink during pre-training can further improve streaming deployment. In streaming settings, StreamingLLM outperforms the sliding window recomputation baseline by up to 22.2x speedup.

 Guangxuan-Xiao

 Not Applicable

 February 3, 2024

 Efficient Streaming Language Models with Attention Sinks - ICLR 2024

← Verba: A Slick UI Tool for Visualizing RAG Components RAG Application Deployment with Embedchain Integration →