Microsoft Research

One Billion Tokens with Revolutionary LONGNET Architecture

LONGNET Architecture enables scaling Transformers to over 1 billion tokens through a multi-scale dilated attention module, but the impact of such massive contexts remains unproven. Introduction Transformers have become ubiquitous across natural language processing, demonstrating major advantages on tasks like translation, text generation, and question answering. However, the standard Transformer’s dot-product self-attention mechanism scales quadratically…