LONGNET Architecture enables scaling Transformers to over 1 billion tokens through a multi-scale dilated attention module, but the impact of such massive contexts remains unproven.
Transformers have become ubiquitous across natural language processing, demonstrating major advantages on tasks like translation, text generation, and question answering. However, the standard Transformer’s dot-product self-attention mechanism scales quadratically with sequence length. This quadratic complexity severely restricts the maximum length of sequences that even the largest Transformer models can process, topping out at a few hundred thousand tokens.
The newly proposed LONGNET architecture provides a breakthrough that for the first time unlocks Transformer models with sequence lengths of over 1 billion tokens. This order-of-magnitude increase could enable Transformers to model much longer-term dependencies and take advantage of even more context.
Limitations of Standard Transformers
Transformers have achieved state-of-the-art results across an array of NLP tasks. But all Transformer models rely on a self-attention mechanism that compares each token to every other token in the sequence. This dot-product self-attention has a computational and memory cost that grows quadratically with the number of tokens.
As a result, Transformer models today can only process sequences of a few hundred thousand tokens even when heavily optimized. For comparison, traditional RNNs and other sequence models are not as limited in sequence length but sacrifice the powerful capabilities of self-attention.
This sequence length restriction prevents Transformers from fully modeling long documents or leveraging broader context. Removing this bottleneck could unlock even greater performance.
Introducing the LONGNET Architecture
To overcome the quadratic limitation of dot-product self-attention, researchers Jiayu Ding, Shuming Ma, Li Dong et al. propose a new Transformer architecture called LONGNET. The key innovation in LONGNET is introducing a multi-scale “dilated attention” module to replace standard self-attention.
Dilated attention works by allocating computation capacity more intelligently based on distance between tokens. It splits the sequence into segments and within each segment, skips over tokens exponentially more as distance increases.
This approximates global attention while drastically reducing the computational cost to linear O(N) complexity compared to the O(N^2) of standard attention. By adjusting which tokens are skipped in each segment, LONGNET ensures it retains access to all tokens in the sequence.
Leveraging Efficient Kernels
A major benefit of the LONGNET architecture is that its dilated attention can be directly substituted into any standard Transformer model. Because the output matches a dense attention map, LONGNET can immediately take advantage of highly optimized kernels for dot-product attention, like the FlashAttention library.
This allows LONGNET models to benefit from speed-ups and memory savings of sparse or low-rank attention approximations during training. The underlying dilated attention pattern is transparent to the external kernels.
Scaling up Training
Given its linear computational complexity, LONGNET can parallelize training across multiple devices to reach unprecedented sequence lengths.
LONGNET partitions the sequence into chunks, allowing each device to handle fitting a portion of the tokens locally. For long-range dependencies spanning chunks, an efficient collective communication step shares information between devices.
This distributed training scheme allows LONGNET to scale up to sequences with 1 billion tokens while keeping compute time and memory constant. Training such long sequences would be completely infeasible for standard Transformer architectures.
The researchers validate LONGNET on language modeling tasks, demonstrating superior perplexity compared to Sparse Transformers on sequence lengths up to 32,000 tokens. LONGNET models continue improving with more context, as sequences during training reach 1 billion tokens.
Moreover, experiments show that LONGNET benefits significantly from larger context windows during inference. This confirms that the massive contexts unlocked by LONGNET provide useful signals that improve predictions.
The LONGNET architecture represents a breakthrough in overcoming the quadratic self-attention bottleneck that has severely limited Transformer sequence lengths. Although not AGI, scaling up to 1 billion tokens could enable Transformers to take advantage of more context and long-range dependencies.
Unlocking such rich billion-token representations could open doors for new applications across domains including scientific research, medicine, education, and more. More broadly, LONGNET highlights the importance of continued research into new and improved Transformer architectures.