A technique used in transformer models to limit the attention span of each token to a fixed size window around it, reducing computational complexity and making the model more efficient.
For example, in a machine translation task, SWA can be used to focus only on the tokens within a certain distance from the current word being translated, rather than considering the entire input sequence.