Looking back, Andrej Karpathy reflects on the most beautiful or surprising idea in deep learning or AI, highlighting the Transformer architecture. Neural networks have seen various architectures come and go, tailored for different sensory modalities like vision, audio, and text. Recently, the Transformer has emerged as a general-purpose architecture capable of processing video, images, speech, and text efficiently. The paper ‘Attention is All You Need’ introduced this architecture in 2016, though its profound impact was not fully anticipated at the time. The Transformer is a general-purpose, differentiable computer that is expressive in the forward pass, optimizable via backpropagation, and efficient in high-parallelism compute graphs. Karpathy explains that the Transformer’s success lies in its ability to perform message passing, where nodes store vectors, communicate, and update each other. This architecture combines various elements like residual connections, layer normalizations, and softmax attention, making it both powerful and optimizable. The residual connections allow gradients to flow smoothly during backpropagation, facilitating the learning of short algorithms that gradually extend during training. Despite its stability since 2016, the Transformer has seen minor modifications, such as reshuffling layer normalizations to a pre-norm formulation. While the architecture has proven resilient, Karpathy believes there could be even better architectures in the future. The current trend in AI is to scale up datasets and evaluations while keeping the Transformer architecture unchanged, reflecting its robustness and versatility.