Attention Is All You Need

Figure 1: The Transformer model architecture.

Figure 2: Different transformer models and their number of parameters.

The Core Technology

This paper introduces the Transformer, a unique machine learning architecture solely based on attention mechanisms, eliminating the need for recurrent or convolutional neural networks.

Consider you’re playing with blocks, each showcasing a different animal. You aim to construct a tower with land and water animals side-by-side, but can only hold a few blocks simultaneously.

  • You could handle one block at a time, deciding its position in the tower – a process resembling the Recurrent Neural Network (RNN). RNNs review single data pieces, remembering past information for decision-making.

  • Alternatively, you might evaluate several blocks together, like the Convolutional Neural Network (CNN). CNNs assess information chunks to understand the bigger picture.

However, the most efficient strategy involves laying out all blocks and observing them at once, akin to the Transformer. Using “attention”, it comprehends all information simultaneously, understanding the importance and relationship of each piece regardless of their position.
RNNs process sequential data, suitable for language translation or speech recognition, while CNNs process grid-like data in a hierarchical way, making them ideal for image recognition or object detection.

Despite their dominance when the paper was written, RNNs and CNNs are complex structures, processing data sequentially or hierarchically. The authors suggest replacing them with the simpler, more efficient Transformer model.

Attention Mechanisms

To understand the Transformer, we first need to understand attention mechanisms. Imagine you’re reading a book and come across a sentence that says, “John threw the ball to his dog, and it was very happy.” You instinctively know that “it” refers to the dog, not the ball. That’s because your brain pays more attention to the word “dog” when trying to understand the meaning of “it”.

This is similar to how attention mechanisms work in machine learning. They help the model focus on the important parts of the data when making predictions. In the context of language translation, for example, the model might pay more attention to the subject and verb of a sentence, and less attention to other parts of the sentence.

The Transformer Architecture

The Transformer architecture is designed to take full advantage of these attention mechanisms. It’s a simpler model than recurrent or convolutional neural networks, but it’s also more effective.

The authors show that the Transformer can be trained faster and performs better on tasks like translating English to German or English to French. For example, the researchers trained a Transformer model on eight GPUs for 3.5 days and it achieved a new record score on the English-to-French translation task. This is a significant achievement, as it demonstrates that a simpler model can outperform more complex models.

Examples of the Transformer in Action

Think of the Transformer model as a highly skilled translator at a United Nations meeting. The translator listens to a speaker talking about changes in American laws, and the phrase “making…more difficult” is particularly important for understanding the speaker’s point. It’s “listening” to a sentence about American laws and focusing on the phrase “making…more difficult”. By paying attention to this key phrase, the model can better understand the sentence and translate it accurately. This is like the translator at the UN meeting focusing on the speaker’s key points to provide an accurate translation.

In the Transformer model, attention heads serve a similar role. Each attention head focuses on a different aspect of the input data. One might focus on the structure of a sentence, another on the meaning of specific words, and so on.

By working together, the attention heads (like the team of researchers) can capture a wide range of information from the sentence, leading to a more comprehensive understanding and better translation.

So, in essence, the Transformer model is like a highly skilled team, with each member (attention head) focusing on what they do best to collectively understand and translate sentences accurately.

Training the Model

The researchers used a method called byte-pair encoding to prepare the data for training. This method breaks down words into smaller units, which makes it easier for the model to process the data.

The model was trained on a machine with eight NVIDIA P100 GPUs. The authors used the Adam optimizer, a popular method for training neural networks, and they adjusted the learning rate during training to improve the model’s performance.

They also used a technique called dropout to prevent the model from overfitting the training data. Overfitting is a common problem in machine learning where the model learns the training data too well and performs poorly on new data. Dropout helps to prevent this by randomly ignoring some neurons during training, which forces the model to learn more robust patterns in the data.


The Transformer represents a significant step forward in machine learning. It’s a simpler and more efficient model that outperforms more complex models on a range of tasks. The attention mechanisms in the Transformer make it more interpretable, meaning it’s easier to understand why the model makes the predictions it does.

This research paper is a great example of how simplifying a model can lead to better performance. It also highlights the importance of attention mechanisms in machine learning, and how they can be used to create more effective and interpretable models.

Altmetric Badge