About Attention
Transformer networks (the architecture behind Chat-GPT) are built from multiple layers, and each layer is divided into several attention heads. Each head computes its own attention matrix by combining "queries" and "keys"—the fundamental elements that help the network decide how much focus to give to different parts of the input.
You can think of each query as a question that a token asks, such as "Are there adjectives in front of me?" Meanwhile, each key serves as a potential answer, carrying the token's characteristics. When the model compares queries with keys, it determines the strength of their match and, therefore, how much influence one token should have on another.
For example, consider the phrase "fluffy blue monster." One token might generate a query like, "Is the word in front of me an adjective?" In this case, the tokens "fluffy" and "blue"—which are adjectives—provide keys that answer this question strongly, while "monster," being a noun, offers a weaker response. This interplay of questions (queries) and answers (keys) is what creates the attention matrix for each head.
Each attention head focuses on different relationships and patterns within the text, allowing the network to capture a rich and nuanced understanding of the language. Despite the critical role that these attention mechanisms play, it's interesting to note that only about one third of all the weights in a large language model are actually in the attention blocks. So while the famous slogan "attention is all you need" highlights the importance of these connections, in terms of sheer weight, it's only one third of what you really need!
Made with <3 by Ferdi & Samu. Credits for model view below to BertViz.
A Deep Gaze into Attention Heads
Type in a token sequence (below 50 characters) and hit process. After some loading time, you will be able to see the attention patterns of individual so-called "heads" in the LLM. Each head focuses on different aspects of the input text, and by visualizing these patterns, you can gain insights into how the model processes and understands language.
Here is an example view of a head, with tokens on each side. If you see a connection between two tokens, it means that the head is paying attention to the relationship between those tokens. This way you can see attention heads which "pay attention" to the previous token, the first token, or other patterns. Click on an attention head to select the respective head in a layer. Afterwards you can hover over tokens to see the attention weights of the selected head for that token.
Click on a head that looks interesting to gaze deeper into it in the next section:
Hover Visualization
By hovering over each token, you can see which other token is important for that token. The larger the token, the more important it is for the token you are hovering over. The token with the maximal attention is colored in red.