Introduction to LLMs: Final Part

32 min read

The Mechanics of Query, Key, and Value Vectors

In the architecture of a Transformer model, the transition from raw text to contextual understanding relies on a sophisticated mathematical transformation. While it is common to wonder if the Query ($Q$), Key ($K$), and Value ($V$) vectors used in self-attention are simply the word embeddings themselves, they actually represent a specialized evolution of those embeddings. They are derived from the base embedding to perform distinct functional roles.

The Base Embedding

The process begins with the Base Embedding. When a model processes a word, such as “apple,” it retrieves a high-dimensional vector from a pre-defined lookup table. This vector serves as a general-purpose representation, encoding the broad semantic properties of the word—for instance, its association with fruit, the color red, or a specific technology company.

However, a static embedding is often too rigid for complex linguistic tasks. To determine how a word relates to other words in a specific sentence, the model must transform this general representation into functional components. The base embedding contains the “what” of a word, but not the “how” it should interact with its neighbors.

The Transformation

To move beyond a general representation, the Transformer uses three distinct, learned weight matrices: $W^Q$, $W^K$, and $W^V$. Multiplying the current token representation (the base embedding in the first layer, or the previous layer’s output in deeper layers) by these matrices, the model generates three unique vectors. Each vector represents a different projection of the same word into specialized mathematical spaces:

The Query ($Q$) Vector: This vector represents the specific information the current word is seeking from other words in the sequence. If the word is “apple” in the context of a recipe, the Query vector focuses on identifying relevant descriptors, such as “sliced” or “baked.”
The Key ($K$) Vector: This acts as a descriptor of the word’s own characteristics. It identifies the word’s grammatical and semantic identity so that other words can determine if it is relevant to their own Queries. Essentially, it tells the rest of the sentence, “I am a noun, I am the subject, and I represent a fruit.”
The Value ($V$) Vector: While $Q$ and $K$ are used to calculate attention scores (determining how much focus to place on a word), the Value vector contains the actual information. This is the content that gets passed forward to the next layer of the network once the relevance—determined by the interaction of $Q$ and $K$—is established.

A Concrete Example: Processing the Pronoun “it”

Consider the sentence: “The cat sat on the mat because it was soft.”

When the model processes the word “it”:

The Query ($Q_{\text{it}}$): “it” broadcasts a question: “Who or what am I connected to? Which noun should I focus on?”
The Keys: All words in the sequence, including “cat,” “mat,” and “it” itself, provide their Keys:
- “cat” Key: “I am an animate noun, a subject”
- “mat” Key: “I am an inanimate noun, recently mentioned as the object”
- “it” Key: “I am a pronoun”
Attention Scores: The model computes dot products between $Q_{\text{it}}$ and all Keys. The dot products are then scaled by $\sqrt{d_k}$ and normalized using softmax, ensuring stable gradients and weights that sum to 1. The word “mat” likely receives a higher attention score than “cat” because pronouns typically refer to the most recently mentioned appropriate noun.
Value Aggregation: Each word’s current representation (embedding or hidden state) is projected into Value space using the learned matrix $W^V$:

$$V_{\text{mat}} = W^V \cdot \text{embedding}_{\text{mat}}, \quad V_{\text{it}} = W^V \cdot \text{embedding}_{\text{it}}, \quad V_{\text{cat}} = W^V \cdot \text{embedding}_{\text{cat}}$$

These Value vectors are then combined as a weighted sum using the attention weights:

$$\text{Output}_{\text{it}} = 0.70 \cdot V_{\text{mat}} + 0.20 \cdot V_{\text{it}} + 0.10 \cdot V_{\text{cat}} + \ldots$$

Or, writing it explicitly:

$$\text{Output}_{\text{it}} = 0.70 \cdot (W^V \cdot \text{embedding}_{\text{mat}}) + 0.20 \cdot (W^V \cdot \text{embedding}_{\text{it}}) + 0.10 \cdot (W^V \cdot \text{embedding}_{\text{cat}})$$

The $W^V$ matrix is a learned transformation that projects base embeddings into a specialized space optimized for information aggregation and transfer. This is crucial: without $W^V$, the model would simply use raw embeddings as Values, losing flexibility. With $W^V$, each word can dynamically transform its semantic content depending on how it’s attended to across different contexts.

The attention weights (0.70, 0.20, 0.10) reflect relevance. The transformed $V_{\text{mat}}$ carries the strongest signal—the Value representation of “mat” (containing information about its context, role, and meaning) is mixed with smaller contributions from other words. This weighted combination is what flows forward to the next layer, allowing the model to capture that “it” refers to “mat” with high confidence.

This mechanism is fundamental: $Q$ and $K$ determine which information matters; $V$ is the payload that gets routed based on those relevance decisions.

The Architecture of Multi-Head Attention

While the self-attention mechanism is a cornerstone of modern natural language processing, a single attention mechanism—often referred to as a “head”—has a fundamental limitation: it can only focus on one specific relationship or semantic feature at a time. In the complexity of human language, words often carry multiple, simultaneous layers of meaning and grammatical function. Multi-head attention was designed to address this by allowing a model to process information across several different representation subspaces in parallel.

The Limitation of Single-Head Attention

In a standard attention layer, a word is projected into a single Query, Key, and Value space. This means the model must represent all relevant relationships using a single attention pattern, which can make it harder to disentangle different types of information. For example, in the sentence, “The bank of the river was steep, but the bank approved my loan,” a single-head model might struggle to represent the word “bank” accurately. If it focuses on the grammatical structure of the sentence, it might miss the fact that “bank” refers to two entirely different concepts (a geographical feature and a financial institution).

How Multi-Head Attention Works

Multi-head attention solves this by running multiple self-attention mechanisms simultaneously and independently. Rather than using one broad set of parameters, the model utilizes several “heads” to analyze the input sequence from different perspectives.

Multiple Projections and Independent Weights

Each attention head is defined by its own unique set of learned weight matrices. If a model has $h$ heads, there are $h$ sets of matrices: $(W^Q_1, W^K_1, W^V_1), (W^Q_2, W^K_2, W^V_2), \dots, (W^Q_h, W^K_h, W^V_h)$.

The total size of the embedding (the number of numbers used to represent a single word) is called $d_{model}$. Rather than operating at this full model dimension, the workload is distributed across heads by setting $d_k = \frac{d_{model}}{h}$. This ensures that total computation remains comparable to single-head attention while allowing each head to specialize in different patterns.

When the base embeddings enter the attention layer, they are multiplied by each set of weights. This creates a unique set of Query ($Q$), Key ($K$), and Value ($V$) vectors for every head. Because these weights are initialized randomly and updated through backpropagation, each head naturally specializes in identifying different patterns.

Parallel Processing of Relationships

By processing the same input through these different pathways at the same time, the model can capture a variety of linguistic nuances. Research indicates that attention heads exhibit distinct functional tendencies, though their specific roles are neither rigid nor mutually exclusive:

Syntactic Heads: One head might focus on the relationship between subjects and verbs to maintain grammatical coherence.
Semantic Heads: Another head might analyze context clues to resolve lexical ambiguity (e.g., distinguishing the two meanings of “bank”).
Relational Heads: A third head might track long-distance dependencies, such as identifying which noun a pronoun refers to earlier in a paragraph.

Recombination and Synthesis

Once each head has performed its individual calculations, the model must merge these different “opinions” back into a single representation. The Transformer takes the outputs from all individual heads and concatenates them (links them together) into one long vector.

This concatenated vector is then passed through a final linear layer (using a weight matrix $W^O$). This step ensures that the information from all heads is mathematically integrated and transformed back into the original dimensionality required for the next layer of the network.

The Significance of Multi-Head Attention

The use of multi-head rather than single-head attention is one of the key architectural choices that gives Large Language Models (LLMs) their strong ability to capture rich contextual structure. By allowing the network to simultaneously weigh different types of relevance, it avoids the “bottleneck” of trying to condense all contextual information into a single attention score.

Furthermore, this architecture is highly efficient. Because each head operates independently, the calculations can be performed in parallel on modern hardware like GPUs. This parallelization is a primary factor that has allowed Transformer-based models to scale to billions of parameters while maintaining the ability to process vast amounts of data.

The Emergence of Specialization in Multi-Head Attention

In the study of Transformer architectures, a common question arises regarding the specialization of attention heads: who decides which head focuses on grammar, which on semantics, and which on long-range dependencies? Contrary to what one might expect from traditional software engineering, these roles are not assigned by software engineers. Instead, the specialization of Query ($Q$), Key ($K$), and Value ($V$) matrices is an emergent property of the training process, driven by the fundamental mechanics of machine learning.

Random Initialization: The Starting Point of Divergence

The path toward specialization begins before the model even processes its first word. When a Transformer is created, the weight matrices for every attention head—$W^Q$, $W^K$, and $W^V$—are populated with random numerical values.

Because each head starts with a different set of random numbers, they each produce a unique mathematical projection of the input data. On the very first training pass, Head 1 will generate a different set of attention scores than Head 2, simply because their starting parameters were not identical. This initial randomness ensures that no two heads begin their “career” looking at the data in the exact same way.

Backpropagation and the Optimization Cycle

Specialization is driven by the training loop: the model calculates prediction loss and uses backpropagation to derive gradients for every weight. An optimizer then adjusts these weights to minimize error. Because each head begins with unique initial values, they receive distinct gradient updates, evolving to capture different features.

The Mathematical Division of Labor

Over many training iterations on billions of tokens, a natural division of labor tends to emerge as heads diverge based on their unique initial states and gradient updates. However, this specialization is a self-organized tendency rather than a guarantee; research shows significant redundancy, as many heads can be pruned with minimal performance loss. Ultimately, head specialization is an emergent strategy for minimizing loss, not a result of human design.

If Head 1’s initial random state makes it slightly more effective at identifying subject-verb agreement, the training process will continue to refine its weights in that direction, reinforcing its role as a “syntactic expert.” Meanwhile, if Head 2 is slightly more successful at resolving lexical ambiguity, backpropagation will steer it toward becoming a “semantic expert.” This specialization is not a result of human instruction, but a result of the network finding the most effective configuration to solve the objective function.

Integration through the Output Matrix

Once the individual heads have performed their specialized calculations, the model must synthesize these disparate signals. The outputs from every head are concatenated—essentially joined side-by-side—to form a single, comprehensive vector.

This large vector is then multiplied by a final learned weight matrix, often denoted as $W^O$ (the Output matrix). This matrix serves as a learned “mixer.” It takes the specific insights found by each head—such as grammatical structure, contextual meaning, and pronoun references—and integrates them back into a single, enriched representation. This finalized vector is then passed on to the subsequent layers of the network, carrying a multi-faceted understanding of the input text.

Mechanistic Interpretability: Deciphering the “Heads” of AI

While Large Language Models demonstrate remarkable capabilities, the internal logic driving their decisions is not immediately transparent. Because these models consist of billions of numerical parameters rather than human-readable code, researchers have turned to a specialized field known as mechanistic interpretability to reverse-engineer their inner workings. By treating the model as a biological system to be dissected, researchers are beginning to identify the specific functions of certain attention circuits, although many components remain inherently complex and exhibit multi-functional behavior.

Understanding through Mechanistic Interpretability

Mechanistic interpretability functions similarly to experimental neuroscience. Since developers cannot simply read the weight matrices to understand the model’s logic, they perform experiments to observe how internal changes affect external behavior. A primary technique used is ablation and causal intervention, where researchers programmatically deactivate or alter specific heads or circuits. By measuring how the model’s accuracy or linguistic coherence degrades when a specific component is deactivated, they can isolate that component’s functional responsibility.

Identified Functional Circuits

Through these rigorous experiments, researchers have identified specific, repeatable circuits within the Transformer architecture that perform well-defined tasks in certain models:

Induction Heads: Perhaps the most significant discovery in the field, induction heads are pairs of attention heads that work in tandem to recognize and repeat patterns. For example, if the sequence “A B” appeared earlier in a document, and the model later encounters “A,” the induction head identifies the pattern and increases the probability that “B” will follow. This mechanism is a key driver of in-context learning, allowing an AI to adapt to new tasks and instructions provided within a single prompt without further training.
Indirect Object Identification (IOI) Heads: Researchers have mapped specific circuits in smaller models, such as GPT-2 Small, dedicated to tracking linguistic relationships. In a sentence such as “Alice and Bob went to the park, and Alice gave the keys to [Bob]”, IOI heads are responsible for identifying that “Bob” is the correct recipient based on the preceding context. These heads specialize in resolving the “who” and “to whom” of complex sentences.

The Challenge of Polysemanticity

Despite these breakthroughs, our understanding of these models remains fractional. A significant barrier to full transparency is polysemanticity. Many attention heads are not dedicated to a single, clean task; instead, they may be responsible for multiple unrelated functions simultaneously—such as tracking grammatical gender in one context while identifying specific punctuation patterns in another. This overlapping of duties makes it incredibly difficult to isolate a single “meaning” for most heads in a high-parameter model like GPT-4.

The Engineering of Complexity: Defining Head Counts

The specific number of attention heads in a model—whether it is 12 in a smaller model or 96 in a massive one—is a hyperparameter. This is a configuration setting determined by human engineers before the training process begins.

Engineers choose the number of heads based on a balance of three primary factors:

Representational Capacity: As a design intuition, more heads generally allow the model to track a greater variety of simultaneous linguistic features, though this does not scale infinitely.
Computational Efficiency: Because heads run in parallel, increasing the count must be balanced against the memory and processing power of the GPUs used for training.
Model Dimensionality: Typically, the total embedding size is divided equally among the heads. For instance, a 768-dimensional embedding split across 12 heads results in 64-dimensional vectors per head.

While the number of heads is an engineering choice, the purpose those heads eventually serve is determined entirely by the data and the optimization process during training.

The Nature of Modern AI

LLMs represent a significant milestone in Artificial Intelligence, yet their fundamental nature remains the subject of intense debate. While the underlying mechanics of these models are mathematically clear, the implications of their performance at scale have divided researchers into two primary philosophical camps. Understanding this divide is essential to recognizing what LLMs actually are—and what they are not.

The Stochastic Parrot

One influential perspective, often referred to as the “Stochastic Parrot” view, argues that LLMs are essentially sophisticated statistical engines. In this framework, the model does not “understand” language in a human sense; rather, it calculates the mathematical probability of a specific sequence of words based on its training data.

Under this interpretation:

Pattern Matching: When a model provides a correct answer or writes a coherent essay, it is performing a high-level form of data interpolation. It is essentially “copy-pasting” the logic and structure it has seen billions of times before.
Lack of Grounding: Because LLMs are trained solely on text, they lack a physical body or real-world experience. Critics argue that a model cannot truly know what an “apple” is; it only knows that the word “apple” is statistically likely to appear near words like “red,” “crunchy,” or “fruit.”
The Illusion of Thought: The apparent creativity or reasoning of the AI is viewed as an illusion created by the sheer volume of its training data, where complex outputs are simply the result of extremely high-dimensional pattern recognition.

The World Model

An opposing and increasingly prominent view suggests that next-word prediction, when scaled to trillions of tokens, necessitates the development of an internal “world model.” Proponents of this view argue that for a neural network to achieve near-perfect prediction accuracy across diverse scenarios, it cannot rely on memorization alone. It must instead learn the underlying rules and structures that govern the information it processes.

A compelling piece of evidence for this view comes from research involving games like chess. When models are trained purely on text-based transcripts of chess moves, they don’t just learn which moves are common. Empirical analyses suggest that these models develop an internal representation of the 64-square board that lets them track where pieces are likely to be, because maintaining an approximate board state helps them predict legal and plausible next moves.

From this perspective, the model is not just predicting words; it is simulating the reality those words describe. By learning the patterns of human language, the model inadvertently learns the patterns of human logic, physics, and social interaction.

Reconciling the Perspectives

The reality of LLMs likely exists in the tension between these two views. These models frequently exhibit “emergent abilities”—skills that were not explicitly programmed or expected. A model might learn to translate between two languages it has rarely seen together, or solve a novel logic puzzle it hasn’t encountered in its training set. These behaviors suggest a level of abstraction that goes beyond simple statistical parroting.

However, the limitations of LLMs are equally revealing. The phenomenon of “hallucination”—where a model confidently asserts a factual falsehood—shows that their internal world models are often incomplete or inconsistent. Furthermore, LLMs frequently struggle with basic physical reasoning, such as understanding the spatial constraints of stacking objects. These failures indicate that while the model has a deep map of language, that map is occasionally disconnected from the fundamental laws of the physical world.

Conclusion

Ultimately, Large Language Models are an entirely new type of entity. They do not think like humans, nor are they conscious, but dismissing them as mere “word predictors” fails to account for the profound complexity of the mathematical representations they build.

They are best understood as vast, high-dimensional maps of human knowledge. While they lack the biological grounding that defines human understanding, they have managed to encode the structure of our thoughts, our logic, and our world within billions of parameters. As these models continue to evolve, the debate over their true nature will likely remain one of the most important questions in the history of technology.

Review

### In the self-attention mechanism, what is the primary functional role of the Query ($Q$) vector? > Think about what a word is "asking" its neighbors in the sequence. 1. [ ] Acting as a static descriptor of the word's own grammatical characteristics. 1. [x] Representing the specific information a word is seeking from other words. 1. [ ] Containing the actual content or "payload" passed to the next layer. 1. [ ] Encoding the broad, general-purpose semantic properties of a word. ### How is the dimensionality of an individual attention head ($d_k$) determined in a Multi-Head Attention architecture? > The total workload is distributed equally across the total number of heads ($h$). 1. [ ] It is a fixed constant that never changes regardless of the model size. 1. [ ] It is calculated by multiplying the base embedding size ($d_{model}$) by the number of heads. 1. [x] It is the total embedding size ($d_{model}$) divided by the number of heads ($h$). 1. [ ] It is determined randomly during the initialization phase for each token. ### Who or what is responsible for determining which attention head focuses on specific tasks like syntax or semantics? > These roles are not pre-programmed by humans. 1. [ ] Software engineers who assign roles before training begins. 1. [ ] The Output matrix ($W^O$) which dictates roles to the heads. 1. [ ] A specific hyperparameter set during the configuration phase. 1. [x] An emergent property driven by random initialization and training optimization. ### What is the specific function of the "Induction Heads" identified through mechanistic interpretability? > They are crucial for a model's ability to perform in-context learning. 1. [ ] Identifying the indirect object in a complex sentence. 1. [x] Recognizing and repeating patterns found earlier in a document. 1. [ ] Deactivating redundant heads to improve computational efficiency. 1. [ ] Transforming raw text into multi-dimensional tensors. ### Which vector in the Transformer architecture acts as the "payload" that carries information forward after relevance is determined? > This vector is the actual content that is aggregated based on attention scores. 1. [x] The Value ($V$) vector. 1. [ ] The Key ($K$) vector. 1. [ ] The Query ($Q$) vector. 1. [ ] The Base Embedding. ### What is the purpose of the Output matrix ($W^O$) at the end of the multi-head attention process? > It acts as a "learned mixer" for the outputs of all individual heads. 1. [ ] To split the input into separate Query, Key, and Value spaces. 1. [x] To integrate and synthesize information from all heads back into a single vector. 1. [ ] To normalize dot products using the square root of the head dimension. 1. [ ] To select which single head contains the most accurate prediction. ### According to the "World Model" perspective, how does an LLM achieve high prediction accuracy? > This view suggests the model does more than just match statistical patterns. 1. [ ] By memorizing every possible combination of words in its training set. 1. [ ] By relying on a physical body to ground its understanding in reality. 1. [x] By learning the underlying rules and structures that govern reality. 1. [ ] By copy-pasting segments of text from its lookup tables.