Introduction to LLMs: Part One

26 min read

Introduction

Large Language Models (LLMs) represent a milestone in Artificial Intelligence, marking a shift from rigid, rule-based systems to fluid, deep-learning models capable of mimicking human-like communication. By processing massive datasets, these systems have moved beyond simple pattern recognition to master the complexities of linguistic structure, context, and intent.

The primary driver behind the success of modern LLMs is the Transformer architecture. Unlike earlier models that processed text sequentially, Transformers utilize a self-attention mechanism to look at an entire sequence of words simultaneously. This enables the model to weigh the relevance of different words regardless of their distance from one another, capturing long-range dependencies in parallel.

The core of this “attention” can be expressed mathematically through Query ($Q$), Key ($K$), and Value ($V$) vectors:

$$\text{Attention}(Q,K,V)=\text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V$$

This formula allows the model to track complex relationships within data. The “large” in LLM refers to the billions or even trillions of adjustable parameters the model learns during training, which act as the “synapses” of the system.

Because LLMs capture deep context and reasoning rather than relying on simple keyword matching, they are highly adaptable:

Content Generation: Autonomously creating everything from creative prose to technical documentation.
Coding & Logic: Writing and debugging computer code, often acting as a “co-pilot” for developers.
Multistep Reasoning: When equipped with agentic capabilities, modern LLMs can plan and execute complex workflows—such as conducting research or managing files—that previously required human intervention.
Explainability: Translating complex machine learning outputs into natural language narratives that humans can easily understand.

The Evolution of LLMs

The landscape of large language models has shifted rapidly from early research to the multi-agent, trillion-parameter systems of 2026.

BERT (2018): Introduced by Google, BERT pioneered bidirectional training, allowing models to understand a word’s context based on both its left and right surroundings.
GPT Series (2020–2025): OpenAI’s Generative Pre-trained Transformers scaled the industry. While GPT-4 set the standard for reasoning, the release of GPT-5 in summer 2025 marked the transition into highly reliable, autonomous systems.
Claude 4 Series (2025–2026): Anthropic’s models, including the recent Claude Opus 4.7, are renowned for their massive parameter counts and Constitutional AI framework, which ensures safer, more helpful responses.
Gemini 3.1 & Deep Research (2026): These systems utilize collaborative planning to synthesize information across thousands of documents and various modalities (video, audio, and code) in real-time.
Llama 4 & Open Source (2025–2026): Meta’s Llama 4 Maverick brought Mixture of Experts (MoE) architecture to the open-source community, offering hundreds-of-billions-parameter performance with the efficiency of much smaller models.
GPT-5.5 & The Agentic Era (2026): Released in April 2026, GPT-5.5 represents the pinnacle of Agentic AI, featuring advanced capabilities for coding, research, data analysis, document production, and software operation.

The Attention Mechanism

The attention mechanism is a breakthrough in deep learning that mimics a fundamental human trait: selective focus. Just as you might focus on a specific voice in a crowded room while ignoring background noise, the attention mechanism allows neural networks to prioritize the most relevant parts of their input data.

Before its inception, sequential models like Recurrent Neural Networks (RNNs) suffered from a memory bottleneck. They processed information piece by piece, attempting to compress an entire sentence’s meaning into a single, fixed-length vector. This often resulted in the loss of detail from the beginning of a sentence by the time the model reached the end. Attention solved this by providing the model with direct, simultaneous access to the entire input, using attention weights to determine which pieces of information are critical for the task at hand.

To understand how attention works, it is helpful to view the data through the lens of a retrieval system. Every input word is transformed into three distinct vectors:

Query ($Q$): What the model is currently “looking for” (e.g., the word it is trying to translate or predict).
Key ($K$): A label that describes the identity and characteristics of every other word in the sequence.
Value ($V$): The actual semantic information or content contained within those words.

The mechanism calculates a compatibility score between the Query and all the Keys. These scores determine how much “attention” should be paid to each word’s Value.

Step-by-Step Example: “The Cat”

To see the math in action, let’s look at a simplified example using the two-word phrase: “The cat.” We want to calculate how much attention the word “The” (Word 1) should pay to itself and the word “cat” (Word 2).

1. Vector Initialization

In a real model, these vectors are derived from learned weights. For our example, we will use simplified values:

“The” (Word 1): $Q_1 = [1, 0]$, $K_1 = [1, 0]$, $V_1 = [2, 3]$
“cat” (Word 2): $K_2 = [0, 1]$, $V_2 = [4, 5]$

2. Calculating Similarity (The Dot-Product)

We calculate the similarity score by taking the dot product of the Query for “The” ($Q_1$) against all available Keys ($K_1, K_2$). This measures how similar the words are in a specific context.

Score 1 (“The” to “The”): $(1 \times 1) + (0 \times 0) = 1$
Score 2 (“The” to “cat”): $(1 \times 0) + (0 \times 1) = 0$

3. Scaling for Stability

To prevent numbers from becoming too large (which can destabilize the training process), we divide the scores by the square root of the vector dimension ($d_k$). Since our dimension is 2:

Scaled Score 1: $1 / \sqrt{2} \approx 0.707$
Scaled Score 2: $0 / \sqrt{2} = 0$

4. Normalization (The Softmax Function)

After scaling the scores, we pass them through the Softmax function. The purpose of Softmax is to transform a vector of raw numerical scores (which can be any real numbers) into a probability distribution where every value is between 0 and 1, and the total sum is exactly 1.

The Softmax function for an input vector $z$ is expressed as:

$$ \sigma_i(\vec{z}) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}} $$

By using the exponential function ($e^z$), Softmax ensures that higher scores are magnified while ensuring that even a score of zero still receives a small amount of “attention” (since $e^0 = 1$).

Numerical Example: Using our scaled scores from the previous step ($0.707$ and $0$):

Calculate Exponentials:
- $e^{0.707} \approx 2.028$
- $e^0 = 1$
Calculate the Total Sum: $2.028 + 1 = 3.028$
Calculate Final Weights:
- Weight for “The”: $2.028 / 3.028 \approx \mathbf{0.67}$ (67%)
- Weight for “cat”: $1 / 3.028 \approx \mathbf{0.33}$ (33%)

This tells the model that to understand “The” in this context, it should focus 67% on the word itself and 33% on the word “cat.”

5. Synthesizing the Final Context

Finally, we multiply these weights by the original Value vectors and sum them to create a new contextualized vector for “The.”

From “The”: $0.67 \times [2, 3] = [1.34, 2.01]$
From “cat”: $0.33 \times [4, 5] = [1.32, 1.65]$

Final Output for “The”: $[1.34 + 1.32, 2.01 + 1.65] = \mathbf{[2.66, 3.66]}$

The Transformer Block

Calculating the context-aware vector (such as the [2.66, 3.66] output for the word “The”) is a significant milestone, but it is not the end of the journey. In a Large Language Model, this vector must pass through a series of refining stages within a Transformer Block to transform raw contextual data into high-level understanding.

Here is the step-by-step process of how a vector is polished, processed, and eventually turned into a prediction.

The Residual Connection (The “Add” Step)

Immediately after the attention mechanism, the model performs a Residual Connection. It takes the original input vector (the one we had before the attention math) and adds it back to the newly calculated attention vector.

The Logic: Think of this as a memory failsafe. While the attention mechanism is busy mixing information from other words, the residual connection ensures the model doesn’t “forget” the identity of the original word.
The Benefit: Mathematically, this prevents the vanishing gradient problem. It creates a direct highway for information to flow through the network, making it much easier and more stable to train deep models with dozens of layers.

Layer Normalization (The “Norm” Step)

After adding the vectors, the result undergoes Layer Normalization. In a neural network, numbers can occasionally become extremely large or infinitesimally small as they are multiplied across layers, which can cause the model to crash or stop learning.

Normalization rescales the values in the vector to a consistent range (usually with a mean of zero and a specific variance). This “tidying up” of the data ensures that the signals remain stable as they move deeper into the architecture.

The Feed-Forward Network (FFN)

Once normalized, the vector enters the Position-wise Feed-Forward Network. This is a classic neural network layer that operates on each word vector independently.

While the attention layer’s job is to let words “talk” to one another, the Feed-Forward layer is where the model does its individual thinking. It applies a series of linear transformations and non-linear activations (like ReLU or GELU) to ask: “Now that I have all this context about how this word relates to its neighbors, what does it actually mean in this specific instance?” It essentially extracts deeper features and patterns from the context gathered during the attention phase.

The Final “Add & Norm”

To wrap up the block, the output of the Feed-Forward Network goes through one last round of residual connection and normalization.

$$\text{Final Block Output} = \text{LayerNorm}(\text{FFN Output} + \text{FFN Input})$$

By the end of this stage, the vector for “The” has been thoroughly processed: it has looked at its neighbors, integrated that context, been mathematically stabilized, and had its features extracted by the FFN.

The Path to Prediction

A single pass through the steps above constitutes one Transformer Layer. In modern LLMs, this refined vector is not done; it is immediately fed into the next layer to repeat the entire process (Attention → Add/Norm → FFN → Add/Norm). Models like GPT-4 or Claude 3 stack dozens of these layers (anywhere from 12 to 96+). As the vector travels through deeper layers, the representation moves from simple grammar to complex logic, sentiment, and abstract reasoning.

How LLMs Choose Their Next Word

After a Large Language Model has processed an input through its many Transformer layers, it arrives at a final numerical representation—a dense vector. However, this vector is not a word; it is a mathematical summary of everything the model has analyzed. To turn this math back into human language, the model must navigate a final decision-making process to select the most appropriate next token.

The final vector emerges from the last Transformer block and enters what is known as the Language Model Head. This consists of a final linear layer that maps the vector against the model’s entire vocabulary—a dictionary typically containing 50,000 to 100,000 unique tokens (words or word fragments).

The model then applies the Softmax function, which converts raw numerical scores into a probability distribution. Every single word in the dictionary is assigned a percentage chance of being the “correct” next word.

For a prompt like “The cat sat on the,” the vocabulary probability distribution might look like this (oversimplified example):

“mat”: 40%
“sofa”: 30%
“floor”: 20%
“moon”: 10%

The sum of these probabilities (over the entire dictionary) is always exactly 1.0 (100%). At this stage, the model knows the likelihood of every word, but it hasn’t actually chosen one yet.

Decoding Strategies

How a model selects a word from that probability list is determined by its decoding strategy. Depending on the settings, the same model can be made to act like a rigid logician or a creative storyteller.

Greedy Search

The simplest strategy is Greedy Search. The model automatically picks the word with the highest probability (e.g., “mat”).

Pros: Highly logical and computationally efficient.
Cons: Often results in repetitive, robotic phrasing and can get stuck in loops where it repeats the same sentence structure over and over.

Temperature

To avoid the pitfalls of greedy search, models use Temperature ($T$), a mathematical scaling factor that reshapes the probability distribution before a word is picked.

Low Temperature ($T < 1.0$): This sharpens the distribution. It makes the high-probability words even more likely and suppresses the lower ones. This is ideal for factual tasks where accuracy is paramount.
High Temperature ($T > 1.0$): This flattens the distribution. It brings the probabilities closer together, increasing the chance that the model will pick a less obvious word. This is where creative AI comes from, though it increases the risk of hallucinations.

Top-K and Top-P Sampling

Even with temperature, we want to prevent the model from picking completely nonsensical words (like “moon” in our cat example). To ensure the output remains coherent, two popular filtering methods are used:

Top-K Sampling

The model is instructed to only consider the top K most likely words. If $K=50$, the model ignores everything except the top 50 candidates and chooses from that small pool. This provides a hard limit on how weird the model can get.

Top-P (Nucleus) Sampling

Regarded as a more sophisticated approach, Top-P sampling looks at the cumulative probability. If $P=0.90$, the model gathers words (starting from the most likely) until their combined probabilities add up to 90%. It then discards the remaining 10% of the vocabulary.

Why it works? Top-P is dynamic. If the model is very confident, the pool of words to choose from will be small. If the model is uncertain, the pool expands, allowing for more varied expression.

The Autoregressive Loop: Building the Sentence

Once a word is finally selected—let’s say it chose “sofa”—that word is appended to the original prompt. The new sequence is now: “The cat sat on the sofa.”

This entire updated sequence is then fed back into the very beginning of the LLM. The model processes the entire string again to predict the next word. This cycle, called autoregressive generation, continues token by token until the model generates a special “End of Sentence” (EOS) token or reaches a user-defined word limit.

Review

### What primary architectural innovation allowed Large Language Models to process entire sequences of text simultaneously rather than sequentially? > Think about the mechanism that allows the model to "focus" on different parts of a sentence at once. 1. [ ] Recurrent Neural Networks (RNNs) 1. [ ] Linear Language Modeling 1. [x] Transformer architecture 1. [ ] Sequential Memory Bottlenecking ### In the formula for the attention mechanism, which vector represents the actual semantic information or content of the words? > One vector is what you are looking for, one is the label, and the other one is the actual data. 1. [ ] Query (Q) 1. [ ] Key (K) 1. [x] Value (V) 1. [ ] Softmax (S) ### Which model, introduced in 2018, pioneered bidirectional training to understand word context from both the left and right surroundings? > This model was a major milestone from Google before the GPT series dominated the landscape. 1. [ ] GPT-4 1. [ ] Llama 4 1. [x] BERT 1. [ ] Claude Opus ### Why is "Scaling" (dividing by the square root of the vector dimension) used in the attention calculation? > This step occurs right before the Softmax function to keep the math stable. 1. [ ] To increase the creativity of the output 1. [x] To prevent large numbers from destabilizing the training process 1. [ ] To ensure the model remembers the original word identity 1. [ ] To map the vector against the model's vocabulary ### What is the primary purpose of the Softmax function in the context of LLM decision-making? > It is used to transform raw numerical scores into a specific mathematical distribution. 1. [ ] To add the original input back to the processed vector 1. [ ] To allow words to talk to each other independently 1. [x] To create a probability distribution where all values sum to 1.0 1. [ ] To filter out the top 50 most likely words ### What is the function of the "Residual Connection" (the Add step) within a Transformer Block? > This step helps the model retain the "identity" of the word as it passes through deep layers. 1. [ ] It flattens the probability distribution to allow for more creative answers. 1. [ ] It performs the final mapping of vectors to human-readable tokens. 1. [x] It prevents the vanishing gradient problem and ensures the model doesn't "forget" the original input. 1. [ ] It calculates the similarity between the Query and the Key. ### How does the Position-wise Feed-Forward Network (FFN) differ from the Attention layer? > While attention is about interaction between words, this layer is about the "individual thinking" of each word. 1. [ ] The FFN processes text sequentially rather than in parallel. 1. [ ] The FFN is responsible for selective focus on neighboring words. 1. [x] The FFN operates on each word vector independently to extract deeper features. 1. [ ] The FFN is only used in the final "Language Model Head." ### Which decoding strategy involves the model always selecting the word with the absolute highest probability? > This method is logical and efficient but can lead to repetitive or robotic phrasing. 1. [x] Greedy Search 1. [ ] Top-P (Nucleus) Sampling 1. [ ] Temperature Scaling 1. [ ] Top-K Sampling ### What effect does a high "Temperature" (T > 1.0) setting have on the model's output? > This setting is often adjusted to make AI assistants feel more creative. 1. [ ] It sharpens the probability distribution to focus only on the most likely word. 1. [x] It flattens the distribution, making less likely words more probable and increasing creativity. 1. [ ] It limits the model's vocabulary to only the top 10% of tokens. 1. [ ] It forces the model to follow a rigid, rule-based logic. ### What does the term "Autoregressive generation" describe in Large Language Models? > This describes the repetitive loop used to construct a full sentence. 1. [ ] The process of training a model on trillions of parameters simultaneously. 1. [ ] The ability of a model to understand video and audio in real-time. 1. [ ] The method of using "Agents" to manage files and conduct research. 1. [x] The cycle of appending a chosen word to the prompt and feeding it back into the model to predict the next word.