Introduction to LLMs: Part Two

23 min read

How LLMs Prevent Infinite Looping

Large Language Models function through a process called autoregressive generation. Essentially, they predict the next token in a sequence based on all the tokens that came before it. Without specific guardrails, this process could theoretically continue forever, leading to infinite loops that drain computational resources and produce repetitive, useless text.

To ensure every conversation has an end, developers employ a multi-layered strategy involving training logic, hard constraints, and mathematical penalties.

The End-of-Sequence (EOS) Token

The most natural way a model stops talking is through the End-of-Sequence (EOS) token. During the training phase, every document or dialogue in the dataset is appended with a specific, invisible “stop word” (such as <|endoftext|> or </s>).

Through this exposure, the model learns the structural patterns of how human thoughts and sentences conclude. When you ask a factual question, the model predicts the answer and then calculates that the most statistically probable next step is the EOS token. Once the inference engine detects this token, it immediately terminates the generation and delivers the final response to the user.

Hard Limits

Even a well-trained model can occasionally lose its way, failing to predict an EOS token and falling into a “rambling” state. To prevent this from consuming infinite memory and processing power, programmers set Hard Limits, such as total tokens for a response.

These are strict system-level constraints. If a model is assigned a limit of 500 tokens, the software will forcefully cut off the generation at that exact point, regardless of whether the model has finished its sentence or predicted an EOS token. This acts as a final safety net for the system’s infrastructure.

Mathematical Penalties

One of the most common failure modes for an LLM is a repetition loop, where the model gets stuck repeating the same phrase indefinitely (e.g., “and then, and then, and then…”). In these cases, the model is trapped in a local statistical loop and may never naturally reach an EOS token.

To break these cycles, developers apply mathematical penalties during the decoding process:

Frequency Penalty: This reduces the probability of a word being chosen again based on how many times it has already appeared in the text. The more a word is used, the less likely the model is to select it next, forcing the AI to explore new vocabulary.
Presence/Repetition Penalty: This applies a penalty regardless of frequency; if a token has appeared at all, its probability is suppressed. This is particularly effective at forcing the model to move on to new ideas and eventually reach a natural conclusion.

Custom Stop Sequences

For specialized applications, developers can define Custom Stop Sequences. These are specific strings of text—such as User:, ###, or a double newline—that the software monitors for in real-time.

If the model generates a sequence that matches one of these triggers, the software intercepts the process and stops the generation immediately. This is frequently used in chatbot interfaces to prevent the model from “hallucinating” both sides of a conversation or continuing to write after it has provided the requested information.

How AI Acquires Vocabulary and Intelligence

LLMs do not emerge with an innate understanding of human language. Before a model can engage in a single conversation, it must undergo two distinct, rigorous phases: the construction of its vocabulary and the optimization of its internal weights. Together, these processes transform a collection of random numbers into a sophisticated system capable of reasoning and communication.

Constructing the Lexicon

The model’s vocabulary is established during a preparatory phase called tokenization. This occurs before the neural network begins its actual training. Because computers process numbers rather than letters, researchers must first create a dictionary that maps segments of text to specific numerical IDs.

To build this dictionary, developers utilize an algorithm known as Byte-Pair Encoding (BPE). This process is driven entirely by the frequency of characters and patterns within a massive dataset.

The BPE Evolution

Initial State: The algorithm starts at the most granular level—individual characters (a, b, c) or raw bytes.
Pattern Recognition: It scans billions of tokens of text to identify the two most frequently occurring adjacent units. For instance, it might find that “t” and “h” appear together more often than any other pair.
Merging: These two units are merged into a single new entry: “th.” This new entry is added to the vocabulary.
Iteration: This cycle repeats. Eventually, “th” might merge with “e” to form “the,” or common suffixes like “-ing” and “-tion” are consolidated into their own unique tokens.

This process continues until the vocabulary reaches a predetermined limit—typically around 100,000 tokens. This hybrid approach is powerful because it allows the model to handle whole words, common fragments, and individual letters. Even if the model encounters a word it has never seen before, it can reconstruct it using smaller fragments, ensuring it is never at a loss for words.

How Weights are Learned

While the vocabulary provides the model with words, the learned weights provide the “intelligence” to use them. These weights are numerical parameters within the model’s architecture that dictate how information flows through the system.

During the training phase, these weights are optimized. When an LLM is first initialized, these billions of weights are essentially random numbers. If you were to prompt a “raw” model, it would produce nothing but chaotic, meaningless gibberish.

The transition from chaos to coherence happens through a four-step cycle:

The Prediction: The model is given a sequence of text (e.g., “The cat sat on the…”) and is asked to predict the next token. With random weights, it might guess something irrelevant, like “bicycle.”
Loss Calculation: The system compares the model’s guess to the actual word in the training data (“mat”). The mathematical difference between the wrong guess and the correct answer is called the Loss.
Backpropagation: Using Calculus, the system works backward through the layers of the neural network. It identifies exactly which weights were most responsible for the incorrect prediction.
Updating Parameters: The system applies a tiny adjustment to those weights, “nudging” them in a direction that makes the correct answer (“mat”) more likely the next time the model sees that sequence.

This loop is repeated trillions of times across vast supercomputing clusters. Over months of processing, these “microscopic nudges” accumulate. The random numbers gradually evolve into highly tuned weights that represent a deep understanding of grammar, factual relationships, and logical patterns.

In Transformer architectures, this weight optimization applies to all layers, including the attention mechanisms. Backpropagation refines the weights that determine which tokens should attend to—or focus on—which other tokens in the sequence. This allows the model to learn which relationships matter for accurate predictions.

The Mechanics of Tokenization

Before an LLM can read a prompt, your words must undergo a complex translation process. Despite their conversational prowess, neural networks do not understand human languages in their raw form; they operate exclusively in the realm of mathematics. The bridge between our words and the model’s math is a specialized software component known as the Tokenizer.

The following is a step-by-step breakdown of how your text is transformed into data that AI can process.

Normalization

When you submit a prompt, the tokenizer first performs normalization. This is essentially a “cleanup” phase to ensure consistency. Because computers are sensitive to even the smallest variations, the tokenizer must standardize the input so that “Apple” and “apple” (or words with different types of punctuation) are handled predictably.

During this stage, the tokenizer may:

Standardize capitalization (some tokenizers might do this but not all).
Manage trailing punctuation.
Convert invisible spaces into visible placeholder characters. This ensures the model knows exactly where a word begins and ends, treating a space as a piece of data rather than empty air.

Splitting into Tokens

Once the text is cleaned, it is cut into smaller units called tokens. To do this, the tokenizer uses the pre-built dictionary established during the training phase.

In the English language, a helpful rule of thumb is that one token is roughly equal to four characters, or approximately 75% of a word.

Token IDs

After the text is fragmented into tokens, the tokenizer acts as a lookup table. Every unique token in the model’s massive vocabulary is assigned a specific, unique integer called a Token ID.

At this point, the text vanishes and is replaced by an array of numbers. For example, the phrase “Hello, how are you?” might be converted into a sequence like [15339, 11, 1268, 527, 499, 30]. These numbers serve as the “address” for each token in the model’s internal database.

Tensors

The final step is formatting these integers into a tensor—a multi-dimensional mathematical structure (similar to a matrix) that the GPU can process at high speeds.

Only once the data is in tensor form is it passed into the neural network. The LLM takes these Token IDs and maps them to Embeddings, which are long lists of numbers that represent the word’s meaning in a high-dimensional space. The model then performs its calculations and predicts a new Token ID as a response.

The process then runs in reverse:

The model outputs a numerical ID.
The Detokenizer looks up that ID in its dictionary.
The ID is converted back into a human-readable string (e.g., 499 becomes “you”).
The final text is rendered on your screen.

Understanding AI Embeddings

While humans perceive words through a lens of culture, emotion, and experience, a Large Language Model initially sees text as mere numerical identifiers. If you tell a computer that “apple” is ID 15339, it has no inherent way of knowing if an apple is a fruit or a tech company.

Embeddings are the solution to this meaning gap. They act as a sophisticated bridge, translating those raw ID numbers into dense, high-dimensional mathematical vectors that capture the actual essence of human language.

What is a Word Vector?

At its core, an embedding is a list of numbers—a vector—that represents a word’s position in a multi-dimensional “semantic space”.

Imagine a traditional dictionary where each word has a single definition. Now, imagine a massive mathematical spreadsheet where every word is assigned hundreds or even thousands of specific coordinates. For example, the word “apple” might be represented by an array like [0.12, -0.45, 0.89, ...] extending across 300 or more dimensions.

While humans can easily visualize three dimensions (length, width, and height), an LLM operates in a “high-dimensional” space. Individual numbers have no inherent meaning. Rather, a word’s coordinates define its position in semantic space, where proximity reflects similarity and vector operations reveal analogies. This distributed encoding allows the model to represent “apple” as a precise point where nearby locations contain semantically related words.

The Geography of Meaning

The true power of embeddings lies in spatial relationships. In this mathematical map, the physical distance between two points (vectors) indicates how similar their meanings are.

Semantic Clustering: Words that share a context, such as “dog,” “puppy,” and “hound,” are assigned coordinates that place them very close to one another. Conversely, the vector for “dog” would be mathematically distant from unrelated concepts like “microchip” or “existentialism.”
Linguistic Algebra: Because these words are now numbers, we can perform math on them to reveal logical relationships. A famous example in AI research (static embeddings) is the equation:
Vector("King") - Vector("Man") + Vector("Woman") ≈ Vector("Queen")
The model understands this relationship because the embedding space encodes it geometrically: the direction from “man” to “king” parallels the direction from “woman” to “queen.” Both displacements encode the same semantic shift—status change—while preserving the gender dimension.

From Static to Contextual Embeddings

Early AI models used static embeddings (like Word2Vec), where a word had one fixed vector regardless of how it was used. This created a significant problem with polysemy—words with multiple meanings. In a static system, the word “bank” had the same coordinates whether you were talking about a “river bank” or a “bank account.”

Modern LLMs utilize the Transformer architecture to create Contextual Embeddings. The process works in two stages:

Lookup: The model pulls the “base” vector for a word from its embedding matrix.
Refinement: As the word passes through the model’s “attention” layers, its coordinates are dynamically shifted based on the words surrounding it.

By the time the model processes the full sentence, the vector for “bank” has moved in mathematical space to hover near other financial terms or, alternatively, near geographical terms, depending on the context.

How the Map is Drawn

These embedding matrices are not manually constructed; they are learned parameters. During initialization, the model populates the embedding matrix with random values—essentially noise. Through billions of prediction tasks, backpropagation computes gradients and updates these matrix parameters.

When the model incorrectly predicts “apple” in a mechanical context, the system calculates the loss and performs a gradient update, shifting the “apple” embedding vector away from mechanical feature dimensions toward fruit-related dimensions. This iterative optimization gradually transforms the random matrix into a learned semantic representation.

Review

### Which mechanism serves as the primary natural stopping point for an LLM by predicting a specific invisible stop word? > It is a special token typically added to training sequences/examples so the model can learn when a sequence ends. 1. [x] End-of-Sequence (EOS) token 1. [ ] Hard Limit 1. [ ] Presence Penalty 1. [ ] Custom Stop Sequence ### How does a frequency penalty specifically help prevent an LLM from getting stuck in a repetition loop? > It evaluates the number of times a specific token has already appeared in the output. 1. [x] It reduces the probability of a word being chosen again based on how many times it has already appeared. 1. [ ] It forcefully cuts off the generation once a specific token count is reached. 1. [ ] It monitors for specific strings like "User:" or "###" to stop the process. 1. [ ] It merges characters together to form larger tokens. ### In the construction of a model's vocabulary, what is the core mechanism of the Byte-Pair Encoding (BPE) algorithm? > It starts at the granular level and looks for patterns between adjacent units. 1. [x] Iteratively merging the most frequently occurring adjacent units into new entries. 1. [ ] Randomly assigning numerical IDs to dictionary words before training begins. 1. [ ] Filtering out all characters that are not part of the standard alphabet. 1. [ ] Calculating the mathematical distance between two unrelated word vectors. ### During the training phase, what mathematical process is used to identify which weights were responsible for an incorrect prediction? > This step involves working backward through the layers of the neural network. 1. [x] Backpropagation 1. [ ] Normalization 1. [ ] Tokenization 1. [ ] Detokenization ### What is the primary purpose of the Normalization phase in the tokenization process? > It acts as a "cleanup" phase to ensure the model handles text variations predictably. 1. [x] To standardize and clean text, such as whitespace, case, and accents, so tokenization is more consistent. 1. [ ] To convert numerical IDs back into human-readable strings. 1. [ ] To provide the "intelligence" needed to understand grammar. 1. [ ] To calculate the loss between a guess and the actual training data. ### In the context of AI embeddings, how is the similarity between two word meanings mathematically represented? > Think about the "Geography of Meaning" and how words are positioned relative to each other. 1. [x] By the distance or cosine similarity between their vectors in embedding space. 1. [ ] By comparing the number of characters in each word. 1. [ ] By checking if the words have the same Token ID. 1. [ ] By measuring the speed at which the GPU processes the tensor. ### What is the main advantage of contextual embeddings used in modern Transformers compared to older static systems? > Consider how a word like "bank" might change meaning based on the words around it. 1. [x] They dynamically shift a word's coordinates based on the surrounding context. 1. [ ] They allow the model to operate without using any numerical IDs. 1. [ ] They ensure every word has a single, fixed position in the embedding matrix. 1. [ ] They prevent the model from using more than one language at a time.