Introduction to LLMs: Part Two
23 min read
How LLMs Prevent Infinite Looping
Large Language Models function through a process called autoregressive generation. Essentially, they predict the next token in a sequence based on all the tokens that came before it. Without specific guardrails, this process could theoretically continue forever, leading to infinite loops that drain computational resources and produce repetitive, useless text.
To ensure every conversation has an end, developers employ a multi-layered strategy involving training logic, hard constraints, and mathematical penalties.
The End-of-Sequence (EOS) Token
The most natural way a model stops talking is through the End-of-Sequence (EOS) token. During the training phase, every document or dialogue in the dataset is appended with a specific, invisible “stop word” (such as <|endoftext|> or </s>).
Through this exposure, the model learns the structural patterns of how human thoughts and sentences conclude. When you ask a factual question, the model predicts the answer and then calculates that the most statistically probable next step is the EOS token. Once the inference engine detects this token, it immediately terminates the generation and delivers the final response to the user.
Hard Limits
Even a well-trained model can occasionally lose its way, failing to predict an EOS token and falling into a “rambling” state. To prevent this from consuming infinite memory and processing power, programmers set Hard Limits, such as total tokens for a response.
These are strict system-level constraints. If a model is assigned a limit of 500 tokens, the software will forcefully cut off the generation at that exact point, regardless of whether the model has finished its sentence or predicted an EOS token. This acts as a final safety net for the system’s infrastructure.
Mathematical Penalties
One of the most common failure modes for an LLM is a repetition loop, where the model gets stuck repeating the same phrase indefinitely (e.g., “and then, and then, and then…”). In these cases, the model is trapped in a local statistical loop and may never naturally reach an EOS token.
To break these cycles, developers apply mathematical penalties during the decoding process:
- Frequency Penalty: This reduces the probability of a word being chosen again based on how many times it has already appeared in the text. The more a word is used, the less likely the model is to select it next, forcing the AI to explore new vocabulary.
- Presence/Repetition Penalty: This applies a penalty regardless of frequency; if a token has appeared at all, its probability is suppressed. This is particularly effective at forcing the model to move on to new ideas and eventually reach a natural conclusion.
Custom Stop Sequences
For specialized applications, developers can define Custom Stop Sequences. These are specific strings of textâsuch as User:, ###, or a double newlineâthat the software monitors for in real-time.
If the model generates a sequence that matches one of these triggers, the software intercepts the process and stops the generation immediately. This is frequently used in chatbot interfaces to prevent the model from “hallucinating” both sides of a conversation or continuing to write after it has provided the requested information.
How AI Acquires Vocabulary and Intelligence
LLMs do not emerge with an innate understanding of human language. Before a model can engage in a single conversation, it must undergo two distinct, rigorous phases: the construction of its vocabulary and the optimization of its internal weights. Together, these processes transform a collection of random numbers into a sophisticated system capable of reasoning and communication.
Constructing the Lexicon
The modelâs vocabulary is established during a preparatory phase called tokenization. This occurs before the neural network begins its actual training. Because computers process numbers rather than letters, researchers must first create a dictionary that maps segments of text to specific numerical IDs.
To build this dictionary, developers utilize an algorithm known as Byte-Pair Encoding (BPE). This process is driven entirely by the frequency of characters and patterns within a massive dataset.
The BPE Evolution
- Initial State: The algorithm starts at the most granular levelâindividual characters (a, b, c) or raw bytes.
- Pattern Recognition: It scans billions of tokens of text to identify the two most frequently occurring adjacent units. For instance, it might find that “t” and “h” appear together more often than any other pair.
- Merging: These two units are merged into a single new entry: “th.” This new entry is added to the vocabulary.
- Iteration: This cycle repeats. Eventually, “th” might merge with “e” to form “the,” or common suffixes like “-ing” and “-tion” are consolidated into their own unique tokens.
This process continues until the vocabulary reaches a predetermined limitâtypically around 100,000 tokens. This hybrid approach is powerful because it allows the model to handle whole words, common fragments, and individual letters. Even if the model encounters a word it has never seen before, it can reconstruct it using smaller fragments, ensuring it is never at a loss for words.
How Weights are Learned
While the vocabulary provides the model with words, the learned weights provide the “intelligence” to use them. These weights are numerical parameters within the model’s architecture that dictate how information flows through the system.
During the training phase, these weights are optimized. When an LLM is first initialized, these billions of weights are essentially random numbers. If you were to prompt a “raw” model, it would produce nothing but chaotic, meaningless gibberish.
The transition from chaos to coherence happens through a four-step cycle:
- The Prediction: The model is given a sequence of text (e.g., “The cat sat on the…”) and is asked to predict the next token. With random weights, it might guess something irrelevant, like “bicycle.”
- Loss Calculation: The system compares the modelâs guess to the actual word in the training data (“mat”). The mathematical difference between the wrong guess and the correct answer is called the Loss.
- Backpropagation: Using Calculus, the system works backward through the layers of the neural network. It identifies exactly which weights were most responsible for the incorrect prediction.
- Updating Parameters: The system applies a tiny adjustment to those weights, “nudging” them in a direction that makes the correct answer (“mat”) more likely the next time the model sees that sequence.
This loop is repeated trillions of times across vast supercomputing clusters. Over months of processing, these “microscopic nudges” accumulate. The random numbers gradually evolve into highly tuned weights that represent a deep understanding of grammar, factual relationships, and logical patterns.
In Transformer architectures, this weight optimization applies to all layers, including the attention mechanisms. Backpropagation refines the weights that determine which tokens should attend toâor focus onâwhich other tokens in the sequence. This allows the model to learn which relationships matter for accurate predictions.
The Mechanics of Tokenization
Before an LLM can read a prompt, your words must undergo a complex translation process. Despite their conversational prowess, neural networks do not understand human languages in their raw form; they operate exclusively in the realm of mathematics. The bridge between our words and the model’s math is a specialized software component known as the Tokenizer.
The following is a step-by-step breakdown of how your text is transformed into data that AI can process.
Normalization
When you submit a prompt, the tokenizer first performs normalization. This is essentially a “cleanup” phase to ensure consistency. Because computers are sensitive to even the smallest variations, the tokenizer must standardize the input so that “Apple” and “apple” (or words with different types of punctuation) are handled predictably.
During this stage, the tokenizer may:
- Standardize capitalization (some tokenizers might do this but not all).
- Manage trailing punctuation.
- Convert invisible spaces into visible placeholder characters. This ensures the model knows exactly where a word begins and ends, treating a space as a piece of data rather than empty air.
Splitting into Tokens
Once the text is cleaned, it is cut into smaller units called tokens. To do this, the tokenizer uses the pre-built dictionary established during the training phase.
In the English language, a helpful rule of thumb is that one token is roughly equal to four characters, or approximately 75% of a word.
Token IDs
After the text is fragmented into tokens, the tokenizer acts as a lookup table. Every unique token in the model’s massive vocabulary is assigned a specific, unique integer called a Token ID.
At this point, the text vanishes and is replaced by an array of numbers. For example, the phrase “Hello, how are you?” might be converted into a sequence like [15339, 11, 1268, 527, 499, 30]. These numbers serve as the “address” for each token in the modelâs internal database.
Tensors
The final step is formatting these integers into a tensorâa multi-dimensional mathematical structure (similar to a matrix) that the GPU can process at high speeds.
Only once the data is in tensor form is it passed into the neural network. The LLM takes these Token IDs and maps them to Embeddings, which are long lists of numbers that represent the word’s meaning in a high-dimensional space. The model then performs its calculations and predicts a new Token ID as a response.
The process then runs in reverse:
- The model outputs a numerical ID.
- The Detokenizer looks up that ID in its dictionary.
- The ID is converted back into a human-readable string (e.g.,
499becomes “you”). - The final text is rendered on your screen.
Understanding AI Embeddings
While humans perceive words through a lens of culture, emotion, and experience, a Large Language Model initially sees text as mere numerical identifiers. If you tell a computer that “apple” is ID 15339, it has no inherent way of knowing if an apple is a fruit or a tech company.
Embeddings are the solution to this meaning gap. They act as a sophisticated bridge, translating those raw ID numbers into dense, high-dimensional mathematical vectors that capture the actual essence of human language.
What is a Word Vector?
At its core, an embedding is a list of numbersâa vectorâthat represents a word’s position in a multi-dimensional “semantic space”.
Imagine a traditional dictionary where each word has a single definition. Now, imagine a massive mathematical spreadsheet where every word is assigned hundreds or even thousands of specific coordinates. For example, the word “apple” might be represented by an array like [0.12, -0.45, 0.89, ...] extending across 300 or more dimensions.
While humans can easily visualize three dimensions (length, width, and height), an LLM operates in a “high-dimensional” space. Individual numbers have no inherent meaning. Rather, a word’s coordinates define its position in semantic space, where proximity reflects similarity and vector operations reveal analogies. This distributed encoding allows the model to represent “apple” as a precise point where nearby locations contain semantically related words.
The Geography of Meaning
The true power of embeddings lies in spatial relationships. In this mathematical map, the physical distance between two points (vectors) indicates how similar their meanings are.
- Semantic Clustering: Words that share a context, such as “dog,” “puppy,” and “hound,” are assigned coordinates that place them very close to one another. Conversely, the vector for “dog” would be mathematically distant from unrelated concepts like “microchip” or “existentialism.”
- Linguistic Algebra: Because these words are now numbers, we can perform math on them to reveal logical relationships. A famous example in AI research (static embeddings) is the equation:
Vector("King") - Vector("Man") + Vector("Woman") â Vector("Queen")
The model understands this relationship because the embedding space encodes it geometrically: the direction from “man” to “king” parallels the direction from “woman” to “queen.” Both displacements encode the same semantic shiftâstatus changeâwhile preserving the gender dimension.
From Static to Contextual Embeddings
Early AI models used static embeddings (like Word2Vec), where a word had one fixed vector regardless of how it was used. This created a significant problem with polysemyâwords with multiple meanings. In a static system, the word “bank” had the same coordinates whether you were talking about a “river bank” or a “bank account.”
Modern LLMs utilize the Transformer architecture to create Contextual Embeddings. The process works in two stages:
- Lookup: The model pulls the “base” vector for a word from its embedding matrix.
- Refinement: As the word passes through the modelâs “attention” layers, its coordinates are dynamically shifted based on the words surrounding it.
By the time the model processes the full sentence, the vector for “bank” has moved in mathematical space to hover near other financial terms or, alternatively, near geographical terms, depending on the context.
How the Map is Drawn
These embedding matrices are not manually constructed; they are learned parameters. During initialization, the model populates the embedding matrix with random valuesâessentially noise. Through billions of prediction tasks, backpropagation computes gradients and updates these matrix parameters.
When the model incorrectly predicts “apple” in a mechanical context, the system calculates the loss and performs a gradient update, shifting the “apple” embedding vector away from mechanical feature dimensions toward fruit-related dimensions. This iterative optimization gradually transforms the random matrix into a learned semantic representation.