How LLMs Work: From Raw Text to Predictive Generation & Training
Large language models (LLMs) look like magic from the outside — you type text, and the model writes a thoughtful answer. Under the hood, the process is a four-stage pipeline: raw text is broken into tokens, each token becomes a vector, a transformer predicts the next token, and during training a loss is used to adjust the model’s weights. The same loop that trained the model is the loop that generates every output.
This article walks through the four stages end-to-end and shows a working .NET example using the Microsoft.ML.Tokenizers library that implements the first stage and simulates the rest for teaching purposes.
Stage 1 — Tokenization & ID Assignment
The first thing a model does with text is chop it into tokens. A token can be a whole word, part of a word, a single character, or punctuation. The tokenizer is the component that performs this split, and it also assigns every unique token a numeric ID. The full set of IDs the tokenizer knows is called the vocabulary.
Example — the sentence "The quick brown fox." might tokenize to:
From this point on, the model never sees text. It only sees numbers. Every downstream stage operates on these IDs.
Stage 2 — Contextual Embedding (Semantic Representation)
A raw integer ID says nothing about meaning: ID 464 isn’t “close to” ID 465 in any useful way. To capture meaning, the model passes each ID through an embedding layer — essentially a big lookup table that maps every token ID to a high-dimensional vector of floating-point numbers.
These vectors are learned during training so that tokens that appear in similar contexts end up with similar vectors. The word quick sits near fast and speedy; fox sits near dog and animal. The model does not know what a fox is, but it knows which other tokens behave like it.
Embeddings are what let the model generalize: a sentence it has never seen can still be represented in roughly the same region of vector space as sentences it was trained on.
Stage 3 — Next Token Prediction (Iterative Inference)
With the input represented as a sequence of embedding vectors, the transformer block comes next. It applies attention, which computes how much influence each prior token should have on what comes next, and then a feed-forward network mixes that information to produce a predicted next vector.
The predicted vector is compared to every token in the vocabulary, producing a probability for each. The candidate with the highest score (or one sampled from the distribution) becomes the next token. That token is appended to the context and the whole process repeats. This is the autoregressive loop:
- Take all tokens so far.
- Run them through the transformer.
- Pick the next token.
- Append it and go back to step 1.
One token at a time is why LLMs “stream” their answers word by word.
Stage 4 — Training Loop (Learning from Loss)
Stages 1–3 describe how a trained model generates text. Stage 4 is how the model became trained in the first place.
During training, the model is given a real piece of text and asked to predict the next token at every position. Its prediction (a probability distribution over the vocabulary) is compared against the actual next token from the training data using a loss function — usually cross-entropy:
If the model gave the correct token a high probability, the loss is small. If it gave it a low probability, the loss is large. Backpropagation then computes how each weight in the network contributed to that loss and nudges every weight in the direction that would have reduced it. Repeat this across billions of training tokens and the model gradually becomes good at predicting what comes next.
Why This Matters for Developers
- Billing is tied to Stage 1 — you pay per token, not per character.
- Semantic search, retrieval, and RAG use Stage 2 — embeddings let you compare meaning.
- Streaming responses and latency come from Stage 3 — one token per forward pass.
- Fine-tuning is Stage 4 applied to your own data, usually with a much smaller loss-and-update loop.
A .NET Demo of the Pipeline
The example below uses Microsoft.ML.Tokenizers for Stage 1 with the real GPT-4o tokenizer, and it simulates the other three stages with clearly illustrative (not real) math so you can trace a single sentence through every stage. The scenario is deliberately simple: short descriptions of constellations.
Install the packages:
Full Example
Reference
How generative AI and LLMs work - Microsoft Learn