Context Window: The LLM's Working Memory
Introduction
When you interact with a large language model (LLM), it does not have unlimited memory. The context window is the maximum amount of text the model can "see" at one time. It includes everything: the system instructions, conversation history, your current message, any injected data, and the model's own generated response. All of this must fit within a single token budget.
Understanding the context window is critical for developers because it directly affects how much information you can include in prompts, how many turns a conversation can retain, and how the model handles large documents or datasets.
What Is a Context Window?
A context window is defined as the maximum number of tokens an LLM can process in a single request. Think of it as the model's working memory budget: everything the model reads and writes during one interaction must fit inside this window.
The context window is shared between input and output. If a model has a 128,000-token context window and your prompt uses 2,000 tokens, you have up to 126,000 tokens remaining for the response. In practice, you also reserve some capacity for internal overhead and to avoid hitting the hard limit.
What Fills the Context Window?
Every request to an LLM fills the context window from several sources:
| Source | Description |
|---|---|
| System instructions | The system prompt that defines the model's role and behavior |
| Conversation history | Previous user messages and assistant replies in multi-turn conversations |
| Injected data | Retrieved documents (RAG), tool descriptions, function results, or any data added to the prompt |
| User's current message | The latest prompt or question from the user |
| Generated response | The tokens the model produces as its output — these also count against the window |
Context Window Sizes by Model
Context window sizes have grown dramatically over time. Early models were limited to 4,000 tokens, while modern models can process over 1 million tokens in a single request.
| Model | Context Window | Approximate Pages of Text |
|---|---|---|
| GPT-3.5 Turbo | 16,385 tokens | ~20 pages |
| GPT-4 | 128,000 tokens | ~160 pages |
| GPT-4o | 128,000 tokens | ~160 pages |
| GPT-4o mini | 128,000 tokens | ~160 pages |
| GPT-4.1 | 1,047,576 tokens | ~1,300 pages |
| Claude 3.5 Sonnet | 200,000 tokens | ~250 pages |
Note: "pages" is a rough estimate assuming ~800 tokens per page of English text.
Why the Context Window Matters
1. Information Capacity
Larger context windows let you include more information in a single request. With a small context window you might only fit a short question. With a large context window you can include entire documents, database records, or lengthy conversation histories.
2. Multi-Turn Conversations
In a chatbot, every previous message (both user and assistant) is sent to the model on each turn. As the conversation grows, it consumes more of the context window. When the history exceeds the window, the oldest messages must be dropped or summarized — and the model "forgets" them.
3. RAG and Tool Use
Retrieval-Augmented Generation (RAG) injects retrieved documents into the prompt. Tool descriptions and function call results also consume tokens. A limited context window constrains how many documents or tool results you can provide.
4. Cost
LLM APIs charge per token. A longer prompt means more input tokens, which means higher cost. Sending the same large context on every request can become expensive. Efficient context management directly reduces operational costs.
Strategies for Managing the Context Window
| Strategy | How It Works | Trade-off |
|---|---|---|
| Trimming / Truncation | Cut old messages or long text to a token limit | Simple but loses information abruptly |
| Sliding Window | Keep only the most recent N messages | Easy to implement; oldest context is discarded |
| Summarization | Summarize older conversation turns into a compact block | Preserves key facts but costs an extra LLM call |
| Selective Injection (RAG) | Only inject the most relevant documents rather than all available data | Requires a retrieval pipeline; relevance depends on the search quality |
| Chunking | Split large documents into smaller pieces and process them individually | Works well for extraction tasks; cross-chunk context can be lost |
Counting Tokens in .NET
Before sending a request, you can measure exactly how many tokens your prompt will consume using the Microsoft.ML.Tokenizers library. This lets you:
- Calculate remaining context budget after system instructions and history
- Decide how many documents or records to inject
- Trim text precisely to a token boundary using
GetIndexByTokenCount - Estimate costs before making API calls
Full Example
Reference
LLM Fundamentals - Microsoft Learn