Microsoft Agent Framework Microsoft.Extensions.AI Created: 01 Mar 2026 Updated: 01 Mar 2026

Text Tokenization with Microsoft.ML.Tokenizers

The Microsoft.ML.Tokenizers library provides tokenization tools for .NET AI applications. Tokenization converts text into integer token IDs that AI models actually process. Knowing the token count of a piece of text lets you stay within a model's context limit, estimate API costs, and trim prompts before sending them.

Key Concepts

1. Creating a Tokenizer

Use TiktokenTokenizer.CreateForModel with the target model name. Requires the Microsoft.ML.Tokenizers.Data.O200kBase NuGet package for gpt-4o. Create the instance once and reuse it — construction downloads encoding data.

Tokenizer tokenizer = TiktokenTokenizer.CreateForModel("gpt-4o");

2. CountTokens

Returns the number of tokens in a string without allocating the full ID list. Use this to check whether text fits within a model's context window before sending:

int count = tokenizer.CountTokens(text);
bool fits = count <= 8192;

3. EncodeToIds / Decode

Convert text to a list of integer token IDs, then decode them back to the original string:

IReadOnlyList<int> ids = tokenizer.EncodeToIds(text);
string? decoded = tokenizer.Decode(ids);

4. EncodeToTokens

Returns EncodedToken objects with both the ID and the string value of each token. Use this to inspect exactly where word boundaries fall:

IReadOnlyList<EncodedToken> tokens = tokenizer.EncodeToTokens(text, out string? normalised);
foreach (EncodedToken token in tokens)
Console.WriteLine($"ID {token.Id} '{token.Value}'");

5. GetIndexByTokenCount / GetIndexByTokenCountFromEnd

Find the character index that corresponds to a given token budget, measured from the start or the end of the string. Use these to trim text to a precise token limit:

// First 20 tokens
int endIdx = tokenizer.GetIndexByTokenCount(text, 20, out string? processed, out _);
string first20 = (processed ?? text)[..endIdx];

// Last 10 tokens
int startIdx = tokenizer.GetIndexByTokenCountFromEnd(text, 10, out processed, out _);
string last10 = (processed ?? text)[startIdx..];

Full Example

using Microsoft.ML.Tokenizers;

namespace MicrosoftAgentFrameworkLesson.ConsoleApp.Tokenizers;

/// <summary>
/// Demonstrates Microsoft.ML.Tokenizers with the Tiktoken (gpt-4o) tokenizer.
/// Scenario: Pharmaceutical research assistant that checks token counts of
/// clinical study summaries before sending them to an AI model.
/// </summary>
public static class TokenizersDemo
{
public static void RunAsync()
{
// Create the Tiktoken tokenizer for gpt-4o (o200k_base encoding).
// Reuse this instance throughout the app — creation is expensive.
Tokenizer tokenizer = TiktokenTokenizer.CreateForModel("gpt-4o");

Console.WriteLine("====== Microsoft.ML.Tokenizers — Pharmaceutical Research Assistant ======\n");

// -----------------------------------------------------------------------
// Demo 1: CountTokens — check whether a study abstract fits the limit
// -----------------------------------------------------------------------
Console.WriteLine("--- Demo 1: CountTokens ---");

string studyAbstract =
"A randomised double-blind placebo-controlled trial evaluated the efficacy " +
"of compound XR-42 in 340 adult patients diagnosed with moderate persistent asthma. " +
"Primary endpoint was FEV1 improvement at week 12. Secondary endpoints included " +
"exacerbation rate, rescue inhaler use, and quality-of-life scores. " +
"Patients receiving XR-42 showed a 28% mean increase in FEV1 compared to 4% in the " +
"placebo group (p < 0.001). No serious adverse events were attributed to the compound.";

int tokenCount = tokenizer.CountTokens(studyAbstract);
const int contextLimit = 8192;

Console.WriteLine($"Abstract token count : {tokenCount}");
Console.WriteLine($"Context limit : {contextLimit}");
Console.WriteLine($"Fits in context : {tokenCount <= contextLimit}");
Console.WriteLine();

// -----------------------------------------------------------------------
// Demo 2: EncodeToIds / Decode — round-trip a drug interaction warning
// -----------------------------------------------------------------------
Console.WriteLine("--- Demo 2: EncodeToIds / Decode ---");

string warning =
"Contraindicated with MAO inhibitors. Concurrent use with serotonergic " +
"agents may increase risk of serotonin syndrome. Avoid grapefruit juice.";

IReadOnlyList<int> ids = tokenizer.EncodeToIds(warning);
Console.WriteLine($"Token IDs : {string.Join(", ", ids)}");
Console.WriteLine($"Token count: {ids.Count}");

string? decoded = tokenizer.Decode(ids);
Console.WriteLine($"Decoded : {decoded}");
Console.WriteLine();

// -----------------------------------------------------------------------
// Demo 3: EncodeToTokens — inspect the token boundary of a dosage string
// -----------------------------------------------------------------------
Console.WriteLine("--- Demo 3: EncodeToTokens ---");

string dosage = "XR-42: 50 mg twice daily (maximum 200 mg/day)";

IReadOnlyList<EncodedToken> tokens = tokenizer.EncodeToTokens(dosage, out string? normalised);
Console.WriteLine($"Input : {dosage}");
Console.WriteLine($"Normalised: {normalised ?? dosage}");
Console.WriteLine("Tokens:");
foreach (EncodedToken token in tokens)
Console.WriteLine($" ID {token.Id,6} '{token.Value}'");
Console.WriteLine();

// -----------------------------------------------------------------------
// Demo 4: GetIndexByTokenCount / GetIndexByTokenCountFromEnd — trim a
// long patient history to the first 20 and last 10 tokens
// -----------------------------------------------------------------------
Console.WriteLine("--- Demo 4: GetIndexByTokenCount / GetIndexByTokenCountFromEnd ---");

string patientHistory =
"Patient DOB 1978-03-14. Diagnosed hypertension 2015, type-2 diabetes 2019. " +
"Current medications: metformin 500 mg, lisinopril 10 mg, atorvastatin 20 mg. " +
"Allergies: penicillin (anaphylaxis), sulfa drugs (rash). " +
"Last HbA1c: 7.2% (Jan 2026). Last eGFR: 74 mL/min/1.73 m² (Jan 2026).";

int totalTokens = tokenizer.CountTokens(patientHistory);
Console.WriteLine($"Full history: {totalTokens} tokens");

// First 20 tokens
int endIdx = tokenizer.GetIndexByTokenCount(patientHistory, 20, out string? proc1, out _);
proc1 ??= patientHistory;
Console.WriteLine($"First 20 tokens: \"{proc1[..endIdx]}\"");

// Last 10 tokens
int startIdx = tokenizer.GetIndexByTokenCountFromEnd(patientHistory, 10, out string? proc2, out _);
proc2 ??= patientHistory;
Console.WriteLine($"Last 10 tokens : \"{proc2[startIdx..]}\"");
Console.WriteLine();
}
}
Share this lesson: