The Microsoft.ML.Tokenizers library provides tokenization tools for .NET AI applications. Tokenization converts text into integer token IDs that AI models actually process. Knowing the token count of a piece of text lets you stay within a model's context limit, estimate API costs, and trim prompts before sending them.
Key Concepts
1. Creating a Tokenizer
Use TiktokenTokenizer.CreateForModel with the target model name. Requires the Microsoft.ML.Tokenizers.Data.O200kBase NuGet package for gpt-4o. Create the instance once and reuse it — construction downloads encoding data.
Tokenizer tokenizer = TiktokenTokenizer.CreateForModel("gpt-4o");
2. CountTokens
Returns the number of tokens in a string without allocating the full ID list. Use this to check whether text fits within a model's context window before sending:
int count = tokenizer.CountTokens(text);
bool fits = count <= 8192;
3. EncodeToIds / Decode
Convert text to a list of integer token IDs, then decode them back to the original string:
IReadOnlyList<int> ids = tokenizer.EncodeToIds(text);
string? decoded = tokenizer.Decode(ids);
4. EncodeToTokens
Returns EncodedToken objects with both the ID and the string value of each token. Use this to inspect exactly where word boundaries fall:
IReadOnlyList<EncodedToken> tokens = tokenizer.EncodeToTokens(text, out string? normalised);
foreach (EncodedToken token in tokens)
Console.WriteLine($"ID {token.Id} '{token.Value}'");
5. GetIndexByTokenCount / GetIndexByTokenCountFromEnd
Find the character index that corresponds to a given token budget, measured from the start or the end of the string. Use these to trim text to a precise token limit:
// First 20 tokens
int endIdx = tokenizer.GetIndexByTokenCount(text, 20, out string? processed, out _);
string first20 = (processed ?? text)[..endIdx];
// Last 10 tokens
int startIdx = tokenizer.GetIndexByTokenCountFromEnd(text, 10, out processed, out _);
string last10 = (processed ?? text)[startIdx..];
Full Example
using Microsoft.ML.Tokenizers;
namespace MicrosoftAgentFrameworkLesson.ConsoleApp.Tokenizers;
/// <summary>
/// Demonstrates Microsoft.ML.Tokenizers with the Tiktoken (gpt-4o) tokenizer.
/// Scenario: Pharmaceutical research assistant that checks token counts of
/// clinical study summaries before sending them to an AI model.
/// </summary>
public static class TokenizersDemo
{
public static void RunAsync()
{
// Create the Tiktoken tokenizer for gpt-4o (o200k_base encoding).
// Reuse this instance throughout the app — creation is expensive.
Tokenizer tokenizer = TiktokenTokenizer.CreateForModel("gpt-4o");
Console.WriteLine("====== Microsoft.ML.Tokenizers — Pharmaceutical Research Assistant ======\n");
// -----------------------------------------------------------------------
// Demo 1: CountTokens — check whether a study abstract fits the limit
// -----------------------------------------------------------------------
Console.WriteLine("--- Demo 1: CountTokens ---");
string studyAbstract =
"A randomised double-blind placebo-controlled trial evaluated the efficacy " +
"of compound XR-42 in 340 adult patients diagnosed with moderate persistent asthma. " +
"Primary endpoint was FEV1 improvement at week 12. Secondary endpoints included " +
"exacerbation rate, rescue inhaler use, and quality-of-life scores. " +
"Patients receiving XR-42 showed a 28% mean increase in FEV1 compared to 4% in the " +
"placebo group (p < 0.001). No serious adverse events were attributed to the compound.";
int tokenCount = tokenizer.CountTokens(studyAbstract);
const int contextLimit = 8192;
Console.WriteLine($"Abstract token count : {tokenCount}");
Console.WriteLine($"Context limit : {contextLimit}");
Console.WriteLine($"Fits in context : {tokenCount <= contextLimit}");
Console.WriteLine();
// -----------------------------------------------------------------------
// Demo 2: EncodeToIds / Decode — round-trip a drug interaction warning
// -----------------------------------------------------------------------
Console.WriteLine("--- Demo 2: EncodeToIds / Decode ---");
string warning =
"Contraindicated with MAO inhibitors. Concurrent use with serotonergic " +
"agents may increase risk of serotonin syndrome. Avoid grapefruit juice.";
IReadOnlyList<int> ids = tokenizer.EncodeToIds(warning);
Console.WriteLine($"Token IDs : {string.Join(", ", ids)}");
Console.WriteLine($"Token count: {ids.Count}");
string? decoded = tokenizer.Decode(ids);
Console.WriteLine($"Decoded : {decoded}");
Console.WriteLine();
// -----------------------------------------------------------------------
// Demo 3: EncodeToTokens — inspect the token boundary of a dosage string
// -----------------------------------------------------------------------
Console.WriteLine("--- Demo 3: EncodeToTokens ---");
string dosage = "XR-42: 50 mg twice daily (maximum 200 mg/day)";
IReadOnlyList<EncodedToken> tokens = tokenizer.EncodeToTokens(dosage, out string? normalised);
Console.WriteLine($"Input : {dosage}");
Console.WriteLine($"Normalised: {normalised ?? dosage}");
Console.WriteLine("Tokens:");
foreach (EncodedToken token in tokens)
Console.WriteLine($" ID {token.Id,6} '{token.Value}'");
Console.WriteLine();
// -----------------------------------------------------------------------
// Demo 4: GetIndexByTokenCount / GetIndexByTokenCountFromEnd — trim a
// long patient history to the first 20 and last 10 tokens
// -----------------------------------------------------------------------
Console.WriteLine("--- Demo 4: GetIndexByTokenCount / GetIndexByTokenCountFromEnd ---");
string patientHistory =
"Patient DOB 1978-03-14. Diagnosed hypertension 2015, type-2 diabetes 2019. " +
"Current medications: metformin 500 mg, lisinopril 10 mg, atorvastatin 20 mg. " +
"Allergies: penicillin (anaphylaxis), sulfa drugs (rash). " +
"Last HbA1c: 7.2% (Jan 2026). Last eGFR: 74 mL/min/1.73 m² (Jan 2026).";
int totalTokens = tokenizer.CountTokens(patientHistory);
Console.WriteLine($"Full history: {totalTokens} tokens");
// First 20 tokens
int endIdx = tokenizer.GetIndexByTokenCount(patientHistory, 20, out string? proc1, out _);
proc1 ??= patientHistory;
Console.WriteLine($"First 20 tokens: \"{proc1[..endIdx]}\"");
// Last 10 tokens
int startIdx = tokenizer.GetIndexByTokenCountFromEnd(patientHistory, 10, out string? proc2, out _);
proc2 ??= patientHistory;
Console.WriteLine($"Last 10 tokens : \"{proc2[startIdx..]}\"");
Console.WriteLine();
}
}