Retrieval-Augmented Generation (RAG)
In this lesson you will learn how to ground AI responses in your own data so that answers are accurate, current, and relevant to your specific domain — even when that information was never part of the model's training set.
The Knowledge Problem
Language models are trained on a fixed snapshot of the world. They cannot know about:
- Your organisation's internal documents and policies
- Events that occurred after their training cut-off
- Private data they were never trained on
- Domain-specific knowledge unique to your business
Retrieval-Augmented Generation (RAG) solves this by fetching relevant documents from a knowledge base at query time and injecting them into the prompt before asking the model to answer.
Part 1: How RAG Works
RAG has two main phases that execute every time a user asks a question.
Phase 1 — Retrieve
The user's question is converted into an embedding vector. That vector is compared against a pre-built index of document embeddings. The most semantically similar documents are returned.
Phase 2 — Augment and Generate
The retrieved documents are inserted into the system prompt as context. The chat model reads the question together with that context and generates a grounded answer.
The RAG Flow
Part 2: Building a RAG System
Our demo uses a fitness center member portal as its domain — completely separate from the standard customer-support examples you see in tutorials.
Step 1 — Define the Data Model
Decorate a plain C# class with attributes from Microsoft.Extensions.VectorData.
[VectorStoreKey]— the unique identifier for each document[VectorStoreData]— plain metadata stored alongside the vector[VectorStoreVector(1536)]— the embedding field; dimension must match the model (text-embedding-3-smallproduces 1536-dimensional vectors)
Step 2 — Set Up the Services
IEmbeddingGenerator and IChatClient are provider-neutral interfaces from Microsoft.Extensions.AI. Swapping from OpenAI to Azure OpenAI or a local Ollama model only requires changing the construction code — the rest of the RAG pipeline is identical.
Step 3 — Populate the Knowledge Base
Each document's content text is embedded and stored alongside its metadata. This index step happens once (at application start, or as a background job when documents are added or updated).
Step 4 — Retrieve Relevant Documents
SearchAsync returns the top documents whose embedding vectors are closest to the query vector (cosine distance by default).
Step 5 — Build the Augmented Prompt
Four key constraints in the system prompt:
- Define the assistant's role clearly
- Restrict the model to only the provided context
- Tell it what to say when the answer is absent from the context
- Separate the context block from the user question visually
Step 6 — Generate the Answer
GetResponseAsync calls the chat model with the augmented prompt. The model reads the injected policy text and produces a grounded reply. It cannot hallucinate details that are not in the context when the prompt is engineered correctly.
Part 3: Chunking Strategy for Large Documents
Embedding models have an input-token limit (typically 8 192 tokens for text-embedding-3-small). Long documents must be split into overlapping chunks so that each chunk is small enough to embed and specific enough to be retrieved precisely.
The overlap parameter ensures that a sentence split across two chunk boundaries still appears in at least one complete chunk, preventing retrieval gaps.
Chunking Trade-offs
| Factor | Smaller chunks | Larger chunks |
|---|---|---|
| Retrieval precision | Higher — tightly focused matches | Lower — noisy context |
| Context completeness | Lower — may miss surrounding context | Higher — more background included |
| Token cost per query | Lower | Higher |
| Index size | Larger (more chunks to embed) | Smaller |
Part 4: RAG Best Practices
Retrieval Quality Levers
| Technique | Effect |
|---|---|
| Increase top-K | More context provided; may add noise |
| Metadata filtering | Filter by category, date, or author before vector search |
| Reranking | Score retrieved docs by a cross-encoder for higher precision |
| Hybrid search | Combine vector search with keyword (BM25) search |
Provider Flexibility
Because both IEmbeddingGenerator and IChatClient are abstractions, you can run the same RAG pipeline locally with Ollama:
Production Vector Stores
| Store | Package | Use Case |
|---|---|---|
| InMemory | Microsoft.SemanticKernel.Connectors.InMemory | Local dev and prototyping |
| Azure AI Search | Microsoft.SemanticKernel.Connectors.AzureAISearch | Enterprise, hybrid search |
| Qdrant | Microsoft.SemanticKernel.Connectors.Qdrant | Open-source, self-hosted |
| Weaviate | Microsoft.SemanticKernel.Connectors.Weaviate | Open-source, GraphQL API |
| Postgres (pgvector) | Microsoft.SemanticKernel.Connectors.Postgres | Existing Postgres databases |
Let's Review: What You Learned
| Concept | Summary |
|---|---|
| The Knowledge Problem | LLMs only know what they were trained on; RAG fills the gap at query time |
| Retrieve phase | Embed the question, search the vector store, return top-K documents |
| Augment phase | Inject retrieved documents into the system prompt as context |
| Generate phase | The chat model answers using only the injected context |
| Chunking | Split large docs into overlapping word-windows before embedding |
| Provider flexibility | IEmbeddingGenerator and IChatClient abstractions allow any backend |
Quick Self-Check
- What are the three phases of a RAG pipeline?
- Why do we chunk large documents instead of embedding them whole?
- What should the model reply when the answer is not in the retrieved context?
- Which NuGet package provides
InMemoryVectorStore?
Full Example
Complete, verbatim source of RagDemo.cs: