Generative AI for Beginners AI Patterns and Applications in .NET Created: 04 Apr 2026 Updated: 04 Apr 2026

Vision and Document Understanding

In this lesson you learn how to send images to AI models, extract structured information from photographs, compare multiple images in a single request, and maintain a multi-turn conversation about a visual document.

Reference: Generative AI for Beginners .NET — 03 Vision and Document Understanding

The Visual Gap

Every AI pattern we have used so far operates on text. The real world, however, is filled with content that exists only as images: site photographs, engineering drawings, scanned contracts, product labels, and aerial maps. Forcing humans to transcribe these into text before an AI can reason about them is slow and error-prone.

Modern multimodal models accept both text and images in the same request. You describe what you want in words and then simply hand the model a picture. It sees both and reasons across them simultaneously.

Practical examples where this matters:

  1. A civil engineering firm inspecting bridges from field photographs
  2. A logistics company reading shipping labels from camera images
  3. A legal team extracting clauses from scanned contracts
  4. A manufacturer detecting defects on a production line

Part 1 — Sending an Image to the Model

Microsoft.Extensions.AI represents image data with the DataContent class. You can supply either raw bytes plus a MIME type, or a public URL that the model fetches itself.

Loading an Image from a Public URL

DataContent's URI constructor only accepts data: scheme URIs, not regular https:// links. To use a publicly hosted image you must download the bytes first and supply them together with the MIME type:

// Download bytes — DataContent does NOT accept https:// URIs directly.
// Many CDNs (e.g. Wikimedia) also require a User-Agent header; use SendAsync.
using var http = new HttpClient();
using var request = new HttpRequestMessage(HttpMethod.Get, "https://example.com/site-photo.jpg");
request.Headers.UserAgent.ParseAdd("Mozilla/5.0 (compatible; MyApp/1.0)");
var response = await http.SendAsync(request);
response.EnsureSuccessStatusCode();
byte[] bytes = await response.Content.ReadAsByteArrayAsync();
var photo = new DataContent(bytes, "image/jpeg");

var messages = new List<ChatMessage>
{
new(ChatRole.System,
"You are a licensed civil engineer. Provide professional assessments."),
new(ChatRole.User, new AIContent[]
{
new TextContent("Identify this structure and describe its primary materials."),
photo // image travels alongside the text question
})
};

var response = await chatClient.GetResponseAsync(messages);
Console.WriteLine(response.Text);

The key difference from a plain text call is the message content: instead of a single string you supply an AIContent[] array that mixes TextContent and DataContent items.

Re-using Images — Fetch Once, Pass Everywhere

If the same image is used in multiple requests, download it once and keep the DataContent object in a variable:

// Helper: download once → reuse as many times as needed.
// A User-Agent header prevents 403 responses from CDNs like Wikimedia.
private static async Task<DataContent> FetchImageAsync(HttpClient http, string url)
{
using var request = new HttpRequestMessage(HttpMethod.Get, url);
request.Headers.UserAgent.ParseAdd("Mozilla/5.0 (compatible; MyApp/1.0)");
var response = await http.SendAsync(request);
response.EnsureSuccessStatusCode();
var bytes = await response.Content.ReadAsByteArrayAsync();
return new DataContent(bytes, "image/jpeg");
}

// At startup
using var http = new HttpClient();
var photo = await FetchImageAsync(http, "https://example.com/site-photo.jpg");

// Reuse in as many calls as you like
await AnalyzeStructureAsync(chatClient, photo);
await ExtractAssetRecordAsync(chatClient, photo);

Loading an Image from a Local File

For images on disk, read the bytes directly — the pattern is the same:

byte[] bytes = await File.ReadAllBytesAsync("inspection-photo.jpg");
var photo = new DataContent(bytes, "image/jpeg");

Part 2 — Extracting Structured Data from Images

Vision models can read text visible in an image and reason about what it means. Combine that with a schema in your prompt and you get structured JSON out of a photograph — no OCR library required.

// Use FetchImageAsync (shown above) — handles the User-Agent header required by most CDNs
using var http = new HttpClient();
var photo = await FetchImageAsync(http, "https://example.com/bridge-photo.jpg");

var messages = new List<ChatMessage>
{
new(ChatRole.System,
"You are an asset data-capture tool. " +
"Respond with valid JSON only — no markdown, no extra explanation."),
new(ChatRole.User, new AIContent[]
{
new TextContent("""
Analyse this infrastructure image and return exactly this JSON shape:
{
"asset_type": "bridge|building|road|tunnel|dam|other",
"structure_name": "name or 'Unknown'",
"estimated_year_built": "year or decade",
"primary_material": "steel|concrete|masonry|timber|composite",
"span_count_or_floors": 0,
"condition_rating": "excellent|good|fair|poor|critical",
"risk_flags": ["flag1", "flag2"]
}
"""),
photo
})
};

var response = await chatClient.GetResponseAsync(messages);
// response.Text contains the JSON — parse with JsonSerializer if needed

The system prompt instructs the model to return only JSON so downstream code can parse it without string manipulation.

Part 3 — Comparing Multiple Images

You can include more than one DataContent item in the same AIContent[] array. The model sees all images together and can compare, contrast, or find relationships between them.

// FetchImageAsync (defined in Part 1) sends a User-Agent header to avoid 403s
using var http = new HttpClient();
var imageA = await FetchImageAsync(http, "https://example.com/bridge.jpg");
var imageB = await FetchImageAsync(http, "https://example.com/tower.jpg");

var messages = new List<ChatMessage>
{
new(ChatRole.System, "You are a comparative structures analyst. Be factual and concise."),
new(ChatRole.User, new AIContent[]
{
new TextContent(
"Compare these two structures: " +
"(1) construction type and era, " +
"(2) primary function, " +
"(3) greatest construction challenge, " +
"(4) maintenance complexity."),
imageA,
imageB // both images travel in the same request
})
};

var response = await chatClient.GetResponseAsync(messages);

Useful scenarios: before-and-after change detection, A/B defect comparison, multi-site inspection comparison.

Part 4 — Conversational Document Q&A

The most powerful vision pattern is treating an image (or a scanned document) as a long-lived context that you interrogate across multiple turns. You load the image once and then ask follow-up questions that build on each other, just like a person leafing through a report.

The trick is to maintain a List<ChatMessage> as your conversation history and re-attach the image with each question so the model always has visual context when it answers.

The AssetAdvisor Pattern

public sealed class AssetAdvisor
{
private readonly IChatClient _chatClient;
private readonly List<ChatMessage> _history;
private DataContent? _photo; // holds the pre-downloaded image bytes

public AssetAdvisor(IChatClient chatClient)
{
_chatClient = chatClient;
_history =
[
new ChatMessage(ChatRole.System,
"""
You are a senior infrastructure asset advisor.
Reference specific visual details from the photo in every answer.
Keep each answer under 100 words.
""")
];
}

// Accepts a DataContent already loaded from bytes (not a Uri)
public async Task LoadAssetAsync(DataContent photo)
{
_photo = photo;

_history.Add(new ChatMessage(ChatRole.User, new AIContent[]
{
new TextContent("Please give me a one-paragraph structural overview of this asset."),
_photo
}));

var response = await _chatClient.GetResponseAsync(_history);
_history.Add(new ChatMessage(ChatRole.Assistant, response.Text));

Console.WriteLine($"Asset loaded:\n{response.Text}");
}

// Ask follow-up questions — conversation history provides context
public async Task<string> AskAsync(string question)
{
_history.Add(new ChatMessage(ChatRole.User, new AIContent[]
{
new TextContent(question),
_photo! // re-attach so model can see the image again
}));

var response = await _chatClient.GetResponseAsync(_history);
_history.Add(new ChatMessage(ChatRole.Assistant, response.Text));

return response.Text;
}
}

Using AssetAdvisor

// FetchImageAsync sends a User-Agent header — plain GetByteArrayAsync causes 403 on CDNs
using var http = new HttpClient();
var bridgePhoto = await FetchImageAsync(http, "https://example.com/bridge.jpg");

var advisor = new AssetAdvisor(chatClient);
await advisor.LoadAssetAsync(bridgePhoto); // pass DataContent, not a Uri

Console.WriteLine(await advisor.AskAsync("What rehabilitation work would you prioritise?"));
Console.WriteLine(await advisor.AskAsync("How does salt-air exposure affect this structure type?"));
Console.WriteLine(await advisor.AskAsync("Which sensors would you install and where?"));

Each call to AskAsync builds on all previous answers. The model never loses track of what was already said, enabling deep technical dialogue about a single visual document.

Part 5 — PDF and Multi-Page Document Processing

Vision models accept images, not PDF files directly. The standard approach for PDFs is:

  1. Convert each page to a PNG or JPEG (using a library such as PdfiumViewer or ImageMagick)
  2. Send each page image to the model
  3. Collect per-page results and combine them
async Task<string> ProcessPdfAsync(IEnumerable<byte[]> pageImages, IChatClient chatClient)
{
var results = new System.Text.StringBuilder();
int pageNumber = 1;

foreach (var pageBytes in pageImages)
{
var pageImage = new DataContent(pageBytes, "image/png");

var messages = new List<ChatMessage>
{
new(ChatRole.User, new AIContent[]
{
new TextContent($"Extract all text and describe diagrams on page {pageNumber}."),
pageImage
})
};

var response = await chatClient.GetResponseAsync(messages);
results.AppendLine($"## Page {pageNumber}");
results.AppendLine(response.Text);
results.AppendLine();
pageNumber++;
}

return results.ToString();
}

This page-by-page strategy respects context boundaries — each page is a self-contained unit — and lets you process arbitrarily long documents.

Part 6 — Vision Best Practices

Image Quality

TipWhy it matters
Use high resolutionSmall or blurry details get missed or hallucinated
Avoid heavy JPEG compressionArtefacts look like dirt or damage to the model
Good, even lightingShadows obscure text and structural features
Straighten and cropTilted documents are harder to read reliably

Token Budget

Images consume tokens just like text. A high-resolution image can cost thousands of tokens. Resize large images before sending when exact pixel detail is not required:

// Example: cap width at 1 920 px to reduce token cost while preserving readability
// Use SixLabors.ImageSharp or System.Drawing for the actual resize operation

Sensitive Documents

Never log raw responses when processing documents that may contain personal data. Use the system prompt to instruct the model to mask sensitive fields:

new(ChatRole.System,
"Process this document but do not retain any personal information. " +
"Mask SSNs, credit card numbers, and other sensitive identifiers in your response.")

Provider Flexibility

Because all code uses the IChatClient abstraction from Microsoft.Extensions.AI, swapping from a cloud model to a local vision model (e.g., llava via Ollama) requires changing only the client construction line — the rest of the code is identical.

// Cloud (OpenAI)
IChatClient chatClient = new OpenAIClient(apiKey)
.GetChatClient("gpt-4o-mini")
.AsIChatClient();

// Local (Ollama — same interface, no code changes elsewhere)
// IChatClient chatClient = new OllamaApiClient("http://localhost:11434")
// .AsIChatClient("llava");

Let's Review — What You Learned

ConceptDescription
DataContentWraps image bytes (+ MIME type) so they can travel alongside text in a chat message. Requires bytes — does not accept https:// URIs. When downloading from CDNs (e.g. Wikimedia), use HttpRequestMessage + UserAgent via SendAsync — bare GetByteArrayAsync can return 403.
AIContent[]Array of mixed content items (TextContent + DataContent) passed to ChatMessage
Multimodal requestA single request that contains both natural-language instructions and one or more images
Structured extractionPrompt the model with a JSON schema; it returns filled-in JSON from the image
Multi-image comparisonInclude multiple DataContent items in one AIContent[] to compare images in one request
Document Q&A patternMaintain a conversation history; re-attach the image with each follow-up question
PDF processingConvert pages to images first, then process each page through the vision model

Quick Self-Check

  1. What class in Microsoft.Extensions.AI carries image data alongside text in a message?
  2. How can you send two images to a model for comparison in a single request?
  3. Why is conversation history important in the Document Q&A pattern?
  4. Why does processing a PDF require converting pages to images first?

Full Example

using Microsoft.Extensions.AI;
using OpenAI;

namespace MicrosoftAgentFrameworkLesson.ConsoleApp;

/// <summary>
/// Lesson 4 — Vision and Document Understanding
/// Domain: Civil Infrastructure Asset Assessment
///
/// An engineering consultancy uses AI vision to inspect bridges, towers, and
/// other public structures from field photographs. Patterns covered:
/// Part 1 — Analyse a single image (bytes fetched from a public URL)
/// Part 2 — Extract structured JSON data from an image
/// Part 3 — Compare two structures supplied in the same request
/// Part 4 — Maintain a multi-turn conversation about one structure (AssetAdvisor)
/// </summary>
public static class VisionDocumentDemo
{
// Stable Wikimedia Commons images used as stand-ins for field photographs
private const string BridgePhotoUrl =
"https://upload.wikimedia.org/wikipedia/commons/thumb/0/0c/GoldenGateBridge-001.jpg/1200px-GoldenGateBridge-001.jpg";

private const string TowerPhotoUrl =
"https://upload.wikimedia.org/wikipedia/commons/a/a8/Tour_Eiffel_Wikimedia_Commons.jpg";

// DataContent requires raw bytes + MIME type; it does NOT accept https:// URIs.
// Wikimedia requires a User-Agent header; without it the server returns 403.
private static async Task<DataContent> FetchImageAsync(HttpClient http, string url)
{
using var request = new HttpRequestMessage(HttpMethod.Get, url);
request.Headers.UserAgent.ParseAdd("Mozilla/5.0 (compatible; LessonBot/1.0)");
var response = await http.SendAsync(request);
response.EnsureSuccessStatusCode();
var bytes = await response.Content.ReadAsByteArrayAsync();
return new DataContent(bytes, "image/jpeg");
}

public static async Task RunAsync()
{
var apiKey = Environment.GetEnvironmentVariable("OPEN_AI_KEY")
?? throw new InvalidOperationException("OPEN_AI_KEY environment variable is not set.");

IChatClient chatClient = new OpenAIClient(apiKey)
.GetChatClient("gpt-4o-mini")
.AsIChatClient();

using var http = new HttpClient();

// Pre-fetch images once so each demo part can reuse the bytes
Console.WriteLine("Fetching field photographs...");
var bridgePhoto = await FetchImageAsync(http, BridgePhotoUrl);
var towerPhoto = await FetchImageAsync(http, TowerPhotoUrl);
Console.WriteLine("Photos ready.\n");

// ── Part 1: Basic image recognition ───────────────────────────────
Console.WriteLine("═══════════════════════════════════════════════════");
Console.WriteLine("PART 1 — Single-Image Structural Analysis");
Console.WriteLine("═══════════════════════════════════════════════════\n");
await AnalyzeStructureAsync(chatClient, bridgePhoto);

// ── Part 2: Structured JSON extraction from image ─────────────────
Console.WriteLine("\n═══════════════════════════════════════════════════");
Console.WriteLine("PART 2 — Asset Record Extraction (JSON)");
Console.WriteLine("═══════════════════════════════════════════════════\n");
await ExtractAssetRecordAsync(chatClient, bridgePhoto);

// ── Part 3: Multi-image comparison ────────────────────────────────
Console.WriteLine("\n═══════════════════════════════════════════════════");
Console.WriteLine("PART 3 — Comparative Structural Analysis");
Console.WriteLine("═══════════════════════════════════════════════════\n");
await CompareStructuresAsync(chatClient, bridgePhoto, towerPhoto);

// ── Part 4: Conversational asset Q&A ─────────────────────────────
Console.WriteLine("\n═══════════════════════════════════════════════════");
Console.WriteLine("PART 4 — Conversational Asset Advisory Session");
Console.WriteLine("═══════════════════════════════════════════════════\n");
await RunAdvisorySessionAsync(chatClient, bridgePhoto);
}

// Part 1 ── Describe a structure from a single photo (bytes)
private static async Task AnalyzeStructureAsync(IChatClient chatClient, DataContent photo)
{
var messages = new List<ChatMessage>
{
new(ChatRole.System,
"You are a licensed civil engineer specialising in infrastructure inspections. " +
"Provide concise, professional assessments based on photographs."),
new(ChatRole.User, new AIContent[]
{
new TextContent(
"Identify this structure and provide a brief assessment covering: " +
"structure type, approximate construction era, primary materials, " +
"and any immediately visible maintenance concerns."),
photo
})
};

var response = await chatClient.GetResponseAsync(messages);
Console.WriteLine($"Field Inspection Report:\n{response.Text}");
}

// Part 2 ── Extract a structured asset record as JSON from an image
private static async Task ExtractAssetRecordAsync(IChatClient chatClient, DataContent photo)
{
var messages = new List<ChatMessage>
{
new(ChatRole.System,
"You are an asset data-capture tool. " +
"Respond with valid JSON only — no markdown, no extra explanation."),
new(ChatRole.User, new AIContent[]
{
new TextContent("""
Analyse this infrastructure image and return exactly this JSON shape:
{
"asset_type": "bridge|building|road|tunnel|dam|other",
"structure_name": "name or 'Unknown'",
"estimated_year_built": "year or decade",
"primary_material": "steel|concrete|masonry|timber|composite",
"span_count_or_floors": 0,
"condition_rating": "excellent|good|fair|poor|critical",
"risk_flags": ["flag1", "flag2"]
}
"""),
photo
})
};

var response = await chatClient.GetResponseAsync(messages);
Console.WriteLine($"Asset Record:\n{response.Text}");
}

// Part 3 ── Send two images in one message for side-by-side comparison
private static async Task CompareStructuresAsync(IChatClient chatClient, DataContent bridge, DataContent tower)
{
var messages = new List<ChatMessage>
{
new(ChatRole.System, "You are a comparative structures analyst. Be factual and concise."),
new(ChatRole.User, new AIContent[]
{
new TextContent(
"You are looking at two iconic engineering structures. " +
"Compare them across four dimensions — " +
"use one sentence per dimension per structure: " +
"(1) construction type and era, " +
"(2) primary engineering purpose, " +
"(3) greatest construction challenge, " +
"(4) estimated ongoing maintenance complexity."),
bridge,
tower
})
};

var response = await chatClient.GetResponseAsync(messages);
Console.WriteLine($"Comparative Report:\n{response.Text}");
}

// Part 4 ── Run a multi-turn advisory conversation using AssetAdvisor
private static async Task RunAdvisorySessionAsync(IChatClient chatClient, DataContent photo)
{
var advisor = new AssetAdvisor(chatClient);
await advisor.LoadAssetAsync(photo);

string[] questions =
[
"What rehabilitation strategies would you recommend for a structure of this age and type?",
"How would prolonged exposure to salt air and high winds accelerate deterioration here?",
"Which structural monitoring sensors would you install, and at what locations on this structure?"
];

foreach (var question in questions)
{
Console.WriteLine($"\nEngineer: {question}");
var answer = await advisor.AskAsync(question);
Console.WriteLine($"Advisor: {answer}");
}
}
}

/// <summary>
/// Maintains a multi-turn conversation about a single structure photograph.
/// Demonstrates the Document Q&A pattern: load a visual document once,
/// then ask as many follow-up questions as needed while preserving context.
/// </summary>
public sealed class AssetAdvisor
{
private readonly IChatClient _chatClient;
private readonly List<ChatMessage> _history;
private DataContent? _photo; // holds the pre-downloaded image bytes

public AssetAdvisor(IChatClient chatClient)
{
_chatClient = chatClient;
_history =
[
new ChatMessage(ChatRole.System,
"""
You are a senior infrastructure asset advisor with 30 years of field experience.
When answering follow-up questions always reference specific visual details
visible in the uploaded photograph. Keep each answer under 100 words.
""")
];
}

/// <summary>
/// Loads a photo and asks the model for an initial structural overview.
/// The overview is added to conversation history so subsequent questions
/// can refer back to it.
/// </summary>
public async Task LoadAssetAsync(DataContent photo)
{
_photo = photo;

_history.Add(new ChatMessage(ChatRole.User, new AIContent[]
{
new TextContent(
"I am uploading a field photograph of an infrastructure asset. " +
"Please give me a one-paragraph structural overview."),
_photo
}));

var response = await _chatClient.GetResponseAsync(_history);
_history.Add(new ChatMessage(ChatRole.Assistant, response.Text));

Console.WriteLine($"Asset loaded. Overview:\n{response.Text}");
}

/// <summary>
/// Asks a follow-up question. The photo is re-attached so the model can
/// reference specific visual details in its answer.
/// </summary>
public async Task<string> AskAsync(string question)
{
_history.Add(new ChatMessage(ChatRole.User, new AIContent[]
{
new TextContent(question),
_photo! // re-attach bytes so model always has visual context
}));

var response = await _chatClient.GetResponseAsync(_history);
_history.Add(new ChatMessage(ChatRole.Assistant, response.Text));

return response.Text;
}
}
Share this lesson: