Microsoft Agent Framework Agents Created: 16 Feb 2026 Updated: 18 Feb 2026

MultiModels with Agents

1. Introduction

The Microsoft Agent Framework supports multimodal input — you can send images alongside text to an agent, and the agent can analyze and respond to the image content. This opens up use cases like image description, visual comparison, document analysis, and more.

In this lesson, you will learn how to create a ChatMessage that includes both text and image content using TextContent and UriContent. The agent (backed by a vision-capable model like gpt-4o) can then analyze the image and respond accordingly.

2. Prerequisites

  1. .NET 10 SDK installed
  2. An OpenAI API key (set the OPEN_AI_KEY environment variable)
  3. A vision-capable model (e.g., gpt-4o)
  4. The following NuGet packages:
  5. Microsoft.Agents.AI
  6. Microsoft.Extensions.AI.OpenAI

3. Core Concepts

3.1. ChatMessage with Mixed Content

A ChatMessage can contain multiple content items. For multimodal input, you combine TextContent (your text prompt) with UriContent (an image URL) in a single message.

ChatMessage message = new(ChatRole.User, [
new TextContent("What do you see in this image?"),
new UriContent("https://example.com/image.jpg", "image/jpeg")
]);

3.2. Content Types

TypeDescriptionUse Case
TextContentPlain text content in a messagePrompts, instructions, questions
UriContentContent referenced by a URI (URL)Images from the web, publicly accessible files
DataContentRaw binary data (e.g., base64-encoded)Local images, generated images

3.3. Creating a Vision Agent

Any agent backed by a vision-capable model can process images. You create one the same way as a text agent — no special configuration is required for image support:

var agent = chatClient.AsAIAgent(new ChatClientAgentOptions
{
Name = "VisionAgent",
Description = "You are a helpful agent that can analyze images."
});

4. Step-by-Step: Passing Images to an Agent

Step 1 — Create the Chat Client

Create a chat client using a vision-capable model like gpt-4o:

var chatClient = new OpenAIClient(apiKey)
.GetChatClient("gpt-4o")
.AsIChatClient();

Step 2 — Create the Agent

var agent = chatClient.AsAIAgent(new ChatClientAgentOptions
{
Name = "VisionAgent",
Description = "You are a helpful agent that can analyze images and provide detailed descriptions."
});

Step 3 — Build the Message with Image Content

Create a ChatMessage that contains both a text prompt and an image URL. Use TextContent for the text and UriContent for the image:

ChatMessage message = new(ChatRole.User, [
new TextContent("What do you see in this image?"),
new UriContent(
"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
"image/jpeg")
]);
Note: The second parameter of UriContent is the MIME type of the image (e.g., "image/jpeg", "image/png").

Step 4 — Run the Agent

var response = await agent.RunAsync(message);
Console.WriteLine(response);

The agent will analyze the image and return a text description.

5. Demo 1 — Basic Image Analysis

This demo shows the simplest use case: sending an image URL to the agent and receiving a text description. The agent analyzes the visual content and describes what it sees.

What it demonstrates:

  1. Creating a ChatMessage with TextContent + UriContent
  2. Using agent.RunAsync(message) with image input
  3. Receiving a text response describing the image
ChatMessage message = new(ChatRole.User, [
new TextContent("What do you see in this image? Describe it in detail."),
new UriContent(imageUrl, "image/jpeg")
]);

var response = await agent.RunAsync(message);
Console.WriteLine(response);

Use Cases: Image cataloging, accessibility descriptions, content moderation.

6. Demo 2 — Image Comparison

This demo shows how to send multiple images in a single message. The agent compares the images and identifies similarities and differences.

What it demonstrates:

  1. Including multiple UriContent items in one ChatMessage
  2. Asking the agent to compare visual content
ChatMessage message = new(ChatRole.User, [
new TextContent("Compare these two images. What are the similarities and differences?"),
new UriContent(imageUrl1, "image/jpeg"),
new UriContent(imageUrl2, "image/jpeg")
]);

Use Cases: Before/after comparison, quality control, visual diff analysis.

7. Demo 3 — Image + Structured Output

This demo combines multimodal input (from this lesson) with structured output (from Lesson 4). The agent analyzes an image and returns the results as a strongly-typed C# object instead of free-form text.

What it demonstrates:

  1. Combining UriContent with ChatResponseFormat.ForJsonSchema()
  2. Deserializing image analysis into a typed ImageAnalysisResult object
  3. Programmatic access to extracted image data (subject, mood, colors, objects)
public class ImageAnalysisResult
{
[JsonPropertyName("subject")]
public string? Subject { get; set; }

[JsonPropertyName("setting")]
public string? Setting { get; set; }

[JsonPropertyName("colors")]
public List<string>? Colors { get; set; }

[JsonPropertyName("mood")]
public string? Mood { get; set; }

[JsonPropertyName("objects")]
public List<string>? Objects { get; set; }

[JsonPropertyName("description")]
public string? Description { get; set; }
}

// Configure structured output
JsonElement schema = AIJsonUtilities.CreateJsonSchema(typeof(ImageAnalysisResult));
var chatOptions = new ChatOptions
{
ResponseFormat = ChatResponseFormat.ForJsonSchema(
schema: schema,
schemaName: "ImageAnalysisResult",
schemaDescription: "Structured analysis of an image")
};

// Send image and get structured response
ChatMessage message = new(ChatRole.User, [
new TextContent("Analyze this image and extract structured information."),
new UriContent(imageUrl, "image/jpeg")
]);

var response = await agent.RunAsync(message);
var analysis = response.Deserialize<ImageAnalysisResult>(JsonSerializerOptions.Web);

Console.WriteLine($"Subject: {analysis.Subject}");
Console.WriteLine($"Mood: {analysis.Mood}");
Console.WriteLine($"Colors: {string.Join(", ", analysis.Colors ?? [])}");

Use Cases: Automated image tagging, visual search indexing, content management systems.

8. Demo 4 — Conversational Image Analysis

This demo shows a multi-turn conversation about an image. The agent receives an image in the first message, then answers follow-up questions about it in subsequent turns — without needing to re-send the image.

What it demonstrates:

  1. Sending an image in the first message
  2. Building a conversation history with List<ChatMessage>
  3. Asking follow-up questions that reference the original image
// Turn 1: Send the image
ChatMessage imageMessage = new(ChatRole.User, [
new TextContent("What is in this image?"),
new UriContent(imageUrl, "image/jpeg")
]);
var response1 = await agent.RunAsync(imageMessage);

// Turn 2: Follow-up (text only — agent remembers the image)
var messages = new List<ChatMessage>
{
imageMessage,
new ChatMessage(ChatRole.Assistant, response1.ToString()),
new ChatMessage(ChatRole.User, "What architectural style is the building?")
};
var response2 = await agent.RunAsync(messages);

Use Cases: Interactive image exploration, educational tools, customer support with visual context.

9. Demo 5 — Streaming Image Analysis

This demo shows how to stream the agent's response while it analyzes an image. Streaming is useful for long analyses, as the user sees progressive results instead of waiting for the complete response.

What it demonstrates:

  1. Using agent.RunStreamingAsync(message) with image input
  2. Processing streaming updates with await foreach
  3. Displaying results progressively in the console
ChatMessage message = new(ChatRole.User, [
new TextContent("Provide a very detailed analysis of this image."),
new UriContent(imageUrl, "image/jpeg")
]);

await foreach (var update in agent.RunStreamingAsync(message))
{
Console.Write(update.Text);
}

Use Cases: Real-time analysis dashboards, chat UIs, long-running visual inspections.

10. Best Practices

Do's

  1. Use a vision-capable model (e.g., gpt-4o)
  2. Always specify the correct MIME type in UriContent
  3. Write clear, specific text prompts alongside images
  4. Use publicly accessible image URLs
  5. Combine with structured output for machine-readable results
  6. Use streaming for detailed image analyses

Don'ts

  1. Don't send extremely large images (resize or compress first)
  2. Don't use models that don't support vision (e.g., text-only models)
  3. Don't send more than 5-10 images in a single message (performance)
  4. Don't expect pixel-perfect accuracy for text extraction from images

11. Troubleshooting

Problem: Agent cannot analyze the image.

Solution: Ensure you are using a vision-capable model like gpt-4o. Text-only models cannot process images.

Problem: Image URL returns an error.

Solution: Verify the URL is publicly accessible. Private or authenticated URLs will fail. Check the MIME type matches the actual image format.

Problem: Response is generic or inaccurate.

Solution: Write more specific prompts. Instead of "describe this image", try "list all objects visible in this image and estimate their distance from the camera".

Problem: Multi-turn conversation loses image context.

Solution: Include the full conversation history (including the original image message) in each subsequent call.

12. Summary

In this lesson, we learned how to use images with agents in the Microsoft Agent Framework:

  1. Creating ChatMessage with TextContent + UriContent
  2. Analyzing single images and multiple images
  3. Combining multimodal input with structured output (JSON schema)
  4. Building multi-turn conversations with image context
  5. Streaming image analysis responses

Useful Resources

  1. Official Documentation — Using Images with Agents
  2. Structured Output (Lesson 4)
  3. Microsoft Agent Framework GitHub

Running the Application

# Set the OPEN_AI_KEY environment variable
$env:OPEN_AI_KEY = "your-api-key-here"

# Run the project
dotnet run

# Select Lesson 5 from the main menu, then pick a demo

© 2026 Microsoft Agent Framework Lessons | Lesson 5: Multimodal


Share this lesson: