MultiModels with Agents
1. Introduction
The Microsoft Agent Framework supports multimodal input — you can send images alongside text to an agent, and the agent can analyze and respond to the image content. This opens up use cases like image description, visual comparison, document analysis, and more.
In this lesson, you will learn how to create a ChatMessage that includes both text and image content using TextContent and UriContent. The agent (backed by a vision-capable model like gpt-4o) can then analyze the image and respond accordingly.
2. Prerequisites
- .NET 10 SDK installed
- An OpenAI API key (set the
OPEN_AI_KEYenvironment variable) - A vision-capable model (e.g.,
gpt-4o) - The following NuGet packages:
Microsoft.Agents.AIMicrosoft.Extensions.AI.OpenAI
3. Core Concepts
3.1. ChatMessage with Mixed Content
A ChatMessage can contain multiple content items. For multimodal input, you combine TextContent (your text prompt) with UriContent (an image URL) in a single message.
3.2. Content Types
| Type | Description | Use Case |
|---|---|---|
TextContent | Plain text content in a message | Prompts, instructions, questions |
UriContent | Content referenced by a URI (URL) | Images from the web, publicly accessible files |
DataContent | Raw binary data (e.g., base64-encoded) | Local images, generated images |
3.3. Creating a Vision Agent
Any agent backed by a vision-capable model can process images. You create one the same way as a text agent — no special configuration is required for image support:
4. Step-by-Step: Passing Images to an Agent
Step 1 — Create the Chat Client
Create a chat client using a vision-capable model like gpt-4o:
Step 2 — Create the Agent
Step 3 — Build the Message with Image Content
Create a ChatMessage that contains both a text prompt and an image URL. Use TextContent for the text and UriContent for the image:
Note: The second parameter ofUriContentis the MIME type of the image (e.g.,"image/jpeg","image/png").
Step 4 — Run the Agent
The agent will analyze the image and return a text description.
5. Demo 1 — Basic Image Analysis
This demo shows the simplest use case: sending an image URL to the agent and receiving a text description. The agent analyzes the visual content and describes what it sees.
What it demonstrates:
- Creating a
ChatMessagewithTextContent+UriContent - Using
agent.RunAsync(message)with image input - Receiving a text response describing the image
Use Cases: Image cataloging, accessibility descriptions, content moderation.
6. Demo 2 — Image Comparison
This demo shows how to send multiple images in a single message. The agent compares the images and identifies similarities and differences.
What it demonstrates:
- Including multiple
UriContentitems in oneChatMessage - Asking the agent to compare visual content
Use Cases: Before/after comparison, quality control, visual diff analysis.
7. Demo 3 — Image + Structured Output
This demo combines multimodal input (from this lesson) with structured output (from Lesson 4). The agent analyzes an image and returns the results as a strongly-typed C# object instead of free-form text.
What it demonstrates:
- Combining
UriContentwithChatResponseFormat.ForJsonSchema() - Deserializing image analysis into a typed
ImageAnalysisResultobject - Programmatic access to extracted image data (subject, mood, colors, objects)
Use Cases: Automated image tagging, visual search indexing, content management systems.
8. Demo 4 — Conversational Image Analysis
This demo shows a multi-turn conversation about an image. The agent receives an image in the first message, then answers follow-up questions about it in subsequent turns — without needing to re-send the image.
What it demonstrates:
- Sending an image in the first message
- Building a conversation history with
List<ChatMessage> - Asking follow-up questions that reference the original image
Use Cases: Interactive image exploration, educational tools, customer support with visual context.
9. Demo 5 — Streaming Image Analysis
This demo shows how to stream the agent's response while it analyzes an image. Streaming is useful for long analyses, as the user sees progressive results instead of waiting for the complete response.
What it demonstrates:
- Using
agent.RunStreamingAsync(message)with image input - Processing streaming updates with
await foreach - Displaying results progressively in the console
Use Cases: Real-time analysis dashboards, chat UIs, long-running visual inspections.
10. Best Practices
Do's
- Use a vision-capable model (e.g.,
gpt-4o) - Always specify the correct MIME type in
UriContent - Write clear, specific text prompts alongside images
- Use publicly accessible image URLs
- Combine with structured output for machine-readable results
- Use streaming for detailed image analyses
Don'ts
- Don't send extremely large images (resize or compress first)
- Don't use models that don't support vision (e.g., text-only models)
- Don't send more than 5-10 images in a single message (performance)
- Don't expect pixel-perfect accuracy for text extraction from images
11. Troubleshooting
Problem: Agent cannot analyze the image.
Solution: Ensure you are using a vision-capable model like gpt-4o. Text-only models cannot process images.
Problem: Image URL returns an error.
Solution: Verify the URL is publicly accessible. Private or authenticated URLs will fail. Check the MIME type matches the actual image format.
Problem: Response is generic or inaccurate.
Solution: Write more specific prompts. Instead of "describe this image", try "list all objects visible in this image and estimate their distance from the camera".
Problem: Multi-turn conversation loses image context.
Solution: Include the full conversation history (including the original image message) in each subsequent call.
12. Summary
In this lesson, we learned how to use images with agents in the Microsoft Agent Framework:
- Creating
ChatMessagewithTextContent+UriContent - Analyzing single images and multiple images
- Combining multimodal input with structured output (JSON schema)
- Building multi-turn conversations with image context
- Streaming image analysis responses
Useful Resources
- Official Documentation — Using Images with Agents
- Structured Output (Lesson 4)
- Microsoft Agent Framework GitHub
Running the Application
© 2026 Microsoft Agent Framework Lessons | Lesson 5: Multimodal