book

Multimodal, Real-Time AI Agent Systems

by Heiko Hotz, Sokratis Kartakis

May 2027

Intermediate to advanced

425 pages

7h 45m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Brief Table of Contents (Not Yet Final)
1. Intelligent Agents and Collaborative AI
The Birth of AgentsStage 0: Legacy AgentsStage 1: The Foundation Models and PromptsStage 2: Foundation Models and Context RetrievalStage 3: Modern AI Agents: Foundation Models, Context Retrieval, and ActionsStage 4: Multi-Agent SystemsWhat is an AI Agent?Defining Actions: How Agents become Doers using ToolsOrchestrating the Core of an AgentThe Model’s Response: A Call to ActionFrom Request to Execution: Your Code’s RoleClosing the Loop: Execution and Final ResponseMoving to Multi-Turn AgentsAgents Calling Agents: The Power of Multi-Agent SystemsConclusion
2. Architecting for Real-Time AI Interaction
The First Failed Generation: Why Siri and Alexa Never Lived Up to the HypeA New Foundation for ConversationThe Anatomy of a Modern Voice AssistantHands-On: Building Your First Live Voice ApplicationArchitectural OverviewWhy a Web App? The Power of Browser-Based Echo CancellationGemini Live API: Your Toolkit for Building Live ApplicationsOur Interaction Model: The “Hot Mic” and Fluid ConversationSetting the Stage: Prerequisites and Project SetupThe Two-Server SkeletonThe Frontend - Capturing and Streaming User AudioThe Backend - Implementing the Secure Proxy LogicThe Frontend - Giving the AI a VoiceCheckpoint: The Full ConversationFinal Polish and Architectural ReviewFinal Code ReviewTying It All Back to the ArchitectureConclusion
3. Advanced Live Interactions: Video, Tools, and System Instructions
Giving Your Assistant a PersonalityThe Blueprint for Character: System Instructions and VoiceGiving Your Assistant Eyes: Show, Don’t TellThe “Show, Don’t Tell” ParadigmHands-On: Integrating Mobile and Desktop VideoGiving Your Assistant Hands: Mastering Tools with Function CallingA Local Architecture for ToolsHands-On: Building and Integrating a Weather ToolFrom Prototype to Polished ProductPolishing the Experience: A Mobile-First User InterfaceHands-On: Building the Mobile UIFreeing the AssistantWhat is Google Cloud Run?Why is Cloud Run Perfect for Our Assistant?How Does it Work? The Role of Containers (Docker)Hands-On: Deploying to the CloudConclusion
4. Core Components of an Agent Framework
The Conductor: The Runtime and Orchestration EngineThe Developer’s Dilemma: Building an Agent from ScratchThe Execution Loop (The “Perceive-Reason-Act” Cycle)State ManagementLifecycle Management (Pause, Resume, Stop)Asynchronous by Default: The Non-Blocking EngineControl Flow and RoutingConcurrency and ParallelismThe Agent’s Mind: MemoryShort-Term Memory: The Conversational WorkspaceLong-Term Memory: The Persistent Knowledge BaseThe Synergy of MemoryThe Agent’s Hands and Senses: The Tooling InterfaceThe Developer’s Dilemma: The Disconnected AgentCore Responsibilities of a Tooling InterfaceFramework Spotlight: The Tool-as-a-ComponentThe Team Players: Multi-Agent CommunicationThe Developer’s Dilemma: Orchestrating a TeamCore Principles of Multi-Agent CommunicationThe Agent’s Charter: Instructions and PromptingThe Developer’s Dilemma: The Unpredictable AgentCore Responsibilities of a Prompting SystemFrameworks in Focus: Choosing Your Orchestration EngineRuntime and Orchestration EngineMemory SystemsThe Tooling InterfaceMulti-Agent CommunicationInstructions and PromptingChoosing the Right FrameworkConclusion
5. Designing and Building Agents
Designing Your AgentDefine the Agent’s Purpose and ToolsDetermine the Required ContextDesign the Core InstructionsProvide Examples of ExecutionSetting Up Your WorkshopGetting the CodeCreating a Python EnvironmentInstalling ADK and Configure CredentialsRunning a Quick TestBuilding the Basic AgentInstantiating an AgentInteracting with an AgentAdding Tools to the Basic AgentCreating the ToolsQuerying the AgentTransforming our Math Agent into a Live AgentImproving Agent ImplementationFrom Live Streaming Models Interaction to Live AgentsLive Agent Deep DiveRunning Your Live Agent’s Backend and FrontendImproving and Evaluating the AgentYour Agent’s Cockpit: Interactive Debugging with ADK WebSystematic Quality Checks: Formal EvaluationConclusion
6. Multi-Agent Systems: Collaboration and Orchestration (Illustrated with ADK)
Design your Multi-Agent SystemSplit Agents Like MicroservicesChoose a Collaboration PatternDesign the Context ExchangeImplement your first Multi-Agent SystemPrerequisitesImplement the Specialist AgentsMulti-Agent Sequential PatternCallbacks for Control and RobustnessContext ExchangeInteract with the Teaching Assistant AgentDrive Multi-agent Systems using Live AgentsWhen Would Multiple Live Agents Make Sense?Conclusion
7. The Agent2Agent (A2A) Protocol: Enabling Agent-to-Agent Communication
A2A Principles & CapabilitiesThe Main Actors in A2AAgent Card: The business card of an agentA2A Main ComponentsThe Task: The Unit of WorkArtifacts vs MessagesParts: The Atoms of ContentA2A Communication MechanismsSynchronous Communication (Request/Response & Polling)Asynchronous Streaming (Server-Sent Events)Asynchronous Push (Webhooks)A2A & Security1. Authentication: “Who are you?”2. Transport Security: “Is the line secure?”3. Authorization & RBAC: “What are you allowed to do?”4. Opaque Execution: Security by Design5. Webhook Verification: Securing the Return TripConclusion
8. Implementing A2A-Compliant Agents
Introducing the A2A Python ToolkitFor Building A2A Servers (Exposing Your Agent):For Building A2A Clients (Talking to Other Agents):Your First A2A Handshake: A Cross-Framework “Hello”Building the LangGraph “Greeter” Agent (The Server)Building the ADK “Initiator” (The Client)Connecting to a Legacy System: A2A-Enabling the Math AgentThe Interoperable Research Assistant: A Multi-Framework WorkflowWhy a Multi-Agent System for Research?The ArchitectureSetting Up the Specialist AgentsHandling Diverse Data and DebuggingHandling Files in A2AHandling Structured Data (Forms)Debugging Your A2A InteractionsConclusion
9. The Model Context Protocol (MCP): Standardizing Tool Interaction
The World Before a Standard: The Integration ChaosThe Core Problem: The Brittle, Monolithic ToolThe Deeper Challenge: The Agent Communication ChaosThe MCP Architecture: A Blueprint for a Dynamic EcosystemThe Core ComponentsThe “Plug-and-Play” Model in ActionMCP Primitives: The Standardized Language of Agents and ToolsServer Primitives (What a Server Can Offer)Client Primitives (What a Server Can Request)The Transport Layer: The Communication ChannelsStdio: For Local, Secure CommunicationStreamable HTTP: For Remote and Web-Based CommunicationConclusion

10. Implementing MCP-Enabled Agents and Tool Servers
The Scenario: The Financial AnalystBuilding the MCP Server (The “Stock Data” Server)Writing the Server CodeVerifying the ServerEnabling the ADK Agent as an MCP ClientThe Bridge: MCPToolsetConfiguration: Defining the TransportThe “Magic” of DiscoveryDefining the AgentPractical Example: The Financial Analyst AgentPutting it TogetherThe Execution FlowHandling Multimodal DataThe ScenarioServer Side: Returning an ImageClient Side: Handling the BlobDebugging with the MCP InspectorThe Problem with Stdio DebuggingUsing the InspectorThe Debugging WorkflowConclusionSecurity: Managing the HandshakeTransport Selection: Stdio vs. HTTP (SSE)The “Write Once, Use Everywhere” Promise
11. The Birth of AgentOps: Introduction to an Agent Operationalization Platform
From MLOps to AgentOps: Defining the Shift for Agentic SystemsThe Foundation: DevOps and MLOpsThe Rise of Model Creators and ConsumersThe Application Era: GenAIOpsThe Leap to Autonomy: AgentOpsThe Essentials of MLOps and GenAIOps: People, Processes, and TechnologyMLOps FoundationsThe Shift to GenAIOpsAgentOps: Productionizing Agents at ScaleA Quick Recap: How an Agent is Built and RunPeople and Processes: Core AgentOps Processes and ChallengesTechnology: The Unified AgentOps PlatformBonus Section: Unifying MLOps and GenAI/AgentOpsConclusion
About the Authors

Content preview from Multimodal, Real-Time AI Agent Systems

Chapter 3. Advanced Live Interactions: Video, Tools, and System Instructions

In the last chapter, we built a modern, real-time voice application from the ground up. We replaced the rigid, turn-based model of old with a fluid, streaming architecture built on WebSockets, creating an AI that could hold a natural, interruptible conversation. We successfully built an assistant that can listen and speak fluently.

But a truly useful assistant must do more than just talk. It must perceive the world and act within it. This chapter is about teaching our assistant to do just that. We will give it senses and appendages, evolving it from a simple chatbot into a capable, mobile-first companion.

Our first step will be to give our assistant a unique personality. You will learn to use system instructions and voice configurations to shape its character, controlling not just what it says but how it sounds.

Next, we will give it eyes. You will learn to integrate live video from a webcam or a mobile phone’s camera, ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9798341661110Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Multimodal, Real-Time AI Agent Systems

by Heiko Hotz, Sokratis Kartakis

Chapter 3. Advanced Live Interactions: Video, Tools, and System Instructions

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.