AI Agent Memory Management Bootcamp
Published by O'Reilly Media, Inc.
Empowering AI agents with robust memory
Course outcomes
- Differentiate between short-term, long-term, episodic, semantic, and procedural memory and explain how each maps to agent behaviour
- Implement core memory operations: initialization, segmentation, retrieval, update, deletion, and creation: within AI agents
- Apply lexical, vector, and hybrid RAG techniques to integrate memory into LLM-based applications
- Configure Oracle AI Database as a memory provider with document storage, indexing, and vector search
- Distinguish between memory-augmented and memory-aware agents, and between agent-triggered and deterministic memory operations
- Build end-to-end memory-aware agents using Oracle AI Database, LangGraph, and LangMem
- Evaluate RAG pipelines using IR metrics, RAGAS, and Galileo, isolating retrieval failures from generation failures
- Assess agent performance across task completion, tool use, trajectory quality, cost, and safety using AgentBench, AgentEval, and LangSmith
- Benchmark memory systems using LoCoMo, LongMemEval, and MemBench, and apply the RBC framework to assess system-level memory quality
- Run structured competitive benchmarks across memory architectures and translate results into architecture decisions
Join expert Richmond Alake for a two-day deep dive into the theory, research, and practical implementation of memory management and evaluation in AI agent systems. Day 1 establishes the conceptual and engineering foundations: starting with what memory means for an AI agent and how it parallels human cognition, then moving through the core components of robust memory management systems: initialization, segmentation, retrieval, updating, deletion, and creation. You will examine advanced retrieval-augmented generation techniques including lexical, vector-based, and hybrid approaches, as well as agentic RAG, and gain hands-on experience implementing context engineering and memory engineering techniques from the ground up. The day closes with a practical session on implementing memory-aware agent workflows using Oracle AI Database, LangGraph, and LangMem, with a clear focus on the distinction between memory-augmented and memory-aware agents and between agent-triggered and deterministic memory operations.
Day 2 shifts focus to evaluation. You will build the skills to measure and diagnose memory system performance at every layer: from classical IR metrics and RAG pipeline evaluation through to agent-level assessment and memory-specific benchmarking. Coverage includes production-grade evaluation frameworks such as RAGAS, Galileo, AgentBench, and LangSmith, as well as dedicated memory benchmarks including LoCoMo, LongMemEval, and MemBench. The day concludes with a structured benchmarking and competitive analysis module, where you will run comparative evaluations across memory system architectures and translate results into concrete architecture decisions.
What you’ll learn and how you can apply it
- Explain what constitutes memory in AI agents, including memory types and analogies with human cognition
- Implement memory operations: initialization, segmentation, retrieval, update, deletion, and creation: within agent workflows
- Apply context engineering and memory engineering techniques and understand where each discipline begins and ends
- Distinguish memory-augmented from memory-aware agents, and agent-triggered from deterministic memory operations
- Use generative agents, A-Mem, MemGPT, and agent workflow memory as reference architectures for memory system design
- Implement lexical, vector, and hybrid RAG approaches with hands-on examples
- Configure Oracle AI Database for persistent memory storage with document modeling, indexing, and native vector search
- Embed memory into agent workflows for real-time decision-making and long-horizon planning
- Apply IR and ranking metrics to evaluate and diagnose retrieval quality in production memory systems
- Evaluate RAG pipelines with RAGAS and Galileo and isolate retrieval from generation failures
- Assess agent behaviour across task completion, tool use, trajectory, latency, cost, and safety
- Apply LoCoMo, LongMemEval, and MemBench to assess memory quality across the full lifecycle
- Run competitive benchmarks across memory architectures and produce analysis reports that inform design decisions
- Build context-aware, adaptive agents with persistent and dynamic memory
- Design systems for long-term learning, personalization, and reduced computational overhead
This live event is for you because...
- You’re an AI or ML engineer who works on LLM-based applications and needs to implement persistent memory for complex tasks.
- You’re a data scientist who builds intelligent systems and wants to understand how to enhance models with robust memory management.
- You’re a software developer in the AI space who needs to implement scalable and efficient memory solutions for agent-based applications.
- You’re an experienced AI practitioner looking to advance your knowledge of state-of-the-art memory architectures for agentic systems.
Prerequisites
- Familiarity with large language models (e.g., GPT-3.5, GPT-4) and their limitations (context windows, tokenization)
- Proficiency in Python, along with experience using common AI/ML libraries
- Basic experience with databases (preferably Oracle AI Database) and data storage concepts
- A general understanding of concepts like embeddings, vector search, and retrieval-augmented generation (RAG) will be helpful
- (Optional) Exposure to cognitive architectures (e.g., Soar, ACT-R) or previous experience with AI agents can provide additional context but is not required
Recommended preparation:
- Install Python (version 3.8 or higher)
- Ensure you have access to a Oracle AI Database instance (either local or via Oracle AI Database Atlas) and familiarize yourself with its basic operations
Recommended follow-up:
- Read LLM Engineer’s Handbook (book)
- Read Building Applications with AI Agents (book)
- Read AI Engineering (book)
- Read RAG-Driven Generative AI (book)
- Read Hands-On Large Language Models (book)
Schedule
The time frames are only estimates and may vary according to how the class is progressing.
Day 1
AI agent memory (55 minutes)
- Presentation: Overview of AI memory concepts: short-term (context window) versus long-term (conversational memory), episodic versus semantic, and their differences; analogies to human cognition and memory; challenges of memory in AI agents and agentic systems; real-world examples and intuition behind AI memory
- Group discussion: Which aspect of memory do you find most challenging to conceptualize?
- Q&A
- Break
Agent memory management (65 minutes)
- Presentation: Overview of retrieval-augmented generation (RAG) methods: lexical, vector, hybrid, agentic RAG, and rerankers; core components of memory management: initialization, segmentation, structure, creation, retrieval, update, deletion; best practices and design patterns in agentic systems; overview of memory design architectures from generative agents, A-Mem, MemGPT, and agent workflow memory
- Group discussion: Which memory component do you believe is most critical for long-term agent coherence?
- Q&A
- Break
AI retrieval (60 minutes)
- Presentation: Overview of AI retrieval: definition, concepts, and techniques; introduction to the concept of "memory providers" and their role in AI application architecture; examples of memory providers
- Hands-on exercises: Setting up the environment; indexing and vector search implementation; building retrieval functions; end-to-end agentic RAG system; performance monitoring and logging implementation
- Q&A
- Break
Agent memory technique implementation (50 minutes)
- Hands-on exercises: Implementing memory management components using Oracle AI Database, LangGraph, and LangMem
- Hands-on exercises: Implementing context engineering and memory engineering techniques from the ground up: encoding strategies, context window management, and memory injection patterns
- Hands-on exercises: Understanding the difference between memory-augmented and memory-aware agents: architectural distinctions, capability boundaries, and when each pattern applies
- Hands-on exercises: Overview and implementation of agent-triggered and deterministic memory operations: when memory actions are initiated by the agent versus enforced by the system, and the trade-offs between the two
Wrap-up and Q&A (10 minutes)
Day 2
Traditional evaluation metrics and RAG pipeline metrics (55 minutes)
- Presentation: Anatomy of a memory-aware agent and the case for evaluation: memory store, ingestion pipeline, retrieval layer, context assembly layer, and agent reasoning loop, each mapped to its failure modes and evaluation targets; classical IR foundations: Precision, Recall, and F1; ranking quality metrics: MRR, nDCG, and MAP; RAG-specific evaluation dimensions: Faithfulness, Answer Relevance, Context Precision, and Context Recall; RAG evaluation frameworks: RAGAS and Galileo; chunking and indexing quality signals
- Notebook: Computing IR metrics from scratch in Python; RAG evaluation with RAGAS; production tracing and evaluation with Galileo
- Q&A
- Break
Agent evaluation metrics (55 minutes)
- Presentation: Why agent evaluation differs from RAG evaluation; task completion and goal achievement: binary and partial-credit scoring; tool use accuracy: precision and recall applied to tool invocation sequences; trajectory and decision quality: trajectory faithfulness metrics; latency, efficiency, and cost metrics; agent evaluation frameworks: AgentBench, AgentEval, and LangSmith; safety and alignment metrics: adversarial test sets and red-teaming
- Notebook: Scoring task completion; tool use evaluation; trajectory evaluation with LangSmith; efficiency and cost profiling
- Q&A
- Break
Memory evaluation as a distinct discipline (MemEval) (55 minutes)
- Presentation: Memory-specific failure modes not captured by RAG or agent metrics; unit-level memory evaluation: precision and recall at the memory-block level; memory retrieval evaluation: MRR, nDCG, and Hit Rate applied to memory queries, including suppression quality; continuity and consistency evaluation: cross-session correctness, persona stability, and contradiction detection; memory lifecycle evaluation: ingestion, consolidation, forgetting, and archiving; LoCoMo, LongMemEval, and MemBench; the RBC lens for memory evaluation: Reliable, Believable, Capable
- Notebook: Evaluating memory units in practice; memory retrieval scoring; running LoCoMo, LongMemEval, and MemBench; building a continuous memory evaluation pipeline
- Q&A
- Break
Memory benchmarking and competitive analysis (65 minutes)
- Presentation: What memory benchmarking means in practice: defining comparison criteria, selecting representative workloads, and establishing baseline performance; overview of memory system architectures under comparison: filesystem-based, vector database, relational, and hybrid approaches; evaluation dimensions for competitive analysis: retrieval latency, recall quality, consistency under load, memory lifecycle support, and cost per operation; interpreting benchmark results: avoiding benchmark gaming, understanding workload sensitivity, and translating results into architecture decisions
- Notebook: Running a structured benchmark across two or more memory system configurations; scoring each system against retrieval quality, latency, and lifecycle metrics; producing a comparative analysis report with findings and recommendations
- Q&A
Wrap-up and Q&A (10 minutes)
Your Instructor
Richmond Alake
Richmond Alake is a highly experienced Machine Learning Architect and Engineer with over five years of expertise in the field. He specializes in Computer Vision and Deep Learning and has a proven track record of successfully developing and integrating deep learning models to solve a wide range of problems, such as motion detection, object detection, and pose estimation. Throughout his career, he has worked with a diverse range of clients, including large conglomerates, financial institutions, and small startups. In addition to his professional work, Richmond also serves as an AI advisor to a number of startups in the UK and the US.
With a background in building websites and mobile applications, Richmond is a firm believer in using technology to solve everyday problems. He has extensive knowledge of Machine Learning and has written over 200 articles on the subject, gaining over a million views. He was recognized as one of Medium's top AI writers in 2020/2021 and has collaborated with companies such as O'Reilly, BuiltIn and Nvidia to develop effective educational and informative learning materials on AI.
Currently, Richmond Alake is a Machine Learning Architect at Slalom Build UK. As the first hire of the machine learning practice in the UK division, he is responsible for helping organizations move from machine learning research to productionisation and assisting maturing organizations in promoting AI models into existing infrastructure to drive commercial and business value. His main role as an ML Architect is to assist organizations in developing and maintaining machine learning pipelines by implementing MLOps principles, techniques, and tooling. He is well-versed in Feature Stores and has conducted internal training for Data Engineers, Data Scientists, and ML Engineers.