
On May 8, O’Reilly Media will be hosting Coding with AI: The End of Software Development as We Know It—a live virtual tech conference spotlighting how AI is already supercharging developers, boosting productivity, and providing real value to their organizations. If you’re in the trenches building tomorrow’s development practices today and interested in speaking at the event, we’d love to hear from you by March 12. You can find more information and our call for presentations here. Just want to attend? Register for free here.
Join Shelby Heinecke, senior research manager at Salesforce, and Ben Lorica as they talk about agents, AI models that can take action on behalf of their users. Are they the future—or at least the hot topic for the coming year? Where are we with smaller models? And what do we need to improve the agent stack?
Check out other episodes of this podcast or the full-length version of this episode on the O’Reilly learning platform.
About the Generative AI in the Real World podcast: In 2023, ChatGPT put AI on everyone’s agenda. In 2025, the challenge will be turning those agendas into reality. In Generative AI in the Real World, Ben Lorica interviews leaders who are building with AI. Learn from their experience to help put AI to work in your enterprise.
Timestamps
- 0:29: Introduction—Our guest is Shelby Heinecke, senior research manager at Salesforce.
- 0:44: The hot topic of the year is agents. Agents are increasingly capable of GUI-based interactions. Is this my imagination?
- 1:21: The research community has made tremendous progress to make this happen. We’ve made progress on function calling. We’ve trained LLMs to call the correct functions to perform tasks like sending emails. My team has built large action models that, given a task, write a plan and the API calls to execute that. This is one piece. A second piece is when you don’t know the functions a priori, giving the agent the ability to reason about images and video.
- 3:10: We released multimodal action models. They take an image and text and produce API calls. That makes navigating GUIs a reality.
- 3:34: A lot of knowledge work relies on GUI interactions. Is this just robotic process automation rebranded?
- 4:06: We’ve been automating forever. What’s special is that automation is driven by LLMs, and that combination is particularly powerful.
- 4:33: The earlier generation of RPA was very tightly scripted. With multimodal models that can see the screen, they can really understand what’s happening. Now we’re beginning to see reasoning enhanced models. Inference scaling will be important.
- 5:52: Multimodality and reasoning-enhanced models will make agents even more powerful.
- 6:01: I’m very interested in how much reasoning we can pack into a smaller model. Just this week DeepSeek also released smaller distilled versions.
- 7:08: Every month the capability of smaller models has been pushed. Smaller models right now may not compare to large models. But this year, we can push the boundaries.
- 7:39: What’s missing from the agent stack? You have the model—some notion of memory. You have tools that the agent can call. There are agent frameworks. You need monitoring, observability. Everything depends on the model’s capabilities: There’s a lot of fragmentation, and the vocabulary is still unclear. Where do agents usually fall short?
- 9:00: There’s a lot of room for improvement with function calling and multistep function calling. Earlier in the year, it was just single step. Now there’s multistep. That expands our horizons.
- 9:59: We need to think about deploying agents that solve complex tasks that take multiple steps. We will need to think more about efficiency and latency. With increased reasoning abilities, latency increases.

