Everyday AI Podcast – An AI and ChatGPT Podcast - EP 545: How to build reliable AI agents for mission-critical tasks

Starting point is 00:00:00 This is the Everyday AI Show, the everyday podcast where we simplify AI and bring its power to your fingertips. Listen daily for practical advice to boost your career, business, and everyday life. Meet Firefly AI Assistant, now live in Adobe Firefly, the All In One Creative AI Studio. Just describe what you want to create and the assistant handles the rest, orchestrating multi-step workflows across Photoshop, Premiere Express, and more in one conversational interface. You direct the outcome. The assistant accelerates execution. It seems like the entire business world is sprinting to implement agentic AI.

Starting point is 00:00:52 But what's crazy is there's no real playbook per se, right? Because not only is the concept and even kind of the definition of agents changing, but so too are the models that generally power them. So I think it's an important conversation for us to have today to talk about, well, how do you just build more reliable AI agents for mission critical tasks? And luckily, I said even though there's not an official playbook, because this technology in the developments in AI and large language models are so fast moving, we at least have today a great guess, the co-founder of Galileo, who they're kind of writing the unofficial

Starting point is 00:01:38 playbook at least for how so many enterprises are using agents and agentic AI. All right, I'm excited for today's conversation. I hope you are too. What's going on, y'all? If you're new here, my name's Jordan Wilson. Welcome to Everyday AI. This is your daily live stream podcast and free daily newsletter, helping us all not just learn AI, but how we can leverage it to grow our companies in our careers. It starts here with this unedited, unscripted, live stream and podcast. But if you really want to leverage what we're going to be going over today in this conversation, Make sure if you haven't already to go to your everyday AI.com, sign up for the free daily newsletter. In that newsletter, we're going to be recapping the best insights from today's conversation,

Starting point is 00:02:18 as well as keeping you up to date with everything else that's happening in the world of AI because it doesn't sleep, kind of like me. All right. So if you haven't done that, make sure you do that. And if you're looking for the AI news, sometimes we start the show with that. Today, not so much. Technically pre-recorded, we're debuting it live. So make sure if you want the AI news, just go check the newsletter. All right, enough from me.

Starting point is 00:02:40 This is a conversation I think you need to hear. All right, we're going to talk about a lot of things, but just where we're at with AI agents, where we're headed, how to increase trust and reliability. But luckily, you don't have to hear me ramble on about it. We have a great guest. I'm excited. So live stream audience, please help me welcome to the stage.

Starting point is 00:02:58 There we got them. We have Yash Sheth, the co-founder and COO of Galileo. Yash, thank you so much for joining the Everyday AI show. Thanks, Jordan. Thanks for having me here. All right. Can you explain to everyone like what is Galileo? What is it that you all do? Yeah. So at Galileo, we are building the reliability platform for agentic development. And so basically, think of Calio as the AI reliability platform for helping developers build, ship and scale their agentic applications reliably.

Starting point is 00:03:30 You know, we start the company about four years ago. That's, you know, a while before Chadibut even came out. And that's because, you know, me and my co-founder, you know, me and my co-founders actually realized, and we were working on LLMs back at Google, and we realized that it's going to be extremely hard for, for enterprises to adopt these models for machine critical tasks. And, you know, I love the topic that we have today because we're finally here. We're here where, you know, AI is being adopted in the mainstream. you know, we really want to see what is the extent of the ROI that we can deliver, we can extract from this exciting new technology.

Starting point is 00:04:15 And Lashanglish models, you know, are amazing at doing so many things. But when it comes to enterprise applications, these models, as we all know, may not have ever seen the data that we are that we are working with or the systems that we are trying to automate in our, in our enterprise. And then, you know, as we can talk more, you would love to, you know, even, you know, shed some light on what we're seeing there, Jordan. But, you know, that's in a gist of what Galileo does. Yeah. And yeah, I do want to dive into a lot of aspects here.

Starting point is 00:04:50 But I kind of maybe want to start at the end. And then we can work our way backwards. But, you know, ultimately, and I know it's not an easy answer, right, because it's probably the billion dollar. answer of 2025 when it comes to AI implementation. But how do companies build more reliable AI agents, right? It seems like no one quite knows. There's no official playbook per se. How do companies do this? What are the correct steps they should be taking? Yeah, and that's a that's a great question because we're thinking about this all the time, Jordan. And I think this is really top of mind for everyone building agents today.

Starting point is 00:05:35 I mean, for, I think I'll start off with at least why we're doing that, just so that, you know, like, at least everyone understands what's on the hype about agents, right? You know, we all, we know, we're all seeing how chat GPT, Gemini and others can be actually super and clod as well, you know, and can be super helpful in our day-to-day tasks. Like, can be great assistance, right? But you still have a human in the loop when it comes to really large scale, tasks that are being done.

Starting point is 00:06:05 And in order to see real ROI from LLMs, from AI, I think the big shift or the big excitement is all about can we automate a lot of things that we do today without even involving a human in the loop? And can these systems get, you know, just leverage the true intelligence that they have to understand the right intent behind. what a human is doing, talk to systems behind the scenes and get actually end-to-end workflows done. And so that's why the world is so excited about building agents. And if I were to put it bluntly, you know, today, the world of software, like, you know,

Starting point is 00:06:50 we talk about like having microservices and small software services that do specific things, even websites, right? So we're moving from a world of microservice-based architectures and software to microagents and microagentic software, where you'll have each part of the, each component of the software we built today, just will become more intelligent and smarter and more independent at doing tasks than us having to hard code as us as developers, having to hard code heuristics within it. So that's why I wanted to start with like, why is, you know, why is all the hype of being ages actually is there? And, and, you know, why is it so important for us?

Starting point is 00:07:42 Yeah. And I think even that, you know, like zooming out even more is maybe even helpful, right? Because I think as the actual large language models themselves, especially via like an AI chatbot interface, as they change, right? So as an example, you know, you mentioned Claude, you know, with ChadGBT, GBT, Google Gemini, obviously. Now the base models have, you know, agentic capabilities, right? You know, these reasoning models, they'll start to plan. You know, they'll say, oh, wait, I need to go look at more documents.

Starting point is 00:08:16 Oh, I can tap into all of your company's dynamic data. So it's almost like the lines are blurring between traditional AI chatbots and, you know, more quote unquote traditional agents. So like where do you even draw the line, right? Like what is technically like, oh, I'm just chatting with a, you know, an AI chatbot that has agented capabilities connected to my data versus like, oh, I'm working with a full-blown agent? Yeah, I think the, I mean, just going to, going to some of the more academic definitions

Starting point is 00:08:48 of agents. Like typically a chatbot ends with an answer as the final action. an agent typically has three stages and I would think at least the top the first two is like firstly there's a planning phase where the agent is trying to be more is trying to be curious about what you want and like getting you as a user

Starting point is 00:09:13 to the right spot before it actually takes the action for you itself and then even after taking the action it kind of there's a reflection step typically on like did I do it well or is there anything missing? For example, like if, you know, as a human, if I am ordering something for you, maybe if I'm, like, let's say,

Starting point is 00:09:34 if even if I'm like, you know, we have a waiter in the restaurant, someone is actually ordering food for you. Even after they complete the task, they do ask, like, is that all? Do you, would you want anything else? Like, you know, there's actually, you know, a feedback loop happening all the time.

Starting point is 00:09:51 And that's the big difference between a chatbot that is like, you know, And you can easily relate to that as a, you know, me as a consumer. I can relate to that very easily when I'm, you know, when I'm talking to chat GPT these days as a user. And then now it's starting to do some actions. Like it can take over, you know, my screen. It can do stuff for me. So that's the authentic in nature.

Starting point is 00:10:17 But when you build these capabilities into your apps, that becomes an agentic experience. So where would you say where are we at today? Because obviously someone that talks about AI every day and I'm lucky enough to get to talk to smart people like you, sometimes I feel like I'm in a little agentic bubble, right? Like are enterprises at scale actually using agents today? If so, how? If not, why?

Starting point is 00:10:47 Yeah, that's a great question. So I think Galileo works with hundreds of enterprise teams. and organizations to be able to see agentic development across the board, whether it's financial services, healthcare, retail, telco, and even, you know, cutting-edge startups, right? If, and there's a whole spectrum of adoption. You know, there are so many teams and, I mean,

Starting point is 00:11:15 obviously, startups are all going to be like all about agent-igentic workflows. But even we're seeing a lot. lot of investments from enterprises as well. Now, we have, we already have several customers who have agents that are live in production. And we have customers who are even thinking of like this year is just going to be about us productionizing rag applications and chatbots and agents are going to be next year because these are heavily regulated industries.

Starting point is 00:11:50 and the like you know basically chat pots and rad applications are basically a stepping stone you know it's like the crawl walk run analogy right like you know the they're still in the crawl phase and you know next year with agents is going to be walk and then multi agents is going to be the true run right but it's so exciting to see the the adoption curve is extremely fast it's never it's something that we've never seen before honestly even with the regulated industries and organizations, a lot of teams are already building agentic applications and they will be productionized this year.

Starting point is 00:12:33 So it's incredible to see the arc of innovation that even regulated organizations, large organizations, are able to do with this technology now. So that's where we see, we've seen agents that can actually preempt

Starting point is 00:12:50 internet outages in production. We've seen agents that can really manage the data platform for an organization. We've seen supply chain agents that can look at multiple warehouses and automatically place orders. And so that's, and these three are just like, you know, the starting point, right? And all three use cases, as I mentioned, they're far away from being. chatbot style use case. And, and, you know, one thing that I wanted to dive into a little bit deeper there that you

Starting point is 00:13:27 said is, you know, talking about, you know, some of your, you know, companies you work with that are in highly regulated industries, right? So whether that's, you know, finance, health care, et cetera. But when it comes to agents working on mission critical tasks, right, aside from, you know, certain regulation that, you know, you might not be able to out or, you know, over, uh, engineer, you know, around regulation, right? But aside from that, what are those most important steps for enterprises that do have that in their control to be able to start unleashing agents on those actually mission critical tasks versus just the low hanging fruit, you know, content

Starting point is 00:14:06 creation research, right? Areas where I think Agentsic AI has already done very well in 2025. So what are those most crucial next steps to actually get to those mission critical tasks for tasks for enterprise? Yeah. like the last three examples I gave you were actually mission critical at their actually mission critical tasks that run businesses right and if anything would just like go wrong there like you know you have real world systems that are going to fail and that's and we've seen enterprises actually you know be successful in deploying these mission critical agents and this brings us back to your very first question jordan on you know again how do we make agentic i reliable right and that's

Starting point is 00:14:49 because that's the number one thing. And I'll talk about like there are other things as well and we can go to where, which will truly make multi-agentic systems successful. But the number one thing for agents right now is trust and reliability. How can we make sure these, when these agents and mission critical tasks means that these agents have access and control over real world systems. They can make API calls to tools.

Starting point is 00:15:19 to tools, backend databases, ticketing systems, to update them, to change them. And if these agents were to behave incorrectly, you know, and to profess this as well, the reason why we keep talking about if these agents were to, you know, go wrong is because we are entering a world of non-deterministic software. And enterprises as a whole need to adapt to a world of a world of. of non-deterministic software. It's a new world. None of us have built these agents as scale before.

Starting point is 00:15:57 And so reliability, setting up a reliable pipeline for building, shipping, and scaling these agents is absolutely critical and happy to share more there is. Yeah, and I think it would be a good time to talk about the agent leaderboard. But real quick, before we do, first, a quick word from are sponsors at Google. This podcast is supported by Google. Hey, everyone. David here, one of the product leads for Google Gemini.

Starting point is 00:16:29 Check out VO3, our state-of-the-art AI video generation model in the Gemini app, which lets you create high-quality, eight-second videos with native audio generation. Try it with a Google AI pro plan or get the highest access with the ultra plan. Sign up at Gemini.com to get started and show us what you create. All right, let's let's dig in a little bit on this trust, this trust in reliability aspect, right? Because that, you know, even if we're talking AI chatbots, right, obviously trust and reliability is still huge. But when we start talking about agents, it becomes even more paramount.

Starting point is 00:17:11 So for our live stream audience, I'm sharing Galileo's agent leaderboard that's on a hugging, on the hugging face space here. But yes, maybe if you could walk us through, what is this agent leaderboard? And what does it help people understand, at least when it comes to trust in reliability? Yeah. So before we walk through this, Jordan, I want to at least preface this by just by creating the understanding that reliability in agents come through a foundation of really test-driven development for these agents and having high-quality, evals for an agent.

Starting point is 00:17:53 And building agent evals is not just like evaluating an LLM. You got it. It's more about, you know, creating unit tests for an agent and an integration test for an agent. Just like, you know, software engineering best practices. When we build these agents, like the tests, the eval test that we're building, then extremely catered towards it towards a use case that you're building for. It includes the metrics that.

Starting point is 00:18:20 that understand what the agent is doing and what is the expectation from the agent. And it includes a data set that is actually applicable to that use case. Now, once you start with that strong foundation, you can apply these eval tests and in real time as well. Because in production, your agents are going to be non-deterministic. So these strong evals power really valuable observability.

Starting point is 00:18:49 and then even using these strong Eval tests, you can create strong guardrails that can prevent the outcomes at all. Imagine if your agent started hallucinating, it started making the wrong tool calls, API calls, and if you can prevent it in under 300 milliseconds, that is super real time,

Starting point is 00:19:11 then you have the end-to-end reliability pipeline. And what we're doing here with the agent leader board there is actually helping T-1stall. teams understand just how do we start getting like evaluating LLMs and agents. And so with the agent leader board, you know, our teams at Galileo, like we've selected many, many models

Starting point is 00:19:33 and actually applied them to several use cases. There are data sets, there are agent like agent prototypes, and we run them across different LLMs. And then we evaluate them for the task that they're doing. So we have the entire GitHub repo link there. Anyone can go in and start seeing that and actually start doing their own evas with the GitHub repo. And coming up with their own sort of ranking of like how can I architect the best, most accurate agent for my use case? This leaderboard is extremely popular.

Starting point is 00:20:12 You know, we've received millions of hits on this. And this just shows that developers and teams, they really want to go in and see how these LLMs apply to real world agents and not have some academic benchmark, you know, rank these models for them that don't represent real world use cases. So, you know, as an example, maybe help us simplify this for our non-technical audience, but there's a data set here. right, that kind of tells users a little bit more about what's actually going on under the hood. Let me see if I can get it in the correct window. There we go for our live stream audience. But as an example, if I scroll down here in the data set, you know, one of these conversation pieces is, you know, who won the last World Cup in football, right? And then there's a way to test if these different models in an agenic framework are able to retrieve the information,

Starting point is 00:21:16 correctly. So number one, is that how it works? But number two, maybe can you just tell us how it works with this data set and then how you're able to tell, hey, which models are maybe most reliable or trustworthy in these agentic settings? Yeah. And I'll answer your last question first because this is like an ever, this is an ever like moving target, right? Like new models, new capabilities are coming out so fast. And, you know, our teams are even having a hard time to keeping up at that honestly. But which model is the best, that answer can only be like gotten by going to the leaderboard and seeing what is the live, you know, comparison. Like, can we keep updating these fairly regularly?

Starting point is 00:21:56 In terms of the data set, right, like, you'll see how agents, the most common pattern for an agent is to make a tool call, right? And you, that's where even, you know, we've seen how popular MCP has become, like the export protocol from Anthropic, because it really reduced. the overhead of from of of of an LLM calling a tool it just simplifies that at the most basic level right and and that makes agent building a lot more exciting and easy so in this data set you know we're seeing agents make a bunch of tool calls and the most basic requirement we want is from an LLM to not make the wrong tool call to understand the action of tool calling really well so that's what

Starting point is 00:22:46 what we're seeing in this data set. And then even the evaluation metrics are based on tool selection quality for this particular data set. And then, you know, what as we talk about, right, you kind of went through this, this, this crawl, walk, run. You know, you kind of mentioned there, you know, multi-agent setups or scenarios. You know, what should business leaders, even though, yeah, for some industries, it might be a while until we get there. But what should companies be keeping in mind when it comes to

Starting point is 00:23:20 multi-agentic AI and making sure that they have reliable agents? You know, very soon, we're going to get to a world of like having small agents talking to each other. So, you know, the kind of like microagentic architecture that I talk about, which is truly multi-agentic systems. For example, if I wanted, and this is a very simple example that like, you know, a lot of people would have heard about, but for those who haven't is, let's say there's a travel agent application. Like, it's helping me book, like create, firstly create my itinerary plan for my vacation. And then I also help me book things end to end and then coordinate any logistics for me if I had to ship things when I'm traveling,

Starting point is 00:24:04 etc. Right. So it can be an end to end thing, even a supply chain agent that I talked about, which, you know, one of our customers have, have implemented. is literally like, you know, looking at the inventory in their warehouses and automatically placing orders. And so there are multiple tasks that this entire workflow is doing, right? Right, right? From planning to making reservations or making orders and then execution of those orders, right? So that's our tracking of those orders. These can be implemented as individual agents that talk to each other. and yes, and over time, the planning agent's capabilities just can grow.

Starting point is 00:24:47 It can not only help me book hotels and flights, but also restaurants and excursions. And for that, it has to talk to other agents that are specialized in booking, making reservations for those things. So now in that world of multi-agentic system, there are three things that need to be really solid. for. First is again trust. When an agent

Starting point is 00:25:13 is talking to another agent, how can it trust that other agent? How can it what kind of observability can we get from the other agent to say like Jordan right now I can see you so I know it's you but if it was some bot that I was talking to that I wouldn't trust

Starting point is 00:25:29 it as much as I can trust you because I can see you. I'm getting that observability feedback that it's you. Similarly and you know There's also other challenges beyond trust is authentication. Like how do I know that it's, you know, an agent passing, handing off the task to another agent can authenticate me as an end user. And the third thing is in communication.

Starting point is 00:25:54 And like these three are kind of the top priorities for most of us in the agentic space. And communication is solved by a bunch of things. You know, you'll see. And there are MCP based patterns. that are emerging. Google's announced the A2A protocol. You know, Gallio has been part of like even founding this agency organization, which is a, you know, a truly open organization that can, that is,

Starting point is 00:26:26 that is making multi-agentic systems easier to solve for. And you'll see that in overall, the world will kind of converge on one protocol where any non-heterogeneous agent built in any system using any LL is able to talk to each other. So yes, we've covered a lot in today's episode but as we wrap up, what is the one most important takeaway

Starting point is 00:26:53 that our audience needs to know when it comes to building reliable agents but for specifically mission critical tasks? Yeah, thanks, Jordan. I think the one thing as we head towards a multi-agent agentic world. The one thing is that let's get our single agents to be more reliable now. You know, it's very important. Even if you're not launching them in production, there is a whole stack. There is a whole playbook being created on how to build, how to launch, like just the

Starting point is 00:27:25 CICD aspect of things. How do we make them reliable in production? There's prevention. There is mitigation. How can we prevent bad outcomes from happening? How can we? and we mitigate them because in a non-deterministic world of software again, good evaluations followed by good reliable prevention and mitigations is going to be absolutely critical to be successful in the world of non-deterministic software. All right. Some important information, not just for the future, but things you need to start doing today to prepare for the multi-agentic future of tomorrow.

Starting point is 00:28:02 Yes, thank you so much for your time on joining the Everyday AI show. we really appreciate it. Pleasure being here, Jordan. Thank you. All right. As a reminder, y'all, that was a lot. If you missed anything, don't worry. We're going to be sharing it in our newsletter, some resources that yes, talked about, the AI agent leaderboard.

Starting point is 00:28:17 It's all going to be in there. So if you haven't already, please sign up for our free daily newsletter at your everyday AI. So thank you for tuning in. Please join us tomorrow and every day for more everyday AI. Thanks, y'all. Meet Firefly AI assistant. Now live in Adobe Firefly, the Allman One Creative AI Studio. Just describe what you want to create in your own words and the assistant handles the rest,

Starting point is 00:28:43 orchestrating multi-step workflows across Adobe Creative Cloud apps, including Photoshop, Premiere Express, and more in one conversational interface. You direct the outcome while the assistant accelerates execution. Stand control with the ability to step in and refine at any time. See it today at firefly.adop.com. And that's a wrap for today's edition of Everyday AI. Thanks for joining us. If you enjoyed this episode, please subscribe and leave us a rating.

Starting point is 00:29:17 It helps keep us going. For a little more AI magic, visit Your EverydayAI.com and sign up to our daily newsletter so you don't get left behind. Go break some barriers and we'll see you next time.

Everyday AI Podcast – An AI and ChatGPT Podcast - EP 545: How to build reliable AI agents for mission-critical tasks

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.