Everyday AI Podcast – An AI and ChatGPT Podcast - EP 575: Preparing Enterprises for Reliable AI Agent Deployment

Episode Date: July 25, 2025

Every enterprise is legit rushing to build AI agents.But there's no instructions. So, what do you do? How do you make sure it works? How do you track reliability and traceability? We dive in ...and find out.Newsletter: Sign up for our free daily newsletterMore on this Episode: Episode PageJoin the discussion: Have a question? Join the convo here.Upcoming Episodes: Check out the upcoming Everyday AI Livestream lineupWebsite: YourEverydayAI.comEmail The Show: info@youreverydayai.comConnect with Jordan on LinkedInTopics Covered in This Episode:Google Gemini's Veo 3 Video Creation ToolTrust & Reliability in AI AgentsBuilding Reliable AI Agents GuideAgentic AI for Mission-Critical TasksMicro Agentic System Architecture DiscussionNondeterministic Software Challenges for EnterprisesGalileo's Agent Leaderboard OverviewMulti-Agent Systems: Future ProtocolsTimestamps:00:00 "Building Reliable Agentic AI"05:23 The Future of Autonomous AI Agents08:43 Chatbots vs. Agents: Key Differences10:48 "Galileo Drives Enterprise AI Adoption"13:24 Utilizing AI in Regulated Industries18:10 Test-Driven Development for Reliable Agents22:07 Evolving AI Models and Tools24:05 "Multi-Agent Systems Revolution"27:40 Ensuring Reliability in Single AgentsKeywords:Google Gemini, Agentic AI, reliable AI agents, mission-critical tasks, large language models, AI reliability platform, AI implementation, microservices, micro agents, ChuckGPT, AI observability, enterprise applications, nondeterministic software, multi-agentic systems, AI trust, AI authentication, AI communication, AI production, test-driven development, agent EVALS, Hugging Face space, tool calls, expert protocol, MCP protocol, Google A2A protocol, multi-agent systems, agent reliability, real-time prevention, CICD aspect, mission-critical agents, nondeterministic world, reliable software, Galileo, agent leaderboard, AI planning, AI execution, observability feedback, API calls, tool selection quality.Send Everyday AI and Jordan a text message. (We can't reply back unless you leave contact info) Start Here ▶️Not sure where to start when it comes to AI? Start with our Start Here Series. You can listen to the first drop -- Episode 691 -- or get free access to our Inner Cricle community and all episodes: StartHereSeries.com Also, here's a link to the entire series on a Spotify playlist. 

Transcript
Discussion (0)
Starting point is 00:00:00 This is the Everyday AI Show, the everyday podcast where we simplify AI and bring its power to your fingertips. Listen daily for practical advice to boost your career, business, and everyday life. Meet Firefly AI Assistant, now live and Adobe Firefly, the All In One Creative AI Studio. Just describe what you want to create and the assistant handles the rest, orchestrating multi-step workflows across Photoshop, Premiere Express, and more in one conversational interface. You direct the outcome. The assistant accelerates execution. It seems like the entire business world is sprinting to implement agentic AI.
Starting point is 00:00:52 But what's crazy is there's no real playbook per se, right? Because not only is the concept and even kind of the definition of agents changing, but so too are the models that generally power them. So I think it's an important conversation for us to have today to talk about, well, how do you just build more reliable AI agents for mission critical tasks? And luckily, I said even though there's not an official playbook, because this technology in the developments in AI and large language models are so fast moving, we at least have today a great guess, the co-founder of Galileo,
Starting point is 00:01:36 who they're kind of writing the unofficial playbook, at least, for how so many enterprises are using agents and agentic AI. All right, I'm excited for today's conversation. I hope you are too. What's going on, y'all? If you're new here, my name's Jordan Wilson. Welcome to Everyday AI. This is your daily live stream podcast and free daily newsletter, helping us all not just learn AI, but how we can leverage it to grow our companies in our careers. It starts here with this unedited, unscripted, live stream and podcast. But if you really want to leverage what we're going to be going over today in this conversation, make sure if you haven't already to go to your everyday AI.com.
Starting point is 00:02:12 Sign up for the free daily newsletter. In that newsletter, we're going to be recapping the best insights from today's conversation, as well as keeping you up to date with everything else that's happening in the world of AI because it doesn't sleep, kind of like me. All right. So if you haven't done that, make sure you do that. And if you're looking for the AI news, sometimes we start the show with that. Today, not so much.
Starting point is 00:02:32 Technically pre-recorded, we're debuting it live. So make sure if you want the AI news, just go check the newsletter. All right. from me, this is a conversation I think you need to hear. All right, we're going to talk about a lot of things, but just where we're at with AI agents, where we're headed, how to increase trust and reliability. But luckily, you don't have to hear me ramble on about it. We have a great guest.
Starting point is 00:02:54 I'm excited. So live stream audience, please help me welcome to the stage. There we got them. We have Yash, Chef, the co-founder and COO of Galileo. Yash, thank you so much for joining the Everyday AI show. Thanks, Jordan. Thanks for having me here. All right.
Starting point is 00:03:08 Can you explain to everyone, like, what is Galileo? What is it that you all do? Yeah. So at Galileo, we are building the reliability platform for agentic development. And so basically, think of Calio as the AI reliability platform for helping developers build ship and scale their agentic applications reliably. You know, we start the company about four years ago. That's, you know, a while before Chadibut even came out.
Starting point is 00:03:37 And that's because, you know, me and my co-founder, you know, me and my co-founder is actually realized, and we were working on LLMs back at Google, and we realized that it's going to be extremely hard for enterprises to adopt these models for machine critical tasks. And, you know, I love the topic that we have today because we're finally here. We're here where, you know, AI is being adopted in the mainstream. you know, we really want to see what is the extent of the ROI that we can deliver, we can extract from this exciting new technology. And Lashanglish models, you know, are amazing at doing so many things. But when it comes to enterprise applications, these models, as we all know, may not have
Starting point is 00:04:25 ever seen the data that we are that we are working with or the systems that we are trying to automate in our, in our enterprise. And then, you know, as we can talk more, you would love to, you know, even, you know, shed some light on what we're seeing there, Jordan. But, you know, that's in a gist of what Galileo does. Yeah. And yeah, I do want to dive into a lot of aspects here. But I kind of maybe want to start at the end. And then we can work our way backwards.
Starting point is 00:04:57 But, you know, ultimately, and I know it's not an easy answer, right, because it's probably the billion dollar. answer of 2025 when it comes to AI implementation. But how do companies build more reliable AI agents, right? It seems like no one quite knows. There's no official playbook per se. How do companies do this? What are the correct steps they should be taking? Yeah, and that's a that's a great question because we're thinking about this all the time Jordan. And I think this is really top of mind for everyone building agents today. I mean, for, I think I'll start off with at least why we are doing that just so that, you know, like at least everyone understands what's on the hype about agents, right? You know, we all, we know, we're all seeing how chat GPT, Gemini,
Starting point is 00:05:48 and others can be actually super and clod as well, you know, and can be super helpful in our day-to-day tasks. Like, can be great assistance, right? But you still have a human in the loop when it comes really large-scale tasks that are being done. And in order to see real ROI from LLMs, from AI, I think the big shift or the big excitement is all about can we automate a lot of things that we do today without even involving a human in the loop? And can these systems get, you know, just leverage the true intelligence that they have to understand. the right intent behind what a human is doing, talk to systems behind the scenes and get actually end-to-end workflows done.
Starting point is 00:06:39 And so that's why the world is so excited about building agents. And if I were to put it bluntly, you know, today, the world of software, like, you know, we talk about like having microservices and small software services that do specific things, even websites, right? So we're moving from a world of microservice-based architectures and software to microagents and microagentic software, where you'll have each part of the, each component of the software we built today, just will become more intelligent and smarter and more independent at doing tasks than us having to hard code as us as developers, having to hard code heuristics within it.
Starting point is 00:07:26 So that's why I wanted to start with like, why is, you know, why is all the hype of being ages actually is there? And, and, you know, why is it so important for us? Yeah. And I think even that, you know, like zooming out even more is maybe even helpful, right? Because I think as the actual large language models themselves, especially via like an AI chatbot interface, as they change, right? So as an example, you know, you mentioned Claude, you know, with ChadGBT, GBT, Google Gemini, obviously.
Starting point is 00:08:05 Now the base models have, you know, agentic capabilities, right? You know, these reasoning models, they'll start to plan. You know, they'll say, oh, wait, I need to go look at more documents. Oh, I can tap into all of your company's dynamic data. So it's almost like the lines are blurring between traditional AI chatbots and, you know, more quote unquote traditional agents. So like where do you even draw the line, right? Like what is technically like, oh, I'm just chatting with an AI chatbot that has agented
Starting point is 00:08:37 to my data versus like, oh, I'm working with a full-blown agent? Yeah, I think the, I mean, just going to going to some of the more academic definitions of agents. Typically a chatbot ends with an answer as the final action. an agent typically has three stages and I would think at least the top the first two is like firstly there's a planning phase where the agent is trying to be more is trying to be curious about what you want and like getting you as a user to the right spot before it actually takes the action for you itself and then even after taking the action it kind of there's a reflection step typically on like did I do it well? Or is there anything missing? For example, like if, you know, as a human, if I am ordering something for you, maybe if I'm like, let's say, even if I'm like, you know, we have a waiter in the restaurant,
Starting point is 00:09:38 someone is actually ordering food for you. Even after they complete the task, they do ask, like, is that all? Do you, would you want anything else? Like, you know, there's actually, you know, a feedback loop happening all the time. And that's the big difference between a chatbot that is like, you know, that and you can easily relate to that as a you know as me as a consumer i can relate to that very easily when i'm you know uh when when i'm talking to chat gpt these days as as a user and then now it's starting to do some actions like it can take over you know my screen it can do stuff
Starting point is 00:10:14 for me so that's the authentic in nature but when you build these capabilities into your apps that becomes an agentic experience so where would you see say where are we at today? Right? Because obviously, someone that talks about AI every day and I'm lucky enough to get to talk to, you know,
Starting point is 00:10:33 smart people like you, sometimes I feel like I'm in a little agentic bubble, right? Like, are enterprises at scale actually using agents today? If so, how? If not, why?
Starting point is 00:10:47 Yeah, that's a great question. So I think, you know, Galileo works with, you know, hundreds of enterprise teams and,
Starting point is 00:10:54 and, uh, organizations to be able to see agentic development across the board, whether it's financial services, healthcare, retail, telco, and even, you know, cutting-edge startups, right? If, and it, there's a whole spectrum of adoption. You know, there are so many teams and I mean, obviously, startups are all, all going to be like all about agent-igent workflows. But even we're seeing a lot of investments from enterprises as well. Now, we have,
Starting point is 00:11:30 we already have several customers who have agents that are live in production. And we have customers who are even thinking of like this year is just going to be about us productionizing rag applications and chat pots and agents are going to be next year because these are heavily regulated industries. And like, you know, basically chatpots. and RAD applications are basically a stepping stone. You know, it's like the crawl walk run analogy, right?
Starting point is 00:12:01 Like, you know, they're still in the crawl phase and, you know, next year with agents is going to be walk and then multi-agents is going to be the true run, right? But it's so exciting to see the adoption curve is extremely fast. It's something that we've never seen before, honestly. even with the regulated industries and organizations, a lot of teams are already building agentic applications
Starting point is 00:12:29 and they will be productionized this year. So it's incredible to see the arc of innovation that even regulated organizations, large organizations are able to do with this technology now. So that's where
Starting point is 00:12:45 we see, we've seen agents that can actually reempt internet outages in production. We've seen agents that can really manage the data platform for an organization. We've seen supply chain agents that can look at multiple warehouses and automatically place orders. And so that's, and these three are just like, you know, the starting point, right? And all three use cases, as I mentioned, they're far away from being.
Starting point is 00:13:20 chatbot style use case. And, and, you know, one thing that I wanted to dive into a little bit deeper there that you said is, you know, talking about, you know, some of your, you know, companies you work with that are in highly regulated industries, right? So whether that's, you know, finance, healthcare, et cetera. But when it comes to agents working on mission critical tasks, right, aside from, you know, certain regulation that, you know, you might not be able to out or, you know, over, uh, engineer, you know, around regulation, right? But aside from that, what are those most important
Starting point is 00:13:54 steps for enterprises that do have that in their control to be able to start unleashing agents on those actually mission critical tasks versus just the low hanging fruit, you know, content creation research, right? Areas where I think Agentsic AI has already done very well in 2025. So what are those most crucial next steps to actually get to those mission critical tasks for tasks for enterprise? Yeah, and like the last three examples I gave you were actually mission critical, they're actually mission-critical tasks that run businesses, right? And if anything would just like go wrong there, like, you know, you have real-world systems
Starting point is 00:14:31 that are going to fail. And that's, and we've seen enterprises actually, you know, be successful in deploying these mission-critical agents. And this brings us back to your very first question, Jordan, on, you know, again, how do we make agentic AI reliable? right and that's because that's the number one thing um and i'll talk about like there are other things as well and we can go to where which will truly make multi-agentic systems um it's successful but the number one thing for agents right now is trust and reliability how can we make your these
Starting point is 00:15:07 when these agents and mission critical tasks means that these agents have access and and control over real world systems they can make a big API calls to tools, backend databases, ticketing systems, to update them, to change them. And if these agents were to behave incorrectly, you know, and to preface this as well, the reason why we keep talking about
Starting point is 00:15:36 if these agents were to, you know, go wrong is because we are entering a world of non-deterministic software. Enterprises as a whole need to adapt to a world of non-deterministic software. It's a new world. None of us have built these agents as scale before. And so reliability, setting up a reliable pipeline for building, shipping, and scaling these agents. It's absolutely critical and happy to share more there is.
Starting point is 00:16:09 Yeah, and I think it would be a good time to talk about the agent leaderboard. But real quick, before we do first, a quick work. from our sponsors at Google. This podcast is supported by Google. Hi, folks. Paige Bailey here from the Google DeepMind Devrel team. For our developers out there, we know there's a constant trade-off between model intelligence, speed, and cost. Gemini 2.5 Flash aims right at that challenge.
Starting point is 00:16:35 It's got the speed you expect from Flash, but with upgraded reasoning power. And crucially, we've added controls like setting thinking budgets, so you can decide how much reasoning to apply, optimizing for latency and costs. So try out Gemini 2.5 flash at AIS Studio.com and let us know what you build. Adobe just introduced an entirely new way to create, bringing the power and precision of its creative suite into one conversational experience. Meet Firefly AI Assistant, now live in the Adobe Firefly app, the All In One Creative AI Studio. Powered by Adobe's Creative Agent, Firefly AI Assistant lets you start with your vision, just describe what you want, and shape the outcome as it takes form with the assistant. The assistant orchestrates multi-step workflows, drawing on 60 plus pro-grade tools across
Starting point is 00:17:28 Adobe Creative Cloud apps, including Photoshop, Illustrator, Premiere, Lightroom Express, and more to help bring your ideas to life. You can also get started with creative skills, a growing library of pre-built workflows for common creative tasks, like batch editing photos, creating mood boards, portrait retouching, and creating social variations. Every step the assistant takes is visible, so you can refine, redirect, or take over at any time. You stay in the driver's seat as the creative director. Adobe Firefly AI assistant now in public beta. See it today at firefly.adopi.com.
Starting point is 00:18:08 All right, let's dig in a little bit on this trust, this trust and reliability aspect, right? Because that, you know, even if we're talking AI chatbots, right, obviously trust and reliability is still huge. But when we start talking about agents, it becomes even more paramount. So for our live stream audience, I'm sharing Galileo's agent leaderboard that's on a hugging, on the hugging face space here. But yes, maybe if you could walk us through, what is this agent leaderboard? And what does it help people understand, at least when it comes to trust in reliability? Yeah. So before we walk through this, Jordan, I want to at least preface this spicy.
Starting point is 00:18:53 like just by creating the understanding that reliability in agents come through like, you know, come through a foundation of really test driven development for these agents. Like having high quality evals for an agent. And building agent evals is not just like evaluating an LLM. You got it. It's more about, you know, creating unit tests for an agent and an integration test for an agent. Just like, you know, software engineering best practices. When we built these agents, like the tests, the eval tests that we're building,
Starting point is 00:19:29 then extremely catered towards the use case that you're building for. It includes the metrics that understand what the agent is doing and what is the expectation from the agent. And it includes a dataset that is actually applicable to that use case. Now, once you start with that strong foundation, you can apply these eval tests and in real time as well, because in production, your agents are going to be non-deterministic. So these strong evals power, you know, really valuable observability. And then even using these strong eval tests,
Starting point is 00:20:09 you can create strong guardrails that can prevent the outcomes at all. Like, imagine if your agents started hallucinating, it started making the wrong tool calls, API calls. And if you can prevent it in under, you know, 300 milliseconds, that is super real time. Then you have the end-to-end reliability pipeline in place. And what we're doing here with the agent leader board there is actually helping teams understand just how do we start getting LLMs and agents.
Starting point is 00:20:42 And so with the agent leader board, you know, our teams at Galileo, like, you know, we've selected many, many models and actually applied them to several use cases that are datasets that are more you know agent like agent prototypes and we run them across different nLMs and then we evaluate them for the task that they're doing so we have the entire github like repo link there you can anyone can go in and start seeing that and actually start doing their own evals with the github repo and coming up with their own sort of ranking of like how how can i architect the best most accurate agent for my use case. This leaderboard is extremely popular.
Starting point is 00:21:28 You know, we've received millions of his hits on this. And this just shows that developers and teams, they really want to go in and see how these LLMs apply to real world agents and not have some academic benchmark, you know, rank these models for them that don't represent real-world use cases. So, as an example, maybe help us simplify this for our non-technical audience, but there's a data set here, right, that kind of tells users a little bit more about what's actually going on under the hood.
Starting point is 00:22:10 Let me see if I can get it in the correct window. There we go for our live stream audience. But, you know, as an example, if I scroll down here in the data set, you know, one of these conversation pieces is, you know, who won the last World Cup in football, right? And then there's a way to test if these different models in an agentic framework are able to retrieve the information correctly. So number one, is that how it works? But number two, maybe can you just tell us how it works with this data set and then how you're
Starting point is 00:22:38 able to tell, hey, which models are maybe most reliable or trustworthy in these agentic settings? Yeah. I mean, I'll answer your last question first because this is like an ever, this is an ever like moving target, right? Like new models, new capabilities are coming out so fast. And, you know, our teams are even having a hard time to keeping up at that honestly. But which model is the best? That answer can only be like gotten by going to the leaderboard and seeing what is the
Starting point is 00:23:07 live comparison. Like, can we keep updating these fairly regularly? In terms of the dataset, right? like in you'll see how agents the most common pattern for an agent is to to make a tool call right and you that's where even you know we've seen how popular mcp has become like the expert protocol from an anthropic because it really reduces the overhead of from of of an lm calling a tool it just simplifies that at the most basic level right and and that makes agent building a lot more exciting and easy. So in this data set, you know, we're seeing agents make a bunch of
Starting point is 00:23:50 tool calls. And the most basic requirement we want is from an LLM to not make the wrong tool call, to understand the action of tool calling really well. So that's what we're seeing in this data set. And then even the evaluation metrics are based on tool selection quality from this, for this particular data site. And then, you know, what? as we talk about, right, you kind of went through this, this, this crawl, walk, run. You know, you kind of mentioned there, you know, multi-agent setups or scenarios. You know, what should business leaders, even though, yeah, for some industries, it might be a while until we get there, but what should companies be keeping in mind when it
Starting point is 00:24:35 comes to multi-agentic AI and making sure that they have reliable agents? you know, very soon, we're going to get to a world of like having small agents talking to each other. So, you know, the kind of like microagentic architecture that I talk about, which is truly multi-agentic systems. For example, if I wanted, and this is a very simple example that like, you know, a lot of people would have heard about. But for those who haven't is, let's say there's a travel agent application. Like it's helping me book, like, create, firstly create my itinerary plan for, my vacation. And then I also help me book things end to end and then coordinate any logistics for me if I had to ship things when I'm traveling, etc. Right. So it can be an end to end thing,
Starting point is 00:25:23 even a supply chain agent that I talked about, which, you know, one of our customers have, have implemented is literally like, you know, looking at the inventory in their warehouses and automatically placing orders. And so there are multiple tasks that this entire workflow is doing. right, right, from planning to making reservations or making orders and then execution of those orders, right? So that's our tracking of those orders. These can be implemented as individual agents that talk to each other. And yes, and over time, the planning agents' capabilities include, just can grow. It can not only help me book hotels and flights, but also restaurants and excursions. And for that, it has to talk to other agents that are, that specialize in booking,
Starting point is 00:26:15 making reservations for those things. So now in that world of multi-agent existence, there are three things that need to be really solved for. First is again, trust. When an agent is talking to another agent, how can it trust that other agent? How can it, what kind of observability can we get from the other agent to say, like, Jordan, right now I can see you. so I know it's you but if it was some bot that I was talking to that I wouldn't trust it as much as I can trust you
Starting point is 00:26:47 because I can see you. I'm getting that observability feedback that it's you. Similarly and you know there's also other challenges beyond trust is authentication. Like how do I know that it's an agent passing and handing
Starting point is 00:27:03 off the task to another agent can authenticate me as an end user? And the third thing is in communication and like these three are kind of the top priorities for most of us in the agentic space and communication is solved by a bunch of things you know you'll see and there are mcp based patterns that are emerging google's announced the a to a protocol um you know galio has been part of like even founding this agency uh organization which is a you know a truly open organization that can that is that is making multi-agentic systems easier to solve for and
Starting point is 00:27:46 you'll see that in overall the world will uh kind of converge on one protocol where any rogue any any any non-heterogeneous agent built in any system built using any LLF is able to talk to each other so uh yes we've covered a lot in today's episode but as we wrap up what is the one most important takeaway that our audience needs to know when it comes to building reliable agents, but for specifically mission critical tasks. Yeah, thanks, I think the one thing as we head towards a multi-agentic world, the one thing is that let's get our single agents to be more reliable now. You know, it's very important.
Starting point is 00:28:30 Even if you're not launching them in production, there is a whole stack, there is a whole playbook being created on how to build, how to launch, like just the CICD aspect of things. How do we make them reliable in production? There's prevention. There is mitigation. How can we prevent bad outcomes from happening? How can we mitigate them? Because in a non-deterministic world of software again, good evaluations, followed by good
Starting point is 00:29:00 reliable preventions and mitigations is going to be absolutely critical to be successful. to be successful in the world of non-deterministic software. All right. Some important information, not just for the future, but things you need to start doing today to prepare for the multi-agentic future of tomorrow. Yesh, thank you so much for your time on joining the Everyday AI show.
Starting point is 00:29:21 We really appreciate it. Pleasure being here, John. Thank you. All right. Now, as a reminder, y'all, that was a lot. If you missed anything, don't worry. We're going to be sharing it in our newsletter, some resources that Yesh talked about, the AI agent leaderboard.
Starting point is 00:29:33 It's all going to be in there. So if you haven't already, please sign up for our free daily newsletter at your everyday AI.com. So thank you for tuning in. Please join us tomorrow and every day for more everyday AI. Thanks, y'all. Meet Firefly AI assistant. Now live in Adobe Firefly, the Allman One Creative AI Studio.
Starting point is 00:29:55 Just describe what you want to create in your own words and the assistant handles the rest, orchestrating multi-step workflows across Adobe Creative Cloud apps, including Photoshop, Premiere Express, and more in one conversational interface. You direct the outcome while the assistant. assistant accelerates execution. Stand control with the ability to step in and refine at any time. See it today at firefly.adobie.com. And that's a wrap for today's edition of Everyday AI.
Starting point is 00:30:27 Thanks for joining us. If you enjoyed this episode, please subscribe and leave us a rating. It helps keep us going. For a little more AI magic, visit Your EverydayAI.com and sign up to our daily newsletter so you don't get left behind. Go break some barriers and we'll see you next time. Thank you.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.