Everyday AI Podcast – An AI and ChatGPT Podcast - EP 545: How to build reliable AI agents for mission-critical tasks
Episode Date: June 12, 2025Every enterprise is legit rushing to build AI agents.But there's no instructions. So, what do you do? How do you make sure it works? How do you track reliability and traceability? We dive in ...and find out.Newsletter: Sign up for our free daily newsletterMore on this Episode: Episode PageJoin the discussion: Have a question? Join the convo here.Upcoming Episodes: Check out the upcoming Everyday AI Livestream lineupWebsite: YourEverydayAI.comEmail The Show: info@youreverydayai.comConnect with Jordan on LinkedInTopics Covered in This Episode:Google Gemini's Veo 3 Video Creation ToolTrust & Reliability in AI AgentsBuilding Reliable AI Agents GuideAgentic AI for Mission-Critical TasksMicro Agentic System Architecture DiscussionNondeterministic Software Challenges for EnterprisesGalileo's Agent Leaderboard OverviewMulti-Agent Systems: Future ProtocolsTimestamps:00:00 "Building Reliable Agentic AI"05:23 The Future of Autonomous AI Agents08:43 Chatbots vs. Agents: Key Differences10:48 "Galileo Drives Enterprise AI Adoption"13:24 Utilizing AI in Regulated Industries18:10 Test-Driven Development for Reliable Agents22:07 Evolving AI Models and Tools24:05 "Multi-Agent Systems Revolution"27:40 Ensuring Reliability in Single AgentsKeywords:Google Gemini, Agentic AI, reliable AI agents, mission-critical tasks, large language models, AI reliability platform, AI implementation, microservices, micro agents, ChuckGPT, AI observability, enterprise applications, nondeterministic software, multi-agentic systems, AI trust, AI authentication, AI communication, AI production, test-driven development, agent EVALS, Hugging Face space, tool calls, expert protocol, MCP protocol, Google A2A protocol, multi-agent systems, agent reliability, real-time prevention, CICD aspect, mission-critical agents, nondeterministic world, reliable software, Galileo, agent leaderboard, AI planning, AI execution, observability feedback, API calls, tool selection quality.Send Everyday AI and Jordan a text message. (We can't reply back unless you leave contact info)
Transcript
Discussion (0)
This is the Everyday AI Show, the everyday podcast where we simplify AI and bring its power to your fingertips.
Listen daily for practical advice to boost your career, business, and everyday life.
Meet Firefly AI Assistant, now live in Adobe Firefly, the All In One Creative AI Studio.
Just describe what you want to create and the assistant handles the rest,
orchestrating multi-step workflows across Photoshop, Premiere Express, and more in one conversational interface.
You direct the outcome.
The assistant accelerates execution.
It seems like the entire business world is sprinting to implement agentic AI.
But what's crazy is there's no real playbook per se, right?
Because not only is the concept and even kind of the definition of agents changing,
but so too are the models that generally power them.
So I think it's an important conversation for us to have today to talk about, well,
how do you just build more reliable AI agents for mission critical tasks?
And luckily, I said even though there's not an official playbook, because this technology
in the developments in AI and large language models are so fast moving, we at least have
today a great guess, the co-founder of Galileo, who they're kind of writing the unofficial
playbook at least for how so many enterprises are using agents and agentic AI. All right, I'm excited for
today's conversation. I hope you are too. What's going on, y'all? If you're new here, my name's
Jordan Wilson. Welcome to Everyday AI. This is your daily live stream podcast and free daily newsletter,
helping us all not just learn AI, but how we can leverage it to grow our companies in our careers.
It starts here with this unedited, unscripted, live stream and podcast. But if you really want to
leverage what we're going to be going over today in this conversation,
Make sure if you haven't already to go to your everyday AI.com, sign up for the free daily newsletter.
In that newsletter, we're going to be recapping the best insights from today's conversation,
as well as keeping you up to date with everything else that's happening in the world of AI because it doesn't sleep, kind of like me.
All right.
So if you haven't done that, make sure you do that.
And if you're looking for the AI news, sometimes we start the show with that.
Today, not so much.
Technically pre-recorded, we're debuting it live.
So make sure if you want the AI news, just go check the newsletter.
All right, enough from me.
This is a conversation I think you need to hear.
All right, we're going to talk about a lot of things,
but just where we're at with AI agents,
where we're headed, how to increase trust and reliability.
But luckily, you don't have to hear me ramble on about it.
We have a great guest.
I'm excited.
So live stream audience, please help me welcome to the stage.
There we got them.
We have Yash Sheth, the co-founder and COO of Galileo.
Yash, thank you so much for joining the Everyday AI show.
Thanks, Jordan.
Thanks for having me here.
All right. Can you explain to everyone like what is Galileo? What is it that you all do?
Yeah. So at Galileo, we are building the reliability platform for agentic development.
And so basically, think of Calio as the AI reliability platform for helping developers build, ship and scale their agentic applications reliably.
You know, we start the company about four years ago. That's, you know, a while before Chadibut even came out.
And that's because, you know, me and my co-founder, you know, me and my co-founders actually realized,
and we were working on LLMs back at Google, and we realized that it's going to be extremely hard for,
for enterprises to adopt these models for machine critical tasks.
And, you know, I love the topic that we have today because we're finally here.
We're here where, you know, AI is being adopted in the mainstream.
you know, we really want to see what is the extent of the ROI that we can
deliver, we can extract from this exciting new technology.
And Lashanglish models, you know, are amazing at doing so many things.
But when it comes to enterprise applications, these models, as we all know, may not have
ever seen the data that we are that we are working with or the systems that we are
trying to automate in our, in our enterprise.
And then, you know, as we can talk more, you would love to, you know, even, you know, shed some light on what we're seeing there, Jordan.
But, you know, that's in a gist of what Galileo does.
Yeah.
And yeah, I do want to dive into a lot of aspects here.
But I kind of maybe want to start at the end.
And then we can work our way backwards.
But, you know, ultimately, and I know it's not an easy answer, right, because it's probably the billion dollar.
answer of 2025 when it comes to AI implementation. But how do companies build more reliable AI agents,
right? It seems like no one quite knows. There's no official playbook per se. How do companies
do this? What are the correct steps they should be taking? Yeah, and that's a that's a great question
because we're thinking about this all the time, Jordan. And I think this is really top of mind for
everyone building agents today.
I mean, for, I think I'll start off with at least why we're doing that,
just so that, you know, like, at least everyone understands what's on the hype about
agents, right?
You know, we all, we know, we're all seeing how chat GPT, Gemini and others can be actually
super and clod as well, you know, and can be super helpful in our day-to-day tasks.
Like, can be great assistance, right?
But you still have a human in the loop when it comes to really large scale,
tasks that are being done.
And in order to see real ROI from LLMs, from AI, I think the big shift or the big
excitement is all about can we automate a lot of things that we do today without even
involving a human in the loop?
And can these systems get, you know, just leverage the true intelligence that they have
to understand the right intent behind.
what a human is doing, talk to systems behind the scenes and get actually end-to-end workflows done.
And so that's why the world is so excited about building agents.
And if I were to put it bluntly, you know, today, the world of software, like, you know,
we talk about like having microservices and small software services that do specific things,
even websites, right?
So we're moving from a world of microservice-based architectures and software to microagents and microagentic software,
where you'll have each part of the, each component of the software we built today,
just will become more intelligent and smarter and more independent at doing tasks than us having to hard code as us as developers,
having to hard code heuristics within it.
So that's why I wanted to start with like, why is, you know, why is all the hype of being ages actually is there?
And, and, you know, why is it so important for us?
Yeah.
And I think even that, you know, like zooming out even more is maybe even helpful, right?
Because I think as the actual large language models themselves, especially via like an AI chatbot interface,
as they change, right?
So as an example, you know, you mentioned Claude, you know, with ChadGBT, GBT, Google Gemini, obviously.
Now the base models have, you know, agentic capabilities, right?
You know, these reasoning models, they'll start to plan.
You know, they'll say, oh, wait, I need to go look at more documents.
Oh, I can tap into all of your company's dynamic data.
So it's almost like the lines are blurring between traditional AI chatbots and, you know, more quote unquote
traditional agents.
So like where do you even draw the line, right?
Like what is technically like, oh, I'm just chatting with a, you know, an AI chatbot
that has agented capabilities connected to my data versus like, oh, I'm working with a
full-blown agent?
Yeah, I think the, I mean, just going to, going to some of the more academic definitions
of agents.
Like typically a chatbot ends with an answer as the final action.
an agent typically has three stages
and I would think at least the top
the first two is like firstly there's a planning phase
where the agent is trying to be more
is trying to be curious about what you want
and like getting you as a user
to the right spot before it actually takes the action
for you itself and then even after taking the action
it kind of there's a reflection step
typically on like did I do it well
or is there anything missing?
For example, like if, you know, as a human,
if I am ordering something for you,
maybe if I'm, like, let's say,
if even if I'm like, you know,
we have a waiter in the restaurant,
someone is actually ordering food for you.
Even after they complete the task,
they do ask, like, is that all?
Do you, would you want anything else?
Like, you know, there's actually, you know,
a feedback loop happening all the time.
And that's the big difference between a chatbot
that is like, you know,
And you can easily relate to that as a, you know, me as a consumer.
I can relate to that very easily when I'm, you know, when I'm talking to chat GPT these days as a user.
And then now it's starting to do some actions.
Like it can take over, you know, my screen.
It can do stuff for me.
So that's the authentic in nature.
But when you build these capabilities into your apps, that becomes an agentic experience.
So where would you say where are we at today?
Because obviously someone that talks about AI every day
and I'm lucky enough to get to talk to smart people like you,
sometimes I feel like I'm in a little agentic bubble, right?
Like are enterprises at scale actually using agents today?
If so, how?
If not, why?
Yeah, that's a great question.
So I think Galileo works with hundreds of enterprise teams.
and organizations to be able to see
agentic development across the board,
whether it's financial services, healthcare, retail, telco,
and even, you know, cutting-edge startups, right?
If, and there's a whole spectrum of adoption.
You know, there are so many teams and, I mean,
obviously, startups are all going to be like all about agent-igentic workflows.
But even we're seeing a lot.
lot of investments from enterprises as well.
Now, we have,
we already have several customers who have agents that are live in production.
And we have customers who are even thinking of like this year is just going to be about
us productionizing rag applications and chatbots and agents are going to be next year
because these are heavily regulated industries.
and the like you know basically chat pots and rad applications are basically a stepping stone you know it's like the crawl walk run analogy right like you know the they're still in the crawl phase and you know next year with agents is going to be walk and then multi agents is going to be the true run right but it's so exciting to see the the adoption curve is extremely fast it's never it's something that we've never seen before honestly
even with the regulated
industries and
organizations,
a lot of teams are already
building agentic applications
and they will be
productionized this year.
So it's incredible to see
the arc of innovation
that even regulated organizations,
large organizations,
are able to do with this technology now.
So that's where we
see, we've seen agents
that can actually preempt
internet outages in production.
We've seen agents that can really manage the data platform for an organization.
We've seen supply chain agents that can look at multiple warehouses and automatically place
orders.
And so that's, and these three are just like, you know, the starting point, right?
And all three use cases, as I mentioned, they're far away from being.
chatbot style use case.
And, and, you know, one thing that I wanted to dive into a little bit deeper there that you
said is, you know, talking about, you know, some of your, you know, companies you work with
that are in highly regulated industries, right?
So whether that's, you know, finance, health care, et cetera.
But when it comes to agents working on mission critical tasks, right, aside from, you know,
certain regulation that, you know, you might not be able to out or, you know, over, uh,
engineer, you know, around regulation, right? But aside from that, what are those most important
steps for enterprises that do have that in their control to be able to start unleashing agents on
those actually mission critical tasks versus just the low hanging fruit, you know, content
creation research, right? Areas where I think Agentsic AI has already done very well in 2025.
So what are those most crucial next steps to actually get to those mission critical tasks for
tasks for enterprise? Yeah.
like the last three examples I gave you were actually mission critical at their actually mission
critical tasks that run businesses right and if anything would just like go wrong there like you know
you have real world systems that are going to fail and that's and we've seen enterprises actually
you know be successful in deploying these mission critical agents and this brings us back to your
very first question jordan on you know again how do we make agentic i reliable right and that's
because that's the number one thing.
And I'll talk about like there are other things as well
and we can go to where,
which will truly make multi-agentic systems successful.
But the number one thing for agents right now is trust and reliability.
How can we make sure these, when these agents and mission critical tasks
means that these agents have access and control over real world systems.
They can make API calls to tools.
to tools, backend databases, ticketing systems, to update them, to change them.
And if these agents were to behave incorrectly, you know, and to profess this as well,
the reason why we keep talking about if these agents were to, you know, go wrong is because
we are entering a world of non-deterministic software.
And enterprises as a whole need to adapt to a world of a world of.
of non-deterministic software.
It's a new world.
None of us have built these agents as scale before.
And so reliability, setting up a reliable pipeline for building, shipping, and scaling
these agents is absolutely critical and happy to share more there is.
Yeah, and I think it would be a good time to talk about the agent leaderboard.
But real quick, before we do, first, a quick word from
are sponsors at Google.
This podcast is supported by Google.
Hey, everyone.
David here, one of the product leads for Google Gemini.
Check out VO3, our state-of-the-art AI video generation model
in the Gemini app, which lets you create high-quality,
eight-second videos with native audio generation.
Try it with a Google AI pro plan or get the highest access with the ultra plan.
Sign up at Gemini.com to get started and show us what you create.
All right, let's let's dig in a little bit on this trust, this trust in reliability aspect, right?
Because that, you know, even if we're talking AI chatbots, right, obviously trust and reliability is still huge.
But when we start talking about agents, it becomes even more paramount.
So for our live stream audience, I'm sharing Galileo's agent leaderboard that's on a hugging, on the hugging face space here.
But yes, maybe if you could walk us through, what is this agent leaderboard?
And what does it help people understand, at least when it comes to trust in reliability?
Yeah.
So before we walk through this, Jordan, I want to at least preface this by just by creating
the understanding that reliability in agents come through a foundation of really test-driven
development for these agents and having high-quality,
evals for an agent.
And building agent evals is not just like evaluating an LLM.
You got it.
It's more about, you know, creating unit tests for an agent and an integration test for
an agent.
Just like, you know, software engineering best practices.
When we build these agents, like the tests, the eval test that we're building,
then extremely catered towards it towards a use case that you're building for.
It includes the metrics that.
that understand what the agent is doing
and what is the expectation from the agent.
And it includes a data set that is actually applicable
to that use case.
Now, once you start with that strong foundation,
you can apply these eval tests and in real time as well.
Because in production, your agents are going to be non-deterministic.
So these strong evals power really valuable observability.
and then even using these strong Eval tests,
you can create strong guardrails
that can prevent the outcomes at all.
Imagine if your agent started hallucinating,
it started making the wrong tool calls,
API calls,
and if you can prevent it in under 300 milliseconds,
that is super real time,
then you have the end-to-end reliability pipeline.
And what we're doing here with the agent leader board there
is actually helping T-1stall.
teams understand just how do we start getting
like evaluating LLMs and agents.
And so with the agent leader board,
you know, our teams at Galileo,
like we've selected many, many models
and actually applied them to several use cases.
There are data sets, there are agent like agent prototypes,
and we run them across different LLMs.
And then we evaluate them for the task that they're doing.
So we have the entire GitHub repo link there.
Anyone can go in and start seeing that and actually start doing their own evas with the GitHub repo.
And coming up with their own sort of ranking of like how can I architect the best, most accurate agent for my use case?
This leaderboard is extremely popular.
You know, we've received millions of hits on this.
And this just shows that developers and teams, they really want to go in and see how these LLMs apply to real world agents and not have some academic benchmark, you know, rank these models for them that don't represent real world use cases.
So, you know, as an example, maybe help us simplify this for our non-technical audience, but there's a data set here.
right, that kind of tells users a little bit more about what's actually going on under the hood.
Let me see if I can get it in the correct window.
There we go for our live stream audience.
But as an example, if I scroll down here in the data set, you know, one of these conversation pieces is, you know, who won the last World Cup in football, right?
And then there's a way to test if these different models in an agenic framework are able to retrieve the information,
correctly. So number one, is that how it works? But number two, maybe can you just tell us how
it works with this data set and then how you're able to tell, hey, which models are maybe
most reliable or trustworthy in these agentic settings?
Yeah. And I'll answer your last question first because this is like an ever, this is an ever
like moving target, right? Like new models, new capabilities are coming out so fast. And, you know,
our teams are even having a hard time to keeping up at that honestly. But which
model is the best, that answer can only be like gotten by going to the leaderboard and seeing
what is the live, you know, comparison. Like, can we keep updating these fairly regularly?
In terms of the data set, right, like, you'll see how agents, the most common pattern for an agent
is to make a tool call, right? And you, that's where even, you know, we've seen how popular
MCP has become, like the export protocol from Anthropic, because it really reduced.
the overhead of from of of of an LLM calling a tool it just simplifies that at the
most basic level right and and that makes agent building a lot more exciting and easy
so in this data set you know we're seeing agents make a bunch of tool calls and
the most basic requirement we want is from an LLM to not make the wrong tool call
to understand the action of tool calling really well so that's what
what we're seeing in this data set.
And then even the evaluation metrics are based on tool selection quality for this particular
data set.
And then, you know, what as we talk about, right, you kind of went through this, this,
this crawl, walk, run.
You know, you kind of mentioned there, you know, multi-agent setups or scenarios.
You know, what should business leaders, even though, yeah, for some industries, it might be a
while until we get there. But what should companies be keeping in mind when it comes to
multi-agentic AI and making sure that they have reliable agents?
You know, very soon, we're going to get to a world of like having small agents talking to each
other. So, you know, the kind of like microagentic architecture that I talk about, which is
truly multi-agentic systems. For example, if I wanted, and this is a very simple example
that like, you know, a lot of people would have heard about, but for those who haven't is,
let's say there's a travel agent application. Like, it's helping me book, like create,
firstly create my itinerary plan for my vacation. And then I also help me book things end to end
and then coordinate any logistics for me if I had to ship things when I'm traveling,
etc. Right. So it can be an end to end thing, even a supply chain agent that I talked about,
which, you know, one of our customers have, have implemented.
is literally like, you know, looking at the inventory in their warehouses and automatically placing orders.
And so there are multiple tasks that this entire workflow is doing, right?
Right, right? From planning to making reservations or making orders and then execution of those orders, right?
So that's our tracking of those orders. These can be implemented as individual agents that talk to each other.
and yes, and over time, the planning agent's capabilities
just can grow.
It can not only help me book hotels and flights,
but also restaurants and excursions.
And for that, it has to talk to other agents
that are specialized in booking, making reservations for those things.
So now in that world of multi-agentic system,
there are three things that need to be really solid.
for. First is again
trust. When an agent
is talking to another agent, how can it
trust that other agent? How can it
what kind of observability can we
get from the other agent to say
like Jordan right now I can see you
so I know it's you
but if it was some bot
that I was talking to that I wouldn't trust
it as much as I can trust
you because I can see you. I'm getting that
observability feedback that
it's you. Similarly
and you know
There's also other challenges beyond trust is authentication.
Like how do I know that it's, you know, an agent passing, handing off the task to another agent can authenticate me as an end user.
And the third thing is in communication.
And like these three are kind of the top priorities for most of us in the agentic space.
And communication is solved by a bunch of things.
You know, you'll see.
And there are MCP based patterns.
that are emerging.
Google's announced the A2A protocol.
You know, Gallio has been part of like even founding this agency organization,
which is a, you know, a truly open organization that can, that is,
that is making multi-agentic systems easier to solve for.
And you'll see that in overall, the world will kind of converge on one protocol
where any non-heterogeneous agent
built in any system
using any LL is able to talk to each other.
So yes, we've covered a lot in today's episode
but as we wrap up,
what is the one most important takeaway
that our audience needs to know
when it comes to building reliable agents
but for specifically mission critical tasks?
Yeah, thanks, Jordan.
I think the one thing as we head towards a multi-agent
agentic world. The one thing is that let's get our single agents to be more reliable now.
You know, it's very important. Even if you're not launching them in production, there is a whole
stack. There is a whole playbook being created on how to build, how to launch, like just the
CICD aspect of things. How do we make them reliable in production? There's prevention. There is mitigation.
How can we prevent bad outcomes from happening? How can we?
and we mitigate them because in a non-deterministic world of software again,
good evaluations followed by good reliable prevention and mitigations
is going to be absolutely critical to be successful in the world of non-deterministic software.
All right.
Some important information, not just for the future,
but things you need to start doing today to prepare for the multi-agentic future of tomorrow.
Yes, thank you so much for your time on joining the Everyday AI show.
we really appreciate it.
Pleasure being here, Jordan.
Thank you.
All right.
As a reminder, y'all, that was a lot.
If you missed anything, don't worry.
We're going to be sharing it in our newsletter, some resources that yes, talked about, the AI agent leaderboard.
It's all going to be in there.
So if you haven't already, please sign up for our free daily newsletter at your everyday AI.
So thank you for tuning in.
Please join us tomorrow and every day for more everyday AI.
Thanks, y'all.
Meet Firefly AI assistant.
Now live in Adobe Firefly, the Allman One Creative AI Studio.
Just describe what you want to create in your own words and the assistant handles the rest,
orchestrating multi-step workflows across Adobe Creative Cloud apps,
including Photoshop, Premiere Express, and more in one conversational interface.
You direct the outcome while the assistant accelerates execution.
Stand control with the ability to step in and refine at any time.
See it today at firefly.adop.com.
And that's a wrap for today's edition of Everyday AI.
Thanks for joining us.
If you enjoyed this episode, please subscribe and leave us a rating.
It helps keep us going.
For a little more AI magic, visit Your EverydayAI.com and sign up to our daily newsletter so you don't get left behind.
Go break some barriers and we'll see you next time.
