Everyday AI Podcast – An AI and ChatGPT Podcast - EP 575: Preparing Enterprises for Reliable AI Agent Deployment
Episode Date: July 25, 2025Every enterprise is legit rushing to build AI agents.But there's no instructions. So, what do you do? How do you make sure it works? How do you track reliability and traceability? We dive in ...and find out.Newsletter: Sign up for our free daily newsletterMore on this Episode: Episode PageJoin the discussion: Have a question? Join the convo here.Upcoming Episodes: Check out the upcoming Everyday AI Livestream lineupWebsite: YourEverydayAI.comEmail The Show: info@youreverydayai.comConnect with Jordan on LinkedInTopics Covered in This Episode:Google Gemini's Veo 3 Video Creation ToolTrust & Reliability in AI AgentsBuilding Reliable AI Agents GuideAgentic AI for Mission-Critical TasksMicro Agentic System Architecture DiscussionNondeterministic Software Challenges for EnterprisesGalileo's Agent Leaderboard OverviewMulti-Agent Systems: Future ProtocolsTimestamps:00:00 "Building Reliable Agentic AI"05:23 The Future of Autonomous AI Agents08:43 Chatbots vs. Agents: Key Differences10:48 "Galileo Drives Enterprise AI Adoption"13:24 Utilizing AI in Regulated Industries18:10 Test-Driven Development for Reliable Agents22:07 Evolving AI Models and Tools24:05 "Multi-Agent Systems Revolution"27:40 Ensuring Reliability in Single AgentsKeywords:Google Gemini, Agentic AI, reliable AI agents, mission-critical tasks, large language models, AI reliability platform, AI implementation, microservices, micro agents, ChuckGPT, AI observability, enterprise applications, nondeterministic software, multi-agentic systems, AI trust, AI authentication, AI communication, AI production, test-driven development, agent EVALS, Hugging Face space, tool calls, expert protocol, MCP protocol, Google A2A protocol, multi-agent systems, agent reliability, real-time prevention, CICD aspect, mission-critical agents, nondeterministic world, reliable software, Galileo, agent leaderboard, AI planning, AI execution, observability feedback, API calls, tool selection quality.Send Everyday AI and Jordan a text message. (We can't reply back unless you leave contact info) Start Here ▶️Not sure where to start when it comes to AI? Start with our Start Here Series. You can listen to the first drop -- Episode 691 -- or get free access to our Inner Cricle community and all episodes: StartHereSeries.com Also, here's a link to the entire series on a Spotify playlist.
Transcript
Discussion (0)
This is the Everyday AI Show, the everyday podcast where we simplify AI and bring its power to your fingertips.
Listen daily for practical advice to boost your career, business, and everyday life.
Meet Firefly AI Assistant, now live and Adobe Firefly, the All In One Creative AI Studio.
Just describe what you want to create and the assistant handles the rest,
orchestrating multi-step workflows across Photoshop, Premiere Express, and more in one conversational interface.
You direct the outcome.
The assistant accelerates execution.
It seems like the entire business world is sprinting to implement agentic AI.
But what's crazy is there's no real playbook per se, right?
Because not only is the concept and even kind of the definition of agents changing,
but so too are the models that generally power them.
So I think it's an important conversation for us to have today to talk about, well,
how do you just build more reliable AI agents for mission critical tasks?
And luckily, I said even though there's not an official playbook,
because this technology in the developments in AI and large language models are so fast moving,
we at least have today a great guess, the co-founder of Galileo,
who they're kind of writing the unofficial playbook, at least, for how so many
enterprises are using agents and agentic AI. All right, I'm excited for today's conversation.
I hope you are too. What's going on, y'all? If you're new here, my name's Jordan Wilson.
Welcome to Everyday AI. This is your daily live stream podcast and free daily newsletter,
helping us all not just learn AI, but how we can leverage it to grow our companies in our careers.
It starts here with this unedited, unscripted, live stream and podcast. But if you really want to
leverage what we're going to be going over today in this conversation, make sure if you haven't already
to go to your everyday AI.com.
Sign up for the free daily newsletter.
In that newsletter, we're going to be recapping the best insights from today's conversation,
as well as keeping you up to date with everything else that's happening in the world of AI
because it doesn't sleep, kind of like me.
All right.
So if you haven't done that, make sure you do that.
And if you're looking for the AI news, sometimes we start the show with that.
Today, not so much.
Technically pre-recorded, we're debuting it live.
So make sure if you want the AI news, just go check the newsletter.
All right.
from me, this is a conversation I think you need to hear.
All right, we're going to talk about a lot of things, but just where we're at with AI
agents, where we're headed, how to increase trust and reliability.
But luckily, you don't have to hear me ramble on about it.
We have a great guest.
I'm excited.
So live stream audience, please help me welcome to the stage.
There we got them.
We have Yash, Chef, the co-founder and COO of Galileo.
Yash, thank you so much for joining the Everyday AI show.
Thanks, Jordan.
Thanks for having me here.
All right.
Can you explain to everyone, like, what is Galileo?
What is it that you all do?
Yeah.
So at Galileo, we are building the reliability platform for agentic development.
And so basically, think of Calio as the AI reliability platform for helping developers
build ship and scale their agentic applications reliably.
You know, we start the company about four years ago.
That's, you know, a while before Chadibut even came out.
And that's because, you know, me and my co-founder, you know, me and my co-founder is actually realized,
and we were working on LLMs back at Google, and we realized that it's going to be extremely hard for enterprises to adopt these models for machine critical tasks.
And, you know, I love the topic that we have today because we're finally here.
We're here where, you know, AI is being adopted in the mainstream.
you know, we really want to see what is the extent of the ROI that we can
deliver, we can extract from this exciting new technology.
And Lashanglish models, you know, are amazing at doing so many things.
But when it comes to enterprise applications, these models, as we all know, may not have
ever seen the data that we are that we are working with or the systems that we are
trying to automate in our, in our enterprise.
And then, you know, as we can talk more, you would love to, you know, even, you know, shed some light on what we're seeing there, Jordan.
But, you know, that's in a gist of what Galileo does.
Yeah.
And yeah, I do want to dive into a lot of aspects here.
But I kind of maybe want to start at the end.
And then we can work our way backwards.
But, you know, ultimately, and I know it's not an easy answer, right, because it's probably the billion dollar.
answer of 2025 when it comes to AI implementation. But how do companies build more reliable
AI agents, right? It seems like no one quite knows. There's no official playbook per se. How do
companies do this? What are the correct steps they should be taking? Yeah, and that's a that's
a great question because we're thinking about this all the time Jordan. And I think this is really
top of mind for everyone building agents today. I mean, for, I think I'll start off with at least
why we are doing that just so that, you know, like at least everyone understands what's on
the hype about agents, right? You know, we all, we know, we're all seeing how chat GPT, Gemini,
and others can be actually super and clod as well, you know, and can be super helpful in our day-to-day
tasks. Like, can be great assistance, right? But you still have a human in the loop when it comes
really large-scale tasks that are being done. And in order to see real ROI from LLMs, from
AI, I think the big shift or the big excitement is all about can we automate a lot of things
that we do today without even involving a human in the loop? And can these systems get,
you know, just leverage the true intelligence that they have to understand.
the right intent behind what a human is doing,
talk to systems behind the scenes and get actually end-to-end workflows done.
And so that's why the world is so excited about building agents.
And if I were to put it bluntly, you know, today,
the world of software, like, you know, we talk about like having microservices
and small software services that do specific things, even websites, right?
So we're moving from a world of microservice-based architectures and software to microagents and microagentic software,
where you'll have each part of the, each component of the software we built today,
just will become more intelligent and smarter and more independent at doing tasks than us having to hard code as us as developers,
having to hard code heuristics within it.
So that's why I wanted to start with like, why is, you know, why is all the hype of being ages actually is there?
And, and, you know, why is it so important for us?
Yeah.
And I think even that, you know, like zooming out even more is maybe even helpful, right?
Because I think as the actual large language models themselves, especially via like an AI chatbot interface,
as they change, right?
So as an example, you know, you mentioned Claude, you know, with ChadGBT, GBT, Google Gemini,
obviously.
Now the base models have, you know, agentic capabilities, right?
You know, these reasoning models, they'll start to plan.
You know, they'll say, oh, wait, I need to go look at more documents.
Oh, I can tap into all of your company's dynamic data.
So it's almost like the lines are blurring between traditional AI chatbots and, you know, more quote
unquote traditional agents.
So like where do you even draw the line, right?
Like what is technically like, oh, I'm just chatting with an AI chatbot that has agented
to my data versus like, oh, I'm working with a full-blown agent?
Yeah, I think the, I mean, just going to going to some of the more academic definitions of
agents.
Typically a chatbot ends with an answer as the final action.
an agent typically has three stages and I would think at least the top the first two is like firstly there's a planning phase where the agent is trying to be more is trying to be curious about what you want and like getting you as a user to the right spot before it actually takes the action for you itself and then even after taking the action it kind of there's a reflection step typically on like did I do it well?
Or is there anything missing?
For example, like if, you know, as a human, if I am ordering something for you,
maybe if I'm like, let's say, even if I'm like, you know, we have a waiter in the restaurant,
someone is actually ordering food for you.
Even after they complete the task, they do ask, like, is that all?
Do you, would you want anything else?
Like, you know, there's actually, you know, a feedback loop happening all the time.
And that's the big difference between a chatbot that is like, you know,
that and you can easily relate to that as a you know as me as a consumer i can relate to that
very easily when i'm you know uh when when i'm talking to chat gpt these days as as a user and then
now it's starting to do some actions like it can take over you know my screen it can do stuff
for me so that's the authentic in nature but when you build these capabilities
into your apps that becomes an agentic experience so where would you see
say where are we at today?
Right?
Because obviously,
someone that talks about AI every day
and I'm lucky enough to get to talk to,
you know,
smart people like you,
sometimes I feel like I'm in a little
agentic bubble, right?
Like,
are enterprises at scale actually using agents today?
If so, how?
If not,
why?
Yeah,
that's a great question.
So I think,
you know,
Galileo works with,
you know,
hundreds of enterprise teams
and,
and,
uh,
organizations to be able to see agentic development across the board, whether it's financial services,
healthcare, retail, telco, and even, you know, cutting-edge startups, right? If, and it, there's a whole
spectrum of adoption. You know, there are so many teams and I mean, obviously, startups are all,
all going to be like all about agent-igent workflows. But even we're seeing a lot of
investments from enterprises as well.
Now, we have,
we already have several customers who have agents that are live in production.
And we have customers who are even thinking of like this year is just going to be about
us productionizing rag applications and chat pots and agents are going to be next year
because these are heavily regulated industries.
And like, you know,
basically chatpots.
and RAD applications are basically a stepping stone.
You know, it's like the crawl walk run analogy, right?
Like, you know, they're still in the crawl phase and, you know, next year with agents is going to be walk and then multi-agents is going to be the true run, right?
But it's so exciting to see the adoption curve is extremely fast.
It's something that we've never seen before, honestly.
even with the regulated
industries and
organizations,
a lot of teams are already
building agentic applications
and they will be
productionized this year.
So it's incredible to see
the arc of innovation
that even regulated organizations,
large organizations are able to do
with this technology now.
So that's where
we see, we've seen agents
that can actually reempt
internet outages in production.
We've seen agents that can really manage the data platform for an organization.
We've seen supply chain agents that can look at multiple warehouses and automatically place
orders.
And so that's, and these three are just like, you know, the starting point, right?
And all three use cases, as I mentioned, they're far away from being.
chatbot style use case.
And, and, you know, one thing that I wanted to dive into a little bit deeper there that you
said is, you know, talking about, you know, some of your, you know, companies you work with
that are in highly regulated industries, right?
So whether that's, you know, finance, healthcare, et cetera.
But when it comes to agents working on mission critical tasks, right, aside from, you know,
certain regulation that, you know, you might not be able to out or, you know, over, uh,
engineer, you know, around regulation, right? But aside from that, what are those most important
steps for enterprises that do have that in their control to be able to start unleashing agents
on those actually mission critical tasks versus just the low hanging fruit, you know, content
creation research, right? Areas where I think Agentsic AI has already done very well in 2025.
So what are those most crucial next steps to actually get to those mission critical tasks for
tasks for enterprise?
Yeah, and like the last three examples I gave you were actually mission critical,
they're actually mission-critical tasks that run businesses, right?
And if anything would just like go wrong there, like, you know, you have real-world systems
that are going to fail.
And that's, and we've seen enterprises actually, you know, be successful in deploying
these mission-critical agents.
And this brings us back to your very first question, Jordan, on, you know, again, how do we
make agentic AI reliable?
right and that's because that's the number one thing um and i'll talk about like there are other things
as well and we can go to where which will truly make multi-agentic systems um it's successful
but the number one thing for agents right now is trust and reliability how can we make your these
when these agents and mission critical tasks means that these agents have access and and control
over real world systems they can make a big
API calls to tools,
backend databases, ticketing systems,
to update them, to change them.
And if these agents were to behave incorrectly,
you know, and to preface this as well,
the reason why we keep talking about
if these agents were to, you know, go wrong
is because we are entering a world of non-deterministic software.
Enterprises as a whole need to adapt
to a world of non-deterministic software.
It's a new world.
None of us have built these agents as scale before.
And so reliability, setting up a reliable pipeline for building, shipping, and scaling these agents.
It's absolutely critical and happy to share more there is.
Yeah, and I think it would be a good time to talk about the agent leaderboard.
But real quick, before we do first, a quick work.
from our sponsors at Google.
This podcast is supported by Google.
Hi, folks. Paige Bailey here from the Google DeepMind Devrel team.
For our developers out there, we know there's a constant trade-off between model intelligence,
speed, and cost.
Gemini 2.5 Flash aims right at that challenge.
It's got the speed you expect from Flash, but with upgraded reasoning power.
And crucially, we've added controls like setting thinking budgets,
so you can decide how much reasoning to apply, optimizing for latency and costs.
So try out Gemini 2.5 flash at AIS Studio.com and let us know what you build.
Adobe just introduced an entirely new way to create, bringing the power and precision of its creative suite into one conversational experience.
Meet Firefly AI Assistant, now live in the Adobe Firefly app, the All In One Creative AI Studio.
Powered by Adobe's Creative Agent, Firefly AI Assistant lets you start with your vision, just describe what you want, and shape the outcome as it takes form with the
assistant. The assistant orchestrates multi-step workflows, drawing on 60 plus pro-grade tools across
Adobe Creative Cloud apps, including Photoshop, Illustrator, Premiere, Lightroom Express, and more to help
bring your ideas to life. You can also get started with creative skills, a growing library of
pre-built workflows for common creative tasks, like batch editing photos, creating mood boards, portrait retouching,
and creating social variations. Every step the assistant takes is visible, so you can
refine, redirect, or take over at any time.
You stay in the driver's seat as the creative director.
Adobe Firefly AI assistant now in public beta.
See it today at firefly.adopi.com.
All right, let's dig in a little bit on this trust, this trust and reliability aspect, right?
Because that, you know, even if we're talking AI chatbots, right, obviously trust and reliability is still huge.
But when we start talking about agents, it becomes even more paramount.
So for our live stream audience, I'm sharing Galileo's agent leaderboard that's on a hugging, on the hugging face space here.
But yes, maybe if you could walk us through, what is this agent leaderboard?
And what does it help people understand, at least when it comes to trust in reliability?
Yeah.
So before we walk through this, Jordan, I want to at least preface this spicy.
like just by creating the understanding that reliability in agents come through like, you know,
come through a foundation of really test driven development for these agents.
Like having high quality evals for an agent.
And building agent evals is not just like evaluating an LLM.
You got it.
It's more about, you know, creating unit tests for an agent and an integration test for an agent.
Just like, you know, software engineering best practices.
When we built these agents, like the tests, the eval tests that we're building,
then extremely catered towards the use case that you're building for.
It includes the metrics that understand what the agent is doing and what is the expectation from the agent.
And it includes a dataset that is actually applicable to that use case.
Now, once you start with that strong foundation, you can apply these eval tests
and in real time as well, because in production,
your agents are going to be non-deterministic.
So these strong evals power, you know, really valuable observability.
And then even using these strong eval tests,
you can create strong guardrails that can prevent the outcomes at all.
Like, imagine if your agents started hallucinating,
it started making the wrong tool calls, API calls.
And if you can prevent it in under, you know,
300 milliseconds, that is super real time.
Then you have the end-to-end reliability pipeline in place.
And what we're doing here with the agent leader board there is actually helping teams understand
just how do we start getting LLMs and agents.
And so with the agent leader board, you know, our teams at Galileo, like, you know,
we've selected many, many models and actually applied them to several use cases that are
datasets that are more you know agent like agent prototypes and we run them across different
nLMs and then we evaluate them for the task that they're doing so we have the entire github
like repo link there you can anyone can go in and start seeing that and actually start doing their own
evals with the github repo and coming up with their own sort of ranking of like how how can i
architect the best most accurate agent for my use case.
This leaderboard is extremely popular.
You know, we've received millions of his hits on this.
And this just shows that developers and teams,
they really want to go in and see how these LLMs apply to real world agents
and not have some academic benchmark, you know, rank these models for them
that don't represent real-world use cases.
So, as an example, maybe help us simplify this for our non-technical audience,
but there's a data set here, right, that kind of tells users a little bit more
about what's actually going on under the hood.
Let me see if I can get it in the correct window.
There we go for our live stream audience.
But, you know, as an example, if I scroll down here in the data set, you know,
one of these conversation pieces is, you know, who won the last World Cup in football, right?
And then there's a way to test if these different models in an agentic framework are able to
retrieve the information correctly.
So number one, is that how it works?
But number two, maybe can you just tell us how it works with this data set and then how you're
able to tell, hey, which models are maybe most reliable or trustworthy in these agentic settings?
Yeah.
I mean, I'll answer your last question first because this is like an ever,
this is an ever like moving target, right?
Like new models, new capabilities are coming out so fast.
And, you know, our teams are even having a hard time to keeping up at that honestly.
But which model is the best?
That answer can only be like gotten by going to the leaderboard and seeing what is the
live comparison.
Like, can we keep updating these fairly regularly?
In terms of the dataset, right?
like in you'll see how agents the most common pattern for an agent is to to make a tool call right and you
that's where even you know we've seen how popular mcp has become like the expert protocol from an anthropic
because it really reduces the overhead of from of of an lm calling a tool it just simplifies that
at the most basic level right and and that makes agent building
a lot more exciting and easy. So in this data set, you know, we're seeing agents make a bunch of
tool calls. And the most basic requirement we want is from an LLM to not make the wrong tool call,
to understand the action of tool calling really well. So that's what we're seeing in this data set.
And then even the evaluation metrics are based on tool selection quality from this,
for this particular data site. And then, you know, what?
as we talk about, right, you kind of went through this, this, this crawl, walk, run.
You know, you kind of mentioned there, you know, multi-agent setups or scenarios.
You know, what should business leaders, even though, yeah, for some industries,
it might be a while until we get there, but what should companies be keeping in mind when it
comes to multi-agentic AI and making sure that they have reliable agents?
you know, very soon, we're going to get to a world of like having small agents talking to each other.
So, you know, the kind of like microagentic architecture that I talk about, which is truly multi-agentic systems.
For example, if I wanted, and this is a very simple example that like, you know, a lot of people would have heard about.
But for those who haven't is, let's say there's a travel agent application.
Like it's helping me book, like, create, firstly create my itinerary plan for,
my vacation. And then I also help me book things end to end and then coordinate any logistics
for me if I had to ship things when I'm traveling, etc. Right. So it can be an end to end thing,
even a supply chain agent that I talked about, which, you know, one of our customers have,
have implemented is literally like, you know, looking at the inventory in their warehouses and
automatically placing orders. And so there are multiple tasks that this entire workflow is doing.
right, right, from planning to making reservations or making orders and then execution of those
orders, right? So that's our tracking of those orders. These can be implemented as individual
agents that talk to each other. And yes, and over time, the planning agents' capabilities
include, just can grow. It can not only help me book hotels and flights, but also restaurants and
excursions. And for that, it has to talk to other agents that are, that specialize in booking,
making reservations for those things. So now in that world of multi-agent existence, there are three
things that need to be really solved for. First is again, trust. When an agent is talking to another
agent, how can it trust that other agent? How can it, what kind of observability can we get from
the other agent to say, like, Jordan, right now I can see you.
so I know it's you
but if it was some bot that I was talking to
that I wouldn't trust it
as much as I can trust you
because I can see you. I'm getting that observability
feedback that it's you.
Similarly
and you know there's also
other challenges beyond trust
is authentication.
Like how do I know that it's
an agent passing and handing
off the task to another agent
can authenticate me as an end user?
And the third thing is
in communication and like these three are kind of the top priorities for most of us in the
agentic space and communication is solved by a bunch of things you know you'll see and there are
mcp based patterns that are emerging google's announced the a to a protocol um you know galio has
been part of like even founding this agency uh organization which is a you know a truly open
organization that can that is that is making multi-agentic systems easier to solve for and
you'll see that in overall the world will uh kind of converge on one protocol where any
rogue any any any non-heterogeneous agent built in any system built using any LLF is able to talk to
each other so uh yes we've covered a lot in today's episode but as we wrap up what is the one
most important takeaway that our audience needs to know when it comes to building reliable agents,
but for specifically mission critical tasks.
Yeah, thanks, I think the one thing as we head towards a multi-agentic world, the one thing is
that let's get our single agents to be more reliable now.
You know, it's very important.
Even if you're not launching them in production, there is a whole stack, there is a whole
playbook being created on how to build, how to launch, like just the CICD aspect of things.
How do we make them reliable in production?
There's prevention.
There is mitigation.
How can we prevent bad outcomes from happening?
How can we mitigate them?
Because in a non-deterministic world of software again, good evaluations, followed by good
reliable preventions and mitigations is going to be absolutely critical to be successful.
to be successful in the world of non-deterministic software.
All right.
Some important information, not just for the future,
but things you need to start doing today
to prepare for the multi-agentic future of tomorrow.
Yesh, thank you so much for your time
on joining the Everyday AI show.
We really appreciate it.
Pleasure being here, John. Thank you.
All right.
Now, as a reminder, y'all, that was a lot.
If you missed anything, don't worry.
We're going to be sharing it in our newsletter,
some resources that Yesh talked about,
the AI agent leaderboard.
It's all going to be in there.
So if you haven't already, please sign up for our free daily newsletter at your everyday
AI.com.
So thank you for tuning in.
Please join us tomorrow and every day for more everyday AI.
Thanks, y'all.
Meet Firefly AI assistant.
Now live in Adobe Firefly, the Allman One Creative AI Studio.
Just describe what you want to create in your own words and the assistant handles the rest,
orchestrating multi-step workflows across Adobe Creative Cloud apps, including Photoshop, Premiere
Express, and more in one conversational interface.
You direct the outcome while the assistant.
assistant accelerates execution.
Stand control with the ability to step in and refine at any time.
See it today at firefly.adobie.com.
And that's a wrap for today's edition of Everyday AI.
Thanks for joining us.
If you enjoyed this episode, please subscribe and leave us a rating.
It helps keep us going.
For a little more AI magic, visit Your EverydayAI.com and sign up to our daily newsletter
so you don't get left behind.
Go break some barriers and we'll see you next time.
Thank you.
