Everyday AI Podcast – An AI and ChatGPT Podcast - EP 628: What’s the best LLM for your team? 7 Steps to evaluate and create ROI for AI
Episode Date: October 9, 2025How can you measure ROI on GenAI for your team? 🤔Internal evaluations and intentionality. We've helped thousands of orgs put LLMs to work and ACTUALLY save time. On today's show, we&apos...;re dishing the 7 steps you need to follow. What’s the best LLM for your team? 7 Steps to evaluate and create ROI for AI -- An Everyday AI chat with Jordan WilsonNewsletter: Sign up for our free daily newsletterMore on this Episode: Episode PageJoin the discussion on LinkedIn: Thoughts on this? Join the convo on LinkedIn and connect with other AI leaders.Upcoming Episodes: Check out the upcoming Everyday AI Livestream lineupWebsite: YourEverydayAI.comEmail The Show: info@youreverydayai.comConnect with Jordan on LinkedInTopics Covered in This Episode:Choosing the Right Large Language ModelEvaluating LLMs for Business ROIFront-End AI Operating Systems ExplainedCommon Traps in AI Model EvaluationPublic Benchmarks for LLM EvaluationSeven-Step LLM Evaluation FrameworkMeasuring Pre-GenAI Human BaselinesBuilding Realistic AI Test DatasetsCalculating ROI for GenAI ImplementationMonthly Retesting and AI Model UpdatesTimestamps:00:00 Choosing the Right AI Model07:02 Adapting Workflows for AI Integration10:58 "Gemini's Versatile Modes Overview"14:30 Avoiding AI Shiny Object Syndrome15:36 AI Evaluation for Reliability and Improvement20:36 "Data Testing Guide Essentials"25:15 Realistic and Messy Data Essentials26:06 "Building Effective AI Workspaces"31:08 AI Evaluation and ROI Calculation34:11 Human Oversight in AI Testing35:52 Evaluating GenAI Use Cases39:00 "NotebookLM: AI-Powered Idea Organizer"Keywords:Large Language Model, LLM, generative AI, AI operating system, front end AI models, AI evaluation, model ROI, model evaluation steps, AI benchmarks, scientific benchmarks, API connection, enterprise AI, ChatGPT, Claude, Gemini, Copilot, team AI adoption, knowledge worker AI, operating system choice, productivity modes, connectors, deep research mode, agent mode, image generation, web search, Canvas mode, advanced voice mode, business process automation, workflow evaluation, change management, AI training, Send Everyday AI and Jordan a text message. (We can't reply back unless you leave contact info)
Transcript
Discussion (0)
This is the Everyday AI Show, the everyday podcast where we simplify AI and bring its power to your fingertips.
Listen daily for practical advice to boost your career, business, and everyday life.
Meet Firefly AI Assistant, now live in Adobe Firefly, the All In One Creative AI Studio.
Just describe what you want to create and the assistant handles the rest,
orchestrating multi-step workflows across Photoshop, Premiere Express, and more in one conversational interface.
You direct the outcome.
The assistant accelerates execution.
How do you go about choosing the right large language model for your company and the right model
for the right job and how can you evaluate it and understand if you're actually getting a
ROI on Gen A.I.
This is something I've literally talked to hundreds of companies about over the past three years.
And today I think it's important now more than ever as large language models are slowly morphing under our eyes into AI operating systems.
It's important we talk about how to do this for front end AI models because I do think that's the future.
That's where work happens as your everyday chat GPT, Claude, Gemini, etc.
are becoming places of work and where teams go to get work done.
So today we're going to go over what's the best LLM for your team and these seven steps
to evaluate and create ROI for AI.
I'm excited for today's show.
I hope you are too.
What's going on, y'all?
Welcome to Everyday AI.
My name's Jordan Moulson.
We do this thing every single day.
It's a live stream podcast and free daily newsletter helping us all make sense of AI, but how
we can actually leverage it to grow our companies.
in our career.
So if you're a stressed out business leader,
trying to keep up and you're like,
how the heck can I when there's new AI advancements every day?
Starts here with the live stream podcast.
But to take it to the next level,
make sure to go to our website at your everyday AI.com.
Sign up for the free daily newsletter.
We're going to be recapping both today's show
and keeping you up to date with all of the latest AI news from today.
So let's get into it.
ROI is important, right?
But how do you evaluate it?
Don't be wrong.
There's great resources out there, scientific benchmarks,
evaluation sites that look at large language models.
But here is the issue.
They're just looking at the models themselves.
They're looking mainly if you're using one of these big APIs.
But that's not how companies are using AI now.
Something I've been thinking about a lot over the last year or so.
I've been one of the first people, I think, shouting and screaming about this AI operating system thing.
And now as these AI chatbots are turning into full-fledged operating systems,
now people are understanding, oh, maybe we shouldn't just be using models on the backend via an API connection.
And maybe we should be using them in their interface because they're so much more powerful when you have access to all of these different modes and features that aren't available when you're using an API.
So now we've seen this influx of large enterprise teams moving away from using models just via the API or now bringing more of their team, but putting them on a teams or an enterprise plan for chat, GPT, Gemini, co-pilot, Claude, etc.
But now we're left with this problem again.
How do we evaluate this?
Well, on today's show, we're going to go over the common traps that businesses make when trying to evaluate AI models on the front end.
We're going to share the best publicly available large language model evals to at least get you a jump start.
And we're going to lay out in detail our seven step plan for evaluating AI models for AI.
Live stream audience, good to see you.
What's going on?
If you guys have any questions, good morning, Jason on YouTube.
Douglas, good to see you.
If you guys have a question, feel free to drop it.
I like when I can getting a couple of questions.
So if you have any, let's get them in.
But first of all, who is this show for?
I think this is for teams and organizations that are evaluating using front-end AI chatbots as their AI operating system of choice.
And I'm going to be talking a little bit more about this concept of an AI operating system that's finally now picking up steam, even though I've been talking about it for more than a year.
And why I think now is a more important time than ever to be talking about this concept of model evaluations on the front end and how you can create and,
protect ROI because if your team is not properly trained and doesn't understand how large
language models work, especially on the front end with all of these new modes that are being
added all of the time, you might end up spending more time in AI than you would if you weren't
doing it with AI, which is why I think it's important to evaluate these models and to go through
this step. Why now? Right. Well, now I think the rest of the world is finally caught up to what I've been
saying for a very long time.
I think more and more smart organizations are starting to move their day-to-day business
processes inside the front end of a large language model.
So if you're brand new here, let me explain the difference between this backend API,
front-end operating system, right?
So I think in early, you know, in 2023 in the earlier days, 2024, larger enterprise
organizations essentially they fine-tuned these models from Google OpenAI.
infropic, et cetera, for their own use.
They built rag pipelines,
and they essentially created versions
of these large language models
for their employees to use internally, right?
And those are a little bit easier to evaluate.
You know, it's strictly you're working with a model.
There's an API.
There's benchmarks for certain categories of work, right?
There's great benchmarks out there.
You know, these models are great for coding.
These models are great for creative writing.
These models are great for multi-step research, et cetera.
So when you're using an API and you have a little bit more of a narrow scope, it's a little
bit easier to evaluate these models and to measure ROI.
But I don't think that's how most people should be using them, right?
I'd like to think of an actual operating system.
In the 90s, you know, most companies or early 2000s, how to make a choice on their operating
system. Are you going to be a Microsoft Windows organization? Are you going to be Mac OS? Are you going to be
Linux? Right. And then from there, you essentially built your processes around those operating systems,
right? When operating systems became, you know, popularized in the 90s, you know, work before computers
kind of had to be reworked and reorganized around what an operating system offered. And I think
that's the junction that we're at right now. Smart organizing.
organizations are understanding, maybe using, at least for average, you know, your average
knowledge worker, right? Obviously, if you're a coding shop, you know, if you're a niche company
or a niche department, this is not for you. I'm talking about for the everyday knowledge
worker, right? People in marketing, HR, sales, right? Not necessarily, you know, people just doing
coding tasks because I think that's a little more cut and dry.
And now there's this whole concept, especially after Open AI earlier this week announced a couple of things at their dev day.
But one of the biggest ones was apps.
So bringing in entire apps into the chat GPT experience.
I'm going to do a show on that later, but essentially what that means.
And I think this is something that the other frontier labs are going to be doing is you're going to be working with your own business context in using entire.
user interfaces from other websites all within chat gpt or Gemini or claude etc so we need to start
unlearning the old way of working with AI and we need to relearn how to move our day-to-day
processes inside one of these systems because that is going to be the best and fastest way to
work just like well you probably could have figured out in the 90s
MacGyvering a way to not use one of those three big operating systems,
but if you wanted to succeed, you had to.
I think you have to make that same choice now.
So here's the problem.
Even front end AI systems are complicated, confusing,
and change too often without notice.
Here's a good example.
So for the podcast audience,
I have something on my screen here.
If you ever want the video version,
make sure to go to our website at your everyday AI.com.
You can watch the video version.
version there, click on episodes.
Right now, this whole chat GPT thing going to GPT5 was supposed to make things easier.
Right.
It didn't.
It made things more complicated.
And this is why I think it's important to look at maybe our company shouldn't just be using an API and maybe we should be moving our processes inside of one of these team modes.
right. So obviously all the big AI models, you can have a team account and enterprise account,
bring thousands of users, you know, you can share projects, share GPTs, share chats, right,
through most of the, you know, front end AI chatbots. But it's confusing that. So even if you
look at chat GPT, on my chat GPT screen, let me count them here. I have one, two, three, four, five,
six, seven, eight, nine. I have ten different models that I can choose.
from. Okay. That's not easy. Right. You have all the different variations of GPT5, auto, instant,
thinking mini, thinking and pro. And then you have your legacy models, GPT40, GPT41, GPT4,
5, 03, 04 mini. Confusing. And then even in the thinking mode, you have, I think, four or five
different layers of thinking. So you, I have more than a dozen. And many teams have more than a dozen
and just models they can choose from just in chat gpte.
All right.
It's not as bad in Gemini, right?
There's essentially two.
You know, Claude is a little, you know, in between.
I think there's about six, six different models.
But then you talk about modes.
And this is where you don't get this if you're just working on the back end via API.
And these modes are where the magic happen.
So using modes like connectors, you know, and all the major players
have their version of connectors.
This brings your dynamic data,
essentially creating a mini version of retrieval augmented generation
or rag pipelines with your company's dynamic data.
There's the deep research mode, agent mode, right?
I'm going over the modes in ChadGBT,
image gen mode, web search, Canvas, which is so underrated,
study and learn, right?
Advanced voice mode.
So would you rather, going back, right,
think of those, those,
of you that, you know, worked in the in the 90s and the early internet days. Imagine if you had
a computer with no programs, right? No, no office programs, no terminal, no nothing. That's
kind of what I feel if your company is only using something via an API. And this is why I think
the smartest teams have already made the jump to doing a chat GPT business plan, a chat
GPT enterprise plan, Google Gemini team plan, Claude Enterprise plan,
because you need to take advantage of these modes.
These modes, right, difference between models and modes, are where work happens.
So why then, if this is where work needs to happen, and we know AI is so good and so powerful
and we see all these studies, oh, it's so much faster, more efficient, better than humans.
Well, why do most companies never find an ROI?
Let me give you the reasons.
The pilots are always too long.
You can do a year-long pilot in AI.
You're failing before you start.
Change management and training are basically non-existent in most, even large organizations.
I'm constantly shocked the lack of change management that gets invested as well as training.
Companies aren't training their people.
It's weird.
Also, not properly measuring pre-gen-AI human input efforts.
How long did these projects take before there was AI, right?
We don't have baselines.
We don't have, you know, if you want to talk about ROI, what is it, right?
You have to calculate the time that it takes for your people to do a project.
You have to calculate any hard costs, right?
Software, expenditures, whatever.
But people didn't measure this before AI.
So there's no baseline to compare it to.
Also, I think sometimes one lucky run is celebrated when it comes to AI.
But variance and reliability are never measured.
Generative AI is generative, right?
So a lot of times companies find like one use case and then they just roll that very small
use case out to everyone without properly testing it, right?
Generative AI can be kind of like a roll of the dice sometimes.
So you have to keep that in mind.
And then last but definitely not least,
one of the most common traps of companies not finding ROI on Gen AI
is the shiny object AI syndrome, right?
Every single week in multiple times a week, there are shiny distractions, right?
I just talked about open AI's big announcements.
We're going to have big announcements this week.
we've already had some big announcements from Google.
I'm sure we're going to see next week, right?
I do this every day.
I've been doing this every day for three years.
There's a shiny object every single week multiple times.
And companies get too easily distracted by that.
So let's talk about evaluations, right?
If we're going to talk about how and go over these seven steps to evaluate,
what are e-vals, right?
So AI e-vals are essentially structured quality checks for AI
systems. They're important, especially for generative AI, because you can run the same input
and get wildly different outputs. That's how generative AI works, which is why evaluations are
extremely important. So using these ensures reliability, so evals verify the AI performs its task
consistently and correctly. You're able to catch risk. So evals identify issues like bias and errors
before the AI is used kind of by the rest of your company. And also this helps guide.
improvements, right? To get the most out of Gen AI, you have to constantly be iterating and kind of
going in a cycle of improvement and feedback. And that's what evals help do. So they provide data
driven insights to help your team fix and refine the AI that you're using. And then there are
some great publicly available eval sites to get started, right? Because yes, I mentioned
chat gbt and they currently have like 12 different models right uh claude when you're using it on
and i'm talking front end chat gbt dot com claude dot a i gemini dot googlegloom there's there's multiple
models to choose from but there's good places to get start to start uh to start uh looking at e vows okay
yes there are scientific benchmarks but there's other great resources to at least get start right
because I'm not saying that one model or one mode is going to be best for every single project,
task, or deliverable for your company.
You might have to use multiple, right?
You might have some people, you know, you might have 100 employees on a Chad GPT teams plan,
and you might have, you know, 200 on a Claude Enterprise plan.
So where do you get started?
Well, most of these publicly available AI eval sites look at different scientific benchmarks as well as user benchmarks.
And then depending, you know, like LM Arena is a good example.
We talk about this a lot on the show, but with LM Arena, you can go on there.
You essentially have a battle.
Okay, you put in one prompt.
You get two different responses from a large language model.
You choose which one is better.
You don't see which model it is.
It's a blind LLM taste test.
And then after millions of votes, obviously, you start to see which models are best.
And then also they classify what this was about.
Was this a creative writing prompt?
Was it a factual test?
Was it a coding task, right?
Was it math?
Was it science?
So not only do you get scores, these are called ELO scores,
but you're also able to classify them across different arenas, right?
So if you're looking for the right, if you're a web development team, you can go look at that.
If you're a creative writing department, you can go look at that.
So different e-valves classified across different categories.
LM Arena, I think, is one of the best public e-vout sites.
Next, LiveBench.
Similarly, they look across, you know, reasoning, coding, agentic coding,
agenetic coding, mathematics, data analysis, all these different things.
Then they look at all the different variations of the models as well.
Another great one is Epoch, AI.
They have their AI benchmarking across different categories as well.
And then last but not at least, this is a newer one.
I like this from scale.
They're CLLM leader.
So a lot of these look at different aspects, right?
Because your team, especially if you're a larger organization, like I said, you might end up using multiple of the big platforms, right?
And I don't think that's a bad thing necessarily.
Although it's always best to bring everyone under one roof, sometimes, right, especially like, let's say something like coding, right?
That's Claude is very popular for software development teams.
So maybe you have a team running on that and the rest of your team is on Chad GPT or Gemini.
All right.
We're going to get into the seven steps.
I had to lay the groundwork first.
But before we get into them, very quick word from our sponsors.
This podcast is supported by Google.
Hey, folks, Stephen Johnson here, co-founder of Notebook LM.
As an author, I've always been obsessed with how software could help organize ideas and make connections.
So we built NotebookLM as an AI-first tool for anyone trying to make sense of complex information.
Upload your documents and NotebookLM instantly becomes your personal expert, uncovering insights and helping you brainstorm.
Try it at notebooklm.com.
All right.
I had to go through the prerequisites of laying the groundwork first.
I couldn't just jump in and give you the seven steps.
But now I think we're ready to go through them.
So before we get started, just a little precursor things to remember.
All right.
You have to plan your evaluation sprint first.
You need to get written buy in from, you know, executive sponsors, whoever that may be,
end users, IT security, all that, right?
So here I am crossing my T's dotting my eyes, right?
You need to get the clearance.
If you have to go take this through legal, right?
What data can we upload?
What can we not?
Right.
This is, this is all my, like, I got to get my fine print out of the way because you can't
just follow exactly what I say.
You got to get the.
fine print. All right. Also first, like I said, use public benchmarks to start to see which models
you should start to evaluate. And then you need to plan a two to four week sprint testing just
one workflow. Freeze your model choice and ignore all shiny new objects. All right. Here we go. Step one.
Define your success criteria before you start testing. You need to literally like write a job
description for this pilot, right, for this test. You need to.
explicitly identified the specific outcome, the constraints, the tools that you're going to be using
are allowed, and then any do not do actions. You need to set clear ground rules because, again,
these models and modes are always changing. So, you know, if good example, right, a good example,
chat TVD just launched their apps. If you were midway through and you're like, oh yeah,
we could use this app for that, no, you said, this is what we're doing. We're not doing anything
else. Okay. You got to set your dues and your notes. Then you need to create a rubric, a grading
rubric, you know, one to five, one to ten, whatever works best for you, defining exactly what
makes each eval or test, pass, or fail. All right. So essentially, what you're going to be doing
is you're going to be identifying use cases, right? It could be, you know, parsing certain information
from long PDFs and turning it into a table and a chart.
chart, right? That's an easy example. That's, that's a use case that probably a lot of companies
might do, right? There's a big industry white paper that comes out every month and your team has
historically gone through it manually pulled out certain insights and put those insights into a chart
or a table, right? You need something measurable that you can define the success criteria before
testing, something that has a very finite beginning and end that you can measure, right? And then you
to choose for each of these different use cases, choose three to five simple KPIs that the workflow
impacts. Number one, obviously, is human time spent accuracy, revisions required, value created,
et cetera, right? Because there's some things that might be gray area. Try not to,
with your success criteria and your KPIs, try not to choose anything that's too great. Try to
choose as many black and white things as you can.
Right.
And there's different ways you can grade it.
And I'll talk a little bit more about that later.
But that's step one, define your success criteria before you start testing.
Step two, measure your human baseline first.
I don't understand why no one's doing this.
This is mind boggling to me.
Like literally you have people out there running big organizations running.
AI pilots.
And, you know, they'll tell me about it.
They'll be like, okay, you know, we ended up doing this project.
you know, here's, here's how long it took, you know, et cetera.
And then I'm like, okay, what was it when pre-gen-I, pre-gen AI?
They're like, oh, well, you know, we didn't benchmark it.
It's like, okay, benchmark it now.
Do that exact same thing without any AI, right?
I've been talking about this for years.
It's very simple.
You need to time multiple employees.
It depends on how big your team is, right?
If it's a smaller team, you know, if you're running this through a champion team,
a lot of times, you know, people identify their champion teams, and it's like eight to
people. So it depends on the size of your team, but you need to have multiple employees run
through the entire process. And a lot of companies aren't willing to do that. Y'all, you need to
invest in the same way that you need to invest training and education, learning and development around
AI. You need to sometimes have multiple employees do this. And I think so many businesses are
hesitant because they look at this is waste. They're like, why would I, okay, this is a 40-hour
project, right? Why would I have three to five employees do this same project? And if it's
just for an internal use case.
Okay, well, you see all these studies,
about 60 to 70% time savings when using large language models.
Well, do you want that?
Well, if so, you've got to get the pregen AI human benchmark.
You need to calculate the average time, error rate, rework minutes,
cost per completed task.
You need to measure the human input and output.
This is your baseline.
And this will ultimately prove whether the AI actually saves time or not.
Without it, you're just guessing.
Stop guessing.
Step three, you need to build a realistic and controllable test data set.
All right.
So whether this is synthetic data related to your company, whether it's publicly available,
publicly available data, you know, in your industry, whether it's data about competitors,
whatever it is.
You need a finite in a realistic data set.
You need to gather 20 to 40, 20 to 40 actual work examples.
You need to have messiness, right?
Don't start with something that's too clean, too structured, too organized, right?
Both for the humans and the AI.
You need to really have a real data set.
Also, I would add six drift cases, you know, rename files, dead links.
You need to put actual traps in there, both for humans and for AI.
So again, you know, you're not going to have the team that's ultimately testing this on the human side.
building this use case.
You need to have other people, right?
But you need to build traps,
because especially when we talk about agentic models,
you can't just hold their hand, right?
You have to hope that if they encounter a problem,
they will adapt and overcome it,
just like hopefully a human would.
And then you also need to create a pass-fail checklist
for each case tied directly to your rubric.
Step four.
You need to configure your workspace like production.
Okay, so you need to set up properly.
You need to set up the shared workspace matching a real permissions for the right tools
and the models that users will ultimately have.
So what do I mean by that?
I don't know why.
So many organizations, they're like, I'm not going to pay for AI.
I'm going to use the free one.
Stop.
Right.
You need to be using the exact same mode or model that you want to be using on the front end
or on like once this goes goes live with the rest of your team.
Right?
Sometimes that might mean paying for the super expensive $200 a month plan, right?
Because there are certain features that are only available on those plans.
So it might mean that.
All right.
Also, you need to document the configuration, which model, which tools can be used, which connectors, you know, who has what access level, make sure that the people on the team have access to the same documents, right?
Then you need to verify during the test that the AI uses the expected tools, not workarounds or
guessing. This is the other thing. So many people don't know prompt engineering basics. So many people
don't. All right. This is probably where, even if you follow this, steps one through seven,
this is probably where you're going to fail. Most people have no clue how to prompt an AI. They
don't. And well, why do I know this? Well, people don't even know how to select the right bottle,
right uh CEO sam altman recently said around the gpt five launched that only i think seven percent
of paid users used thinking models which is an absolutely assinine thing to think about right
why would anyone choose models that are worse this just shows most humans have no clue what they're
doing when they're using large language models you need to use the right model the right
mode and the right prompting techniques.
You've got to know the basics.
All right.
Step five.
Generative AI is generative.
You don't do one-offs.
You run this multiple times.
I'd say you need to run at least this trial three times,
both with your humans and on the AI side.
In demand proof, you need to repeat each test case three times in separate chats, right?
So turn off, depending on what, you know,
front end AI chat bot you're using, but you're probably going to want to want to turn off memory.
If you're using chat chabitia as an example, or all the front end AI large language models have
something like this, you're going to want to turn off memory. You're going to want to turn off
past chat history, you know, to make sure that it's not pulling from other things.
You also need to require working citations, file paths, or artifacts for every accepted answer,
no exceptions. And then you need to calculate the reliability score, right? So when you are
creating this rubric on the front end, you need to know how it's going to be scored on the back end
because again, y'all, generative AI is generative. You need to be very detailed in how you're
going to score this thing. And then you need to run it through multiple times with humans and with
AI systems because sometimes generative AI might get it right. Sometimes it might get it wrong.
Step six, you need to calculate the real ROI with objective grading, right?
depending on the resources you have internally,
I would have the grading be from someone that doesn't know which ones are human
and which ones are AI.
So this might also require you having a certain output to where human graders wouldn't
be able to tell which one came from a human and which one came from an AI.
So what that means, right, until
until the models get perfect,
this might mean you have a one person or a small team go through
and make sure that the formatting in the layout is the same
between the AI-created outcomes and the human-created outcomes
just to make sure you remove biases from whoever is ultimately grading.
So a simple example, right?
A lot of times large language models will cite things in line.
So you don't want to just copy and paste that.
over because someone's going to know, oh, this is from an LLM.
And then they might, depending on their objective and their agenda, they might grade it accordingly, either better or worse.
So you need to have that kind of blind taste test, right?
Whoever is testing or grading on the rubric can't know what came from AI and what came from a large language model.
So same thing, right?
Maybe it's a page minimum, page maximum, whatever it is, you need to make sure that it is consistent.
in that human graders won't know the difference.
So then ultimately, you need to use the automated checks for citations and accuracy,
and the humans can grade for tone and quality.
You need to convert time savings to dollars using a fully loaded rate and then subtract the
subscription costs, right, whatever you're paying for these AI systems for net ROI, right?
So do this across the course of a month because that's, you know, you're paying for these,
you're paying for these AI systems monthly.
And then you need to report at least seven,
factors. And this is going to vary depending on, you know, your industry, your department,
your type of work, et cetera. But I think these seven factors are pretty good to keep a look,
to keep a look at. So cost, latency, accuracy, stability, safety, integration, and compliance,
right? Because you don't know, like, you don't want an AI going off the rails and doing something
against company policies in the same way you wouldn't want a human doing something, right?
Safety, same thing. Accuracy.
same thing. All right.
Stability. Well, an AI system do great.
The first three runs and then go off the rails.
The next three need to be measuring these things and reporting on them as well.
All right.
And then step seven, you need to retest this monthly to track changes.
All right.
Here's something most people don't know.
Let me just pull this out as an example.
Let's say for whatever reason, you're using GPT5 thinking as your model of choice.
and then you're using canvas mode.
Okay?
Canvas mode gets updated often.
Most people, unless you're a dork like me, don't know this.
GPD5, it's gotten updated or sorry, GPD5 thinking has gotten updated multiple times since it came out since August.
Most people don't know, right?
There's actually a thinking slider now, right?
GVT5 auto just got updated this week and you probably didn't know it.
Right?
That's why you have to retest month.
to track changes.
You don't necessarily have to retest the human input output, right?
Because that, in theory, is not going to change.
Or maybe you need to, you know, do that at least quarterly or yearly because, you know,
it depends on, again, it depends on your use case.
But at least on the AI side, you don't just do this once, set it and forget it.
Because it could get marginally better.
It could get increasingly worse.
again, it depends.
It depends on what these big AI companies are doing under the hood.
All right, you need to retest though immediately anytime a provider ships model updates.
So a lot of times they're doing these under the hood and you might not know unless you're reading the change log,
but you probably should be reading the change log.
And if there is a noted model change, you need to rerun the test immediately.
Not doing so is negligence, right?
AI changes all the time.
And the other thing is you still need human in the loop, right?
You still need human in the loop, both on the testing phase and ultimately,
once you deploy this, and once you get to the point where, yes, we've measured ROI in step six,
right, we've taken this to production.
It doesn't end there.
You still need your human in the loop or expertise driven loop, like I like to say.
And then you need to retest monthly to track the changes.
And also, you need to track your trends versus a three month average.
and you need to investigate if the accuracy or savings drop significantly.
Like I said, what I just took you through, this seven-step plan, this is for one type of project.
Again, let's go back to my easy, simple example, right?
You have a team or one individual that spends 40 hours a month going through this large industry white paper, pulling out, you know, synthesizing,
personalizing information for your company and turning it into a little bit of a document,
right? Going through this 100-page report, here's everything that applies to us, here's
some spreadsheets, here's some graphs, et cetera, right? That's a simple example. A large language
model can obviously do that. But what's the three-month trend line look like? Is it getting
better over time? Is it getting worse over time? Because if it's slowly getting marginally worse
month over month, then you probably need to revisit the model, right? Going back to looking at the
publicly available model evouts.
So you need to also go through this, depending on your use case or potential use cases,
you might need to run this through multiple models.
And sometimes you might want to do this in tandem.
You might want to be testing, you know, Gemini 2.5 Pro alongside, you know, GPD5 thinking as an example, right?
And you need to see, be constantly reevaluating.
One thing people are almost confused about, at least those companies that have already seen
in measured their ROI on Gen.
AI, right?
They're like, okay, well, what do we do now?
Right?
You have two choices.
You can cut people.
Well, technically three.
But one is you can reduce head count and you do that either by laying people off or you
stop hiring and let churn in attrition do its thing, which is what the latter, I think, is
what a lot of bigger companies are doing.
They just stopped hiring.
And then they're just, you know, eating up these efficiency gains with AI.
Or you can do something else, right?
you can say, hey, we have huge time savings from AI.
So let's start investing in new lines of revenue.
Well, how do you do that?
This, right?
Every single medium-sized business and up.
Not all small businesses can do this.
But if you have 100 employees or more, which is the majority of everyone listening to this,
this process that I outlined, this takes humans.
You need to have AI testing and deployment.
teams. Even if you're a medium-sized organization, right? I'm not talking, oh, this is only for
Fortune 1,000 companies. No, you need a team of me's. That's what you need. You need a team of
people who are constantly paying attention to AI updates, constantly evaluating different models
and keeping a very close eye on these use cases. That's how you get successful AI implementation.
You have to invest in the people that are pushing you forward.
All right.
I hope this was helpful,
y'all.
As always,
uh,
put together a little guide.
All right.
Uh,
so if something in this struck you and you're like,
yeah,
my team needs to hear this,
but we need a little bit more than this podcast.
I obviously put together,
uh,
an extended guide.
So,
uh,
if you're listening on LinkedIn,
uh,
go ahead and repost this show.
I'll send it over to you.
It's hot,
fresh like little Caesar's pizza.
Ready to go.
Uh,
but if you're listening on the podcast,
always,
check the show notes. Okay. In the show notes, there's a link to this LinkedIn post, right?
Each of our live stream podcasts goes out on LinkedIn. So if you want access to the guide,
go find in your show notes on Spotify or Apple Music, whatever you're listening to,
go click that and then go repost this and I'll send that over to you. So I hope today's show was
helpful. If so, make sure you go to your everyday AI.com. Sign up for the free daily newsletter.
We're going to be recapping the highlights and main points from today's show in keeping you
the smartest person in AI at your company. So thank you for tuning in. Hope to see you back tomorrow
and every day for more everyday AI. Thanks y'all. Meet Firefly AI assistant. Now live in Adobe
Firefly, the Allman One Creative AI Studio. Just describe what you want to create in your own words and the
assistant handles the rest, orchestrating multi-step workflows across Adobe Creative Cloud apps,
including Photoshop, Premiere Express, and more in one conversational interface. You direct the outcome
while the assistant accelerates execution. Stand control.
with the ability to step in and refine at any time.
See it today at firefly.adobie.com.
And that's a wrap for today's edition of Everyday AI.
Thanks for joining us.
If you enjoyed this episode, please subscribe and leave us a rating.
It helps keep us going.
For a little more AI magic, visit Your EverydayAI.com
and sign up to our daily newsletter so you don't get left behind.
Go break some barriers and we'll see you next time.
