How I AI - Evals, error analysis, and better prompts: A systematic approach to improving your AI products | Hamel Husain (ML engineer)
Episode Date: October 13, 2025Hamel Husain, an AI consultant and educator, shares his systematic approach to improving AI product quality through error analysis, evaluation frameworks, and prompt engineering. In this episode, he d...emonstrates how product teams can move beyond “vibe checking” their AI systems to implement data-driven quality improvement processes that identify and fix the most common errors. Using real examples from client work with Nurture Boss (an AI assistant for property managers), Hamel walks through practical techniques that product managers can implement immediately to dramatically improve their AI products.What you’ll learn:1. A step-by-step error analysis framework that helps identify and categorize the most common AI failures in your product2. How to create custom annotation systems that make reviewing AI conversations faster and more insightful3. Why binary evaluations (pass/fail) are more useful than arbitrary quality scores for measuring AI performance4. Techniques for validating your LLM judges to ensure they align with human quality expectations5. A practical approach to prioritizing fixes based on frequency counting rather than intuition6. Why looking at real user conversations (not just ideal test cases) is critical for understanding AI product failures7. How to build a comprehensive quality system that spans from manual review to automated evaluation—Brought to you by:GoFundMe Giving Funds—One account. Zero hassle: https://gofundme.com/howiaiPersona—Trusted identity verification for any use case: https://withpersona.com/lp/howiai—Where to find Hamel Husain:Website: https://hamel.dev/Twitter: https://twitter.com/HamelHusainCourse: https://maven.com/parlance-labs/evalsGitHub: https://github.com/hamelsmu—Where to find Claire Vo:ChatPRD: https://www.chatprd.ai/Website: https://clairevo.com/LinkedIn: https://www.linkedin.com/in/clairevo/X: https://x.com/clairevo—In this episode, we cover:(00:00) Introduction to Hamel Husain(03:05) The fundamentals: why data analysis is critical for AI products(06:58) Understanding traces and examining real user interactions(13:35) Error analysis: a systematic approach to finding AI failures(17:40) Creating custom annotation systems for faster review(22:23) The impact of this process(25:15) Different types of evaluations(29:30) LLM-as-a-Judge(33:58) Improving prompts and system instructions(38:15) Analyzing agent workflows(40:38) Hamel’s personal AI tools and workflows(48:02) Lighting round and final thoughts—Tools referenced:• Claude: https://claude.ai/• Braintrust: https://www.braintrust.dev/docs/start• Phoenix: https://phoenix.arize.com/• AI Studio: https://aistudio.google.com/• ChatGPT: https://chat.openai.com/• Gemini: https://gemini.google.com/—Other references:• Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences: https://dl.acm.org/doi/10.1145/3654777.3676450• Nurture Boss: https://nurtureboss.io• Rechat: https://rechat.com/• Your AI Product Needs Evals: https://hamel.dev/blog/posts/evals/• A Field Guide to Rapidly Improving AI Products: https://hamel.dev/blog/posts/field-guide/• Creating a LLM-as-a-Judge That Drives Business Results: https://hamel.dev/blog/posts/llm-judge/• Lenny’s List on Maven: https://maven.com/lenny—Production and marketing by https://penname.co/. For inquiries about sponsoring the podcast, email jordan@penname.co.
Transcript
Discussion (0)
What are the fundamental concepts folks need to know of getting to higher quality products?
The most important thing is looking at data.
Looking at data has always been a thing even before AI.
There's just a little bit of a twist on it for AI, but really the same thing applies.
When you see a real user input like this, you actually look at what users are prompting your AI with.
You realize it's very vague.
Absolutely.
That's the whole interesting bit.
It's like once you see that people are talking like that, you might actually want to simulate stuff that looks like that.
if that's the real distribution of the data or that's what the real world looks like.
I'm sure our listeners expect some like magical system that does this automatically.
And you're like, no, man, just spend three hours of your afternoon, go through, read some of these
chats, look at some of them with your human eyes, put one sentence notes on all of them,
and then run a quick categorization exercise and get to work.
And you see this have actual real impact on quality and reducing these errors?
Yeah, it has an immense quality.
It's so powerful that some of my clients are so happy.
with just this process that they're like, that's great, Hamill, we're done.
And I'm like, no, wait, we can do more.
Welcome back to How IAI.
I'm Claire Vow, product leader and AI obsessive,
here on a mission to help you build better with these new tools.
Today, I have such an educational episode for people like me that are building AI products.
We have Hamil Hussein, who is going to demystify debugging errors in your AI product,
writing good evals, and show us how he runs his entire business,
using Claude and a GitHub repo.
Let's get to it.
This episode is brought to you by GoFundMe Giving Funds, the Zero-Fee DAF.
I want to tell you about a new product GoFundMe has launched called Giving Funds,
a smarter, easier way to give, especially during tax season, which is basically here.
GoFundMe Giving Fund is the DAF, or donor-advised fund,
from the world's number one giving platform, trusted by 200 million people.
It's basically your own mini foundation without the lawyers or admin costs.
You contribute money or appreciated assets, get the tax deduction right away,
potentially reduce capital gains, and then decide later where to donate from 1.4 million nonprofits.
There are zero admin or asset fees and while the money sits there, you can invest and grow it tax-free,
so you have more to give later, all from one simple hub with one clean tax receipt.
Lock in your deduction now and decide where to give later. Perfect for tax season. Join the GoFundMe
community of $200 million and start saving money on your tax bill, all while helping the causes you
care about the most. Start your giving fund today in just minutes at gofundme.com slash how IAI.
We'll even cover the DAF pay fees if you transfer your existing DAF over. That's GoFundMe
com slash how I AI to start your giving fund.
Hamill, I'm really excited for this particular episode because I have been building products
for a very long time.
And this has been one of a few times in my career where the how and what of products that
I'm building are so different than what I've built in the past.
They're technically different.
They're different from a user experience perspective.
and then they have these non-deterministic models on the back end that I'm somehow, as a product
leader, responsible for making output high quality, consistent, reliable, interesting user experiences.
And it's such a challenging problem.
And what I love about what you're going to show us today is how to approach that systematically,
that quality of product building in an AI world systematically, and how you use, you use,
different techniques to get AI products, which are new to all of us, from good to great.
Yeah, I'm happy to be here.
I'm excited to talk about it.
So, you know, this is such a new thing for product managers.
I'm curious if you could start with the fundamentals.
What are the fundamental concepts or things that you think folks building AI products
really need to know about the process of getting to higher quality products?
And then I know you're going to show us a couple examples of how to do that.
So the fundamentals really come down to, the most important thing is looking at data.
And I believe from working with many product managers in the past is looking at data has always been a thing, like even before AI.
You know, like I'm pretty sure that product managers that can like write a little bit of SQL are okay with spreadsheets, looking at numbers, looking at metrics.
You know, that feels like it's kind of table stakes for being a good product manager nowadays.
And so there's just a little bit of a twist on it for AI, but really the same thing applies.
And it's just like, okay, how do you do that for AI? And that's what we teach. And that's what I'm
to show you today. Great. And I cannot agree more. I think one of the most transformational skills
I learned as a young baby chicken product manager was being able to write SQL and actually do my own
data analysis and exploration. But I think the surface area is so broad now.
with AI and the data is different.
So why don't you show us what we should be looking at
when we're building these AI products?
Yeah.
So let me share my screen a bit.
Let me give you some background first.
So this is one of my clients.
The name of the company is called NurtureBoss.
And as you can see, it's an AI assistant for apartment managers
or property managers.
And really, like, you know, you can kind of get an idea
their website, which I'm showing right now.
You know, it's a virtual leasing assistant.
So, you know, they help with the whole top of funnel of, like, helping set up appointments,
helping prospective residents, like find their apartments, setting up appointments,
questions about rent, so on and so forth, kind of like trying to reduce the toil of property
managers, still having humans in the loop.
And so when they came to me, they had already prototyped something out, you know, kind of vibe
checking it, just like everyone does, and put everything together, but they wanted to know, like,
okay, how do we actually make it work well? Because the AI fails in weird ways and it doesn't always
do the right thing, but it feels like, okay, every time you fix a prompt, we're not really sure,
like maybe we're breaking something else. Or is it really improving things as a whole? We don't really
know, we're just guessing. We're just kind of like looking at it and just getting a vibes. And that is a very
uncomfortable feeling of trying to scale a product. Okay, so the first thing that I'll jump right
into is this idea of traces. So traces are this concept of it from engineering, but doesn't have to be
scary. It's basically like, and it's very topical for AI because with AI usually have many
different events, especially like for a chatbot, you have multi-turn conversations where you're
going back and forth with an AI. There might be retrieval.
of information, they might be calling some tools and external tools, internal tools, so on and so forth.
And so you want to log these traces. And there's many different ways to go about it, but just to
kind of show you exactly what happened at NurtureBoss, let's go into what that looks like. So this is
a platform called Brain Trust. There's a lot of them. This is one called Phoenix, which is like the same exact
data in here. It doesn't really matter. You can see like they're both the same, right?
Like, so what we have here, let me just go into a single trace. So this is what I would call a
trace. I can make this bigger so you can see in a full screen. And you can see what an AI interaction
looks like in this product. So you have, okay, the system prompt, you are an AI assistant working
as a leasing team member at some apartment. These are all fictitious because these are all been
scrubbed for PII stuff.
Your primary role is to respond to text messages.
So this is receiving text messages.
Okay.
And you have a whole host of rules like respond, you know, provide accurate information,
answer any question for residents.
Do the following.
You know, provide this website, for example.
If you had asked for a rental application, provide this on and so forth.
All these rules, right?
And this is a real user saying, hello, there's,
What's up to four month rent?
I don't even know what that means.
I got you.
I got you.
Let me read it.
Hello.
Hello there.
What's up to four month rent?
I thought I had it.
I thought I had it.
Yeah.
It's unclear, but okay.
I mean, like, it's fine.
This is real.
This is the real world.
These are real traces.
So, you know, and then there's a tool call here.
Get communities information.
It's calling this tool.
this internal tool and the tool call result comes back with this information and this is all hidden
from the user the user is not seeing this tool call result you're like okay here's the information you can
use about the community blah blah it's not even sure like this is the right tool call we'll get to
that in a moment and then the assistant goes hello we are currently offering update so this is like
back to the user this is what the AI responds to the user with hello we are currently offering
to eight weeks rent free as a special promotion please note the applicable lease specials and concessions
can vary blah blah okay so like is this and i have a cheat sheet for myself about what is actually
right and wrong um okay so like the comment here is the user is probably asking about lease terms
and stuff like that not about specials so like it's not really clear like this is the
right, this is not like what we want. And this is so realistic, right? Like everyone has experienced
AI like this is like it's kind of is being helpful, but it's not really doing what you want to.
And it's actually pretty challenging because it's not really clear what the user one is. You can go
in a lot of different directions of this. You know, when I'm testing my own AI, this is such an
eye opening example. Because when I'm testing my own AI, I ask it good questions and I spell correctly
and I'm very clear. But when you see a real user input like this, you actually look at what
users are prompting your AI with, you realize it's very vague. They say stuff like,
what's up. The question, there's no clear question. And so I really do think looking at real
user data kind of can get a developer or PM out of their own mind on how they think
users are going to interact with the system. Absolutely. It's very critical that you do this.
And so now you might not have this data. And I just jumped right into,
a real example just to set things off
and we can go into all these different rabbit holes
or like what if you don't have data and stuff?
I just want to like ground it and like, okay.
So set the stage like this is kind of one foundation
is like you have to have data.
There's different ways to get it.
One is you can log it from your real system
and you have these things to look at.
Another way is like, okay, you can have synthetic data
where you sort of generate with an LLM
you can generate questions like this.
You know, hello,
So it might be hard to generate stuff that looks like that because I don't even know, we don't know what it means.
And probably an LLM won't generate stuff like that.
But that's the whole interesting bit.
It's like once you see that people are talking like that, you might actually want to simulate stuff that looks like that.
Because if that's the real distribution of the data or that's what the real world looks like, you might want to challenge your LLM or your AI system appropriately.
Okay, so let's step back here.
So you have the system.
It's doing stuff.
It's like there's stuff like this happening.
We can look at another trace if you want just to kind of get an idea.
And this is, you know, this is not pre-scripted.
I didn't memorize what's going on in these traces.
We're just looking at them naturally.
So this is something that says another apartment complex, Meadowbrook apartments.
Same idea.
So we won't read the whole system prompt again.
Okay.
So we'll scroll down here.
Let's get to what the user is asking.
walk-in T-O-R.
So this must be another text message situation.
And the assistant says,
our team tries their best to accommodate walk-ins,
me get you.
No, that's hilarious.
Like, I don't know.
Why is the L-L-L-M?
That's surprising.
Like, why is it saying me get you to someone who can help?
Maybe he's trying to mimic the, you know, the user somehow.
And then it does, like, yes.
And then, okay, great.
So it seems like this one.
maybe is okay. Let's see what we ended up annotating. Yeah, we said this one is okay. There's some
metadata down here about our labels, which we'll talk about next. But yeah, you can, so you can see,
like, this is a real system. There's many different things that can happen here. So the question
becomes like, okay, so we talked about this like writing sequel and data, but like how do you
take that same mindset to this? Like, what do you even do with this? Right. You have this, like,
crazy like interaction is like how do you analyze this without going to getting stuck because like
this seems like um intractable right at first pass no i was just thinking i was like what is the
sequel query i write to get like the first prompt and like how do you query for give me all the
first prompts that include typos like give me all the first prompts that are ambiguous questions
it just feels almost insurmountable and then you know you showed
us two examples, and it's two of probably thousands and thousands and thousands. So going through it
manually is probably not super scalable. So I'm curious, what is the systematic kind of solution here?
Okay. So the systematic solution is something called error analysis. So error analysis just means
it's kind of a counterintuitive process that's extremely effective and it's dumb,
but it's accessible to everybody and it works.
And it's not something that I made up.
It's been around in machine learning for a really long time.
Because actually machine learning has the same problem, like before, like, generative
AI.
We had these stochastic systems that can do, like, a whole number of things.
And, like, how do you actually, like, analyze that and, like, figure out, like, what's
going wrong and improve it?
So error analysis has two steps.
The first step is writing notes.
and it's called open coding
and it's basically like journaling
what is wrong. So if we go back to
like that other
trace that we saw, so let me just
go back to it like the first one,
we would step
into this trace
and we would say
okay like every
observability tool has their own
let's say different ways to take notes.
You know, already have a note in here.
Assistant should have asked follow questions about
you know about the question what's up with four month rent because it's unclear user intent
and just writing notes about what is going on okay and you do that for like a hundred traces
randomly sample 100 traces and you do that and you stop at the most upstream error you find
so you read this and you see what's going on and you're like hmm okay the user intent seems like
we didn't do a good job of like clarifying what the hell that they're they need yeah and
so I think that's the most upstream problem in the sequence of events. So I'm going to go ahead and
just write that as a note. Yeah. And you say focus on the most upstream problem because you presume that
if you can get early intent, early kind of clarity, correctness right, the rest of the system is more
likely to be correct downstream. Yeah, because it's causal in nature. So as we have the sequence of
events, whether it's like user prompts, tool calls, retrieval for rag, whatever it may be,
any error in any point along the chain, you know, like will cause downstream problems.
And so to simplify our lives for this purposes of error analysis, it's a heuristic.
You know, eventually you do want to care about the different errors and different downstream,
but when you're starting out, just focus on the upstream error because we're trying to make it
tractable. And this is like the way that you're going to get results fast. So basically,
what you do is you go through and you collect a bunch of notes. And then what you do is you can
take these notes and you can like download them or whatever and you can categorize those notes.
And you can even put these notes into like chat GBT. It's like, hey, here's all my notes.
Like, can you bucket these into categories? And you kind of have to go back and forth with it a little bit.
Like, hey, these are my notes.
These are the categories.
I think like you're missing a category, whatever.
Now, with NurtureBoss, what we ended up doing is we actually made one of the things that we highly recommend a lot of people think about is to make your own custom annotation tool.
Like, you see this here in Brain Trust.
And it's also here in Arise, Phoenix.
They're very similar.
You can see this is a very similar looking UI.
and you have, they even called it error analysis here.
And you can like add your notes, like, you know, whatever.
And you can save those notes and same thing.
If you're going to be looking at a lot of data, you don't want to slow yourself down.
And you want to be able to have like very human readable sort of, you know, output.
And sometimes like this markdown stuff is like not that readable.
And you want to make sure that, okay, like it makes sense to you.
you can fly through it as fast as possible.
So, you know, it's really easy to vibe code this stuff.
Because ultimately what you're doing is, like, showing data.
So in the nurture boss situation,
so as you might have gathered, like,
they have multiple channels that customers can contact them on.
They have, like, text message, which we saw.
They have email.
They have a chat bot on the website, so on and so forth.
So they just wanted something they could, like, navigate fast.
It's just like vibe coded essentially.
I mean, they have the person, we were developers, but you know, we're using AI in our process and do this very fast.
Is okay, like what channel is the trace from?
And then like some other filters about like, hey, did we already annotate this or not?
And then just kind of have some statistics at the top.
You know, this is like what the annotation like looks like.
It's kind of very similar, but just like dialed into what we wanted.
And you like, you know, we just took notes.
And then what for NurtureBoss, what we did is, okay, we had an automated process that would summarize, like categorize those notes into like what are the biggest issues.
And then we would just something very simple like counting.
Counting is always powerful.
As you know, as a product manager, you can go into a system, the SQL you experience like writing SQL queries.
Like you know how powerful counting is.
Counting remains powerful.
And so you can count these issues.
Right. So it's like, okay, for nurture boss, I don't know if you can see my screener if it's too small.
I can try to zoom it more. Yeah, yeah, that's great.
Is, okay, what are the most, what are the biggest issues after doing that error analysis exercise, which only took, you know, a few hours?
Yeah.
It's like, okay, we're having a lot of transfer and handoff issues. We're trying to transfer the customer to a human.
We're having a lot of tour scheduling issues. So, like, they're trying to schedule a tour, but like,
are rescheduled tours.
In this case, we found that, like,
someone's asking to reschedule.
There is no rescheduled tour.
But, like, the AI doesn't know that.
It just keeps scheduling more tours, which is bad.
You know, follow up.
So, you know, AI not following up
when the user has a question,
you know, sometimes incorrect information provided.
Okay, so, like, you see, like, these are kind of the account,
and now we have, now we're not lost.
Now we know what we should be working on.
We know, okay, you know what?
We should fix this like transfer handoff issue and this tour scheduling issue.
We have confidence.
Like, you know what?
Like, we were not paralyzed anymore.
We know, okay, this is what we need to fix it on our AI.
This episode is brought to you by Persona,
the B2B Identity Platform helping product, fraud, and trust and safety teams
protect what they're building in an AI-first world.
In 2024, bot traffic officially surpassed human activity online.
And with AI agents projected to drive nearly 90% of all traffic by the end of the decade,
it's clear that most of the internet won't be human for much longer.
That's why trust and safety matters more than ever.
Whether you're building a next-gen AI product or launching a new digital platform,
Persona helps ensure it's real humans, not bots or bad actors, accessing your tools.
With Persona's building blocks, you can verify,
blocks you can verify users, fight fraud, and meet compliance requirements, all through identity
flows tailored to your product and risk needs. You may have already seen Persona in action
if you've verified your LinkedIn profile or signed up for an Etsy account. It powers identity
for the internet's most trusted platforms, and now it can power yours too. Visit with persona.com
slash how IAI to learn more. I love this. Just a recap, so you're taking these traces of these
real conversations. And you know, you don't even have to read all of it. You have to read till you
hit hit a snag, right? To hit an obvious sort of like incorrect or high friction part of the
experience. You have vibe coded an app that makes it really easy for the team generally to go in,
annotate these, rate them sort of like good quality, bad quality, automatically categorize them,
count them. And then you have a prioritized list and you're like, here are the problems that I need
to go solve. And what I love about this is, you know, I'm sure our listeners expect some like,
magical system that does this automatically. And you're like, no, man, just spend three hours of your
afternoon, go through, read some of these chats, look at some of them with your human eyes,
put one sentence notes on all of them, and then run a quick categorization exercise and get to work.
And you see this have actual real impact on quality and reducing these errors.
Yeah, it has an immense quality.
It's so powerful that some of my clients are so happy with just this process that they're like,
that's great, Hamel, we're done.
And I'm like, no, wait, like, we can do more.
You know, you've paid for more like, you know, whatever.
I know, this is so great.
Like, I just feel like I know what to do.
And so they find so much value.
in this like process that and it is like very important this is something that no one talks about
the people when you talk about e-vils like well how do you write an e-vail what eval do you do what tools
should you use before you get into all that stuff you need to have some grounding in like what
e-val you should even write because there's infinite evils so like in this case we would write we wrote
an eval about tour scheduling issues and we wrote an eval about transfer handoff issues and we felt
really good about that because we knew that like that is a real problem
And we knew how to write the eval because, like, we saw that error.
And we knew how to find data to test that eval because, again, we already tagged it and we saw that error, which is exactly the way you want to do it.
Yeah.
And what I also like about this is it does take the burden off your users.
I mean, so many people try to collect this data by, like, putting a little thumbs up and thumbs down or little comments.
Like, I even have that on parts of my product.
And yes, it is useful, but it only gives you a sliver of the kind of.
if self-identified errors in the app.
And users are highly tolerant of systems.
And so sometimes those errors just don't get escalated by user.
They'll either abandon or they'll just work through too many steps to get to the outcome that they want.
They'll have a quality experience.
And so I ain't just taking the burden on yourself and saying you're responsible for looking at
the data.
You can create simple ways to categorize it.
And then you have a prioritized list.
Now, if your client is willing to go the next.
step and do something about this and write evals and fix prompts.
What are your kind of next steps here?
What's another example of where we're going to come here?
I just want to talk about this for a minute.
Okay, so this particular technique is so powerful and not many people know about it.
So I actually recently did a training with Open AI showing the people at Open AI like, you
know, how this works for domain specific e-vals.
If you want to learn more about like this, we had Jacob, the founder of nurture boss, like, walk through like this whole process in like two minutes.
So you can find it on this on this page if you like.
Okay, so you get to your question, like what do you do now?
Okay, so you have like, you know, you've done your error analysis.
You have like prioritized these things.
So like now what do you do?
So now you get into writing the e-vails.
So now you have to decide what kind of evals do you want?
There's different kinds of evals.
So there's reference-based evals,
which is like you know what the right answer is.
And maybe you can write some code.
You don't need like an LLM to do the e-vow for you.
Or if it's more subjective in nature,
then, you know, maybe like this transfer handoff issue,
maybe it's more subjective in nature.
Then you need an LLM judge.
And so what you can do is,
you can start to write those evals and so I had this blog post here about evals in general
so there's this diagram it's really hard to put this whole thing into a diagram honestly but
because you know it can be it's kind of it's not is nonlinear process
but really what you want to do is okay we already covered like logging traces
and there's two different kinds of but there's different kinds of evaluators
evaluators or evaluations. There's like kind of like unit tests, which is like, well, I would say like
code based evals. And then there's like models. So like LLMs. You know, code based evils. So like,
you know, for example, what is what kinds of things that be good for code based eval is like,
okay, if you have like user IDs showing up in the response or something like that, okay, you can test
for that in code. I have to say you're saving my life here because I was thinking, what is one of
these unit tests I need to write, and that is exactly one of them, which is my tool calls need
UIDs and users definitely do not. So that's a great example of one for anybody that's writing
a chatbot that does a lot of kind of tool calling. Yeah, because they can show up by accident,
like me, you and I have the UID and the system prompt, and it inadvertently shows up in the output
for some reason and the other, and you don't want that. Okay, you want to write these tests.
with no matter what kinds of tests you write, you want to create test cases. And sometimes you can
gather those from your traces. Sometimes sometimes you might want to generate synthetic data.
And so, you know, this is like a prompt for a different real estate agent, assistant called ReChat,
which is for residential real estate. And this is kind of like a simplified version of a prompt,
write 50 different instructions that a real estate agent can give to their assistant.
It creates contacts on their CRM.
Contact details can include name, phone, email, whatever.
And basically, you know, it can generate synthetic inputs to a system that then you can then
log traces from.
I'm going to jump around a little bit, so we'll kind of come back to that.
Okay, we already covered logging traces.
You know, this is another, like, custom log annotation thing yet again, because we've,
really emphasize this that it's really important to remove all friction doing this so it won't
linger on this too much and basically um you know one kind of thing you want to do is like okay if
using lm as a judge or anything else what you want to do is so one thing that's usually skipped
when we talk about lm as a judge is like people are just using lm as a judge off the shelf like they're
like writing a prompt, they're saying, okay, judge it, and then reporting that. Let me actually
go to a different blog post that is a little bit better for LM judge, which is this one. Okay, so
LM as a judge. So you often see sometimes in LM evel land, like a dashboard that looks like this,
helpfulness, truthfulness, conciseness, core, tone, whatever. What the hell does that mean? Does anyone know what
that means, nobody knows, no one understands
concretely, like, if the helpfulness score is 4.2
and it goes to 4.7, like, do you really know, like,
what's wrong, what changes? No.
And so there's a lot of guidance in how to create an alum
as a judge.
It's probably too much for this podcast to, like, tell you all of the things.
And this blog post is quite long, like,
enumerating, like, how to do it correctly.
but the main things that you need to keep in mind is like one you need to have binary outputs like
is it good or bad for a specific problem so for like you know the handoff problem for nurture boss
like okay was there a problem or not and you want specific evaluators for specific problems
number two is like you want to you need to hand label some data which you already kind of do an error
analysis and you want to compare the judge to the hand label data
so that you can trust the judge.
The last thing you want to do is, like, throw up a judge on a dashboard like this,
and then, like, people don't know if they can trust it.
And the worst thing you do as a product manager is, like, start showing people evals.
And then at some point, the people's perception of the product or their experience of the project
doesn't match the evils.
So, like, hey, like, it's broken, but the evils are showing that it's good.
And that's the moment, like, people lose trust in you.
And then it'll like, it's going to be really hard to regain that trust.
And so the way that you make sure you can trust these automated LM evils is to, you know,
measure sort of agreement with these hand labels.
Yep.
So what I'm hearing from you in terms of LM as judge is these general buckets with arbitrary ratings against them,
not useful and will often work against you.
you want to write specific binary outcome evals for specific tasks.
So you want a set of evals that are like, does this get scheduled correctly?
Yes or no?
And so you're making a list of evals that the LLM as a judge is evaluating that gives you a
pass fail or yes, no, true, false binary outcome very simple.
And then you're doing the additional layer of work of validating that the eval itself is
valid by actually looking at that outcome and saying, do I actually agree with this LLM as a judge
evaluation of the quality of this output? And that those steps together are going to give you a
much more comprehensive view of how your product's performing. And then I, and then that that
second layer of human evaluation, it's going to give you more confidence that either your LLM
as judge is good and is evaluating your outputs correctly or you actually, you actually
need to tune that judge itself to get to higher quality evaluations. Is that kind of the summary
of what you do well? Yes. And the thing that's really important is like it's really difficult
to write any LM judge prompt if you don't do this because the research shows and there's some
research that my co-instructor for the course that I'm teaching. There's a paper called who validates
the validators. And the research shows that people are really bad at
writing specifications or requirements until they need to react to what an LLM is doing
to clarify and help them externalize what they're what they want.
And it's like only going through this process of sort of, okay, writing detailed notes
and critiquing things that you can then like start refining the LM judge.
Great.
And so we've we've covered sort of traces and errors, annotation.
you have kind of how to build unit tests that are automated tests.
Of course, you're looking at it manually.
You're doing LOM as judge the correct way.
Now tell me, I've identified all these problems.
I have these evals that give me data.
How do I write a good prompt?
Like, are there some techniques or, you know, what do I?
Are there things that you found consistently in the next step of improving your system
instructions, improving your tools where you actually have to go solve these?
problems are are effective yeah so um when you get to like the errors that you have so like you know
you're going to use these evals and you're going to deploy it at scale okay it's like you're not
looking at all your data you're looking at the sample of data and you're going to score your
lm as a judge against like a sample of label data and you're going to deploy that at scale
and you're going to like look at where are there errors and
it's pretty like you know you have to make a judgment call on like how do how do you improve your
system based on the errors you're finding like is it a retrieval problem is it a prompting issue
is it um should you be putting more examples into prompt and you know this is not really a silver
bullet there i would say um you know retrieval is its own sort of beast it tends to like retrieval
tends to be the Achilles heel of a lot of AI products, you know, where things tend to go wrong.
But sometimes, yeah, it's just like, especially in the beginning, you're going to find a lot of low-hanging fruits.
Like, for example, in NurtureBoss, the system prop didn't contain today's date.
So when the person said, hey, can you do a schedule for tomorrow?
AI had no idea what, like, we don't know what tomorrow is, but didn't tell the user that, right?
We just guessed.
So, like, you know, that's really obvious.
So there'll be, like, obvious things you can fix.
And then there's, like, lesser obvious things you can fix.
You could try, like, prompt engineering.
So there's a spectrum of, like, okay, prompt engineering all the way to, like, fine tuning.
Most people shouldn't get into fine tuning.
I will say that if you do all this e-val stuff, fine-tuning is basically free.
Because you have all this infrastructure set up to do all these measurements.
and curate data, like high signal data that is difficult.
And that difficult data, those difficult examples where your AI is not getting right,
that's exactly the stuff you want to fine tune on.
That's like the very high value stuff for fine tuning.
So, yeah, fine tuning is not so hard.
In the re-chat case, we had to do fine-tuning to get the extra mile.
But in most case, like, it's prompt engineering.
There's no magic prompt engineering tricks.
It's really like, I would say, there's a lot of experimentation.
that you should engage in.
Well, and one of the things that I found so interesting
as an AI builder that comes from a software engineering background
is now I have a natural language surface for bugs
in terms of my system instructions and prompts.
And I had this experience recently on chat here at E
where we were really having a hard time with tool calling.
Like one of our tools just was intermittently not being called
no matter what the user would say.
And it was really hard to pin down.
And we have this, you know, monster system problem.
and I went through and there was like two words in the prompt that were just in they were incorrect it was
about you you ideas but it was like incorrect and as soon as I deleted those two words which had just
been you know typed in by somebody and pushed in the repo our quality of that tool calling
shot right up and so I just have to you know we have to as product people as engineers start
thinking of the full surface area of our product and it's not the construction of the agent or the chatbot
itself, it really goes down into what words are going in and out of your system. And it's a
complicated surface area to debug and keep track of because it's unstructured, but it's super
high impact in my experience. Yeah, definitely. You know, when it comes to tool calls, actually,
let me show you one thing that always comes up is people wonder, like, how do you evaluate agents?
Because, like, you know, there's so many different handoffs. Like, how do you actually, like, do
it in real life. So let me see if I can share that. Okay, so I'm sharing like the book that we
give students in our class. But let me go to the table of content. So there's all these different
areas. We'll kind of skim towards the agent part of it. So there's like analytical tools you
can use for everything. You know, for agents, you can build these transition matrices. So
going from one step to the other, where are the errors located in what agent handoffs,
what steps being handed off to what other steps? So like in this case, okay, we have this like
generate sequel to execution sequel. That's where a lot of this like errors are happening. And then
you can like, then you can narrow it down. So as you get more advanced into e-vals, this is a very deep
subject. There's a lot of analytical tools you can use to kind of go about things.
It is very interesting.
As a product manager, you can get really far with AI-assisted notebooks.
Yeah, what I was going to say about this from a product manager perspective is this is really put from the frame of errors and evals,
but even just analytics for agenic systems, figuring out what your users are trying to do.
I haven't thought of this idea of actually mapping out the different conversation to tool or tool to tool handoffs.
And even if all of this was working effectively, a product manager's ability to see the data of its agents' behavior from a tool to tool handoff perspective and really identify like where are users trying to get value out of the system also can do things like drive roadmap ideas.
right if you're seeing okay people are just writing SQL executing SQL like we need to dig
into what other things around that could we build for users that are interesting so I like it from
the error perspective I also like it just from the product discovery perspective yeah definitely
that's that's very true yeah I like that okay so you've shown us how to I the other thing that I like
that you've shown us is that there's no way to do this than just do it like I people want these
tricks, they want some hack, they want some off the shelf solution. And you're saying like,
honestly, look at the data, build yourself a solution if you have to, validate it yourself,
do the hard work. And if you do the hard work, you can actually create these leaps in product
quality and experience. But right now, you just got to look at the data and make some
decisions and make things better. So I think this has been super illuminating in terms of helping
people like me that are building AI products, make them higher quality. Let's spend just a couple
minutes on a totally different topic, which you are running this business, you're running a course,
you are clearly an expert in AI. What tools are in your stack for kind of running your day-to-day
life or at least your business life? Yeah, so I do a lot of writing and I do a lot of communication
with clients. And, you know, I also want to reduce my own toil. And so let me show my screen again.
Yeah. It's probably easiest to show you Claude project.
I have all these cloud projects.
So, okay, I have like one for copywriting.
I have a legal assistant.
I have consulting proposals.
Consulting proposals is pretty interesting.
So it's basically like an example of consulting proposals.
It's, you know, so it's kind of funny.
I have skill level, partner of Palantir, expert generative AI, blah, blah, blah.
And, you know, I give it some instructions on the, on the other, like, let's say, proposals I have.
and, you know, I have like this prompt, you know, whatever, get to the point,
writing short sentences, whatever. And basically, I have a lot of examples. And basically,
anytime I have a intake call with a client who wants a proposal, I give this the transcript
and then it's made, it's basically almost ready. It's like, it takes me about a minute to,
to kind of edit it and get it going. So that's proposals, you know,
I have one for the course, which is like, you know, a lot of context about my course,
which is like the entire book.
I have an FAQ that's very, like, extensive that I've published.
There's all the transcripts, all the discord messages, office hours, you know.
And again, my prompt is like, hey, your job is to help course instructors to create standalone,
interesting FAQs.
This is like a writing prompt that I have everywhere.
Do not add little words.
Don't repeat yourself.
Get to the point.
Yeah.
Yeah, yeah.
It's very, you have to really, you know.
And so, okay, like, yeah.
It's just, you know, this stuff here.
You know, so this is like one for the course.
There's, you know, there's one to help me create these things called lightning lessons,
which is basically like, you know, this lead magnet.
So there's all kinds of stuff like this.
I see you and I share a general.
general counsel here.
Oh, okay.
With Claude AI.
Oh yeah, right, exactly.
Yeah, there you go.
So there's that.
And I also have like, you know, my own software that I have.
Yeah.
So I have, let me see I can find it.
It means not I'm not really advertising it, but I have like YouTube chapter creation.
And I basically have this thing that will create blog posts like out of YouTube videos.
So like, let me show you an example.
so like this one basically what I do is I take a YouTube video and it becomes an annotated presentation.
So you don't have to watch the video.
Yep.
Like you can just, especially if the video has slides, what it will do is screenshot all the slides and then have a summary under each slide about what was said.
So you can consume like a one hour presentation and like, you know, whatever, five minutes.
And that's really good because like, you know, I teach a lot and I have a lot of comments.
content. And so I distribute notes. So all of that. So a lot of that stuff, educational stuff,
is part of my workflow. And that this is used like, this uses Gemini. Essentially what it does
is it pulls the transcript. It pulls the video. I can put in the slides all at once and get,
have a lot of examples and I give it to it and it produces this. Yeah, I've heard this in a couple
podcasts that we've done recently that folks really like Gemini for video information ingest seems to
be the fan favorite for taking basically YouTube videos or other video content and turning it into
text or other other applications that you can extract from that. So try the Gemini models for that
folks. Yeah, it's absolutely brilliant. It's amazing. Cool. Okay, so you have cloud projects for every
little part of your business. I love the proposal workflow. It's something that we, we folks that do,
enterprise sales could probably make make some use out of. I'm about to start doing blog posts on all the
how I AI podcast. So maybe I will download your repo and give that a little spin. And then you're
using Gemini models to extract out content and share it as as templates. And then you have,
oh, look at these prompts. We've got to GitHub with prompts. Yeah. So I give GitHub with prompts.
This one is private. But just to give you an idea, conceptually, like it's basically a mono repo of
everything. The reason that is is because I like to have Claude code, open hands, you name it.
And basically what I say is because all these things are all interrelated, right? Like a lot of
these projects are like, you know, this is my blog is in here. This is my blog, for example.
This is that, like YouTube thing I just showed you, this Hamill project. This is like something
else that fetches Discord. This is about copywriting proposals, whatever. And I just point AI
I have this repo and you know there's like clawed rules in here that says like okay what is this
repo about and like where do you find stuff like okay you know this is like if you need to like for writing
you should look here um you know so on and so forth so my friend you have buried the lead here because
we could have done an entire episode on just this repo what this makes me think of is you know five
five years ago there is this big like note taking second brain where do you put all your
information so you can have access to it forever and i see this and my little engineering brain goes
obviously it should go in a repo and it should be a combination of data sources notes articles
things that i've written things that i like and prompts and tools to actually do something
with that so you have given me a personal project that i'm going to go work on in the next couple
days because I think this is this is how I as somebody who lives with cursor or Claude Code
as sort of co-pilots for everything I do, this is how I would want to organize my data and my
prompts to be able to do something with it. Yeah, I don't want to be locked in, right,
like to any one provider. And so this is how I do that. It's amazing. Okay, we might have to
have you back to go through this thing in detail. This has been so great. I have two lightning
round questions for you and then I will get you out of here.
Gary, now you're a busy guy. My first question is, you know, a lot of what you showed us
requires someone, a person to go through with their human eyes, read things, and evaluate.
And I'm curious, whose role do you think this is? Is this the product manager's role? Is it the
engineer's role? Is it the subject matter expert's role? Who does this?
I think the subject matter expert is very central. A lot of times the product manager is
the subject matter export SME in a lot of organizations. Like they're kind of the person
that everyone looks to for like the taste of like, hey, this is what should be happening with the user.
So I would say a lot of times it is the product manager that should be doing that annotation.
Now, when it gets into the analysis, it's really interesting.
It would be good if a product manager, like the more you can do, the better, just like the SQL
and the stuff that you know about.
At some point, you probably need a data scientist when it gets advanced.
but the more you learn the better and vice versa the more data scientists learn more product skills
you know it's going to be better it's hard to predict like you would you know there's always this
tension or this kind of okay can we collapse roles can we collapse the product role and this like
data scientist type AI role I'm not sure um it's yet to be seen I don't think so um there's a lot of
service area, actually. There's something called AI engineer. There's AI product manager.
And there's also like still this data scientist aspect. So those three roles are still
operating on this problem. And there's definitely a lot of service area for all of them,
especially as you scale. The one other thing that I would call out, or my hope is in addition to
sort of like the technical building teams who are sort of proxies in my mind for the subject matter
experts. So a lot of times the product manager is a proxy for like the leasing agent in this example.
They understand that user. They understand what high quality is. But, you know, I would really
love to see folks that are in operational or more functional roles come in and actually contribute
to the quality of the products because you know what makes a good user experience. You know
what makes a good leasing agent. You know how they should speak and what they should do. And I think
there is an opportunity for folks to lean in and bring that expertise to bear.
a way that scales across a company, that if you're willing and brave to do it, I think product
teams would welcome in kind of like non-technical colleagues into this process to add some more
kind of user empathy and subject matter expertise.
Yeah, definitely. Yeah. The more you can bring like the actual required taste in the product
sense into the process, the more that, yeah, because that's essentially what you're doing when
you're annotating. Yep. Doing this error analysis and the error analysis is the foundation for everything.
Yeah. Okay. And then my final question, ask everybody. I know you're very structured and you'll tell me you'll look at the data and then figure out exactly what to say. But you have to admit, sometimes AI is very frustrating and doesn't do what you want it to. Do you have any back pocket prompting techniques you use? Do you yell? Are you all caps? What's your strategy?
AI has frustrated me the most is writing. Because like writing, I don't want the writing to sound like AI. Yeah. And it's hard. You know, it's hard. You know,
that's the last thing you want in certain situations for your writing to sound like AI.
Not that AI is like wrong.
It's just that, yeah, you want to make sure your like flavor is coming across.
And so, um, so one thing, one thing is like, okay, I showed you my writing prompt a little bit
of it.
I can share it with you separately also is like provide lots of examples, but then also take
it step by step.
So for writing, what I do is have it write an outline.
And then I have it write the first one or two.
two sections and edit it very carefully. Now, one tip is use something like AI Studio that allows you
to edit the output of what the LLM is giving you. That's really important because like what that ends
up doing is it creates examples for the LLM in kind of right there. Yeah, it's line. Yep.
Yeah. And so, yeah, you want to edit the output and, you know, yeah, something like a notebook or
AI Studio, there's not too many things that let you edit the output. But once you do that, once you
do that hard work of like those examples, especially like the thing you're trying to write now,
then it starts to work really well. Yeah, it was one of the most important things that I built
into my, my AI product was every asset that gets generated has a real-time editor for the user
to update. And then those updates go back into the model because I just think if the central
value proposition your product is writing, which mine is, it's one of the hardest stylistic challenges
I've seen AI struggle with. It all sounds like slop. Like I can identify AI writing from a mile away.
And so, yeah, I found this like incremental optimization, first outline, then draft, then edit,
then refine process takes a while. There's some latency in the experience, but it ends up netting
higher quality. And then just like use it as a draft, edit it, get the system, get the system to be
So that's really, really great feedback.
Is this for chat PRD?
That's what you're doing?
Yep.
Very cool.
Yeah, you know, I have high standards for writing too.
So it was important to me.
Well, this was so great.
Where can we find you and how can we be helpful?
Yeah, hamill.dev is my website.
You can also find me Hamill Hussein on Twitter.
And yeah, I'm teaching a course on Maven, as you know,
about evals that go into all these subjects very deeply.
But yeah, that's where to find me.
Great. Yeah. And for our listeners that don't know, Lenny's list is on Maven, including a
how-IAI section that I think features your course. So you can check it out there.
Thank you so much for the time. It was super educational, very practical. I'm going to take these
tips right away and go improve my own product. Have a great day.
Yeah. Thank you for having me on.
Thanks so much for watching. If you enjoyed this show, please like and subscribe here on YouTube
or even better, leave us a comment with your thoughts.
You can also find this podcast on Apple Podcasts, Spotify, or your favorite podcast app.
Please consider leaving us a rating and review, which will help others find the show.
You can see all our episodes and learn more about the show at how IAIIPOD.com.
See you next time.
