The Infra Pod - Helping enterprises to adopt AI the real way - Chat with Hamel Hussain
Episode Date: June 3, 2025Ian and Tim host a conversation with Hamel Husain, an AI influencer and consultant who has worked with major tech companies like GitHub and Airbnb. They delve into Husain's journey to becoming an ...independent consultant, his insights on the common struggles with AI evaluation and data literacy, and his pragmatic approach to improving AI systems. The discussion also covers the importance of error analysis, measuring AI performance, and the evolving role of AI engineering in enhancing software development.00:23 Hamel's Background in AI and Machine Learning01:18 Challenges in AI and Machine Learning02:15 Importance of Data Literacy in AI04:02 Common Struggles in AI Development05:05 The Role of AI Engineers08:39 Focus on Processes Over Tools11:10 Error Analysis and Practical Examples15:52 Case Study: Improving AI Products26:05 The Role of Agents in AI
Transcript
Discussion (0)
Welcome to the Infrapod.
This is Tim from Essence and Ian, let's go.
This is Ian, user of LLMs.
Although I don't know how they work, can't wait to talk to Hamil today, the current it boy of LLMs.
Hamil, tell us all about yourselves. into what you're currently doing,
I led research at GitHub that led up into GitHub Copilot called CodeSearchNet. And then at some point I decided to, you know, people start reaching out to me for help with LLMs
and became an independent consultant.
It's really interesting because I get to see a lot of different use cases and what's working and what's not working
and get to help people build AI products.
And yeah, that's a little bit about me.
So, I mean, you're pretty well known now.
You've done some courses, you've done a bunch of tweeting, done a bunch of LM
stuff, I'm curious, you and I've talked at length, one of the things you said to
me, just to kick us off is like most people building with LM's you end up
working with these days for your consultancy, they struggle with like basic eval. So can you expand upon where
you think the current world is at in terms of their ability to actually use,
you know, this newfangled machine learning technology where they find the
hiccups and what you're witnessing is sort of like where people actually at in
terms of their ability to use and how much value can they actually extract
from these things based on that?
Yeah. So, okay, like a key aspect of working with any machine learning AI system,
like pre-generative AI and post-generative AI doesn't really matter,
is that you have this like stochastic system, right, giving you outputs of some kind,
and you have to reason about that. You have to figure out, is it doing what you want?
And how do you improve it?
That's kind of table stakes for building AI products.
If you can't improve it, then you
can't really make any progress.
And you're going to get left behind.
So measuring something is key to improving it.
And it is the case that you don't need a lot of the skills
that you used to need for leveraging AI.
Before you needed to maybe be a machine learning engineer
and know a lot more about how models work,
kind of like the details of all of these things.
And so there's definitely a portion of those skills
that a lot of people don't need.
But there is a portion of those skills
that you absolutely do need.
That I would lump it in this category of data literacy. So while you might not need to understand
how the model works, you do need to have like really strong skills on how to analyze the
performance of your AI system, how to poke at it, how to measure it, how to explore
and quickly have a nose for data and figure out what's going on.
And then also being able to design metrics that make sense
and being able to take this blob of, let's say,
stochastic outputs and not be overwhelmed by it, which
there's a long history of like a systematic process you can apply
to do this correctly. So that's where the people are struggling is like you have this really large
influx of software engineers working on AI, which is excellent because again you don't need to
be a full-blown machine learning engineer to make AI products. That's really clear.
But the thing that a lot of people don't have in their training or their background is like,
okay, how do you do data analysis? How do you measure things? How do you test these
stochastic systems? And that's the key failure mode that everyone's experiencing. And that's
when people reach out to me.
It's like, hey, we built this prototype.
It kind of works, but we're sort of stuck on how to make it better.
How do we make it work really, really well?
We tried XYZ things, we tried a lot of people.
The software engineering mindset a lot of times is like,
okay, let's try different tools.
Let's try a different vector database. Let's try a different vector database.
Let's try a different agent orchestration system.
Let's try whatever.
That becomes a game of whack-a-mole.
And you tend to spin your wheels on that
without having some sort of more principled tests
that you can rely on.
That's not just vibes.
Vibes are useful.
They can take you to a certain point,
but then you plateau really fast. So thates are useful. They can take you to a certain point, but then you
plateau really fast. So that's essentially what everyone is struggling with. So it's
like, okay, how do we rip out the data literacy component of, you know, like machine learning
and data science and kind of bring the pieces that matter into AI engineering? That's where
I spent a lot of time.
And this is sort of, you know, Sean or quickly known as Swix has, you know, an Alessio blatant space of sort of term this concept like AI engineering. And I think they
define AI engineers basically, it's like you took like a full stack dev, you brought some data science
skills, and then you're like, go build products with LLMs. Does that match your mental model of
like sort of what people need to become
in like this sort of concept of AI engineers in order to like to utilize these things?
Basically what you're saying is the default software engineers spend a lot of time building
very deterministic systems where they didn't spend any time thinking about things like data quality
or distribution to how the data, the quality of the data impacted like what could be predicted
from it, right? They're much more interested in how many roads can be pulled back by some query or something along those lines.
So what do you think the future of software engineering looks like, given that it's likely
that LMs will be at the core of many of our apps? And how does this concept of AI engineering fit
into that? And how is that concept sort of like, if you were to become one or you were to take on
whatever the skill sets are
that an engineer has, how that actually gives you
the past superpowers to take advantage of these new tools
or features.
Yeah, so OK, that was a, let me try to break it down,
the question a little bit.
So yeah, Swix has done an excellent job,
like kind of creating this community of AI engineer.
And it's really good, because it's like some kind of way for people to gather and
like exchange thoughts. It's really hard to like get your hands around all the skills.
It is true like to get started, the full-stack engineering skills are really important.
And I'll actually like admit that a lot of data scientists, machine learning engineers,
kind of weak at the full stack engineering stuff.
So going from like zero to one, it's actually like really good that you're a full stack engineer and you have like that mindset.
And the tools are very different. It's not really about tools. Tools and culture and knowledge are all like intermingled very tightly, you know, like they have different communities, right? So it's like TypeScript doesn't have the greatest data tools, you know, but Python
is not as mature in like, its web frameworks and, you know, product engineering stuff,
you know, and people might get mad about this, like, there's people that are like, I know
Python is fine. But you know what I mean, right? It's like, those tools have different
purposes. And, you know, it's not a surprise, right? You Python's fine. But you know what I mean, right? It's like those tools have different purposes.
And it's not a surprise, right?
You need both skills.
But on the other hand, I think you can learn them.
Basic data literacy, I think people can learn.
It's not too onerous.
We're not talking about the full breadth of everything
you need to know from machine learning engineering.
Talking about very basic things, like even counting
and manipulating data
is really helpful. I don't know if that answers the question or not. I think that it definitely
helps it answer some of the question. The next question I have is with a lot of the clients
who are coming in they're asking you basically how can I improve these systems? How can I make
them more efficient? Basically they're like how can I make them more accurate? Right? At the end
of the day how can I make make it respond more correctly with some definition
of correct? And that question basically isn't a question about, am I using Postgres or have I
vertically or horizontally scaled it? It's a question of what systems you have, data systems
you've built to quantify and measure effectiveness and improve effectiveness over time of the data set
that things are being trade on.
So it's like, how do I use things like in context learning?
How do I do reg?
How do I go these different things?
Is there a part of that stack
that people struggle with the most?
Like that you see of this equation
people struggle with the most?
Yeah, the thing that people struggle with the most
is they focus on tools, not processes.
So 99% of the time when I engage
with someone who's stuck, they say, hey, I need help improving my AI. And they immediately launch
into, okay, here are my agent architecture. Here's my tools. Here's my I'm using whatever
Lance, you know, Lance DB or Chrome or whatever, and I'm doing this kind of hybrid search with like this weighting here and this parameter here.
And they give me this like kind of avalanche
of tools focus.
And the first question is like, what tools do I use?
It's really pervasive and it goes really deep.
And people will say, oh, hey, like,
I heard that you said that we should focus
on processes, not tools,
because I repeat the same thing all the time,
not just on this podcast.
And I say, okay, like, great, we want evals.
What tools should I use for that?
Should I use brain trust?
Should I use Langsmith?
Should I use Arise?
What tools should I use?
And then even when they start using those tools,
like, okay, you know, like what model is the best model
to do evaluations?
So all of these questions is a smell
that you have the wrong mindset.
It's like saying, I wanna lose weight,
what gym membership should I buy?
What gym membership you should buy
is the least important question
you should be asking yourself.
You need to go to the gym.
You need to figure out the process
and that is where people fail.
It's like, wait, don't show me any agent architecture
Diagrams are really anything we need to measure
Okay, like what is working and what's not you should go through the process of doing that and it's a process
Lot of times you don't need tools tools can help a little bit just like Jim can help you
But just going through the process with an Excel spreadsheet, with a Jupyter notebook,
with just like a REPL, it doesn't really matter.
Going through the right process of looking at your data, doing basic things like error
analysis, figuring out like what actual errors are occurring in your AI system, and then
justifying every bit of complexity that you have instead of just kind of looking at tools.
I think that that's where people get stuck,
because they don't know.
And so there's a structured way to look at your data.
People are like, the question comes up really fast.
How much data do I look at?
Or what part of my data do I look?
You know, this system is complicated.
There's RAG.
There's tool calling.
There's agent orchestration.
Do I measure everything? Do I test everything?
And there is a concrete answer to all of these things.
It starts with error analysis. Error analysis is not something I made up.
It's been around in machine learning for decades.
And that's how people debugged AI for the longest time.
And actually it works really well.
What does it mean? It's like you look at the data physically,
like you put your eyes on it, you go row by row
or trace by trace, and you write notes
about what errors you may or may not be seeing.
And then what you do is you categorize,
and you can use an LLM to help you categorize
the kinds of errors you're seeing and then count those.
It's a simple exercise,
but it actually drives insane clarity.
That's just an example.
There's a lot of different techniques like that,
like little techniques that are simple.
Like I said, you can use an Excel spreadsheet a lot of times.
You can use a pivot table.
It doesn't have to be crazy.
You can use a notebook, okay, or whatever.
You can use whatever the hell you want.
People don't want to do that.
They're like, well, can't it just something automate looking at the data for me? Like,
why don't you look at data? The truth is like, it actually doesn't take that much time and
you learn a lot. And no, you can't completely automate it. Because at the end of the day,
like if you think about it logically, the only way that you can trust the AI is if you
look at it. Trust is not free, right?
Until we have AGI, until you generally trust the intelligence of something to that extent,
where you just willingly, just blindly trust as judgment.
That's like some other conversation.
Then you need to be looking at data.
People don't know that.
And so one area where that shows up, for example,
is LLM as a judge.
So people use LLM as a judge all the time.
But there's no point in using LLM as a judge
if you don't go through the exercise
in measuring the judge's agreement with you.
So that's a kind of an exercise you can go through to say,
OK, I'm going to judge some things.
I'm going to let the LLM as a judge judge those same things. And I'm going to judge some things. I'm going to let the element as a judge judge those same things.
And I'm going to see whether I agree with it or not.
There's actual ways to do that that are very effective.
Again, this is where people get stuck.
Very basic things, honestly.
Hearing you talk about it, all these AI, LLM,
if we're close to getting AGI discussions,
are getting so rampant.
And sounds like the best way to solve your current AI problem
is go back to pen and paper almost.
Which is a funny sort of thing to think about.
I'm actually curious because you published a field guide
to improve AI applications, right?
Yeah, I actually tried to, I was publishing it
right before this podcast.
And then I deleted it because I messed up the social card
so I have to do it like
After I do it over by at least I seen it and it's really cool that you have a bunch of things here
The thing I got out of reading that post because you already mentioned on the very first bullet point was the tools versus the process
Hmm. I was like, wow, you read it that fast. Just quickly, just quickly.
I know AI here. I'm using my OIs to agree with it.
What I got from the general theme is everyone's process
may be roughly the same, but what you need will be domain specific.
So there's no one tool fits all.
Even from your tools you use to do the AI,
but even the tools you use to help you look at the data.
You have a bunch of different examples
of how do you combine your current usage and context.
And you're also your experts to be in that process as well.
People reading this or people using AI today,
because LLMs are such a black box anyways and so? And so magical, you just assume you can just dump
a bunch of things that would just work.
And that was kind of really the mindset at this point,
is like, we just trained as basic a magic black box.
And now you're like, okay, you nearly need to dig in
to understand, that scares people, just that basically,
how do I even get to understand and set up things?
My question is really, maybe you can, is there maybe an example how when you engage with
one of your customers that try their own AI and it doesn't work, right?
It doesn't really get you what you want.
You said you just sit down with them and look at data, right?
And I'm sure you're getting these questions.
If there are maybe like a very specific example, maybe one of your customer you listed, real
estate or whatever,
what is the sort of thing you have to kind of walk through with them
to say, okay, this is actually what you need, right?
Because maybe there's like a general process you start with,
looking at the error, look at data,
and maybe give an example like,
okay, we see this kind of categorical errors now.
This is what I suggest next.
And kind of walk through an example,
because I think people don't even understand how end-to-end to even prove anything right now
Besides just buy more products or do something very black boxy
Yeah, one example that I mentioned in that post the field
It's called a field guide to rapidly improving AI products. I put a video in there with one of my clients
The name of the client is nurture boss
It's a AI assistant for the apartment industry
for like property managers and helps them with,
you know, like lead management, sales,
interactions with tenants, payments, stuff like that.
And you know, when they came to me, it was like, okay,
like we have this system, but you know,
like how do we make it better?
And so one of the first things we did is like,
we just started doing error analysis.
At first there was some resistance.
Be like, okay, I just paid you a bunch of money.
You want to look at data together?
He's like, this feels like it's like some kind of weird
wax on wax off exercise, like what is going on?
And he actually says that in the video embedded in the post.
You can hear him say that.
He's like, he's very skeptical.
Then we started looking at the data
and we're doing a very basic error analysis. There's many different kinds
flavors of error analysis but it's like okay you know one of the issues that we
saw right away is like date issues like tenants would say okay I need to schedule
a meeting two weeks from now and the date wasn't being handled correctly with
the scheduling. Another issue we saw is like you know know, tennis would schedule a meeting for apartment viewing,
but then say they wanted to reschedule.
But like rescheduling is not a function call
that exists in their platform.
But they would be like, no problem.
We have rescheduled this for you.
You know, things like that is like very concrete failures
that actually mattered and that we could quantify.
Like these are happening at a high frequency.
And then like either one,
we can just go fix them right away.
Or two, like, you know,
we might need, you know,
might want to like do some eval.
So for like the date handling thing,
okay, like what we did is we generated a bunch of
synthetic data that tested
all the different tricky edge cases that might come up
in the way that someone might express dates. Like all the different crazy shit that we could think of. You know,
leap years and not being precise with the dates or, you know, crossing year boundaries
or whatever the hell, like all kinds of stuff. And then, you know, we like iterated on that
really fast. And it wasn't, you know, you can use an LLM to help you create synthetic
data. So this stuff is not that manual manual. Compared to machine learning prior to generative AI,
this stuff is kind of painful.
You had to gather data meticulously,
and you had to do a lot of these things.
It was way more manual.
So to me, it's crazy that you don't do it because it's not hard.
One of the things that we talk about in the blog post is,
looking at data is so important that you should build
your own data viewer specific to you so that you can on one
single page render all the metadata that matters for
specific, in this case as a customer interaction or a
customer thread, right? Like render it very precisely in the
like highlight the information that matters to you.
Dial in the filters and everything, you know, because they want like a property filter and a channel filter,
like is it interacting through voice, text, you know, whatever.
You know, all these things is really important.
And it's like almost free because, you know, like cursor, lovable, whatever.
That's the exact thing that AI is really, really good at making.
You know, simple web applications that render data, you can't think of anything that AI
can do easier than that.
That's basically, you know, throwing a softball to AI.
So it starts there.
And you know, every single client I interact with, it makes a huge difference, like immediately.
It's kind of fun, actually.
It's like, let me spend 30 minutes looking at your data.
I'm going to tell you all kinds of stuff you didn't know.
But it's just counterintuitive, right?
It's like you're standing on a beach, and you ask me, OK,
what's the way we're going to transform this landscape?
I'm like, OK, let's pick up one grain of sand.
It doesn't feel like that's going to do anything.
But in this case, looking at your data makes a lot of sense.
Now, you don't have to look at all your data.
You don't have to look at every single piece of data.
You can sample your data.
You can cluster your data, and you
can explore different regions.
There's all kinds of ways to be more
sophisticated with their analysis.
But honestly, if you just start, then you know.
And people get really caught up in like, oh my god, like how many I have to look at without
even looking.
They're like, oh my god, like how many do you want me to look at?
And sometimes I just throw out numbers like, okay, start with 30.
I don't know the number.
The heuristic is keep looking until you're not learning anything new.
Once people start looking, no one asks me how many do I need to keep looking at?
Because they're like, wow, this is really valuable. It's like the gym thing, right?
It's like, you just got to do it. It's not that hard, honestly.
Yeah. Yeah. It's so, so interesting that it used to be ML data scientists doing these
exercises and we hire a bunch of ML people, not with AI.
I guess the whole world is going on fire and everybody is like, assuming they can
just easily add AI to the mix of whatever they're doing.
But then you're talking about like just empirical, simple analysis to start with.
But it does get complicated over time though, if depending on where you want to
go and how much of this really matters.
And I think there's so many angles of trying to get AI to work in your company. It's complicated over time though, depending on where you want to go and how much of this really matters.
And I think there's so many angles of trying to get AI to work in your company.
Like you mentioned, there's domain experts trying to be part of your processes.
There's like a big piece you're talking about and a huge amount of your blog posts in the
end was more like data and evaluation and just like different types of ways to kind
of build your own toolbox of ways to test edge cases
and build trust around your evaluation systems and stuff like that.
I imagine just looking at the data alone is a little daunting,
but once you get used to it, it'll probably be a little better.
Building evaluations though, this is a pretty new thing for most people, right?
And I think we are now in this process of the ecosystem where like the value of AI is so high. new thing for most people, right?
And I think we are now in this process of the ecosystem where the value of AI is so high.
So therefore a lot of vendors, of course, are selling lots of high value things.
Valuation products and all the kind of like judge products.
I'm curious, where do you see the gap between the products that are being sold, especially around like evaluation testing, because this is a pretty hot space as I see it.
I think the products are actually pretty helpful in a lot of ways.
It's really nice to be able to log your data to a system where you can see it,
that kind of has a nice UX and sort of has different things to get you started.
You know, a lot of the observability tools like the ones I mentioned,
like BrainTrust, Arise, Lang Langsmith, so on and so forth, like to help you get started. And you know, you probably want to use
some kind of tool, honestly, it's just a matter of like how you use the tool, the tool is like just
there. And so, you know, for example, one of the things I talk about in my blog post is, okay,
like a lot of tools offer a prompt playground.
So a prompt playground is basically
you can have your prompt and maybe a data set
where you can template out, let's say,
different inputs into that prompt.
And you can play around with your prompt.
You can version that prompt, so on and so forth.
So that's really useful when you're starting out, right?
And a lot of these prompt playgrounds,
you can do multiple ones in parallel, you know?
You can compare them side by side.
And even sometimes you can put measurements in there
and do some tests or whatever.
The only issue is, I would say most prompts
don't live in a silo, okay?
They're part of your AI system.
So you have prompts, you might have RAG, you might have function calling,
and those are all application-specific code.
That's inherently part of your software.
It's going to be really difficult to have a prompt playground
by a third-party tool run your code.
Because to execute RAG, to execute function calls,
then run your code, codes on and so forth.
And so basically like what I
advocate for is like integrated prompt playgrounds.
Basically the same interface that your users see,
but basically an admin mode
where you can edit the prompt.
So that you can actually see what
your system is going to do, it's not in a silo.
But that doesn't mean the tools are not useful.
It's a good starting point and it's useful for people
before they have this integrated prompt playground.
But these are the kind of things like tools are useful
but you should figure out what works for you
and where you are in your journey.
And there's different parts of these tools
that are helpful along all parts of the journey.
So yeah, I like them.
Yeah, but I guess my question is, I guess you already alluded a little bit, but how
do you think these tools or evaluation as a whole can be like 90% 95% automated because
that's the promise of these products. It's almost like these will take over more and
more. Maybe the judge will be so much powerful or there's an SDKs or generated SDKs and just somehow
it's just magically like these things would be less almost consulting and service based.
Yeah. It feels like these tours have to be bundled with services quite a bit actually to even implement
it correctly sometimes. So I think right now they're 25% automated. Let's say in the absence of AGI, I think you can automate it maybe like 60%,
maybe 65%, maybe even 70%.
So there is like a lot of things that vendors can do
to make the process more automated,
but there is a piece of it in there that you can't,
there's no free lunch.
Like you have to look at the data and provide feedback,
and you have to inject your taste into your product.
Like you have to communicate to the AI in some way
to inject your taste.
And it's actually well known from several different studies,
and I also cite this in a blog post,
is like people are inherently not good
at specifying their tastes and criteria upfront. You have to react to
what you see to even know what that is yourself, right? And so the only way to do
that is to interact with the AI and give feedback. You're gonna have to have some
amount of either services or education because people, you know, will have to do
that process no matter what.
And so I really want to ask you this.
And I feel like most of your posts,
and cover me well, I haven't read every single word you wrote.
Okay.
The word agent doesn't come up that often yet, you know?
And as you know, the whole world is crazy about agents now.
Like, I feel like LLM is just the past already, you know?
Like, we're already at this agents is everything era.
Yeah.
And I think definitely
there is a lot of people in companies both selling agent specific frameworks and products
or adopting agents everywhere. And I'm sure it probably people were asking you about like,
hey, help me on my agents. Do you feel like agents are something that you are, there's
another way to look at this whole way of helping to figure out the problems and accuracy stuff?
Or it's really pretty much the same?
Because the complexity now is like,
the agents are sometimes is, of course,
all prompt and model space, right?
But it's doing more interactions.
But more and more of them are also doing automations as well.
And so it gets a little bit more hairy
on like, does it actually do the thing or not?
So do you have opinions on how we get get you are improving agents quality overall and is it methodology? Yeah
It's really the same thing. Honestly, it actually makes evals even more important
It makes the process more important because now you have more complexity
So, you know Tim from engineering that you need to justify the complexity
You shouldn't just take it on, you know?
Like it has to have some payoff.
And like, so how do you know that there's a payoff?
Like you should do the simplest thing that works.
I think everyone can agree on that.
And so the way to keep yourself honest is evals.
Now, another thing is as you ramp up the complexity,
you're gonna have more surface area.
With more surface area, you will have more failures.
So if you just take a blind approach to using
generic methods to try to figure out how to make your AI better,
you're going to get lost really fast
because there's going to be way too much noise.
So if you go through these processes,
it will cut through all the noise and you'll
figure out exactly what is wrong and
Then you'll also know like where to fix it as opposed to like oh my hallucination score is like
Whatever are my toxicity score is whatever do you even have a toxicity problem? Do you even have a hallucination problem?
I don't know so, you know these things lead people astray
And I think they lead them even more astray
when the complexity is high.
So I think it's like, it doesn't really matter
like what you're using.
We were already living on the spectrum of agents
before the word agent, as far as I'm concerned.
What is an agent?
Is an agent something that can act
on your behalf autonomously?
There's no like agreed upon definition,
but there's a block post from Anthropic
that's like different levels of agent.
I don't remember the exact levels,
but we were already using a few of those levels
before the word agent was popularized.
So I think it's really the same
in terms of how you think about testing.
I'm kind of curious when you think about,
I agree with you, there's a spectrum.
And I think of like agent level zero is the same as,
I always use the self-driving car analogy,
which is like level zero, self-driving,
just the human with a pedal and a steering wheel, right?
And like, if you think about it broadly speaking,
actually what we were building,
if you think of like, I was building a service,
I don't even know if that was like level zero,
it might've been level one,
like you're already still automating something
you would have had to do that was highly static.
It's just like what that thing could do with static
and now is we've kind of moved up the curve towards agency.
It's like your thing gains broader and broader ability
to do dynamic things and do those dynamic things
asynchronously according to some like trust boundary, right?
Some trust equations.
It's like, I trust this thing to operate
within this box on its own, you know,
whatever it decides to do, but this is the box. That said,
I'm really curious from your perspective, because you talked about how,
you know, getting to level five, let's say a level five agent,
you can tell it one line sentence that goes off and does it for five years,
comes back to you and you know,
has built you a house and ordered a lot of figured out what cake and designed
the whole home or whatever it's done. How do you think about, you know,
the way that software engineers dealt with complexity in software applications
and modularization?
We tried to modularize the layers of abstraction
and that helped us think through some of these,
like the complexity box,
and then you hit unit tests, integration tests.
And that's how we've been able to build
like incredibly complex software
with millions and millions and millions
and millions of lines of code
that do very complex things in a very trusted way.
I'm curious, like, does that model of abstraction
translate to the way that you think through
how to build more complex apps backed by LLMs?
Is it similar ways, is that where the mixture of expert stuff comes in,
or is there more to be done?
I'm kind of curious to get your perspective,
considering you're out there doing it.
Yeah, it's an interesting question.
Like, okay, to me, agents, I don't really
focus too much on the definition of it, to be honest.
It's really about capabilities.
Again, I try to do the simplest thing that works.
I don't even try to define it, honestly.
To me, the only thing that matters is, OK,
does it do what you want to do?
How you do that is kind of not as interesting to me.
So I haven't spent too much time thinking about definitions for whatever reason.
Maybe I'm just weird.
Sounds good, sir.
Well, I think this is the perfect time.
Let's go on to the spicy future.
You've been working a bunch of customers.
I think you're seeing the front row view of how people are struggling and using this stuff.
And of course, you know, there's the other side of marketing is happening all the time.
So give us your spicy hot take.
What is something you think you believe that most people don't believe yet?
Yeah, maybe it's not as spicy after a conversation.
But you know, like most people don't know how to look at data and don't have data literacy. And that stops them from making progress. And I think most people don't know how to look at data and don't have data literacy and that stops them
from making progress.
And I think most people don't know that.
I kind of don't know if I like the word data literacy because it feels like an insult to
have the word literacy in there.
Like you're not literate and I don't really like that aspect that seat feels bad.
But I mean it's a word that exists and you know people feel like hey we have all this
like AI like why don't you look at data?
It seems very counterintuitive.
Can't something somewhere automate that for me?
Why do I need to do this?
But yeah, people don't know how to do that, where to start.
There's not that much education about evals out there.
Certainly, the foundation models in the labs,
they are really good at evals for their foundation models.
But for whatever reason, in terms of education and sort of just broadly
speaking for domain specific situations, there's not that much guidance out there
for people like how to do evals. And I don't know why.
It's kind of like the dark art though. It's a super secret sauce of like a great
ML team has always been sort of the dark art though. It's the super secret sauce of like a great ML team
has always been sort of the eval framework.
And I mean, I always have thought about evals
like the integration tests for a model
for lack of like a better thought process.
I'm curious.
I mean, it sounds like what you're saying is basically
at the end of the day, like if we're gonna have
large adoption of LLMs as core parts of products
and workflows in the future,
we need to take the vast majority of developers today who have, whether we use the word data literacy or whatever word we want to use, very little experience doing what, personally, we might
have called some formulation of data science, not a broad understanding of statistics and teach them
about statistics and how to generate good evals and how to use those evals as a part of their workflow.
Do you think that's a result of the fact that most people never get to building
sophisticated enough things that they need such complicated evals?
And so now this is just a new thing.
Like how many organizations in the world would have had to do this
prior to like 2022, right?
Like not many organizations were actually had sophisticated machine learning use cases,
much less sophisticated machine learning use cases at scale.
So I'm kind of like, I look at them just like it's a chicken and egg thing.
It's like, well, there's no reason to do it because it was really hard to figure out where
to even apply machine learning in the first place.
And so now we just have this new thing that like has changed the dynamics of the applicable
use cases gone from like one in a million million to one in a thousand use cases, whatever it is, some
drastic expansion in terms of the number of use cases you can apply this stuff.
And it just turns out that actually the skill set you need to do this well was something
you'd only ever learn in a place you got to actual scale with, which were very few places
in the first place.
Yeah. So, I mean, I think it's important, like, you don't need necessarily to learn the full
breadth of statistics.
Even saying like, hey, you need to be like a data scientist doesn't feel right.
Just that word alone, I think, feels pretty scary, like data scientist.
Oh my God.
Like, what is all the things?
You use Google data scientist, you can get slapped in the face with a curricula that is gonna overwhelm you
And I don't think you need that to begin
What I'm saying is like very basic stuff like just counting and if you really dig deep into data science
And you ask a lot of data scientists like Hamill said that counting is really important
They'll probably nod their head and be like, yeah, that's like 80%
So, you know, I'm trivializing it, right?
Like counting, like, okay, like,
what's so hard about counting?
It's not really just the counting,
it's like, what do you count?
What questions do you ask?
You know, building that muscle.
For example, working with this company
that provides like these study cards, like Anki cards,
and you can like search for them, you know, semantically.
And they had some idea that maybe the retrieval was not as good as they wanted to be, the
semantic search.
And they had a data set that was labeled to some extent of, okay, search queries with
relevancy scores by hand that they did of a handful of queries.
And so I asked, I'm just giving you a handful of queries and so like, you know, like ask
I'm just giving you a concrete example ask question. Okay, like give me like a month worth of queries
And then also let's like do some like a little bit of analysis on
Like this data that you graded and so I asked it as start asking questions
Like how many queries are keyword searches?
Like what is the median length of a query so on and so forth?
Like this is the way I was trained, right? Is to like ask questions. And so you would see like
really fast like, okay, 30% of your queries are keyword searches. So what if instead of
you doing semantic search, you did keyword search? Like what the cards that are returned,
are they more relevant? Turns out according to the graded dataset, yes, a lot more relevant
in the 30% that's keyword search. And I'm like, okay, like the median query length in terms of tokens or
words, it's like something around 200.
So it's like, okay, you know what, people are not asking questions,
they're just copying and pasting their textbook in here, or
copying and pasting some slide.
And so maybe we need to do query rewriting.
Now, pre-LM's, that exercise would maybe take me,
I don't know, like 45 minutes,
but it takes me like maybe a minute now,
because I can just, I can say like,
hey, like, please do this analysis, write the code,
and I can check the code obviously, but it's very painless.
You know, you have to know what questions to ask.
Some wise person will
like be listening to the podcast and be like, okay, this is stupid. Like the hardest thing
in life is to know what questions to ask. So I'm just trivializing it. But what I'm
saying is it's accessible with practice. You know, just even counting, knowing all questions
to ask, that takes you incredibly far. So like, okay, the result of that was like, okay,
we knew the 30% of the time it was keyword search,
route it to keyword search, because we know
we're doing keyword search. And then,
okay, maybe let's do query writing.
And by the way, hybrid search,
does this benchmark it
just based on this data you graded?
Look, there's a result, whatever.
All these different things within the span of
a very short amount of time.
This is very valuable, because if you're just, you know, this is very valuable, right? Because if you're stuck,
you can answer a lot of questions with like some basic data analysis and counting.
I don't know if that example is useful, but again, like it's some process of like
education and teaching people. I think it will happen. It'll just take a little bit of time.
To be honest, this is probably one of the most valuable and fun conversations around AI
personally, you know, because instead of just keep talking about
what's the latest and greatest hype, the reality of making stuff working is
down to literacy and counting.
I feel like I'm teaching my aerial son to drop his phone and
just do the basics, right?
Yeah.
It's the most uncool things like counting, evals, you know?
But yeah, that's yeah.
Yeah, it's so amazing.
I know you're not into nicknames yet or you don't have a nickname,
but I really like you're like you should be called an AI doctor here almost, you know,
like you're here and you're just like, hey, everybody calm down and this is what's happening.
Go back, bring your paper and let's count, you know, and ask the right questions.
We have so much we could ask, but you know, we're running out of time.
Where do people can find you?
What's the social channels and stuff?
Yeah.
The best place to find me is my blog, hamil.dev.
And so you can find all of my contact information there.
I put everything there.
So you can go into the rabbit holeelow from there if you like.
Amazing.
Thank you so much, Hamil.
It's been such a pleasure.
Yeah, likewise.
Thank you.