The Infra Pod - Helping enterprises to adopt AI the real way - Chat with Hamel Hussain

Episode Date: June 3, 2025

Ian and Tim host a conversation with Hamel Husain, an AI influencer and consultant who has worked with major tech companies like GitHub and Airbnb. They delve into Husain's journey to becoming an ...independent consultant, his insights on the common struggles with AI evaluation and data literacy, and his pragmatic approach to improving AI systems. The discussion also covers the importance of error analysis, measuring AI performance, and the evolving role of AI engineering in enhancing software development.00:23 Hamel's Background in AI and Machine Learning01:18 Challenges in AI and Machine Learning02:15 Importance of Data Literacy in AI04:02 Common Struggles in AI Development05:05 The Role of AI Engineers08:39 Focus on Processes Over Tools11:10 Error Analysis and Practical Examples15:52 Case Study: Improving AI Products26:05 The Role of Agents in AI

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Infrapod. This is Tim from Essence and Ian, let's go. This is Ian, user of LLMs. Although I don't know how they work, can't wait to talk to Hamil today, the current it boy of LLMs. Hamil, tell us all about yourselves. into what you're currently doing, I led research at GitHub that led up into GitHub Copilot called CodeSearchNet. And then at some point I decided to, you know, people start reaching out to me for help with LLMs and became an independent consultant. It's really interesting because I get to see a lot of different use cases and what's working and what's not working
Starting point is 00:01:01 and get to help people build AI products. And yeah, that's a little bit about me. So, I mean, you're pretty well known now. You've done some courses, you've done a bunch of tweeting, done a bunch of LM stuff, I'm curious, you and I've talked at length, one of the things you said to me, just to kick us off is like most people building with LM's you end up working with these days for your consultancy, they struggle with like basic eval. So can you expand upon where you think the current world is at in terms of their ability to actually use,
Starting point is 00:01:33 you know, this newfangled machine learning technology where they find the hiccups and what you're witnessing is sort of like where people actually at in terms of their ability to use and how much value can they actually extract from these things based on that? Yeah. So, okay, like a key aspect of working with any machine learning AI system, like pre-generative AI and post-generative AI doesn't really matter, is that you have this like stochastic system, right, giving you outputs of some kind, and you have to reason about that. You have to figure out, is it doing what you want?
Starting point is 00:02:05 And how do you improve it? That's kind of table stakes for building AI products. If you can't improve it, then you can't really make any progress. And you're going to get left behind. So measuring something is key to improving it. And it is the case that you don't need a lot of the skills that you used to need for leveraging AI.
Starting point is 00:02:26 Before you needed to maybe be a machine learning engineer and know a lot more about how models work, kind of like the details of all of these things. And so there's definitely a portion of those skills that a lot of people don't need. But there is a portion of those skills that you absolutely do need. That I would lump it in this category of data literacy. So while you might not need to understand
Starting point is 00:02:51 how the model works, you do need to have like really strong skills on how to analyze the performance of your AI system, how to poke at it, how to measure it, how to explore and quickly have a nose for data and figure out what's going on. And then also being able to design metrics that make sense and being able to take this blob of, let's say, stochastic outputs and not be overwhelmed by it, which there's a long history of like a systematic process you can apply to do this correctly. So that's where the people are struggling is like you have this really large
Starting point is 00:03:33 influx of software engineers working on AI, which is excellent because again you don't need to be a full-blown machine learning engineer to make AI products. That's really clear. But the thing that a lot of people don't have in their training or their background is like, okay, how do you do data analysis? How do you measure things? How do you test these stochastic systems? And that's the key failure mode that everyone's experiencing. And that's when people reach out to me. It's like, hey, we built this prototype. It kind of works, but we're sort of stuck on how to make it better.
Starting point is 00:04:11 How do we make it work really, really well? We tried XYZ things, we tried a lot of people. The software engineering mindset a lot of times is like, okay, let's try different tools. Let's try a different vector database. Let's try a different vector database. Let's try a different agent orchestration system. Let's try whatever. That becomes a game of whack-a-mole.
Starting point is 00:04:32 And you tend to spin your wheels on that without having some sort of more principled tests that you can rely on. That's not just vibes. Vibes are useful. They can take you to a certain point, but then you plateau really fast. So thates are useful. They can take you to a certain point, but then you plateau really fast. So that's essentially what everyone is struggling with. So it's
Starting point is 00:04:49 like, okay, how do we rip out the data literacy component of, you know, like machine learning and data science and kind of bring the pieces that matter into AI engineering? That's where I spent a lot of time. And this is sort of, you know, Sean or quickly known as Swix has, you know, an Alessio blatant space of sort of term this concept like AI engineering. And I think they define AI engineers basically, it's like you took like a full stack dev, you brought some data science skills, and then you're like, go build products with LLMs. Does that match your mental model of like sort of what people need to become in like this sort of concept of AI engineers in order to like to utilize these things?
Starting point is 00:05:29 Basically what you're saying is the default software engineers spend a lot of time building very deterministic systems where they didn't spend any time thinking about things like data quality or distribution to how the data, the quality of the data impacted like what could be predicted from it, right? They're much more interested in how many roads can be pulled back by some query or something along those lines. So what do you think the future of software engineering looks like, given that it's likely that LMs will be at the core of many of our apps? And how does this concept of AI engineering fit into that? And how is that concept sort of like, if you were to become one or you were to take on whatever the skill sets are
Starting point is 00:06:05 that an engineer has, how that actually gives you the past superpowers to take advantage of these new tools or features. Yeah, so OK, that was a, let me try to break it down, the question a little bit. So yeah, Swix has done an excellent job, like kind of creating this community of AI engineer. And it's really good, because it's like some kind of way for people to gather and
Starting point is 00:06:28 like exchange thoughts. It's really hard to like get your hands around all the skills. It is true like to get started, the full-stack engineering skills are really important. And I'll actually like admit that a lot of data scientists, machine learning engineers, kind of weak at the full stack engineering stuff. So going from like zero to one, it's actually like really good that you're a full stack engineer and you have like that mindset. And the tools are very different. It's not really about tools. Tools and culture and knowledge are all like intermingled very tightly, you know, like they have different communities, right? So it's like TypeScript doesn't have the greatest data tools, you know, but Python is not as mature in like, its web frameworks and, you know, product engineering stuff, you know, and people might get mad about this, like, there's people that are like, I know
Starting point is 00:07:20 Python is fine. But you know what I mean, right? It's like, those tools have different purposes. And, you know, it's not a surprise, right? You Python's fine. But you know what I mean, right? It's like those tools have different purposes. And it's not a surprise, right? You need both skills. But on the other hand, I think you can learn them. Basic data literacy, I think people can learn. It's not too onerous. We're not talking about the full breadth of everything
Starting point is 00:07:38 you need to know from machine learning engineering. Talking about very basic things, like even counting and manipulating data is really helpful. I don't know if that answers the question or not. I think that it definitely helps it answer some of the question. The next question I have is with a lot of the clients who are coming in they're asking you basically how can I improve these systems? How can I make them more efficient? Basically they're like how can I make them more accurate? Right? At the end of the day how can I make make it respond more correctly with some definition
Starting point is 00:08:08 of correct? And that question basically isn't a question about, am I using Postgres or have I vertically or horizontally scaled it? It's a question of what systems you have, data systems you've built to quantify and measure effectiveness and improve effectiveness over time of the data set that things are being trade on. So it's like, how do I use things like in context learning? How do I do reg? How do I go these different things? Is there a part of that stack
Starting point is 00:08:34 that people struggle with the most? Like that you see of this equation people struggle with the most? Yeah, the thing that people struggle with the most is they focus on tools, not processes. So 99% of the time when I engage with someone who's stuck, they say, hey, I need help improving my AI. And they immediately launch into, okay, here are my agent architecture. Here's my tools. Here's my I'm using whatever
Starting point is 00:08:59 Lance, you know, Lance DB or Chrome or whatever, and I'm doing this kind of hybrid search with like this weighting here and this parameter here. And they give me this like kind of avalanche of tools focus. And the first question is like, what tools do I use? It's really pervasive and it goes really deep. And people will say, oh, hey, like, I heard that you said that we should focus on processes, not tools,
Starting point is 00:09:23 because I repeat the same thing all the time, not just on this podcast. And I say, okay, like, great, we want evals. What tools should I use for that? Should I use brain trust? Should I use Langsmith? Should I use Arise? What tools should I use?
Starting point is 00:09:35 And then even when they start using those tools, like, okay, you know, like what model is the best model to do evaluations? So all of these questions is a smell that you have the wrong mindset. It's like saying, I wanna lose weight, what gym membership should I buy? What gym membership you should buy
Starting point is 00:09:55 is the least important question you should be asking yourself. You need to go to the gym. You need to figure out the process and that is where people fail. It's like, wait, don't show me any agent architecture Diagrams are really anything we need to measure Okay, like what is working and what's not you should go through the process of doing that and it's a process
Starting point is 00:10:17 Lot of times you don't need tools tools can help a little bit just like Jim can help you But just going through the process with an Excel spreadsheet, with a Jupyter notebook, with just like a REPL, it doesn't really matter. Going through the right process of looking at your data, doing basic things like error analysis, figuring out like what actual errors are occurring in your AI system, and then justifying every bit of complexity that you have instead of just kind of looking at tools. I think that that's where people get stuck, because they don't know.
Starting point is 00:10:49 And so there's a structured way to look at your data. People are like, the question comes up really fast. How much data do I look at? Or what part of my data do I look? You know, this system is complicated. There's RAG. There's tool calling. There's agent orchestration.
Starting point is 00:11:04 Do I measure everything? Do I test everything? And there is a concrete answer to all of these things. It starts with error analysis. Error analysis is not something I made up. It's been around in machine learning for decades. And that's how people debugged AI for the longest time. And actually it works really well. What does it mean? It's like you look at the data physically, like you put your eyes on it, you go row by row
Starting point is 00:11:30 or trace by trace, and you write notes about what errors you may or may not be seeing. And then what you do is you categorize, and you can use an LLM to help you categorize the kinds of errors you're seeing and then count those. It's a simple exercise, but it actually drives insane clarity. That's just an example.
Starting point is 00:11:49 There's a lot of different techniques like that, like little techniques that are simple. Like I said, you can use an Excel spreadsheet a lot of times. You can use a pivot table. It doesn't have to be crazy. You can use a notebook, okay, or whatever. You can use whatever the hell you want. People don't want to do that.
Starting point is 00:12:04 They're like, well, can't it just something automate looking at the data for me? Like, why don't you look at data? The truth is like, it actually doesn't take that much time and you learn a lot. And no, you can't completely automate it. Because at the end of the day, like if you think about it logically, the only way that you can trust the AI is if you look at it. Trust is not free, right? Until we have AGI, until you generally trust the intelligence of something to that extent, where you just willingly, just blindly trust as judgment. That's like some other conversation.
Starting point is 00:12:39 Then you need to be looking at data. People don't know that. And so one area where that shows up, for example, is LLM as a judge. So people use LLM as a judge all the time. But there's no point in using LLM as a judge if you don't go through the exercise in measuring the judge's agreement with you.
Starting point is 00:12:56 So that's a kind of an exercise you can go through to say, OK, I'm going to judge some things. I'm going to let the LLM as a judge judge those same things. And I'm going to judge some things. I'm going to let the element as a judge judge those same things. And I'm going to see whether I agree with it or not. There's actual ways to do that that are very effective. Again, this is where people get stuck. Very basic things, honestly. Hearing you talk about it, all these AI, LLM,
Starting point is 00:13:19 if we're close to getting AGI discussions, are getting so rampant. And sounds like the best way to solve your current AI problem is go back to pen and paper almost. Which is a funny sort of thing to think about. I'm actually curious because you published a field guide to improve AI applications, right? Yeah, I actually tried to, I was publishing it
Starting point is 00:13:41 right before this podcast. And then I deleted it because I messed up the social card so I have to do it like After I do it over by at least I seen it and it's really cool that you have a bunch of things here The thing I got out of reading that post because you already mentioned on the very first bullet point was the tools versus the process Hmm. I was like, wow, you read it that fast. Just quickly, just quickly. I know AI here. I'm using my OIs to agree with it. What I got from the general theme is everyone's process
Starting point is 00:14:14 may be roughly the same, but what you need will be domain specific. So there's no one tool fits all. Even from your tools you use to do the AI, but even the tools you use to help you look at the data. You have a bunch of different examples of how do you combine your current usage and context. And you're also your experts to be in that process as well. People reading this or people using AI today,
Starting point is 00:14:40 because LLMs are such a black box anyways and so? And so magical, you just assume you can just dump a bunch of things that would just work. And that was kind of really the mindset at this point, is like, we just trained as basic a magic black box. And now you're like, okay, you nearly need to dig in to understand, that scares people, just that basically, how do I even get to understand and set up things? My question is really, maybe you can, is there maybe an example how when you engage with
Starting point is 00:15:08 one of your customers that try their own AI and it doesn't work, right? It doesn't really get you what you want. You said you just sit down with them and look at data, right? And I'm sure you're getting these questions. If there are maybe like a very specific example, maybe one of your customer you listed, real estate or whatever, what is the sort of thing you have to kind of walk through with them to say, okay, this is actually what you need, right?
Starting point is 00:15:31 Because maybe there's like a general process you start with, looking at the error, look at data, and maybe give an example like, okay, we see this kind of categorical errors now. This is what I suggest next. And kind of walk through an example, because I think people don't even understand how end-to-end to even prove anything right now Besides just buy more products or do something very black boxy
Starting point is 00:15:51 Yeah, one example that I mentioned in that post the field It's called a field guide to rapidly improving AI products. I put a video in there with one of my clients The name of the client is nurture boss It's a AI assistant for the apartment industry for like property managers and helps them with, you know, like lead management, sales, interactions with tenants, payments, stuff like that. And you know, when they came to me, it was like, okay,
Starting point is 00:16:15 like we have this system, but you know, like how do we make it better? And so one of the first things we did is like, we just started doing error analysis. At first there was some resistance. Be like, okay, I just paid you a bunch of money. You want to look at data together? He's like, this feels like it's like some kind of weird
Starting point is 00:16:34 wax on wax off exercise, like what is going on? And he actually says that in the video embedded in the post. You can hear him say that. He's like, he's very skeptical. Then we started looking at the data and we're doing a very basic error analysis. There's many different kinds flavors of error analysis but it's like okay you know one of the issues that we saw right away is like date issues like tenants would say okay I need to schedule
Starting point is 00:16:56 a meeting two weeks from now and the date wasn't being handled correctly with the scheduling. Another issue we saw is like you know know, tennis would schedule a meeting for apartment viewing, but then say they wanted to reschedule. But like rescheduling is not a function call that exists in their platform. But they would be like, no problem. We have rescheduled this for you. You know, things like that is like very concrete failures
Starting point is 00:17:20 that actually mattered and that we could quantify. Like these are happening at a high frequency. And then like either one, we can just go fix them right away. Or two, like, you know, we might need, you know, might want to like do some eval. So for like the date handling thing,
Starting point is 00:17:36 okay, like what we did is we generated a bunch of synthetic data that tested all the different tricky edge cases that might come up in the way that someone might express dates. Like all the different crazy shit that we could think of. You know, leap years and not being precise with the dates or, you know, crossing year boundaries or whatever the hell, like all kinds of stuff. And then, you know, we like iterated on that really fast. And it wasn't, you know, you can use an LLM to help you create synthetic data. So this stuff is not that manual manual. Compared to machine learning prior to generative AI,
Starting point is 00:18:08 this stuff is kind of painful. You had to gather data meticulously, and you had to do a lot of these things. It was way more manual. So to me, it's crazy that you don't do it because it's not hard. One of the things that we talk about in the blog post is, looking at data is so important that you should build your own data viewer specific to you so that you can on one
Starting point is 00:18:31 single page render all the metadata that matters for specific, in this case as a customer interaction or a customer thread, right? Like render it very precisely in the like highlight the information that matters to you. Dial in the filters and everything, you know, because they want like a property filter and a channel filter, like is it interacting through voice, text, you know, whatever. You know, all these things is really important. And it's like almost free because, you know, like cursor, lovable, whatever.
Starting point is 00:19:01 That's the exact thing that AI is really, really good at making. You know, simple web applications that render data, you can't think of anything that AI can do easier than that. That's basically, you know, throwing a softball to AI. So it starts there. And you know, every single client I interact with, it makes a huge difference, like immediately. It's kind of fun, actually. It's like, let me spend 30 minutes looking at your data.
Starting point is 00:19:26 I'm going to tell you all kinds of stuff you didn't know. But it's just counterintuitive, right? It's like you're standing on a beach, and you ask me, OK, what's the way we're going to transform this landscape? I'm like, OK, let's pick up one grain of sand. It doesn't feel like that's going to do anything. But in this case, looking at your data makes a lot of sense. Now, you don't have to look at all your data.
Starting point is 00:19:48 You don't have to look at every single piece of data. You can sample your data. You can cluster your data, and you can explore different regions. There's all kinds of ways to be more sophisticated with their analysis. But honestly, if you just start, then you know. And people get really caught up in like, oh my god, like how many I have to look at without
Starting point is 00:20:08 even looking. They're like, oh my god, like how many do you want me to look at? And sometimes I just throw out numbers like, okay, start with 30. I don't know the number. The heuristic is keep looking until you're not learning anything new. Once people start looking, no one asks me how many do I need to keep looking at? Because they're like, wow, this is really valuable. It's like the gym thing, right? It's like, you just got to do it. It's not that hard, honestly.
Starting point is 00:20:34 Yeah. Yeah. It's so, so interesting that it used to be ML data scientists doing these exercises and we hire a bunch of ML people, not with AI. I guess the whole world is going on fire and everybody is like, assuming they can just easily add AI to the mix of whatever they're doing. But then you're talking about like just empirical, simple analysis to start with. But it does get complicated over time though, if depending on where you want to go and how much of this really matters. And I think there's so many angles of trying to get AI to work in your company. It's complicated over time though, depending on where you want to go and how much of this really matters.
Starting point is 00:21:05 And I think there's so many angles of trying to get AI to work in your company. Like you mentioned, there's domain experts trying to be part of your processes. There's like a big piece you're talking about and a huge amount of your blog posts in the end was more like data and evaluation and just like different types of ways to kind of build your own toolbox of ways to test edge cases and build trust around your evaluation systems and stuff like that. I imagine just looking at the data alone is a little daunting, but once you get used to it, it'll probably be a little better.
Starting point is 00:21:37 Building evaluations though, this is a pretty new thing for most people, right? And I think we are now in this process of the ecosystem where like the value of AI is so high. new thing for most people, right? And I think we are now in this process of the ecosystem where the value of AI is so high. So therefore a lot of vendors, of course, are selling lots of high value things. Valuation products and all the kind of like judge products. I'm curious, where do you see the gap between the products that are being sold, especially around like evaluation testing, because this is a pretty hot space as I see it. I think the products are actually pretty helpful in a lot of ways. It's really nice to be able to log your data to a system where you can see it,
Starting point is 00:22:14 that kind of has a nice UX and sort of has different things to get you started. You know, a lot of the observability tools like the ones I mentioned, like BrainTrust, Arise, Lang Langsmith, so on and so forth, like to help you get started. And you know, you probably want to use some kind of tool, honestly, it's just a matter of like how you use the tool, the tool is like just there. And so, you know, for example, one of the things I talk about in my blog post is, okay, like a lot of tools offer a prompt playground. So a prompt playground is basically you can have your prompt and maybe a data set
Starting point is 00:22:53 where you can template out, let's say, different inputs into that prompt. And you can play around with your prompt. You can version that prompt, so on and so forth. So that's really useful when you're starting out, right? And a lot of these prompt playgrounds, you can do multiple ones in parallel, you know? You can compare them side by side.
Starting point is 00:23:11 And even sometimes you can put measurements in there and do some tests or whatever. The only issue is, I would say most prompts don't live in a silo, okay? They're part of your AI system. So you have prompts, you might have RAG, you might have function calling, and those are all application-specific code. That's inherently part of your software.
Starting point is 00:23:34 It's going to be really difficult to have a prompt playground by a third-party tool run your code. Because to execute RAG, to execute function calls, then run your code, codes on and so forth. And so basically like what I advocate for is like integrated prompt playgrounds. Basically the same interface that your users see, but basically an admin mode
Starting point is 00:23:54 where you can edit the prompt. So that you can actually see what your system is going to do, it's not in a silo. But that doesn't mean the tools are not useful. It's a good starting point and it's useful for people before they have this integrated prompt playground. But these are the kind of things like tools are useful but you should figure out what works for you
Starting point is 00:24:15 and where you are in your journey. And there's different parts of these tools that are helpful along all parts of the journey. So yeah, I like them. Yeah, but I guess my question is, I guess you already alluded a little bit, but how do you think these tools or evaluation as a whole can be like 90% 95% automated because that's the promise of these products. It's almost like these will take over more and more. Maybe the judge will be so much powerful or there's an SDKs or generated SDKs and just somehow
Starting point is 00:24:46 it's just magically like these things would be less almost consulting and service based. Yeah. It feels like these tours have to be bundled with services quite a bit actually to even implement it correctly sometimes. So I think right now they're 25% automated. Let's say in the absence of AGI, I think you can automate it maybe like 60%, maybe 65%, maybe even 70%. So there is like a lot of things that vendors can do to make the process more automated, but there is a piece of it in there that you can't, there's no free lunch.
Starting point is 00:25:19 Like you have to look at the data and provide feedback, and you have to inject your taste into your product. Like you have to communicate to the AI in some way to inject your taste. And it's actually well known from several different studies, and I also cite this in a blog post, is like people are inherently not good at specifying their tastes and criteria upfront. You have to react to
Starting point is 00:25:46 what you see to even know what that is yourself, right? And so the only way to do that is to interact with the AI and give feedback. You're gonna have to have some amount of either services or education because people, you know, will have to do that process no matter what. And so I really want to ask you this. And I feel like most of your posts, and cover me well, I haven't read every single word you wrote. Okay.
Starting point is 00:26:12 The word agent doesn't come up that often yet, you know? And as you know, the whole world is crazy about agents now. Like, I feel like LLM is just the past already, you know? Like, we're already at this agents is everything era. Yeah. And I think definitely there is a lot of people in companies both selling agent specific frameworks and products or adopting agents everywhere. And I'm sure it probably people were asking you about like,
Starting point is 00:26:35 hey, help me on my agents. Do you feel like agents are something that you are, there's another way to look at this whole way of helping to figure out the problems and accuracy stuff? Or it's really pretty much the same? Because the complexity now is like, the agents are sometimes is, of course, all prompt and model space, right? But it's doing more interactions. But more and more of them are also doing automations as well.
Starting point is 00:27:00 And so it gets a little bit more hairy on like, does it actually do the thing or not? So do you have opinions on how we get get you are improving agents quality overall and is it methodology? Yeah It's really the same thing. Honestly, it actually makes evals even more important It makes the process more important because now you have more complexity So, you know Tim from engineering that you need to justify the complexity You shouldn't just take it on, you know? Like it has to have some payoff.
Starting point is 00:27:28 And like, so how do you know that there's a payoff? Like you should do the simplest thing that works. I think everyone can agree on that. And so the way to keep yourself honest is evals. Now, another thing is as you ramp up the complexity, you're gonna have more surface area. With more surface area, you will have more failures. So if you just take a blind approach to using
Starting point is 00:27:50 generic methods to try to figure out how to make your AI better, you're going to get lost really fast because there's going to be way too much noise. So if you go through these processes, it will cut through all the noise and you'll figure out exactly what is wrong and Then you'll also know like where to fix it as opposed to like oh my hallucination score is like Whatever are my toxicity score is whatever do you even have a toxicity problem? Do you even have a hallucination problem?
Starting point is 00:28:18 I don't know so, you know these things lead people astray And I think they lead them even more astray when the complexity is high. So I think it's like, it doesn't really matter like what you're using. We were already living on the spectrum of agents before the word agent, as far as I'm concerned. What is an agent?
Starting point is 00:28:38 Is an agent something that can act on your behalf autonomously? There's no like agreed upon definition, but there's a block post from Anthropic that's like different levels of agent. I don't remember the exact levels, but we were already using a few of those levels before the word agent was popularized.
Starting point is 00:28:57 So I think it's really the same in terms of how you think about testing. I'm kind of curious when you think about, I agree with you, there's a spectrum. And I think of like agent level zero is the same as, I always use the self-driving car analogy, which is like level zero, self-driving, just the human with a pedal and a steering wheel, right?
Starting point is 00:29:13 And like, if you think about it broadly speaking, actually what we were building, if you think of like, I was building a service, I don't even know if that was like level zero, it might've been level one, like you're already still automating something you would have had to do that was highly static. It's just like what that thing could do with static
Starting point is 00:29:29 and now is we've kind of moved up the curve towards agency. It's like your thing gains broader and broader ability to do dynamic things and do those dynamic things asynchronously according to some like trust boundary, right? Some trust equations. It's like, I trust this thing to operate within this box on its own, you know, whatever it decides to do, but this is the box. That said,
Starting point is 00:29:46 I'm really curious from your perspective, because you talked about how, you know, getting to level five, let's say a level five agent, you can tell it one line sentence that goes off and does it for five years, comes back to you and you know, has built you a house and ordered a lot of figured out what cake and designed the whole home or whatever it's done. How do you think about, you know, the way that software engineers dealt with complexity in software applications and modularization?
Starting point is 00:30:07 We tried to modularize the layers of abstraction and that helped us think through some of these, like the complexity box, and then you hit unit tests, integration tests. And that's how we've been able to build like incredibly complex software with millions and millions and millions and millions of lines of code
Starting point is 00:30:22 that do very complex things in a very trusted way. I'm curious, like, does that model of abstraction translate to the way that you think through how to build more complex apps backed by LLMs? Is it similar ways, is that where the mixture of expert stuff comes in, or is there more to be done? I'm kind of curious to get your perspective, considering you're out there doing it.
Starting point is 00:30:42 Yeah, it's an interesting question. Like, okay, to me, agents, I don't really focus too much on the definition of it, to be honest. It's really about capabilities. Again, I try to do the simplest thing that works. I don't even try to define it, honestly. To me, the only thing that matters is, OK, does it do what you want to do?
Starting point is 00:31:02 How you do that is kind of not as interesting to me. So I haven't spent too much time thinking about definitions for whatever reason. Maybe I'm just weird. Sounds good, sir. Well, I think this is the perfect time. Let's go on to the spicy future. You've been working a bunch of customers. I think you're seeing the front row view of how people are struggling and using this stuff.
Starting point is 00:31:27 And of course, you know, there's the other side of marketing is happening all the time. So give us your spicy hot take. What is something you think you believe that most people don't believe yet? Yeah, maybe it's not as spicy after a conversation. But you know, like most people don't know how to look at data and don't have data literacy. And that stops them from making progress. And I think most people don't know how to look at data and don't have data literacy and that stops them from making progress. And I think most people don't know that. I kind of don't know if I like the word data literacy because it feels like an insult to
Starting point is 00:31:53 have the word literacy in there. Like you're not literate and I don't really like that aspect that seat feels bad. But I mean it's a word that exists and you know people feel like hey we have all this like AI like why don't you look at data? It seems very counterintuitive. Can't something somewhere automate that for me? Why do I need to do this? But yeah, people don't know how to do that, where to start.
Starting point is 00:32:16 There's not that much education about evals out there. Certainly, the foundation models in the labs, they are really good at evals for their foundation models. But for whatever reason, in terms of education and sort of just broadly speaking for domain specific situations, there's not that much guidance out there for people like how to do evals. And I don't know why. It's kind of like the dark art though. It's a super secret sauce of like a great ML team has always been sort of the dark art though. It's the super secret sauce of like a great ML team
Starting point is 00:32:45 has always been sort of the eval framework. And I mean, I always have thought about evals like the integration tests for a model for lack of like a better thought process. I'm curious. I mean, it sounds like what you're saying is basically at the end of the day, like if we're gonna have large adoption of LLMs as core parts of products
Starting point is 00:33:03 and workflows in the future, we need to take the vast majority of developers today who have, whether we use the word data literacy or whatever word we want to use, very little experience doing what, personally, we might have called some formulation of data science, not a broad understanding of statistics and teach them about statistics and how to generate good evals and how to use those evals as a part of their workflow. Do you think that's a result of the fact that most people never get to building sophisticated enough things that they need such complicated evals? And so now this is just a new thing. Like how many organizations in the world would have had to do this
Starting point is 00:33:41 prior to like 2022, right? Like not many organizations were actually had sophisticated machine learning use cases, much less sophisticated machine learning use cases at scale. So I'm kind of like, I look at them just like it's a chicken and egg thing. It's like, well, there's no reason to do it because it was really hard to figure out where to even apply machine learning in the first place. And so now we just have this new thing that like has changed the dynamics of the applicable use cases gone from like one in a million million to one in a thousand use cases, whatever it is, some
Starting point is 00:34:10 drastic expansion in terms of the number of use cases you can apply this stuff. And it just turns out that actually the skill set you need to do this well was something you'd only ever learn in a place you got to actual scale with, which were very few places in the first place. Yeah. So, I mean, I think it's important, like, you don't need necessarily to learn the full breadth of statistics. Even saying like, hey, you need to be like a data scientist doesn't feel right. Just that word alone, I think, feels pretty scary, like data scientist.
Starting point is 00:34:39 Oh my God. Like, what is all the things? You use Google data scientist, you can get slapped in the face with a curricula that is gonna overwhelm you And I don't think you need that to begin What I'm saying is like very basic stuff like just counting and if you really dig deep into data science And you ask a lot of data scientists like Hamill said that counting is really important They'll probably nod their head and be like, yeah, that's like 80% So, you know, I'm trivializing it, right?
Starting point is 00:35:06 Like counting, like, okay, like, what's so hard about counting? It's not really just the counting, it's like, what do you count? What questions do you ask? You know, building that muscle. For example, working with this company that provides like these study cards, like Anki cards,
Starting point is 00:35:22 and you can like search for them, you know, semantically. And they had some idea that maybe the retrieval was not as good as they wanted to be, the semantic search. And they had a data set that was labeled to some extent of, okay, search queries with relevancy scores by hand that they did of a handful of queries. And so I asked, I'm just giving you a handful of queries and so like, you know, like ask I'm just giving you a concrete example ask question. Okay, like give me like a month worth of queries And then also let's like do some like a little bit of analysis on
Starting point is 00:35:55 Like this data that you graded and so I asked it as start asking questions Like how many queries are keyword searches? Like what is the median length of a query so on and so forth? Like this is the way I was trained, right? Is to like ask questions. And so you would see like really fast like, okay, 30% of your queries are keyword searches. So what if instead of you doing semantic search, you did keyword search? Like what the cards that are returned, are they more relevant? Turns out according to the graded dataset, yes, a lot more relevant in the 30% that's keyword search. And I'm like, okay, like the median query length in terms of tokens or
Starting point is 00:36:30 words, it's like something around 200. So it's like, okay, you know what, people are not asking questions, they're just copying and pasting their textbook in here, or copying and pasting some slide. And so maybe we need to do query rewriting. Now, pre-LM's, that exercise would maybe take me, I don't know, like 45 minutes, but it takes me like maybe a minute now,
Starting point is 00:36:52 because I can just, I can say like, hey, like, please do this analysis, write the code, and I can check the code obviously, but it's very painless. You know, you have to know what questions to ask. Some wise person will like be listening to the podcast and be like, okay, this is stupid. Like the hardest thing in life is to know what questions to ask. So I'm just trivializing it. But what I'm saying is it's accessible with practice. You know, just even counting, knowing all questions
Starting point is 00:37:18 to ask, that takes you incredibly far. So like, okay, the result of that was like, okay, we knew the 30% of the time it was keyword search, route it to keyword search, because we know we're doing keyword search. And then, okay, maybe let's do query writing. And by the way, hybrid search, does this benchmark it just based on this data you graded?
Starting point is 00:37:38 Look, there's a result, whatever. All these different things within the span of a very short amount of time. This is very valuable, because if you're just, you know, this is very valuable, right? Because if you're stuck, you can answer a lot of questions with like some basic data analysis and counting. I don't know if that example is useful, but again, like it's some process of like education and teaching people. I think it will happen. It'll just take a little bit of time. To be honest, this is probably one of the most valuable and fun conversations around AI
Starting point is 00:38:04 personally, you know, because instead of just keep talking about what's the latest and greatest hype, the reality of making stuff working is down to literacy and counting. I feel like I'm teaching my aerial son to drop his phone and just do the basics, right? Yeah. It's the most uncool things like counting, evals, you know? But yeah, that's yeah.
Starting point is 00:38:29 Yeah, it's so amazing. I know you're not into nicknames yet or you don't have a nickname, but I really like you're like you should be called an AI doctor here almost, you know, like you're here and you're just like, hey, everybody calm down and this is what's happening. Go back, bring your paper and let's count, you know, and ask the right questions. We have so much we could ask, but you know, we're running out of time. Where do people can find you? What's the social channels and stuff?
Starting point is 00:38:54 Yeah. The best place to find me is my blog, hamil.dev. And so you can find all of my contact information there. I put everything there. So you can go into the rabbit holeelow from there if you like. Amazing. Thank you so much, Hamil. It's been such a pleasure.
Starting point is 00:39:10 Yeah, likewise. Thank you.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.