The Data Stack Show - 177: AI-Based Data Cleaning, Data Labelling, and Data Enrichment with LLMs Featuring Rishabh Bhargava of refuel

Episode Date: February 14, 2024

Highlights from this week’s conversation include:The overview of refuel (0:33)The evolution of AI and LLMs (3:51)Types of LLM models (12:31)Implementing LLM use cases and cost considerations (00:15:...52)User experience and fine-tuning LLM models (21:49)Categorizing search queries (22:44)Creating internal benchmark framework (29:50)Benchmarking and evaluation (35:35)Using refuel for documentation (44:18)The challenges of analytics (46:45)Using customer support ticket data (48:17)The tagging process (50:18)Understanding confidence scores (59:22)Training the model with human feedback (1:02:37)Final thoughts and takeaways (1:05:48)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. We are here on the Data Stack Show with Rish Bhargava. Rish, thank you for giving us some of your time. Thank you for having me. All right. Well, give us the overview. You are running a company that is in a space that is absolutely insane right now, which is AI and LLMs.
Starting point is 00:00:44 But give us the brief background. How did you get into this? And then give us a quick overview of Refuel. Awesome. Yeah. So look, I'm currently the CEO and co-founder of Refuel, but been generally in the space of data, machine learning, and AI for about eight years. Was at grad school at Stanford, studying computer science, researching machine learning, and then spent a few years at a company called Prima.ai, where I was an early ML engineer. The problems we were trying to solve back then were, how do you take in the world's unstructured text, allow people to ask any questions, get a two-pager to read, so all
Starting point is 00:01:20 of these kind of interesting NLP problems. And then spent a few years after that solving data infrastructure problems, how do you move terabytes of data point A to point B, lots of data pipeline stuff. And that led into starting Refuel. You know, one of the key reasons why we started was just how do you make good, clean, reliable data accessible to teams and businesses? And that's the genesis of the company. And here we are.
Starting point is 00:01:46 Very cool. to teams and businesses. And that's the genesis of the company. And here we are. And Eric, you know, I know Rish from the COVID days, so for me, it's very exciting to see the evolution through all this and see him today building in this space. I remember talking
Starting point is 00:02:01 almost two and a half years ago about what he was thinking. Back then, LLMs were not the thing that they are today. So for me, at least, it's very fascinating because I have the journey of the person in front of me here, and I'm really happy to get into more details about that. So definitely, we have to chat about it. But also, I think we have the perfect person here to talk about what it means to build a product in the Combami in such an explosive, almost, environment. Things are changing literally from day to day when it comes to these technologies like LLMs and AI and machine learning. And just keeping up with the pace from the perspective of a founder,
Starting point is 00:02:48 I think it's a very unique experience. So I'd love to talk also about that with you, Rhys. And here also what's happening. You're probably having a much better understanding of what is going on with all these technologies out there, but also how you experience that, trying to build something. And of course, talk about the product itself. What about you?
Starting point is 00:03:11 What are some topics that you are really excited about talking today with us? Look, I'm super excited to talk about just the world of generative AI, how quickly it's evolving. But Kostas and I, both of you have spent so much time talking to folks in data and how the world of LLMs impacts the world of data, right? How do you get better data, cleaner data, all of those fun topics? And frankly, what does it mean, right? What are the opportunities for businesses and enterprises when this, as you said, explosive
Starting point is 00:03:39 technology is really taking off? So excited to dig into these topics. Yep. I think we are ready for an amazing episode here. What do you think, Eric? Let's do it. Let's do it. All right. We have so much ground to cover, but I want to first start off with just a little bit of your background. So you have been working in the ML space for quite some time now, I guess, you know, sort of in and
Starting point is 00:04:05 around it for close to a decade. And you could say maybe you were working on, you know, LLM flavored stuff a couple of years ago before it was cool, which is pretty awesome. That's, I would say, a badge of honor. But what changed? What were you doing back then? And what changed in the last couple of years to where it's a frenzy, you know, and the billions of dollars that are being poured into it is just crazy. Yeah, it's been such an incredible ride, Eric. You know, just a little bit on my background, you know, post-grad school at Stanford, I joined this company called Primer. This is about seven years ago at this point. And the problem that we were trying to solve back then was how do you take in the world's unstructured text information, take in all of news articles, social media, SEC filings,
Starting point is 00:04:56 and then build this simple interface for users where they can search for anything. And instead of Google style results, right? Here are the 10 links. Instead of that, what you get is a two-pager to read, right? So how do you assimilate all of this knowledge and be able to put it together in a form that is easy to consume? And this used to be, this was a really hard problem. This used to be, this is many months of effort, maybe years of effort, getting it into a place where it works. And if you compare that to what the world looks like today, I would bet you this is 10 lines of code today using OpenAI and GPT-4. So truly,
Starting point is 00:05:38 some meaningful changes have happened here. And I think at a high level, it's not one thing. It's many things that have sort of come together. It's new machine learning model architectures that have been developed, things like the Transformers models. The data volumes that we're able to collect and gather, that has gone up significantly. Hardware has improved. Cost of compute has gone down. And it's a marriage of all of these factors coming together that today we have these incredibly powerful models to just understand so much of the world. And you just ask them questions and you get answers that are pretty good. And it just works. So it's been an incredible ride these last few years. Very cool.
Starting point is 00:06:13 And give us an overview of where does Refuel fit into the picture when it comes to AI? Yeah. So look, at Refuel, we're building this platform for data cleaning, data labeling, data enrichment using large language models, and importantly, at better than human accuracy. And the reason why we look at this way of building the product, it's at the end of the day, data is such a crucial asset for companies, right? It's like the lifeblood of making good decisions, training better models, having better insights about how the business is working. But one of the challenges is, and people still complain, hey, we're collecting all of this data, but if only I had the right data, right? I could do things X, Y, or Z, right? People still complain about it. And the reason is, working with data, it's an incredibly
Starting point is 00:07:06 manual activity. It's very time-consuming. People are spending tons of time just looking at individual data points. They're writing simple rules, heuristics. And so doing that data work is actually, it's hard and it's time consuming. And what if, right? Like, you know, the question that we ask is, you know, with LLMs becoming this powerful, what if like, you know, the way of working wasn't, you know, we do that, you know, we look at data ourselves and we write these simple rules and, you know, we do that manual work ourselves. But what if we were to just write instructions for some large machine learning model, some
Starting point is 00:07:46 LLM to do that work for us? And writing instructions for how work should be done is significantly easier, significantly faster than doing the work itself. And so that just is a massive leap in productivity. And what we want to build with Refuel is being able to do a lot of these data activities, data cleaning, data labeling, data enrichment, where the mode of operation is as humans, as the experts, we write instructions, but this smart system goes out and does this work for us. Makes total sense. I have so many questions about Refuel, and I'm sure Costas does.
Starting point is 00:08:23 But before we dive into the specifics, you live and breathe this world of AI and LLMs every day. And so I'd love to pick your brain on where we're at in the industry. And so one of the things, and maybe a good place to start would be, I'm interested in what you see as far as implementation of LLM-based technology inside of companies. And I'll give you just a high-level maybe prompt, if you will. It seems like an appropriate term. But I think everyone who has tried chat GPT is convinced that this is going to be game changing, right? But that, for many people, is largely sort of a productivity. There's a productivity element to it, right? Or you have these companions like GitHub's co-pilot, right? That, again, sort of fall into this almost personal or team level productivity
Starting point is 00:09:26 category, right? Tons and tons of tools out there. But then you have these larger projects within organizations, right? So let's say we want to build an AI chat bot for support, right? We want to adjust the customer journey that we're taking someone on with sort of a next best action type approach, right? It seems to me that there's a pretty big gap between those two modes, right? And there's a huge opportunity in the middle. Is that accurate? What are you seeing out there?
Starting point is 00:10:02 I think that's a great way to look at it, Eric. I think you're absolutely right. Look, folks who have spent a meaningful amount of time with ChatGPT, you go through this experience of like, my God, it's magical. I asked it to do something and this poem that it generated is so perfect. You go through these moments of it works so incredibly well. There's, as you mentioned, there's co-pilot like applications that are almost plugged into where, you know, that individual is doing their work and it's assisting them. It's, you know, in the most basic form, it's doing autocomplete, but in the more kind of advanced form, it's almost offering suggestions of how to be able to rewrite
Starting point is 00:10:41 something or just a slightly higher order activity. But there is a jump from going from something that assists an individual person accomplish their task 5%, 10% better or faster to how do you deploy applications where these large language models are a core key component at scale at a level where the team that is actually developing and building this feels like, you know what? Our users are not going to be let down because the performance and the accuracy and there aren't going to be hallucinations and it's going to scale. There's a whole set of challenges to deal with, to go from that individual use case to something that looks and feels like this is production ready.
Starting point is 00:11:24 And I think as we kind of roll this back a little bit, we're very early in this cycle, right? The core technologies, you know, the substrate, these LLMs, they themselves are changing so rapidly. Of course, you know, OpenAI has been building and, you know, sort of deploying these models for a while now. Google is, you know, we're recording on in December. So Google has just announced sort of their next set of models.
Starting point is 00:11:49 There's a few open source models that are now coming out that are competitive, but the substrate itself, the LLMs themselves, they're so new, right? Yeah. And so the types of applications that we expect to be built, you know, this is going to be a cycle of, you know, somewhere in the, you know, two to five years where we'll truly see a lot of mainstream adoption, but we're early. But the interesting thing is, I think there is still an interesting playbook to follow
Starting point is 00:12:16 for folks who are experimenting and want to build high quality pipelines that use LLMs, that are applications that use LLMs. So I think there are playbooks that are being built out, but I think in the curve, we're still kind of early. Yeah. Yeah. That makes total sense. What are the ways, I mean, if you just read the news headlines and every company that's come out with a survey about adoption of LLMs, you would think that most companies are running something pretty complex in production, which I think that's probably a little bit clickbaity. Maybe even that's generous, but what are you seeing on the ground? What are the most
Starting point is 00:12:57 common types of things that companies are trying to use LLMs for beyond the sort of personal or sort of small team productivity? So the way we would look at, the way we're seeing the types of applications that are going live today, you know, the first cut that enterprises typically take is what are the applications that are internal only, right? That have no meaningful impact on, you know, at least no direct impact on users, but can drive efficiency gains internally. So things that might, if there are a hundred documents that need to be reviewed every single week, can we make sure that maybe only 10 need to be reviewed because 90 of those can be analyzed
Starting point is 00:13:39 by some LLM-based system. That's an example. I think a second example that teams are starting to think about is places where they can almost act like a co-pilot or almost offer suggestions to the user while the user is using the main application. Almost it's helpful suggestions. I think one of my favorite examples is if you've just created, let's say you've captured a video, right? Something good automatically suggests a title, right? It's like a small kind of, it's a small tweak, but makes a nice kind of difference to the user and it doesn't make or break anything. The third thing that we're starting to see, and I think we're still early, but this is where we believe a lot of business value is going to be driven,
Starting point is 00:14:33 is frankly, existing workflows where data is being processed or where data consistently gets reviewed by teams internally, where the goal is, how do we do it more accurately, cheaper, faster, by essentially reducing the amount of human time involved, right? And these are typically, if they're more business critical, the bar for success is going to be a little bit higher, right? So teams will have to invest a little bit more time and effort getting to the levels of accuracy and reliability that they care about. But those become sort of core, let's say data pipelines, they become core product features.
Starting point is 00:15:15 But that's the direction that we're seeing businesses sort of head towards. Yeah, super interesting. You mentioned efficiency and cost. Can you tell us what you're seeing out there in terms of what it takes to actually implement an LLM use case? You know, it's one of those super easy to start, and then very difficult to, A, just, you know, sort of understand the infrastructure you need for your use case among all the options out there, and then B, figure out what it actually will cost to run something at production scale, right? I mean, you can query GPT, even if you pay for the API, you know, it's pretty cheap, you know, to send some queries through, right? When you start processing, you know, hundreds of millions or billions of data points,
Starting point is 00:16:07 it can get pretty serious. So how are companies thinking about it? You know, it's such an interesting question. In some ways, we look at, the way we're seeing it, developing new applications with LLMs, it's a journey. It's a journey that you have to go on
Starting point is 00:16:23 where, you know, as with a journey, you want, you know, somebody who's accompanying you. And in this particular case, it's, you know, it's one LLM or like a set of LLMs that you start out with. And typically, you know, the place where people start is there's a business problem that I need to solve. And we were discussing prompting initially. It's like, can I, would it be amazing if I just wrote down some prompt and the LLM just solved my problem for me? That would be amazing, right? And so that's where people start. And it turns out that, you know, some, you know, in many use cases, it can take you 50%, 60% of the way there. And then you have to sort of layer on
Starting point is 00:17:02 other techniques almost from the world of LLMs that help you sort of go from that 50 to 60% to 70 to 80 and progressively higher. And sometimes it's easier to think about working with LLMs and not to anthropomorphize sort of LLMs too much, but sometimes it's easier to think about LLMs as like, you know, like a human companion almost, right? My favorite analogy here is, sorry, this is a bit of a tangent, right? Winding way to kind of talking about how to develop LLMs, but bear with me. You know, sometimes it's easier to think about how to get LLMs to do what you want them to do by thinking of what would it take a human to succeed at a test? Okay. Let's say we were to kind of go in for a math test in algebra tomorrow.
Starting point is 00:17:47 Right. Of course, you know, we've taken courses in our past. We could just show up. Right. And go take the test, but we'd probably get to,
Starting point is 00:17:55 you know, 50 to 60%. In terms of like how well we do. If you wanted to improve in terms of performance, we would, we would go in with sort of a textbook, right? We'd treat it as like an open book test, right? And the analogy for that in the world of LLMs is things like few-shot prompting, where you show the LLM examples of how you want that work to be done, and then the LLM does it better, right?
Starting point is 00:18:21 Or you introduce new knowledge, right, which is what bringing your textbook does, right? And so that is the next step that typically developers take, right? In terms of improving performance. And then the final thing, if you truly wanted to ace the test, you wouldn't just show up with a textbook. You'd spend the previous week actually preparing, right?
Starting point is 00:18:41 Actually doing a bunch of problems yourself. And that's very similar to how fine tuning works, right? Or training the know, doing a bunch of problems yourself. And that's very similar to how fine tuning works, right? Or training the LLM works. And so typically the journey of building these LLM applications, it takes this path where teams will just, you know, they'll pick an LLM, they'll start prompting, they'll get somewhere and then it won't be enough. And then they'll start to introduce these new techniques that folks are developing on how to work with LLMs, whether it's few shot prompting or retrieval augmented generation where you're introducing new knowledge. And then finally, you're getting
Starting point is 00:19:15 to a place where you've collected enough data and you're training your own models because that drives the best performance for your application. So that's the path that teams take from an accuracy perspective. But then, of course, you were also running this in production. It's not just about accuracy. We have to think about costs. We have to think about latency. We have to think about where is this deployed. And I think the nice thing about this ecosystem is the costs look something today, but the rate at which costs are going down, it's extremely promising.
Starting point is 00:19:50 So we can start deploying something today, but odds are that in three months or six months time, the same API will just cost 3x less. Or there might be an equivalent open source model that is already as good, but it's 10x cheaper. So the cost curve is, I can see in your eyes that you have questions on the tip of your tongue, and I want to know what they are. Yeah, of course I have. So, Rhys, let's go through the experience that someone gets with Refuel. I'm asking that for two reasons. One is because, obviously, I'm very curious to see how the product itself feels like for someone
Starting point is 00:20:50 who is new in working with LLMs because it's one thing... I think most of the people, and you mentioned that with Eric previously, the first impression of an LLM is through something like Chat chat GPT,
Starting point is 00:21:06 right? Which is a very different experience compared to going and fine tuning a model or like building something that is like much more fundamental with these models, right? So I'm sure there's like a gap there in terms of like the experience probably the industry is still trying to figure out what's the right way for people to interact and be productive with fine-tuning and building these models. So tell us a little bit about that, how it happens today. And if you can, also tell us a little bit of how it has changed since you started, right? Because it will help us understand what you've learned also also like by building something like for the market out there absolutely cost us so look to the experience in you know let's maybe i think it's sometimes easier
Starting point is 00:21:55 to take an example right let's say you know the the type of problem that we're trying to solve let's say you're let's say you're an e-commerce company or a marketplace and you're trying to solve. Let's say you're an e-commerce company or a marketplace, and you're trying to understand what are people searching for? Given a list of search queries, what is the thing that they're actually looking for? Is it a specific category of product? Is it a specific product? What is the thing that they're looking for? And this is a classic example of like a classification or categorization type of task. So the way refuel works is, you know, you point us to wherever your data lives, right? So we'll be able to kind of read it from different cloud storage, as warehouses, you can do data uploads. And then you pick from one of our templates of the type of thing that you want to accomplish, the type of tasks you want to accomplish.
Starting point is 00:22:44 So in this particular case, it would be, let's say, categorizing search queries. That's the template that you pick. And the interface of working with Refuel once you've plugged in your data and you've picked the template is just write simple natural language instructions on how you want that categorization to happen. And I think that's similar to how, you know, working, you know, exploring or playing around with what chat GPT feels like, which is there's just a text box. And what it's asking you for is,
Starting point is 00:23:12 hey, you want to do, you want to categorize search queries. Help us understand what are the categories that you're interested in. And, you know, if you were to explain this to another human, what would you write to explain and to get that message across? And that's the starting point here. So when a user will just describe that, hey, these are the categories that matter to me, and this is how I want you to
Starting point is 00:23:36 categorize, essentially, the refuel product will start churning through that data, start categorizing every single search query, and then we'll start highlighting the examples that the LLM found confusing. And this is actually like a big difference to what a simple use of chat GPD would do because LLMs are this kind of, they're an incredible piece of technology, but you give them something and they will give you back something. And without regard for, is it correct or not correct? But if you want to get things right, right, it is important to know and understand where is the LLM actually confused. And so we'll start highlighting those examples to the user to say, hey, this query and your instructions didn't quite make sense.
Starting point is 00:24:19 Can you review this? And at that point, you know, the ones that are confusing, the user can sort of go in, they can provide almost very simple thumbs down, thumbs up feedback to say, hey, you got this wrong, you got this right. Or they can go and adjust the guidelines a little bit and iteratively refine how they want this categorization task to be done. And the goal really is that if in the world without LLMs, right, if you had to do this manually, and you're having to do this categorization every single time for every single one of those search queries, instead of that, you're maybe having to review 1%,
Starting point is 00:24:58 maybe 0.1% of the data points that are most helpful for the LLM to understand and essentially do that task better into the future. So that's what the experience of working with it looks and feels like, where it's this system that is trying to understand the task that you're setting up. It's surfacing up whatever is confusing and iteratively getting to something that is going to be extremely accurate. And whenever folks are, let's say, happy with the quality that they're seeing, it's a one click button and then you get sort of an endpoint.
Starting point is 00:25:33 And then you can just go and plug it in production and continue to kind of serve this categorization maybe for real traffic as well. That's the experience of working with the system. And it's often useful to compare it with how it would be done in the world without LLMs. In the world without LLMs, you're either manually doing it or you're writing simple rules and then you're managing rules. But instead, the game with LLMs is write good instructions and then give almost thumbs up, thumbs down feedback. And that's enough
Starting point is 00:26:05 to get the ball rolling and get it to be good. Now, I think the second part of your question was, how has this changed and evolved as we've been building this out? Actually, there's two interesting things there. The first is, for us, the problems that we've been interested in have always remained the same, which is how do we get better data, cleaner data in less time and so forth, right? So the problem of good, clean data has always remained the same. I think the interesting changes that we've learned is, frankly, which LLMs to pick for a given task. There's more options that are available now know, there's more options that are available now.
Starting point is 00:26:46 And there are more techniques that are available that can almost squeeze the juice out of from an accuracy perspective. So we've essentially just learned a lot in terms of how to maneuver these LLMs. Because, you know, at the very beginning, a lot of the onus was on the end user to be able to drive the LLM in a particular direction.
Starting point is 00:27:05 But at this point, having seen many of these problems, we generally understand what you have to do to get the LLMs to work successfully so that teams are not spending too much time prompt engineering, which is its own kind of sort of ball of wax. So that's one interesting thing that we've learned. And I think the second thing that we've learned is, and I think we're going to see this in industry as well, that the future of the industry, it's not going to look like a single model that is just super capable at every single thing. We are generally headed in a direction where there's going to be different models that are capable, some bigger, some smaller, that are capable at individual things. And almost being able to get there quickly and manage that process and manage those systems becomes the important kind of factor. Because for many reasons, from accuracy to scalability to cost to flexibility, being
Starting point is 00:28:00 able to get to that sort of smaller custom model ends up being super important here. Yeah, that makes a lot of sense. Okay, so when someone starts with trying to build an application with LLMs, and here we are talking about open-source LLMs, right? We're talking about models that are open-source. There are a couple of things that someone needs to decide upon. One is, which model should I use as a base model to go and
Starting point is 00:28:30 train it? The other thing is that all these models come up in different flavors, which usually has to do with their size. You have 7 billion parameters. You have 75 billion parameters. I don't know. In the future, we're probably going to have even more variations. So when someone starts and they
Starting point is 00:28:51 have a problem in their mind and they need to start experimenting to figure out what to do, how do they reason about that stuff? First of all, how do I choose between Lama and Mistral? Why I would use one or the other? Because apparently, and my feeling is that, as you said, there's no one model that does everything, right? So I'm sure that Mistral might be better in some use cases, Lama might be better in some other use cases. But at the end, if you read the literature,
Starting point is 00:29:23 all these models, are always like published with the same benchmarks right so that doesn't really help like someone to decide what's the best for their use case right so how a user should reason about that without wasting and spending hours and hours of training to figure out at the end which model is best for their use case. Yeah, it's such an important problem and still so hard to kind of get right. In some ways, there's a few kind of questions that are, that are kind of underneath. There's a few things that need to be answered here.
Starting point is 00:30:09 At a, at a super high level, you know, the goal is, you know, for somebody who's building that LLM application, it's to figure out almost viability, right? The thing that we're trying to do, like, is this even doable? Is this even possible? Right? And so if I were in that person's shoes, right, the first thing that I would do is I would pick a small amount of data
Starting point is 00:30:31 and I would pick the most powerful model that is available. And I would see, can this problem be solved by the most powerful model today? If, you know, giving it as much information as possible, you know, trying it as much information as possible, try, you know, trying to simplify the problem as much as possible, but what is, you know, can this problem even be solved
Starting point is 00:30:52 by DLM? That's one thing that I would try first and foremost. The second thing, you know, if, you know, if I started to kind of look into open source, you know, the benchmarks that folks publish, it's, these are very academic benchmarks. They don't really tell you too much about how well this is going to do on my data, right? Or let's say my customer support data, right? Like, how is Mistral going to know? Or how is sort of Lama going to know about, you know, what my customers care about? It's hard. So the kind of the way to understand kind of open source LLMs and to start to get a flavor of that,
Starting point is 00:31:28 I think would be first create a small, pick a small data set that is representative of your data and the thing that you want to accomplish. Can be, you know, a couple of hundred examples, maybe, you know, a thousand examples or so forth. And then almost, you know,000 examples or so forth. And then almost if, for example, infrastructure was available to the team, then use some of the hugging phase and some of these other kind of frameworks that are available to spin those models up.
Starting point is 00:31:58 Although today we're starting to see sort of a rise of sort of just inference kind of provider companies that can make this available through an API as well. But I would start playing around with like the smaller models, right? Like, can this problem be solved by a 1 billion parameter model, a 7 billion parameter model, right? And just see like, you know, at what scale does this problem get solved for me? Because odds are that if you're truly interested in open source models, and you're thinking of deploying these open source models into production, you probably don't want to be deploying the biggest model, because it's just a giant pain,
Starting point is 00:32:35 right? So then the question becomes, if we do want to solve this problem, what is the smallest model that we can get away with? And there's a few kind of architectures and there's a few kind of flavors from a few different kind of providers that are the right ones to pick at any given moment in time. And I don't even want to offer suggestions because the time from now when we're recording this
Starting point is 00:33:00 to when this might actually go live, there might be new options that are available. So picking something that from one of the, this to when this might actually go live, right? There might be new options that are available, right? So picking something that from one of the, you know, let's say from Meta or Mistral is a good enough starting point, but then trying it out like the smaller model and seeing how far that takes us almost gives us a good indication of like, for the latencies that we want and the costs that we want, what is the accuracy that is possible? Yep. yep. That makes sense.
Starting point is 00:33:27 So from what I hear from you, it almost sounds like the user needs to come up with their own benchmark, internal benchmark framework, right? Like they need to somehow, before they start working with the models, to have some kind of taxonomy of like what it means for a result to be good or bad and ideally to have some way of measuring that like i don't know if it can be just black and white like it's good or bad and that's it right like it might
Starting point is 00:33:59 needs to be more let's say something like in between, like zero and one. But how can users do that? Because that's, I mean, that's like always like the problem with benchmarks, right? Like, and even in academia, like there is a reason that benchmarks tend to be so well established and rigid, and it's not that easy to bring something new. Or if you bring something new, usually that's a publication also, right? Because figuring out like all the nuances of like creating something that can benchmark, let's say, in a representative way and have like a good understanding of like what might go wrong with the benchmark is important, right? So how do someone who has no idea about benchmarking actually, but they are domain experts in a way, right?
Starting point is 00:34:45 Like the person who is interested in marketing to go and solve the problem, they are the domain experts. It's not you. It's not me. It's not the engineers who go and build that stuff, right? But probably never had you think about benchmarks in their lives or what it means, like specifically a benchmark for a model, right? So can you give us a little bit of hints there? I mean, I'm sure like there's no, probably not an answer to that.
Starting point is 00:35:09 If there was like, probably you would be public already with your company, but how you can help your users like to reason about these things and avoid some common pitfalls, let's say, or at least not be scared of going and trying to build this kind of benchmark infra that they need in order to guide their work. It's a great question, Kostas. Actually, I'll ask you guys the question. In one way,
Starting point is 00:35:41 I can answer in the direction of if Refuel actually makes this possible but i don't want to just show refill here so i can also just chat about just generally how teams should think about it the answer probably is like along the lines of like there should be tools that do so i'm curious like if you guys have like a sense on how you'd want this answered here yeah i'll tell you my opinion and like it comes from a person who has like experience with benchmarks from a little bit of a different domain. Because benchmark is one of the more long-lasting marketing tools in database systems. With a lot of interesting and spicy things happening there with specific clauses in some of them. People cannot publish the names of the vendors and like all that stuff which indicates like how even in something that it's like so deterministic in a way as like building a database system right still like
Starting point is 00:36:35 figuring out like what the right benchmark is is very like almost like an art more than like you know science but what I've learned is that no benchmark out there from academia or from the industry either can survive the use case of the user the user always has some
Starting point is 00:37:00 like small unique nuances to them that can literally render a benchmark completely useless. So it is, at the end, I think more of a product problem, in my opinion. And I say product not because there's no engineering involved. There's a lot of engineering involved there. But it has to be guided by user input or figuring out the right trade dots. And I think what we see here compared to building systems that are supposed to be completely
Starting point is 00:37:33 deterministic is that this is a continuous process. It's part of the product experience itself. The user needs you as they create their data sets and all that stuff they also need to create some kind of like benchmark that's like uniquely aligned to their problems now how do we do that i don't know it's something that i think is like a very fascinating problem to solve and i think something that can deliver like tremendous value for whatever like venture like comes up with that but that's my take on that. What do you think, Eric? I think you might have like,
Starting point is 00:38:08 you're more of like customer side. So you probably have more knowledge than any of us on that. Yeah, I mean, I think, you know, we've done a number of different projects actually trying to, you know, trying to actually leverage this technology in a way. I mean, it's funny, Rish, I think we followed a little bit of the pathway that you talked about,
Starting point is 00:38:32 right? I mean, there's a sort of personal productivity and then there's sort of this, you know, trying to use it almost as like an assistant as part of an existing process. And I think the specificity is really important, right? It actually, I think one of the places that a lot of, that I've seen things go wrong in my sort of limited view is, well, we have an LLM, let's just find a problem, right? And you end up sort of, I don't know. I think you end up sort of solving problems that don't necessarily exist for the business. And so for us, it's really, I think one of the key things for us is defining the specific KPIs that a project like this can actually impact, right? And sort of describing that ahead of time. So that, I don't know, at least that's the way that we've approached it. Makes sense. Yeah. I mean, look, Kostas, I think the question on, by the way, we can
Starting point is 00:39:39 probably kind of start the kind of recording process here again.. But Costas, look, I think benchmarking is a pretty hard problem because every specific customer problem, every specific company, there's so much uniqueness in their data, in how they view the world, that in the world of LLMs, the term that gets used is evaluation, which is what is on a given data set and with a specific metric in mind. The metric might be as simple as accuracy, but with that kind of metric in mind, right? And accuracy is still easier when there's a yes or no clear answer. In many cases, there might not be a clear answer. So what is that right metric becomes a hard problem. So benchmarking is hard.
Starting point is 00:40:32 And I think there's maybe a couple of things to kind of think through for most teams as they go down this process. The first is what dataset, right? What dataset that is small enough that they can maybe manually look at and review, but that still feels representative of their problem, right? And their production traffic that they imagine getting. And of course, that's not going to be a static dataset. So that has to evolve over time as we see more kind of data points come through. But that's almost question number one, which is, what is the data set? Then how can maybe a good product or a
Starting point is 00:41:12 good tool help me find and isolate that data set from a massive table that might exist in a data warehouse? So that's question number one around benchmarking and evaluation. And I think the second question is, what is the right metric? In some cases, it might be a metric that is more technical, something like an accuracy or a precision. Sometimes that metric might be more driven by what users care about and what that product team is thinking about, that this is the thing that matters to a user. And so thinking about like, you know, you know, I'll throw out an example, but in the case of sort of applications where data is being generated, did we generate any fact that was not available in the source text, right? That is a
Starting point is 00:42:00 metric that you could write down matters a lot to users. And so then it's a combination of how's that data set evolving over time? And what is the metric and the threshold that we think is going to be success or failure for this application? It's a combination of those things that teams end up thinking about. The best teams think about this before a single line of code is written, right? But sometimes it's hard, right? Sometimes you don't know what are the bounds of what the technology can offer and how the data set might evolve over time. Or sometimes the threshold that somebody sets is just because they heard it from somebody, right? From another
Starting point is 00:42:42 company, but it turns out it's not meaningful enough in that particular business. And so you're right. It is, it's a super hard problem. It's very complicated, but I think, you know, with better tools, this will become easier for people. But in many ways, this is the, this is one of the most important things to get right.
Starting point is 00:43:00 Because if the more time that gets spent here, you know, some of the infrastructure problems and the tooling downstream of it and which LLMs to use, they are driven by decisions that are made at this stage of the problem statement. Yep, yep, 100%. I think that's like the right time to...
Starting point is 00:43:19 We have the luxury here to have actually a vendor, you, in this space, and also a user, which is Eric. So RadarStack is evaluating some tools that they are trying to build using LLMs, and they are doing that through the field. So I think it's an amazing opportunity to go through this experience by having both the person who builds the solution, but also the person who's trying to solve the problem and see how this works at the end, with very unique
Starting point is 00:43:55 depth and detail. So I'll give it to you, Eric, because you know all the details here. But I'd love, as a now of like the episode here, I'd like to hear your experience with trying to solve a problem using LLMs and how this happened by interacting and using like Refuel as the product. Sure. Maybe I'll get to ask Eric a couple of questions as well about his experience here. Yeah, totally.
Starting point is 00:44:24 Reveal all. Live customer feedback. That's the best. Let's do it. Sure. It really has been fascinating. I'll just go through the high-level use case. We had actually met, Rish, we met talking about the show and having you on. And as I learned what Refuel did, this light bulb kind of went off. And I think I remember asking you in that initial introductory discussion, hey, would it work for something like this? And you said, yeah.
Starting point is 00:45:00 So we hopped on a call. But Kassus, the use case is, you know, one of the things that I am responsible for in my job is our documentation. And documentation is a really interesting part of a software business, right? There are many different aspects to it. There are many different ways that people use it, right? They may read documentation to educate themselves about what the product does, but it's also used very heavily and in large part intended for people who are implementing the product and actively using it. And so one discussion that we've had a lot on the documentation team is how do we define success with the docs, right? And that sort of, you know, that sort of comes from a process of quarterly planning. What are the key things that we want to do in the documentation? And one of the things that we discovered was that there's a lot of low-hanging fruit where
Starting point is 00:46:05 if you have documentation that's been developed over a number of years, and you have thousands of different documents in your portfolio, there are some that are old and need to be updated, or that were done quickly and need to be updated, etc. But once you sort of address the things that are objective problems, which thankfully you have a lot of customers and customer success people can sort of point those out for you and provide the feedback there. One of the challenges is where do you go next in order to improve it, right? Because there are obviously opportunities for improvement, but it's hard to find those out. And analytics themselves are a challenge because you can have lots of
Starting point is 00:46:53 false positives and false negatives. And so I'll give you just an example of one metric, like time on site. If you have a blog and you're analyzing blog traffic, time on site is generally you want more time on site, right? Because it means that people are spending a longer time reading should take a certain amount of time. And so they're on the page for, you know, they can be on the page for a long time, but it could also indicate that they don't understand what they're reading and they keep trying things that aren't working and returning to the documentation. So how do you know, how do you know that's the case? And there are a number of ways that you can determine that or attempt to determine that. But one of the things that we thought a lot about was how we can narrow down those problem areas or opportunity areas and how we can hold the docs
Starting point is 00:48:02 accountable to some sort of metric that is measurable over time, where we can see sort of true, you know, if we make it, if we uncover one of those, and then fix it, and then how do we measure that over time going beyond just sort of raw metrics. And one of the richest repositories that we believe is like a compass for this project is our customer support ticket data, right? Because if we can triangulate, you know, if there are enough customer support tickets at a certain with certain sentiment or a certain outcome that align to a metric like time on site or other some other metric, then that will indicate to us whether it's a good thing or a bad thing, right? And if it's a bad thing, then we can fix it. And then subsequently,
Starting point is 00:48:51 we should see customer support tickets related to that specific documentation or set of documentation decline over time, right? And so that was a high-level project. The challenge is that the customer support team, so I went to the customer support team and said, hey, this is what we want to do with the documentation. And they loved the idea, but they said, the problem is we've tried to do this before. And it just was untenable, right? I mean, you're talking about you know thousands tens of thousands i can't remember what the exact number is but it's a lot right and so even if you try to pull a random sample and have a couple of you know technical account managers go through and try to
Starting point is 00:49:37 label the tickets there's all sorts of challenges right the first one is you have to decide on a taxonomy if you want to change that you have to decide on a taxonomy. If you want to change that, you have to go back and redo all the work. I mean, they basically said we tried this and it didn't work. And so that's when we, literally around that time was when I talked to you, Rish, and we had that initial conversation. And so I said, hey, we have a ton of unstructured data and we essentially need to tag it according to categories. And so, yeah, that's been interesting, actually. It's been a super interesting project.
Starting point is 00:50:18 Okay. And so tell us a little bit more about like the tagging itself. You mentioned, first of all, like the taxonomy, right? What does the taxonomy is in this context? Yeah. Yeah, that's a great question. So when you think about tagging data, I'm not an expert in tagging data, but for our particular use case, when you think about tagging data, you need to be able to aggregate and sort the data according to a structure so that you can identify areas where a certain tag may over-index for tickets that are negative in sentiment or however you want to define that, right? I almost think about it as, you know, if you were creating a pivot table on a spreadsheet, how would you structure
Starting point is 00:51:10 the columns such that you can create a pivot table with drill downs that would allow you to, you know, to group the results? And one thing that, so we actually started out with a very simple idea that's proved to be very helpful, but it's been trickier than we thought to nail down a taxonomy. Actually, Rish, we haven't, I don't think we've talked about this since we kicked off the project. So here's some new information for you. Initially, we just took the navigation of the docs, you know, in the sidebar as our
Starting point is 00:51:43 taxonomy, because we thought that would be, even though we actually need to update some of that information architecture, at least we have a consistent starting point that maps tickets one-to-one with the actual documentation. The challenge that we face, and actually one of the things that refuel has been very helpful with is that the groupings in this if you just list out all the you know essentially the navigation
Starting point is 00:52:15 or even you know one or two layers down in the navigation as the as essentially the tags or the labels that you want to use for each ticket, you quickly start to get into what is technically fine for navigation, but practically needs to be grouped differently, if that makes sense. And so a great example would be, you know, something like the categorization of sources, mobile sources, server-side sources, you know, that sort of thing. And you may want to, you know, like for SDKs or, you know, whatever.
Starting point is 00:52:56 There just may be ways that you practically want to categorize things differently or group things differently, if that makes sense. Or another good example is like all of our integrations, you know, we have hundreds of integrations and, you know, in documentation, they're just sort of all listed, right? But it actually can be helpful to think about groups of those as like analytics destinations or marketing destinations or whatever. And so like, so what refuel has allowed us to do is actually test multiple different taxonomies,
Starting point is 00:53:29 which has been really helpful. And so the practical way that we did that was we took a couple hundred tickets as a just random sample. And we wrote a prompt that defined the taxonomy and gave the LLM an overview of what it's looking for, you know, related to sort of each label that we wanted. And we just tested it, right? And we sort of got the results back and have been able to modify that over time, which has been really helpful.
Starting point is 00:54:06 And so that was interesting to me. Initially, I thought, we'll just have a simple taxonomy. It doesn't matter. But then from a practical standpoint, the output data does really matter for the people who are going to be trying to, you know, sort of use it.
Starting point is 00:54:18 And this is something like for, when you said the user who's going to use it, is this like internal or external? Is this like taxonomy, primarily primarily interpreted by the customer success folks in Rutherstock? Both the documentation team and the customer success team, actually. Okay. And how do they use this taxonomy? So let's say you found the perfect taxonomy there using all these A, B, C, D, whatever, testing
Starting point is 00:54:47 with LLMs. What's next? You feed a new ticket in there and it's mapped in one of the taxonomy categories there. How does this work for the user? Yeah. So I think there are a couple of things that the initial thing that we want to do, and we're fairly close now. I'll actually say one of the other things that we've learned is that going through iterations really helps with the level of confidence that the model
Starting point is 00:55:18 provides back. So one really nice thing, but I'll actually tell you one of the things that we tried really early on before we started using Refuel was just wiring the GPT API up to a Google sheet and sort of dumping in the unstructured data and a list of tags or whatever. But the hallucination is a severe problem in that context because it's just going to provide you an answer either way. And so one of the things about refuel that was very helpful for us is that you can essentially define a confidence threshold and it just won't return a label if it doesn't reach a certain threshold. And one of the things that is really nice about that is, and I don't know if this is the intention, Rish, but the percentage of unlabeled
Starting point is 00:56:06 tickets is kind of a proxy of how well we're defining the taxonomy and sort of the instructions we're giving it, which is a very helpful, like, you know, even just this morning, actually, you know, we've been sort of making iterations to this and we have an extremely high level of confidence across most tickets now, which is really nice, right? Whereas before we may have gotten, and we were, you know, when we were iterating, we were, we had very sort of primitive prompts, I would say. And so maybe you get like 60 or 70% of the tickets labeled, right? Or something like that. And now we're like into the high nineties, which is pretty nice. And so the first step was sort of getting confidence and aligning with the customer success team on,
Starting point is 00:56:47 let's spot check these and see if this is relevant. And we're now at the point where we're going to run the entire set of unstructured tickets. And the first thing we're going to do is actually take that and do planning around a couple of things. So on the documentation side, which I'm closer to, identifying the docs that we need to improve and then setting up a structure to track on a monthly basis, we'll basically operationalize tickets going into refuel and coming back to the labels. And then we'll track over time, the quantity of tickets for a particular label or set of labels. And so on the documentation side, that's sort of how we'll measure these key updates that we do. And then the customer success team, I think, has a number of ways that they're going to use this, right? So if you imagine a new customer is onboarding and they can see the
Starting point is 00:57:33 sources and destinations that they're using or the particular use case that they have, but they already know both quantitatively. And then the interesting thing for them is qualitatively, okay, I have a group of tickets. I can browse through a couple hundred tickets related to this problem and figure out anecdotally where did they run into problems at which point in the process, and they can actually update their onboarding processes. Hopefully, the documentation helps a lot, but it can only go so far, right? So the customer success team can actually update their processes to say, here's a customer,
Starting point is 00:58:09 here's the tech stack that they're running, here are the use cases that they want to implement, and they'll know ahead of time, we need to watch out for these things, do these things, you know, to sort of smooth out that process. Yeah, that makes total sense. And like, Rhys, one question from me,
Starting point is 00:58:22 and then I'm done. I'm not going to ask anything more, but sorry, it's like so interesting. So there's like a very key piece of information here that Eric talked about, and that's like the confidence, right? And that's something that Refuel returns, like the confidence level of like the model in terms of the job that it did with the data. But what does this mean? Because that boils down to a number at the end, right? But there's a lot going on behind the scenes to get down to this number. And probably it has to be also interpreted
Starting point is 00:59:01 in a different way, depending on many different factors. Why do we need 0.9 instead of 0.99 or 0.7? I don't know. So tell us a little bit more about what is this confidence level we're talking about here and how people should think about it. Yeah, great question, Constance. And Eric, thanks for the story i mean honestly just loved hearing kind of your thought process and you know experience as you went through it maybe i'll have like a question or two for you in a second yeah it causes on the confidence bit you know you know this you know confidence it can get sort of technical pretty quickly? But the main reason for trying to have rigorous ways of assessing confidence is, again, it just comes back to LLMs are, they're text in, they're text out,
Starting point is 00:59:53 they'll produce an answer for you. And so then the question becomes, when do we trust this output? When do we trust this response? And I'll tell you a little bit about how we do it internally, which is we actually have custom LLMs that we've sort of fine-tuned and trained that are purpose-built to produce accurate and reliable confidence scores. And the confidence, you know, the way to think about and interpret this number is at the end of the day, you know, with an example of the support ticket tagging use case that Eric was mentioning, you know, there's, you know, with an example of the support ticket tagging use case that Eric was mentioning, you know, there's, you know, there's a, you know, let's say with rudder stack tickets,
Starting point is 01:00:31 it's either about ETL or reverse ETL, or it's about SDK. The confidence is a measure of how likely, you know, if we say a particular output, let's say reverse ETL with 90% confidence, the model's confidence is 90% being correct. So the goal is for the confidence score to be calibrated to correctness, if that makes sense. That's the eventual end goal of having these confidence scores. So when you then get these scores and these outputs, you should be able to almost set a threshold for your specific tasks and say, hey, I want to be able to, you know, I want to hit like a threshold of 90% confidence, because what that means is that everything that is above that, right, is going to be 90% confidence or like 90% correct or more, right? And so you get that sort of calibrated sort of level.
Starting point is 01:01:23 Of course, getting confidence scores to be very calibrated and to be very correct it's an ongoing kind of research problem and something that we invest a lot of our technical resources into but it's absolutely critical to get that right and be productized otherwise being able to rely on these outputs becomes hard that's how we think about confidence score yeah i guess i forgot to also add in a very important detail in costas but one thing so the way that this works and there may be more going on under the hood rish i mean i'm sure there's a lot more going on with the hood but as a user you can actually go in and look at the individual tickets for us, right?
Starting point is 01:02:09 But, you know, it'd be a data point. And you can interact with that ticket and essentially tell the LLM, you know, that this is actually this label or that this is mislabeled, right? And so you basically can, you know, you're sort of training the model on the pieces that it's not confident on. And so it kind of makes sense that initially you get, you know, especially with the primitive prompt that you get stuff back that has a low confidence level, but then you, it's a human in the loop essentially, right? You can go in and literally like tag them and interact with the tickets. And then,
Starting point is 01:02:48 you know, so let's say, you know, we put in a couple hundred tickets and then someone can go in and tag, you know, 20, 30 tickets or whatever. And then the model, you get through a couple of pages and then Refuel essentially tell you like, okay, it's ready to rerun it, you know, based on this feedback, right? And so then the confidence interval increases. And so you can sort of iterate through that and give the LLM feedback on whether its confidence level is accurate or not. Yeah. And exactly.
Starting point is 01:03:19 That's such a good way to put it, Eric. The goal is you spend a little bit of time on the ones that are less confident where the model is not sure, but every single piece of feedback that you collect helps the next data point become better. And eventually you get to a place where you just start plugging in new data as it's sort of being generated and get high quality outputs out. You know, one of the other interesting things, actually, now I'm thinking through all the details of this that makes it tricky to use an LLM with unstructured data, is that, and you asked about the taxonomy costus, and one of the other reasons that has been a very iterative process is that users will often use generic terms or separate terms that are different from what you have in the title of your documentation page. And so over time, we've actually had to adjust the prompt where we sort of include these conditions. If we
Starting point is 01:04:25 notice, again, just sort of doing high level review, we say SDK, but someone may say JavaScript snippet or something like that, right? And so that is actually pretty difficult. That is very difficult. The nice thing is is we i don't know it's made that process faster but we've noticed multiple categories where people just use terminology that isn't in our documentation and like we don't really use but that's just how they refer to it because they have a they're familiar with a related concept yeah well that's super interesting okay i think we should make a promise here that in a couple of weeks, as this project progresses, we'll get both people from Refuel and people from Radarstack that were involved in the project and actually go through the project. I think it's going to be super, super helpful for the people out there. I think like one of the problems that problems, I mean, from my perspective, at least one of the issues with like LLABs right now is that there's so much noise
Starting point is 01:05:27 out there and so much very high level information that yeah, everything like sounds exciting, but when you get into like the gory details of trying to implement something in production, things are like very different and having, you know, like people who actually did it, I think can can drive tremendous value for the people out there. So if both of you guys are fine with that, I think we should have an episode dedicated to this and go through the use case itself and hear from the people who actually made this happen. Sure. That would be awesome. We could get customer success on too.
Starting point is 01:06:04 All right. I think we're at the buzzer here. What do you think, Eric? That's your part, so I'm giving you... Oh, yeah. You stole my line. That was the next best action. Yeah, we are at the buzzer. Riff, this has been great. This has been so great.
Starting point is 01:06:19 It's just been so helpful to sort of orient us to the LLM space and, you know, get practical, which I think is really helpful. And congrats on everything you're doing with Refuel. That's awesome. Thank you so much. It's been so fun chatting with the both of you and, yeah, excited for the next time.
Starting point is 01:06:41 We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rudderstack, the CDP for developers. Learn how to build a CDP on your data you by Rudderstack, the CDP for developers.
Starting point is 01:07:05 Learn how to build a CDP on your data warehouse at Rudderstack dot com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.