The Data Stack Show - 177: AI-Based Data Cleaning, Data Labelling, and Data Enrichment with LLMs Featuring Rishabh Bhargava of refuel
Episode Date: February 14, 2024Highlights from this week’s conversation include:The overview of refuel (0:33)The evolution of AI and LLMs (3:51)Types of LLM models (12:31)Implementing LLM use cases and cost considerations (00:15:...52)User experience and fine-tuning LLM models (21:49)Categorizing search queries (22:44)Creating internal benchmark framework (29:50)Benchmarking and evaluation (35:35)Using refuel for documentation (44:18)The challenges of analytics (46:45)Using customer support ticket data (48:17)The tagging process (50:18)Understanding confidence scores (59:22)Training the model with human feedback (1:02:37)Final thoughts and takeaways (1:05:48)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are run at top companies.
The Data Stack Show is brought to you by Rudderstack, the CDP for developers.
You can learn more at rudderstack.com.
We are here on the Data Stack Show with Rish Bhargava. Rish, thank you for
giving us some of your time. Thank you for having me. All right. Well, give us the overview. You
are running a company that is in a space that is absolutely insane right now, which is AI and LLMs.
But give us the brief background. How did
you get into this? And then give us a quick overview of Refuel. Awesome. Yeah. So look,
I'm currently the CEO and co-founder of Refuel, but been generally in the space of data,
machine learning, and AI for about eight years. Was at grad school at Stanford,
studying computer science, researching machine learning, and then spent a few years at a company called Prima.ai, where I was an early
ML engineer.
The problems we were trying to solve back then were, how do you take in the world's
unstructured text, allow people to ask any questions, get a two-pager to read, so all
of these kind of interesting NLP problems.
And then spent a few years after that solving data infrastructure problems, how do you move terabytes of data point A
to point B, lots of data pipeline stuff.
And that led into starting Refuel.
You know, one of the key reasons why we started was just how do you make good, clean, reliable
data accessible to teams and businesses?
And that's the genesis of the company.
And here we are.
Very cool. to teams and businesses. And that's the genesis of the company. And here we are. And Eric, you know,
I know Rish from
the COVID days, so
for me, it's very
exciting to see the evolution
through all this and
see him today building in
this space. I remember talking
almost two and a half years ago
about what he was thinking.
Back then, LLMs were not the thing that they are today. So for me, at least, it's very
fascinating because I have the journey of the person in front of me here, and I'm really
happy to get into more details about that. So definitely, we have to chat about it.
But also, I think we have the perfect person here to talk about what it means to build a product in the Combami in such an explosive, almost, environment.
Things are changing literally from day to day when it comes to these technologies like LLMs and AI and machine learning. And just keeping up with the pace
from the perspective of a founder,
I think it's a very unique experience.
So I'd love to talk also about that with you, Rhys.
And here also what's happening.
You're probably having a much better understanding
of what is going on with all these technologies out there,
but also how you experience that, trying to build something.
And of course, talk about the product itself.
What about you?
What are some topics that you are really excited about talking today with us?
Look, I'm super excited to talk about just the world of generative AI, how quickly it's
evolving.
But Kostas and I, both of you have spent so much time talking to folks in data and
how the world of LLMs impacts the world of data, right?
How do you get better data, cleaner data, all of those fun topics?
And frankly, what does it mean, right?
What are the opportunities for businesses and enterprises when this, as you said, explosive
technology is really taking off?
So excited to dig into these topics.
Yep.
I think we are ready for an amazing episode here. What do you think, Eric?
Let's do it.
Let's do it.
All right. We have so much ground to cover, but I want to first start off with just a little bit
of your background. So you have been working in the ML space for quite some time now, I guess, you know, sort of in and
around it for close to a decade. And you could say maybe you were working on, you know, LLM
flavored stuff a couple of years ago before it was cool, which is pretty awesome. That's,
I would say, a badge of honor. But what changed? What were you doing back then? And what changed
in the last couple of years to where it's a frenzy, you know, and the billions of dollars that are being poured
into it is just crazy. Yeah, it's been such an incredible ride, Eric. You know,
just a little bit on my background, you know, post-grad school at Stanford, I joined this
company called Primer. This is about seven years ago at this point.
And the problem that we were trying to solve back then was how do you take in the world's unstructured text information, take in all of news articles, social media, SEC filings,
and then build this simple interface for users where they can search for anything.
And instead of Google style results, right?
Here are the 10 links.
Instead of that, what you get is a two-pager to read, right? So how do you assimilate all of this knowledge
and be able to put it together in a form that is easy to consume? And this used to be, this was
a really hard problem. This used to be, this is many months of effort, maybe years of effort,
getting it into a place where it works. And if you compare that to what the world looks like today,
I would bet you this is 10 lines of code today using OpenAI and GPT-4. So truly,
some meaningful changes have happened here. And I think at a high level, it's not one thing. It's many things that have sort of come together. It's new machine learning model architectures that have
been developed, things like the Transformers models. The data volumes that we're able to collect and gather,
that has gone up significantly. Hardware has improved. Cost of compute has gone down. And
it's a marriage of all of these factors coming together that today we have these incredibly
powerful models to just understand so much of the world. And you just ask them questions and you get answers that are pretty good.
And it just works.
So it's been an incredible ride these last few years.
Very cool.
And give us an overview of where does Refuel fit into the picture when it comes to AI?
Yeah.
So look, at Refuel, we're building this platform for data cleaning, data labeling, data enrichment using large language models, and importantly, at better than human accuracy.
And the reason why we look at this way of building the product, it's at the end of the day, data is such a crucial asset for companies, right? It's like the lifeblood of making good decisions,
training better models, having better insights about how the business is working.
But one of the challenges is, and people still complain, hey, we're collecting all of this data,
but if only I had the right data, right? I could do things X, Y, or Z, right? People still complain
about it. And the reason is, working with data, it's an incredibly
manual activity. It's very time-consuming. People are spending tons of time just looking at
individual data points. They're writing simple rules, heuristics. And so doing that data work
is actually, it's hard and it's time consuming. And what if, right?
Like, you know, the question that we ask is, you know, with LLMs becoming this powerful,
what if like, you know, the way of working wasn't, you know, we do that, you know, we
look at data ourselves and we write these simple rules and, you know, we do that manual
work ourselves.
But what if we were to just write instructions for some large machine learning model, some
LLM to do that work for us?
And writing instructions for how work should be done is significantly easier, significantly
faster than doing the work itself.
And so that just is a massive leap in productivity.
And what we want to build with Refuel is being able to do a lot of these
data activities, data cleaning, data labeling, data enrichment, where the mode of operation is
as humans, as the experts, we write instructions, but this smart system goes out and does this work
for us. Makes total sense. I have so many questions about Refuel, and I'm sure Costas does.
But before we dive into the specifics,
you live and breathe this world of AI and LLMs every day. And so I'd love to pick your brain
on where we're at in the industry. And so one of the things, and maybe a good place to start would be, I'm interested in what you see as far as implementation
of LLM-based technology inside of companies. And I'll give you just a high-level maybe prompt,
if you will. It seems like an appropriate term. But I think everyone who has tried chat GPT is convinced that
this is going to be game changing, right? But that, for many people, is largely sort of a
productivity. There's a productivity element to it, right? Or you have these companions like
GitHub's co-pilot, right? That, again, sort of fall into this almost personal or team level productivity
category, right? Tons and tons of tools out there. But then you have these larger projects within
organizations, right? So let's say we want to build an AI chat bot for support, right? We want to
adjust the customer journey that we're taking someone on with sort of a next
best action type approach, right?
It seems to me that there's a pretty big gap between those two modes, right?
And there's a huge opportunity in the middle.
Is that accurate?
What are you seeing out there?
I think that's a great way to look at it, Eric.
I think you're absolutely right. Look, folks who have spent a meaningful amount of time with
ChatGPT, you go through this experience of like, my God, it's magical. I asked it to do something
and this poem that it generated is so perfect. You go through these moments of it works so
incredibly well. There's, as you mentioned, there's co-pilot
like applications that are almost plugged into where, you know, that individual is doing their
work and it's assisting them. It's, you know, in the most basic form, it's doing autocomplete,
but in the more kind of advanced form, it's almost offering suggestions of how to be able to rewrite
something or just a slightly higher order activity.
But there is a jump from going from something that assists an individual person accomplish their task 5%, 10% better or faster to how do you deploy applications where these large
language models are a core key component at scale at a level where the team that is actually developing and building this
feels like, you know what?
Our users are not going to be let down because the performance and the accuracy and there
aren't going to be hallucinations and it's going to scale.
There's a whole set of challenges to deal with, to go from that individual use case
to something that looks and feels like this is production ready.
And I think as we kind of roll this back a little bit, we're very
early in this cycle, right?
The core technologies, you know, the substrate, these LLMs, they themselves are changing so
rapidly.
Of course, you know, OpenAI has been building and, you know, sort of deploying these models
for a while now.
Google is, you know, we're recording on in December.
So Google has just announced sort of their next set of models.
There's a few open source models that are now coming out that are competitive,
but the substrate itself, the LLMs themselves, they're so new, right?
Yeah.
And so the types of applications that we expect to be built,
you know, this is going to be a cycle of, you know,
somewhere in the, you know, two to five years where we'll
truly see a lot of mainstream adoption, but we're early.
But the interesting thing is, I think there is still an interesting playbook to follow
for folks who are experimenting and want to build high quality pipelines that use LLMs,
that are applications that use LLMs.
So I think there are playbooks that are being built out, but I think in the curve,
we're still kind of early. Yeah. Yeah. That makes total sense.
What are the ways, I mean, if you just read the news headlines and every company that's
come out with a survey about adoption of LLMs, you would think that most companies are
running something pretty complex in production, which I think that's probably a little bit
clickbaity. Maybe even that's generous, but what are you seeing on the ground? What are the most
common types of things that companies are trying to use LLMs for beyond the sort of personal or sort of
small team productivity?
So the way we would look at, the way we're seeing the types of applications that are
going live today, you know, the first cut that enterprises typically take is what are
the applications that are internal only, right?
That have no meaningful impact on, you know, at least no direct impact on users, but can drive efficiency gains internally.
So things that might, if there are a hundred documents that need to be reviewed every single
week, can we make sure that maybe only 10 need to be reviewed because 90 of those can be analyzed
by some LLM-based system. That's an example. I think a second example that teams are starting to think
about is places where they can almost act like a co-pilot or almost offer suggestions to the user
while the user is using the main application. Almost it's helpful suggestions. I think one of
my favorite examples is if you've just created, let's say you've captured a video,
right? Something good automatically suggests a title, right? It's like a small kind of,
it's a small tweak, but makes a nice kind of difference to the user and it doesn't make or
break anything. The third thing that we're starting to see, and I think we're still early,
but this is where we believe a lot of business value is going to be driven,
is frankly, existing workflows where data is being processed or where data consistently gets reviewed by teams internally, where the goal is, how do we do it more accurately,
cheaper, faster, by essentially reducing the
amount of human time involved, right?
And these are typically, if they're more business critical, the bar for success is going to
be a little bit higher, right?
So teams will have to invest a little bit more time and effort getting to the levels
of accuracy and reliability that they care about.
But those become sort of core, let's say data pipelines, they become core product features.
But that's the direction that we're seeing businesses sort of head towards.
Yeah, super interesting.
You mentioned efficiency and cost. Can you tell us what you're seeing out there in terms of what it takes to actually implement an LLM use case? You know, it's one of those super easy to start,
and then very difficult to, A, just, you know, sort of understand the infrastructure you need for your use case
among all the options out there, and then B, figure out what it actually will cost to run
something at production scale, right? I mean, you can query GPT, even if you pay for the API,
you know, it's pretty cheap, you know, to send some queries through, right? When you start
processing, you know, hundreds of millions or billions of data points,
it can get pretty serious.
So how are companies thinking about it?
You know, it's such an interesting question.
In some ways, we look at,
the way we're seeing it,
developing new applications with LLMs,
it's a journey.
It's a journey that you have to go on
where, you know, as with a journey,
you want, you know, somebody who's accompanying you. And in this particular case, it's, you know,
it's one LLM or like a set of LLMs that you start out with. And typically, you know, the place where
people start is there's a business problem that I need to solve. And we were discussing prompting
initially. It's like, can I, would it be amazing if I just
wrote down some prompt and the LLM just solved my problem for me? That would be amazing, right?
And so that's where people start. And it turns out that, you know, some, you know, in many use
cases, it can take you 50%, 60% of the way there. And then you have to sort of layer on
other techniques almost from the world
of LLMs that help you sort of go from that 50 to 60% to 70 to 80 and progressively higher.
And sometimes it's easier to think about working with LLMs and not to anthropomorphize sort of
LLMs too much, but sometimes it's easier to think about LLMs as like, you know, like a human companion almost, right? My favorite analogy
here is, sorry, this is a bit of a tangent, right? Winding way to kind of talking about how to
develop LLMs, but bear with me. You know, sometimes it's easier to think about how to get LLMs to do
what you want them to do by thinking of what would it take a human to succeed at a test?
Okay. Let's say we were to kind of go in for a math test in algebra tomorrow.
Right.
Of course,
you know,
we've taken courses in our past.
We could just show up.
Right.
And go take the test,
but we'd probably get to,
you know,
50 to 60%.
In terms of like how well we do.
If you wanted to improve in terms of performance,
we would,
we would go in with sort of a textbook, right?
We'd treat it as like an open book test, right?
And the analogy for that in the world of LLMs is things like few-shot prompting, where you show the LLM examples of how you want that work to be done, and then the LLM does it better, right?
Or you introduce new knowledge, right, which is what bringing your textbook does, right?
And so that is the next step
that typically developers take, right?
In terms of improving performance.
And then the final thing,
if you truly wanted to ace the test,
you wouldn't just show up with a textbook.
You'd spend the previous week actually preparing, right?
Actually doing a bunch of problems yourself.
And that's very similar to how fine tuning works, right? Or training the know, doing a bunch of problems yourself. And that's very similar to
how fine tuning works, right? Or training the LLM works. And so typically the journey of building
these LLM applications, it takes this path where teams will just, you know, they'll pick an LLM,
they'll start prompting, they'll get somewhere and then it won't be enough. And then they'll
start to introduce these new techniques
that folks are developing on how to work with LLMs, whether it's few shot prompting or retrieval
augmented generation where you're introducing new knowledge. And then finally, you're getting
to a place where you've collected enough data and you're training your own models because that
drives the best performance for your application. So that's the path that teams take from an accuracy perspective.
But then, of course, you were also running this in production.
It's not just about accuracy.
We have to think about costs.
We have to think about latency.
We have to think about where is this deployed.
And I think the nice thing about this ecosystem is the costs look something today, but the rate at which costs are going down, it's extremely promising.
So we can start deploying something today, but odds are that in three months or six months time, the same API will just cost 3x less.
Or there might be an equivalent open source model that is already as good, but it's 10x cheaper. So the cost curve is, I can see in your eyes that you have questions on the tip of your
tongue, and I want to know what they are.
Yeah, of course I have.
So, Rhys, let's go through the experience that someone gets with Refuel.
I'm asking that for two reasons. One is because, obviously, I'm very curious
to see how the product itself
feels like for someone
who is new
in working with LLMs
because it's one thing...
I think most of the people, and you mentioned
that with Eric previously,
the first
impression of an LLM is through
something like Chat chat GPT,
right? Which is a very different experience compared to going and fine tuning a model or
like building something that is like much more fundamental with these models, right?
So I'm sure there's like a gap there in terms of like the experience probably the industry is still trying to figure out what's the right way for people
to interact and be productive with fine-tuning and building these models.
So tell us a little bit about that, how it happens today.
And if you can, also tell us a little bit of how it has changed since you started, right?
Because it will help us understand what you've learned also also like by building something like for the market out there
absolutely cost us so look to the experience in you know let's maybe i think it's sometimes easier
to take an example right let's say you know the the type of problem that we're trying to solve
let's say you're let's say you're an e-commerce company or a marketplace and you're trying to solve. Let's say you're an e-commerce company or a marketplace, and you're
trying to understand what are people searching for? Given a list of search queries, what is the
thing that they're actually looking for? Is it a specific category of product? Is it a specific
product? What is the thing that they're looking for? And this is a classic example of like a classification or categorization type of task.
So the way refuel works is, you know, you point us to wherever your data lives, right?
So we'll be able to kind of read it from different cloud storage, as warehouses, you can do data uploads.
And then you pick from one of our templates of the type of thing that you want to accomplish, the type of tasks you want to accomplish.
So in this particular case, it would be, let's say, categorizing search queries. That's the
template that you pick. And the interface of working with Refuel once you've plugged in your
data and you've picked the template is just write simple natural language instructions
on how you want that categorization to happen. And I think that's similar to how, you know,
working, you know, exploring or playing around
with what chat GPT feels like,
which is there's just a text box.
And what it's asking you for is,
hey, you want to do,
you want to categorize search queries.
Help us understand what are the categories
that you're interested in.
And, you know, if you were to explain this
to another human, what would you write to
explain and to get that message across? And that's the starting point here. So when a user will just
describe that, hey, these are the categories that matter to me, and this is how I want you to
categorize, essentially, the refuel product will start churning through that data, start categorizing
every single search query, and then we'll start highlighting the examples that the LLM found confusing. And this is actually like a big
difference to what a simple use of chat GPD would do because LLMs are this kind of, they're an
incredible piece of technology, but you give them something and they will give you back something.
And without regard for, is it correct or not correct?
But if you want to get things right, right, it is important to know and understand where is the LLM actually confused.
And so we'll start highlighting those examples to the user to say, hey, this query and your
instructions didn't quite make sense.
Can you review this?
And at that point, you know, the ones that are confusing, the user can sort of go in,
they can provide almost very simple thumbs down, thumbs up feedback to say, hey, you got this
wrong, you got this right. Or they can go and adjust the guidelines a little bit and iteratively
refine how they want this categorization task to be done. And the goal really is that
if in the world without LLMs, right,
if you had to do this manually, and you're having to do this categorization every single time for
every single one of those search queries, instead of that, you're maybe having to review 1%,
maybe 0.1% of the data points that are most helpful for the LLM to understand and essentially
do that task better into the future.
So that's what the experience of working with it looks and feels like, where it's this
system that is trying to understand the task that you're setting up.
It's surfacing up whatever is confusing and iteratively getting to something that is going
to be extremely accurate.
And whenever folks are, let's say, happy with the quality that they're seeing, it's a one
click button and then you get sort of an endpoint.
And then you can just go and plug it in production and continue to kind of serve this categorization
maybe for real traffic as well.
That's the experience of working with the system.
And it's often useful
to compare it with how it would be done in the world without LLMs. In the world without LLMs,
you're either manually doing it or you're writing simple rules and then you're managing rules.
But instead, the game with LLMs is write good instructions and then give almost thumbs up,
thumbs down feedback. And that's enough
to get the ball rolling and get it to be good. Now, I think the second part of your question was,
how has this changed and evolved as we've been building this out? Actually,
there's two interesting things there. The first is, for us, the problems that we've been
interested in have always remained the same, which is how do we get better data, cleaner data in less time and so forth, right?
So the problem of good, clean data has always remained the same.
I think the interesting changes that we've learned is, frankly, which LLMs to pick for
a given task.
There's more options that are available now know, there's more options that are available now.
And there are more techniques that are available
that can almost squeeze the juice out of
from an accuracy perspective.
So we've essentially just learned a lot
in terms of how to maneuver these LLMs.
Because, you know, at the very beginning,
a lot of the onus was on the end user
to be able to drive the LLM in a particular direction.
But at this point, having seen many of these problems, we generally understand what you
have to do to get the LLMs to work successfully so that teams are not spending too much time
prompt engineering, which is its own kind of sort of ball of wax.
So that's one interesting thing that we've learned.
And I think the second thing that we've learned is, and I think we're going to see this in industry as well, that the future of the industry, it's not going to look like a single model that is just super capable at every single thing.
We are generally headed in a direction where there's going to be different models that are capable, some bigger, some smaller, that are capable at individual things.
And almost being able to get there quickly and manage that process and manage those systems becomes the important kind of factor.
Because for many reasons, from accuracy to scalability to cost to flexibility, being
able to get to that sort of smaller custom model ends up being super important here.
Yeah, that makes a lot of sense.
Okay, so when someone starts with trying to build an application
with LLMs, and here we are talking about open-source LLMs, right?
We're talking about models that are open-source.
There are a couple of things that someone needs to decide upon.
One is, which model should I use
as a base model to go and
train it?
The other thing is that all these models
come up in different flavors,
which usually has to do with their size.
You have 7 billion
parameters. You have 75
billion parameters. I don't know. In the future,
we're probably going to have even more variations. So when someone starts and they
have a problem in their mind and they need to start experimenting to figure out what to do,
how do they reason about that stuff? First of all, how do I choose between Lama and Mistral?
Why I would use one or the other?
Because apparently, and my feeling is that, as you said,
there's no one model that does everything, right?
So I'm sure that Mistral might be better in some use cases,
Lama might be better in some other use cases.
But at the end, if you read the literature,
all these models, are always like published with
the same benchmarks right so that doesn't really help like someone to decide what's the best for
their use case right so how a user should reason about that without wasting and spending hours and
hours of training to figure out at the end which model is best
for their use case.
Yeah, it's such an important problem and still so hard to kind of get right.
In some ways, there's a few kind of questions that are, that are kind of underneath.
There's a few things that need to be answered here.
At a, at a super high level, you know, the goal is, you know, for somebody who's building
that LLM application, it's to figure out almost viability, right?
The thing that we're trying to do, like, is this even doable?
Is this even possible?
Right?
And so if I were in that person's shoes, right,
the first thing that I would do
is I would pick a small amount of data
and I would pick the most powerful model
that is available.
And I would see,
can this problem be solved
by the most powerful model today?
If, you know,
giving it as much information as possible, you know, trying it as much information as possible, try, you know, trying
to simplify the problem as much as possible, but what is, you know, can this problem even be solved
by DLM? That's one thing that I would try first and foremost. The second thing, you know, if,
you know, if I started to kind of look into open source, you know, the benchmarks that folks
publish, it's, these are very academic benchmarks.
They don't really tell you too much about how well this is going to do on my data, right? Or let's
say my customer support data, right? Like, how is Mistral going to know? Or how is sort of Lama
going to know about, you know, what my customers care about? It's hard. So the kind of the way to
understand kind of open source LLMs
and to start to get a flavor of that,
I think would be first create a small,
pick a small data set that is representative of your data
and the thing that you want to accomplish.
Can be, you know, a couple of hundred examples,
maybe, you know, a thousand examples or so forth.
And then almost, you know,000 examples or so forth. And then almost if, for example, infrastructure was available to the team, then use some of
the hugging phase and some of these other kind of frameworks that are available to spin
those models up.
Although today we're starting to see sort of a rise of sort of just inference kind of
provider companies that can make this available
through an API as well. But I would start playing around with like the smaller models, right? Like,
can this problem be solved by a 1 billion parameter model, a 7 billion parameter model,
right? And just see like, you know, at what scale does this problem get solved for me?
Because odds are that if you're truly interested in open source models,
and you're thinking of deploying these open source models into production,
you probably don't want to be deploying the biggest model, because it's just a giant pain,
right? So then the question becomes, if we do want to solve this problem, what is the smallest
model that we can get away with? And there's a few kind of architectures
and there's a few kind of flavors
from a few different kind of providers
that are the right ones to pick
at any given moment in time.
And I don't even want to offer suggestions
because the time from now when we're recording this
to when this might actually go live,
there might be new options that are available.
So picking something that from one of the, this to when this might actually go live, right? There might be new options that are available, right?
So picking something that from one of the, you know, let's say from Meta or Mistral is a good enough starting point, but then trying it out like the smaller model and seeing how
far that takes us almost gives us a good indication of like, for the latencies that we want and
the costs that we want, what is the accuracy that is possible?
Yep. yep.
That makes sense.
So from what I hear from you,
it almost sounds like the user needs to come up
with their own benchmark, internal benchmark framework, right?
Like they need to somehow,
before they start working with the models,
to have some kind of taxonomy of like what it
means for a result to be good or bad and ideally to have some way of measuring that like i don't
know if it can be just black and white like it's good or bad and that's it right like it might
needs to be more let's say something like in between, like zero and one. But how can users do that?
Because that's, I mean, that's like always like the problem with benchmarks, right?
Like, and even in academia, like there is a reason that benchmarks tend to be so well
established and rigid, and it's not that easy to bring something new.
Or if you bring something new, usually that's a publication also, right? Because figuring out like all the nuances of like creating something that can benchmark,
let's say, in a representative way and have like a good understanding of like what might
go wrong with the benchmark is important, right?
So how do someone who has no idea about benchmarking actually, but they are domain experts in a way, right?
Like the person who is interested in marketing to go and solve the problem, they are the domain experts.
It's not you.
It's not me.
It's not the engineers who go and build that stuff, right?
But probably never had you think about benchmarks in their lives or what it means, like specifically a benchmark for a model, right?
So can you give us a little bit of hints there?
I mean, I'm sure like there's no,
probably not an answer to that.
If there was like,
probably you would be public already with your company,
but how you can help your users
like to reason about these things
and avoid some common pitfalls, let's say,
or at least not be scared of going and trying to
build this kind of benchmark infra that they need in order to guide their work.
It's a great question, Kostas. Actually, I'll ask you guys the question. In one way,
I can answer in the direction of if Refuel actually makes this possible but i don't want to just show refill here so i can also just chat about
just generally how teams should think about it the answer probably is like along the lines of
like there should be tools that do so i'm curious like if you guys have like a sense on how you'd
want this answered here yeah i'll tell you my opinion and like it comes from a person who has
like experience with benchmarks from a little bit of a different domain.
Because benchmark is one of the more long-lasting marketing tools in database systems.
With a lot of interesting and spicy things happening there with specific clauses in some of them.
People cannot publish the names of the vendors and like all that stuff which indicates like how even in something that it's like so deterministic in a way as like building a database system right still like
figuring out like what the right benchmark is is very like almost like an art more than like you know science but
what I've learned is that
no benchmark out there from academia
or from the industry either
can survive the use case
of the user
the user always has
some
like small unique
nuances to them
that can literally render
a benchmark completely useless.
So it is, at the end, I think more of a product problem,
in my opinion. And I say product not because there's no engineering
involved. There's a lot of engineering involved there. But it has to be guided by user input or figuring out the right trade dots.
And I think what we see here compared to building systems that are supposed to be completely
deterministic is that this is a continuous process.
It's part of the product experience itself.
The user needs you as they create their data sets and all that stuff they also need to create some kind
of like benchmark that's like uniquely aligned to their problems now how do we do that i don't know
it's something that i think is like a very fascinating problem to solve and i think
something that can deliver like tremendous value for whatever like venture like comes up with that
but that's my take on that. What do you think, Eric?
I think you might have like,
you're more of like customer side.
So you probably have more knowledge
than any of us on that.
Yeah, I mean, I think, you know,
we've done a number of different projects
actually trying to, you know,
trying to actually leverage this technology in a way.
I mean, it's funny, Rish, I think we followed a little bit of the pathway that you talked about,
right? I mean, there's a sort of personal productivity and then there's sort of this,
you know, trying to use it almost as like an assistant as part of an existing process.
And I think the specificity is really important, right?
It actually, I think one of the places that a lot of, that I've seen things go wrong in my sort of limited view is, well, we have an LLM, let's just find a problem, right? And you end up sort of, I don't know. I think you end up sort of solving problems that
don't necessarily exist for the business. And so for us, it's really, I think one of the key
things for us is defining the specific KPIs that a project like this can actually impact, right? And sort of describing
that ahead of time. So that, I don't know, at least that's the way that we've approached it.
Makes sense. Yeah. I mean, look, Kostas, I think the question on, by the way, we can
probably kind of start the kind of recording process here again.. But Costas, look, I think benchmarking is a pretty hard problem because every specific
customer problem, every specific company, there's so much uniqueness in their data,
in how they view the world, that in the world of LLMs, the term that gets used is evaluation,
which is what is on a given data set and with a specific metric in mind. The metric might be
as simple as accuracy, but with that kind of metric in mind,
right? And accuracy is still easier when there's a yes or no clear answer. In many cases, there might not be a clear answer.
So what is that right metric becomes a hard problem.
So benchmarking is hard.
And I think there's maybe a couple of things to kind of think through for most teams as
they go down this process.
The first is what dataset, right? What dataset that is small
enough that they can maybe manually look at and review, but that still feels representative of
their problem, right? And their production traffic that they imagine getting. And of course, that's
not going to be a static dataset. So that has to evolve over time as we see more kind of data
points come through. But that's
almost question number one, which is, what is the data set? Then how can maybe a good product or a
good tool help me find and isolate that data set from a massive table that might exist in a data
warehouse? So that's question number one around benchmarking and evaluation. And I think the second question is,
what is the right metric? In some cases, it might be a metric that is more technical,
something like an accuracy or a precision. Sometimes that metric might be
more driven by what users care about and what that product team is thinking about,
that this is the thing that matters to a user. And so thinking about like, you know,
you know, I'll throw out an example, but in the case of sort of applications where data is being
generated, did we generate any fact that was not available in the source text, right? That is a
metric that you could write down matters a lot to users. And so then it's a combination of how's that data set evolving over time?
And what is the metric and the threshold that we think is going to be success or failure
for this application?
It's a combination of those things that teams end up thinking about.
The best teams think about this before a single line of code is
written, right? But sometimes it's hard, right? Sometimes you don't know what are the bounds of
what the technology can offer and how the data set might evolve over time. Or sometimes the
threshold that somebody sets is just because they heard it from somebody, right? From another
company, but it turns out it's not meaningful enough in that particular business.
And so you're right.
It is, it's a super hard problem.
It's very complicated,
but I think, you know, with better tools,
this will become easier for people.
But in many ways, this is the,
this is one of the most important things to get right.
Because if the more time that gets spent here,
you know, some of the infrastructure problems
and the tooling downstream of it
and which LLMs to use,
they are driven by decisions that are made
at this stage of the problem statement.
Yep, yep, 100%.
I think that's like the right time to...
We have the luxury here
to have actually a vendor, you, in this space,
and also a user,
which is Eric.
So RadarStack is evaluating some tools that they are trying to build using LLMs, and they are doing that through the field.
So I think it's an amazing opportunity to go through this experience by having both
the person who builds the solution, but also the
person who's trying to solve the problem and see how this works at the end, with very unique
depth and detail.
So I'll give it to you, Eric, because you know all the details here.
But I'd love, as a now of like the episode here,
I'd like to hear your experience with trying to solve a problem using LLMs
and how this happened by interacting and using like Refuel as the product.
Sure.
Maybe I'll get to ask Eric a couple of questions as well about his experience here.
Yeah, totally.
Reveal all.
Live customer feedback. That's the best. Let's do it.
Sure. It really has been fascinating.
I'll just go through the high-level use case.
We had actually met, Rish, we met
talking about the show and having you on. And as I learned what
Refuel did, this light bulb kind of went off. And I think I remember asking you in that initial
introductory discussion, hey, would it work for something like this? And you said, yeah.
So we hopped on a call. But Kassus, the use case is, you know, one of the things that I am responsible for in my job is our documentation. And documentation is a really interesting part of a software business, right? There are many different aspects to it. There are many different ways that people
use it, right? They may read documentation to educate themselves about what the product does,
but it's also used very heavily and in large part intended for people who are implementing
the product and actively using it. And so one discussion that we've had a lot on the documentation team is how do we define
success with the docs, right?
And that sort of, you know, that sort of comes from a process of quarterly planning.
What are the key things that we want to do in the documentation?
And one of the things that we discovered was that there's a lot of low-hanging fruit where
if you have documentation that's been developed over a number of years, and you have thousands
of different documents in your portfolio, there are some that are old and need to be
updated, or that were done quickly and need to be updated, etc.
But once you sort of address the things that are objective
problems, which thankfully you have a lot of customers and customer success people
can sort of point those out for you and provide the feedback there.
One of the challenges is where do you go next in order to improve it, right? Because there
are obviously opportunities for improvement, but it's hard to find those out. And analytics themselves are a challenge because you can have lots of
false positives and false negatives. And so I'll give you just an example of one metric,
like time on site. If you have a blog and you're analyzing blog traffic, time on site is generally you want more time on site, right? Because it means that people are spending a longer time reading should take a certain amount of time. And so
they're on the page for, you know, they can be on the page for a long time, but it could also
indicate that they don't understand what they're reading and they keep trying things that aren't
working and returning to the documentation. So how do you know, how do you know that's the case?
And there are a number of ways that you can determine that
or attempt to determine that. But one of the things that we thought a lot about was
how we can narrow down those problem areas or opportunity areas and how we can hold the docs
accountable to some sort of metric that is measurable over time,
where we can see sort of true, you know, if we make it, if we uncover one of those, and then
fix it, and then how do we measure that over time going beyond just sort of raw metrics.
And one of the richest repositories that we believe is like a compass for this project is our customer support ticket
data, right? Because if we can triangulate, you know, if there are enough customer support tickets
at a certain with certain sentiment or a certain outcome that align to a metric like time on site
or other some other metric, then that will indicate to us whether it's a good thing or
a bad thing, right? And if it's a bad thing, then we can fix it. And then subsequently,
we should see customer support tickets related to that specific documentation or set of
documentation decline over time, right? And so that was a high-level project.
The challenge is that the customer support team,
so I went to the customer support team and said, hey, this is what we want to do with
the documentation. And they loved the idea, but they said, the problem is we've tried to do this
before. And it just was untenable, right? I mean, you're talking about you know thousands tens of thousands i can't remember
what the exact number is but it's a lot right and so even if you try to pull a random sample
and have a couple of you know technical account managers go through and try to
label the tickets there's all sorts of challenges right the first one is you have to decide on a
taxonomy if you want to change that you have to decide on a taxonomy.
If you want to change that, you have to go back and redo all the work. I mean,
they basically said we tried this and it didn't work. And so that's when we,
literally around that time was when I talked to you, Rish, and we had that initial conversation. And so I said, hey, we have a ton of unstructured data and we essentially need to tag it
according to categories.
And so, yeah, that's been interesting, actually.
It's been a super interesting project.
Okay.
And so tell us a little bit more
about like the tagging itself.
You mentioned, first of all,
like the taxonomy, right? What does the taxonomy is in this context?
Yeah. Yeah, that's a great question. So when you think about tagging data, I'm not
an expert in tagging data, but for our particular use case, when you think about tagging data, you need to be able to aggregate and sort the data according to a structure so that you can identify areas where a certain tag may over-index for tickets that are negative in sentiment or however you want to define that, right? I almost think about it as,
you know, if you were creating a pivot table on a spreadsheet, how would you structure
the columns such that you can create a pivot table with drill downs that would allow you to,
you know, to group the results? And one thing that, so we actually started out with a very
simple idea that's proved
to be very helpful, but it's been trickier than we thought to nail down a taxonomy.
Actually, Rish, we haven't, I don't think we've talked about this since we kicked off
the project.
So here's some new information for you.
Initially, we just took the navigation of the docs, you know, in the sidebar as our
taxonomy, because we thought that would be,
even though we actually need to update
some of that information architecture,
at least we have a consistent starting point
that maps tickets one-to-one with the actual documentation.
The challenge that we face,
and actually one of the things that refuel has been very helpful with
is that the groupings in this if you just list out all the you know essentially the navigation
or even you know one or two layers down in the navigation as the as essentially the tags or the labels that you want to use for each ticket, you quickly start to get into what is technically fine for navigation,
but practically needs to be grouped differently, if that makes sense.
And so a great example would be, you know,
something like the categorization of sources,
mobile sources, server-side sources,
you know, that sort of thing.
And you may want to, you know,
like for SDKs or, you know, whatever.
There just may be ways that you practically
want to categorize things differently
or group things differently, if that makes sense.
Or another good example is like all
of our integrations, you know, we have hundreds of integrations and, you know, in documentation,
they're just sort of all listed, right? But it actually can be helpful to think about groups
of those as like analytics destinations or marketing destinations or whatever.
And so like, so what refuel has allowed us to do is actually test multiple different taxonomies,
which has been really helpful.
And so the practical way that we did that was we took a couple hundred tickets as a
just random sample.
And we wrote a prompt that defined the taxonomy and gave the LLM an overview of what it's looking for,
you know, related to sort of each label that we wanted.
And we just tested it, right? And we sort of got the results back
and have been able to modify that over time,
which has been really helpful.
And so that was interesting to me.
Initially, I thought,
we'll just have a simple taxonomy.
It doesn't matter.
But then from a practical standpoint,
the output data does really matter
for the people who are going to be trying to,
you know, sort of use it.
And this is something like for,
when you said the user who's going to use it,
is this like internal or external?
Is this like taxonomy, primarily primarily interpreted by the customer success folks in Rutherstock?
Both the documentation team and the customer success team, actually.
Okay.
And how do they use this taxonomy?
So let's say you found the perfect taxonomy there using all these A, B, C, D, whatever, testing
with LLMs.
What's next?
You feed a new ticket in there and it's mapped in one of the taxonomy categories there.
How does this work for the user?
Yeah.
So I think there are a couple of things that the initial thing that we want to do,
and we're fairly close now. I'll actually say one of the other things that we've
learned is that going through iterations really helps with the level of confidence that the model
provides back. So one really nice thing, but I'll actually tell you one of the things that we tried
really early on before we started using Refuel was just wiring the GPT API up to a Google sheet and sort of dumping in the
unstructured data and a list of tags or whatever. But the hallucination is a severe problem in that
context because it's just going to provide you an answer either way. And so one of the things
about refuel that was very helpful for us is that you
can essentially define a confidence threshold and it just won't return a label if it doesn't
reach a certain threshold. And one of the things that is really nice about that is,
and I don't know if this is the intention, Rish, but the percentage of unlabeled
tickets is kind of a proxy of how well we're defining the taxonomy and sort of the instructions
we're giving it, which is a very helpful, like, you know, even just this morning, actually,
you know, we've been sort of making iterations to this and we have an extremely high level of
confidence across most tickets now, which is really nice, right? Whereas before we may have
gotten, and we were, you know, when we were iterating, we were, we had very sort of primitive
prompts, I would say. And so maybe you get like 60 or 70% of the tickets labeled, right? Or something
like that. And now we're like into the high nineties, which is pretty nice. And so the first
step was sort of getting confidence and aligning with the customer success team on,
let's spot check these and see if this is relevant.
And we're now at the point where we're going to run
the entire set of unstructured tickets.
And the first thing we're going to do is actually take that
and do planning around a couple of things.
So on the documentation side, which I'm closer to,
identifying the docs that we need to improve and then setting up a structure to track on a monthly basis, we'll basically operationalize tickets going into refuel and coming back to the labels. And then we'll track over time, the quantity of tickets for a particular label or set of labels. And so on the documentation side, that's sort of how we'll measure these key updates that we do. And then the customer success team, I think, has a number of ways that they're going
to use this, right? So if you imagine a new customer is onboarding and they can see the
sources and destinations that they're using or the particular use case that they have,
but they already know both quantitatively. And then the interesting thing for them is
qualitatively,
okay, I have a group of tickets. I can browse through a couple hundred tickets related to this
problem and figure out anecdotally where did they run into problems at which point in the process,
and they can actually update their onboarding processes. Hopefully, the documentation helps a
lot, but it can only go so far, right? So the customer success team can actually update their processes to say,
here's a customer,
here's the tech stack that they're running,
here are the use cases that they want to implement,
and they'll know ahead of time,
we need to watch out for these things,
do these things, you know,
to sort of smooth out that process.
Yeah, that makes total sense.
And like, Rhys, one question from me,
and then I'm done.
I'm not going to ask anything more, but sorry, it's like so interesting.
So there's like a very key piece of information here that Eric talked about, and that's like
the confidence, right?
And that's something that Refuel returns, like the confidence level of like the model
in terms of the job that it did with the data.
But what does this mean? Because that boils down to a number at the end, right? But there's a lot
going on behind the scenes to get down to this number. And probably it has to be also interpreted
in a different way, depending on many different factors. Why do we need 0.9
instead of 0.99 or 0.7? I don't know. So tell us a little bit more about what is this confidence
level we're talking about here and how people should think about it.
Yeah, great question, Constance. And Eric, thanks for the story i mean honestly just loved hearing kind
of your thought process and you know experience as you went through it maybe i'll have like a
question or two for you in a second yeah it causes on the confidence bit you know you know this you
know confidence it can get sort of technical pretty quickly? But the main reason for trying to have rigorous ways of
assessing confidence is, again, it just comes back to LLMs are, they're text in, they're text out,
they'll produce an answer for you. And so then the question becomes,
when do we trust this output? When do we trust this response? And I'll tell you a little bit
about how we do it internally, which is we
actually have custom LLMs that we've sort of fine-tuned and trained that are purpose-built
to produce accurate and reliable confidence scores. And the confidence, you know, the way
to think about and interpret this number is at the end of the day, you know, with an example of
the support ticket tagging use case that Eric was mentioning, you know, there's, you know, with an example of the support ticket tagging use case that Eric was mentioning,
you know, there's, you know, there's a, you know, let's say with rudder stack tickets,
it's either about ETL or reverse ETL, or it's about SDK. The confidence is a measure of how
likely, you know, if we say a particular output, let's say reverse ETL with 90% confidence, the model's confidence is 90%
being correct. So the goal is for the confidence score to be calibrated to correctness, if that
makes sense. That's the eventual end goal of having these confidence scores. So when you then
get these scores and these outputs, you should be able to almost set a threshold for your specific tasks
and say, hey, I want to be able to, you know, I want to hit like a threshold of 90% confidence,
because what that means is that everything that is above that, right, is going to be 90%
confidence or like 90% correct or more, right? And so you get that sort of calibrated sort of level.
Of course, getting confidence scores to
be very calibrated and to be very correct it's an ongoing kind of research problem and something
that we invest a lot of our technical resources into but it's absolutely critical to get that
right and be productized otherwise being able to rely on these outputs becomes hard
that's how we think about confidence score yeah i guess i forgot to also add in a very important
detail in costas but one thing so the way that this works and there may be more going on under
the hood rish i mean i'm sure there's a lot more going on with the hood but as a user you can
actually go in and look at the individual tickets for us, right?
But, you know, it'd be a data point.
And you can interact with that ticket and essentially tell the LLM,
you know, that this is actually this label or that this is mislabeled, right?
And so you basically can,
you know, you're sort of training the model on the pieces that it's not confident on.
And so it kind of makes sense that initially you get, you know, especially with the primitive
prompt that you get stuff back that has a low confidence level, but then you, it's a human
in the loop essentially, right? You can go in and literally like tag them and interact with the tickets. And then,
you know, so let's say, you know, we put in a couple hundred tickets and then someone can go
in and tag, you know, 20, 30 tickets or whatever. And then the model, you get through a couple of
pages and then Refuel essentially tell you like, okay, it's ready to rerun it, you know, based on
this feedback, right? And so then the confidence interval increases.
And so you can sort of iterate through that and give the LLM feedback on whether its confidence
level is accurate or not.
Yeah.
And exactly.
That's such a good way to put it, Eric.
The goal is you spend a little bit of time on the ones that are less confident where the model is not sure, but every single piece of feedback that you collect
helps the next data point become better. And eventually you get to a place where you just
start plugging in new data as it's sort of being generated and get high quality outputs out.
You know, one of the other interesting things, actually, now I'm thinking through all the
details of this that makes it tricky to use an LLM with unstructured data, is that, and
you asked about the taxonomy costus, and one of the other reasons that has been a very iterative process is that users will often use generic terms or separate terms that are different from what you have in the title of your documentation page.
And so over time, we've actually had to adjust the prompt where we sort of include these conditions. If we
notice, again, just sort of doing high level review, we say SDK, but someone may say
JavaScript snippet or something like that, right? And so that is actually pretty difficult.
That is very difficult. The nice thing is is we i don't know it's made that
process faster but we've noticed multiple categories where people just use terminology that
isn't in our documentation and like we don't really use but that's just how they refer to it
because they have a they're familiar with a related concept yeah well that's super interesting
okay i think we should make a promise here that in a couple of weeks, as this project progresses, we'll get both people from Refuel and people from Radarstack that were involved in the project and actually go through the project.
I think it's going to be super, super helpful for the people out there. I think like one of the problems that problems, I mean, from my perspective, at least one of the issues with like LLABs right now is that there's so much noise
out there and so much very high level information that yeah, everything like
sounds exciting, but when you get into like the gory details of trying to implement
something in production, things are like very different and having, you know, like
people who actually did it, I think can can drive tremendous value for the people out there.
So if both of you guys are fine with that, I think we should have an episode dedicated to this and go through the use case itself and hear from the people who actually made this happen.
Sure.
That would be awesome.
We could get customer success on too.
All right. I think we're at the
buzzer here. What do you think, Eric? That's your
part, so I'm giving you... Oh, yeah. You stole my
line.
That was the next best action.
Yeah, we are at the buzzer.
Riff, this has been great.
This has been so great.
It's just been so helpful to sort of
orient us to
the LLM space and, you know, get practical,
which I think is really helpful.
And congrats on everything you're doing with Refuel.
That's awesome.
Thank you so much.
It's been so fun chatting with the both of you and, yeah, excited for the next time.
We hope you enjoyed this episode of The Data Stack Show.
Be sure to subscribe on your favorite podcast app
to get notified about new episodes every week.
We'd also love your feedback.
You can email me, ericdodds, at eric at datastackshow.com.
That's E-R-I-C at datastackshow.com.
The show is brought to you by Rudderstack,
the CDP for developers. Learn how to build a CDP on your data you by Rudderstack, the CDP for developers.
Learn how to build a CDP on your data warehouse at Rudderstack dot com.