The Data Stack Show - 205: How to make LLMs Boring (Predictable, Reliable, and Safe), Featuring Nicolay Gerold
Episode Date: September 4, 2024Highlights from this week’s conversation include:Nicolay’s Background and Journey in AI (0:39)Milestones in LLMs (4:30)Barriers to Effective Use of LLMs (6:39)Data-Centric AI Approach (10:17)Impor...tance of Data Over Model Tuning (12:20)Capabilities of LLMs (15:08)Challenges in Structuring Data (18:28)JSON Generation Techniques (20:28)Utilizing Unused Data (22:36)Importance of Monitoring in AI (34:11)Challenges in AI Testing (37:40)Error Tracing in AI vs. Software (39:24)The AI Startup Landscape (40:53)Marketing for Technical Founders (42:41)Generative AI Hype Cycle (44:33)Connecting with Nicolay and Final Takeaways (47:59)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Hi, I'm Eric Dotz.
And I'm John Wessel.
Welcome to the Data Stack Show.
The Data Stack Show is a podcast where we talk about the technical, business, and human
challenges involved in data work.
Join our casual conversations with innovators and data professionals to learn about new
data technologies and how data teams are run at top companies.
Welcome back to the show. We're here with Nikolaj Girold. Nikolaj, welcome to the show.
Give us some background today on some of your previous experience and give us some highlights.
Hey, yeah, happy to be here. So I'm Nikolaj. I run an AI agency in Munich. We also recently started
Adventure Builder. Most of my history has been in LLMs and especially like controllable generations.
So how to make them boring, which for me is like predictable, reliable and safe.
And yeah, excited to chat with you today. Okay, Nikolai, we just spent a few minutes
chatting to prepare for the show. And I'm really excited to dig into, from your opinion, what LLMs are actually good at and
maybe what they're not so good at, but everybody still tries to make them do.
What are you looking forward to chatting about?
I'm really looking forward to discuss like AI startups versus software startups, or even
like AI versus data startups, because it really goes into like the deterministic software
versus like unpredictable AI discussion as well.
All right, let's dig in.
Let's do it.
So we're here with Nikolai and Eric is out today.
So we have a special co-host,
the cynical data guy here co-hosting.
So thanks for coming today, Matt.
Thanks for having me.
I'll try to be a little less cynical.
All right, that'll be good. All right. All right.
That'll be good.
All right, Nikolai.
Yeah.
Give us some more on your background.
You've worked a lot with LLMs and AI even before it was cool.
So give us a little bit of your background and unpack some of that for us.
Yeah.
So it started quite early.
So we actually organized the hackathon with open ai and that's how we actually
got to try all of that stuff and it was super d3 when it came out and back then it was like
really hard to get anything out of it like one out of 10 outputs were actually like usable or
in the direction of usable and since then I think they have evolved a lot,
but the problem is still the same.
Like, how can we control them in the end?
And yeah, during my university time,
this was also my study topic.
So I, in my thesis,
wrote about controllable generation with LLMs
and basically benchmark different methods
for controlling them.
And since then, I started my own agency and now doing
that for like different companies so it's quite a lot of fun yeah so tell me about the early days
tell me about you know maybe that experience at the hackathon or even pre-hackathon like what was
that first moment where you're like wow this is really unique or something that i haven't seen before yeah so in university we actually had the
chance to do like text prediction models with rnns lstms even before that but once you go beyond
like simple examples it was like utter nonsense where it's just through random tokens together
and with lms at least like these sentences and the tokens which are close to
each other had like some sense into them and often when it made like on a shorter length like a few
sentences a paragraph it really wrote coherent stuff but often just not factually correct and
it was like it was for me like a real game changer and I really got heavily into AI through that because
the practical
applications I think are
way more
easier to imagine than
with most of
the traditional ML and AI
because there you have to think about
so many different things and with
LLMs you can imagine so
many different use cases because with other LAMs, you can imagine so many different use cases
because they're just transforming
one text into another text.
So that sounds like it was really interesting
and probably something that
I think probably a lot of people were like,
whoa, this is going to be a big deal.
What were kind of for you
some of those other milestones
that you've seen
where you're like okay you know like
going from i'm getting nonsense to like hey this can actually make a coherent paragraph what were
some of the other milestones you saw i think like if we go like one step before that like the first
milestones is the attention mechanism which was like i think somewhere in 2014 and rightly after
that like you mentioned already like the instruction
following part which is mostly through like the rlhf so the reinforcement learning through human
feedback but it actually managed to align the model with human preferences so basically get
them to output stuff that the majority of humans like.
And this often gives you better output for common tasks,
like write me an email and stuff like that.
And also really made them more reliable for the chat interface,
which they introduced at the same time in the end,
but which is for me also a major breakthrough.
It's like an ui innovation in the
end just make it very easy for the everyday person to use that stuff and which for ai is really hard
to do most of the time and like a chat box is the easiest thing to use everyone knows it and
everyone can use it and the results are they're like instantly like magic and if you want to go after
that i think like the next one is scaling loss which often isn't like a breakthrough but actually
like having the realization like as we scale up in parameters and in training size it gets better
and better and this was really like an interesting thing i think few shots are also something people
ignore i think it's also a breakthrough,
like just writing out the examples
and giving it to the LLM or pre-filling its answer
so it actually thinks it has written something already.
I think that's also like something very interesting
and interesting technique,
which isn't so obvious when you look at the like
traditional ML and AI part.
So for the, looking looking forward obviously there's like
plenty more barriers you know things to overcome the obvious one you've already mentioned is
compute getting compute costs down right because there's some ai a lot of ai applications are still
subsidized practically right if you actually look at the math of how much compute and how much went
into the training model like it doesn't quite work.
So say we solve that problem and say we can continue to have process just by expanding training data sets and spending more on compute.
Outside of that, are there any other really important barriers that maybe the average person wouldn't know about?
So I think the alignment, just because it's aligned to humans in general
doesn't mean it's aligned to my preferences i think that's the first like barrier because
i often have like a different taste to how stuff is written and i think like anyone who is
interacting with other land is like they really tend to go into the emoji-ridden social media posts
for most types of text you're running.
And I think one barrier is actually fine-tuning models
to the individual user or personalizing them.
I think at the moment we are trying to do that with few shots,
but I think we can get smarter with that.
And if we see already trends happening for like synthetic data
which will make it way easier for like the everyday person to generate a training data set
to adjust the model to it and fine-tuning model also it's really cheap at the moment
and yeah you can even fine-tune at the moment the open ai models the gpt4 one it's for free
so when you have like the
capability to generate synthetic data based on your actually inputs and outputs and then basically
personalize the model to your taste add a few few shots i think this is something that will get
really interesting and i think the second barrier is actually like the how to actually get the model to pick either something I feed into the context
or something that's in its internal representation, so in its fate.
Because a rack at the moment, you're feeding the stuff into the model
and you're hoping it actually takes the stuff I fed it in.
But still, often it hallucinates.
And this is also still a barrier.
How do I actually get the model to stick to that?
And if there is no information on it,
just say, I don't know.
And this is the third challenge, I think.
Getting models to say, I don't know,
which is, I think, for the foreseeable future,
without a major architectural change, it's impossible.
Yeah, I think that's also because of the way i
think part of that is the way we train them too isn't it where we want it to give a response
that's human-like or that would a human would find acceptable but how do you decide which responses
when in your training set should get i don't know you know like it's being trained to give an answer
so it's going to try to give an answer
that's what it seems yes it's like there are thousands of different possibilities of input
i can feed it in and i want like a one that fixed output which is i don't know which is like based
on what i've trained it with like next token prediction on the entire internet which is like
moving it to generate like tokens based on
its context. And then like basically I'm feeling like in anything like 20 different users will
phrase a question in a different way. And then I expect it to output the same thing every single
time. I think that's, it's very unlikely that it actually gets to that. Yeah. Let's talk a little
bit about approaches. Like you mentioned the data-centric AI to that. Yeah. Let's talk a little bit about approaches.
Like you mentioned the data-centric AI approach as one model.
There's other approaches there.
But maybe explain what that is, what a data-centric approach is,
and even contrast it with some of the other approaches to AI.
Yeah.
So I think it's easiest if I go the other route.
So in traditional AI and ML,
I basically, I started with creating a data set.
Then I picked the model.
Then iteratively, I created features
which allow me to predict an outcome
or generate something.
And then I basically, once I had the data set finished,
I only adjusted the model.
So basically, I picked different features.
I altered the architecture of the model.
I added a few layers, for example, or I added an additional variable in the regression.
And this is basically how I improved the model.
I treated the data set as static, and I basically altered the model to improve the outputs,
like increase my accuracy, for example.
And I'm really hyped about data-centric AI, where you actually don't really tune the model,
but you take an existing base model, so for example, an LLM, and you actually tune the data.
So you train the model, you let it generate the output on the test sets,
and you then look at the examples
that actually got wrong.
And then you actually correct
something in the input data,
or you add additional samples
where it does these categories correctly.
Then you feed in the data into the model,
you train it new,
or you fine tune it,
and then you basically try again.
And iteratively, you basically improve and add to your data set over time until you have
a model that actually has a satisfactory outcome.
And I think this is much more aligned to how it is done in or should be done in practice,
where actually you have data shifts, you have changing data, and you have new user groups
coming in, where you actually have to
adjust the data set over time and then train your model on the data so how much of that would be
like engineering around prompts and context and how much of that would be engineering in the like
actual like underlying data so the you have to separate it a little bit.
So, in the prompts and context, this is not training a model.
Training a model is really about adjusting the parameters.
And this can be applied to any type of AI.
I think adjusting the prompt at the moment is an easier way to, in parentheses, tune a model.
Because you can adjust the outputs,
but it's not really tuning it. Right, it's kind of a shortcut.
Yeah, prompting is restricted to a few sets of models where it's possible,
one of which is otherLems. Another one is, for example, the SAM, the segment anything
models from Meta, where you actually can give a few masks and a few through,
they also call it prompts, which are like boxes.
Yeah, I mean, that was, I think, I know in my own experience, that was a big thing working with actually a former guest in Cameron, Jago, where he showed me that method of, let's go, we're not going to add more data just indiscriminately.
We're not going to go mess with the number of parameters.
We're going to look and say,
oh, look, there are no examples in this edge case.
We're going to go add a handful of those
and suddenly that gets your accuracy a lot better.
Kind of like you're filling the search space
almost with your examples
or like, you know, adding in,
hey, our data is drifting.
So we need to add an examples
to where it's drifting towards
to kind of make it better,
rather than, as you said,
just mess with the model the entire time.
Yeah, and also,
I understand why people don't really like to do it,
because working in the data directly,
it's very laborious.
You have to be really careful most of the time,
and you aren't really working in code,
especially with generative models. You're reading through long text most of the time and you are really working in code especially with generative
models you're like reading through long text all the time and trying to adjust them to get
something good out of it right and it feels like the wrong thing to do right like as an engineer
somebody that's an expert in it arml is like i should be working on the model i shouldn't be
doing that this is low value right this is low valuevalue work. This is low-value work. Yeah. That's what a mechanical is.
But that actually is the work in a lot of these.
Right. I think like
AI gets better
on like the stuff I build
by how much time I spend in
the data and look at basically
because most of it is pipelines, it's not
a single model where I'm feeding it in and feeding it out.
Like how much time am I spending
looking through all of the different pipeline steps? Yeah. All right. Well,
kind of talking about that, like what are some of the things, you know, when we look at LLMs and we
look at kind of everyone with the hype around them and everyone's using them, there's also typically
that first wave of they can do everything, right? But what are they, in your experience what are llms like actually good at what are the
things that they do the best work with so llms can do everything they just can't do it good
you can't slap everything into another lamb it's just like they perform badly on like so many
different tasks for me where they accelerate is translating one form of text representation into another.
And this is especially like, for example, one use case I love is data extraction.
So you take unstructured data, you take long text, and you create another representation,
which is basically a JSON.
And you structure it, and through that, it actually becomes workable.
And this is the thing I, in most of the ventures,
but also the projects, use LLM the most for.
Because there they have the highest value.
They can move through mountains of data in hours,
which would be just impracticable to do with humans and they're
really great at i think also for like all of the like tasks that you don't really like to do but
have to do and this is like very it's like you individually are driving like what good outputs
looks like which is stuff like for example example, running emails, running blog posts,
where you actually can rely
on other lines heavily,
but also the reliability part.
So the expectancy of the output,
like how accurate does it have to be?
Do I have to have like 99%
or am I also happy with 80?
There, I can take garbage outputs
every now and then,
and I can just regenerate. And for the low-risk, they and then, and I can just regenerate.
And for the low-risk, they are great, and you can work with them.
And same goes for coding.
You can't just ask them to especially generate boilerplate code, which you have seen often.
Also in the law, I know a few people were using it heavily just to generate the boilerplate stuff,
and they read through it, just review it and just work over it.
I think like boilerplate tasks
are a good task for them as well
because like the criticality of the task,
it isn't really high
and you often have like a manual review anyhow.
A little bit of getting you from that,
like going from zero to one step,
that getting you off the blank page,
getting you to a point where,
you know, I've seen people they've used it for hey we've got this we got to write this proposal
here's the rules of what it has to be make the first draft of it and it does that first one
pretty well because you're always going to review it anyways or you always think you it in the end
yeah you have to differentiate a little bit like between like enterprise applications of llms
where you use them like a lot and like the personal applications of llms and i think like
personally when i use llms like all of the time for like nearly every task and because it solves
like the blank page problem and i think also like I can explore like the space,
what I actually don't want to do.
Often like the outputs are garbage,
but the errors the LM makes actually leave me forward.
And I actually can put a page like,
what do I actually want?
Yeah.
So to go back kind of with that,
the enterprise one,
when you talked about,
you know, we're going to take this unstructured data
and we're going to put it in like a JSON format
or something like that.
I'm going to kind of selfishly ask
because I've had trouble with this,
but how hard is it to get it
to consistently put it in a format there?
Like, are you going to,
is it going to get better through prompting
or do you actually have to do some retraining to it?
And my mind goes to like email, right?
Like that would be the number one thing i can think of is like i have emails
where it's completely unstructured i want it in json and i'm going to do something with the json
so maybe that could be like a practical yeah example yeah so in the end it depends what model
do you want to use which in an enterprise setting is basically determined whether the data has to be private or not. But you're using the big models.
So, Coher, Anthropic, OpenAI, especially the large ones,
they are so good at generating JSON by now
and have been fine-tuned to do so
that they don't really require any additional fine-tuning.
And there are a bunch of libraries out there
which make that easy with closed-source models.
One I like is Instructor, which basically allows you to define a Pydantic model.
And then they output the data into the Pydantic model, which gives you also the ability to instantly validate the data.
So if it doesn't hit, you get the validation area of bydantic and then you basically can decide do i want to
retry or do i just basically ignore the output depends in the end on you and you also can define
a lot of additional rules like validations like if it's numeric like is it within a certain range
like do i have a min and a max i think like a lot of the different data stuff you have like usually
in your database you can actually define and bring into the structure generation part as well.
And I think that gets even more extreme when you go onto the open source side.
Because with that, you can use grammar parsing.
So a lot of LLMs in closed source, you don't really get the output tokens.
In open source LLMs, you get those
output tokens and their probabilities.
But since JSON
is basically
a lot of it is boilerplate as well.
Like all the parentheses,
all the keys are predetermined.
You don't really need to generate those.
So in open source models,
you can basically do a grammar parsing
which basically ignores the tokens which are the same every time and only generates the part of the tokens, which are basically determined based on your input data, which are the values.
And in that, you basically can define additional stuff.
So basically, if you have a string, it only takes out basically what's possible within that string. But if you generate numbers, you can just throw away all the tokens, even if they're high probability, that are not numeric.
And this makes it a lot easier to basically do the structure generation part.
I'm writing myself a mental note right now for that one.
Yeah.
No, I think that makes a lot of sense and like again
back to the email example i mean i think there's a million business applications of like hey i have
all this data in email i want to get it in a json type format and then do something with it
um that makes a lot of sense too where basically like the way it's been described to me one of the
main things within working with any llm is
focusing it right like you're starting with really broad and you're trying to focus it you know to
get to more and more specific and you also want to focus the compute toward the highest value
part of your equation right so if you're let's say quote spending compute on json which is going to be the same
every single time like that's a waste like let's focus that on this one component of it so that
makes a ton of sense why yeah and i think also like you know a lot of the things that that you've
talked about and that i think we've all seen lms do best at typically are kind of those
well if i was going to do it i would you know i'd get like 100 interns or something
like that there's a lot of that type of stuff so cost really becomes a big thing there because i
can't really spend a billion dollars to replace a couple interns yeah but this is like the best
way to think about it in my opinion like what are the tasks you would actually hire lots of people
to do or that are just untouched
because it would be so impractical to
get people on that.
This goes for every data
lake that's out there.
Every organization has
terabytes of data just in text
and they are largely unused.
With LLMs, you
actually can make them usable
and also enable stuff like retrieval
augmented generation make a document base actually like workable because you get answers as opposed
to like a blank page or a blank face yeah so i think i think this is a perfect segue we were
talking before the show about single shot versus multi-shot and you mentioned kind of this like, retry mechanism,
which makes a ton of sense. It's not something I thought of. But if you're, again, back to the
email parsing example, I'm going to parse email, I have the structure JSON, I'm just going to focus
the LLM on this one key or value rather, because I already have a defined key. And then there I can
also give that particular like multi, I can do that
in five shots with some kind of validation and pick my favorite of, let's say, the five.
That makes a lot of sense to me where I could get a much higher level of accuracy
than if I was using an off-the-shelf, non-open source model where all of it, the whole JSON
context has to be right. I'm regenerating some of these keys and values every single time
and i don't have as much i can't focus the compute as much on the most valuable part of the task
yeah and there's like voting in the end i love it and most llms if you use them that's the end
parameter you can let it generate like multiple times which is also really
great for like evaluation like scoring text for example if you want to like score the output of
the llm as well you can't do a maturity vote so you let it generate like five to ten different
times and just take the average and stuff like that it makes makes it easy. And then you have the second shot stuff
you can do without alarms, which is few shots.
So basically giving the model a few examples of how to do it,
which are usually like human labeled or human written examples
where you give it an example of the input and the output
to show it how the task is actually done and this is often
like especially for tasks where it's hard to define like how to do it so in like in writing
i think like most of us would struggle to define our writing style and if i can give it a few
examples of like few linkedin posts or something i wrote i can just throw that in and give him some guidance.
And then if I generate like multiple different options, either when it's like something I
have like running in the enterprise, I can take the option, which is like generated the
most often, or I can score it and generate like the option which has the highest score.
Or if it's just like an output for me, which I want to use like down the line, I can use the option which I like the most which has the highest score or if it's just like an output for me which i want to
use like down the line i can use the option which i like the most yeah i mean a lot of this reminds
me of kind of when you know with machine learning where when we kind of realize that like a bunch of
weak learners will do a better job than one strong learner. I mean, this all feels like we're kind of,
you know, it has that like fractive feel to it.
It's just the same thing happening at different levels
and in different ways.
Oh, look, if we can just get,
we get five shots at this,
we're much more likely to come up with a good answer
than if we just put all of our eggs in one,
we're gonna make it really strong or something like that.
Well, I think another component too is
if the alternative was hey like you said i'm going to hire a hundred interns right there's
and say you wouldn't actually do that right because maybe there's just not enough value in
that cost but say you know say theoretically that you could get 100 free interns like okay maybe i
would do it but then there's's a time component of it would
take them X amount of time, let's say several hundred hours. And then there's a validation
component. Somebody that works for the company has to validate the work, et cetera. You've got
a lot of time into it. So because of that, I feel like there's this extra space for the LLM to do
the multi-shot approach. And it can run for hours and that's really not a big deal at all because the comparative other method is significantly longer versus using it in some
other applications where you want this like millisecond response time right i'm quote the
ai like that's just a much it seems like a much harder problem in the stage that we're at we're
at right now yeah especially for like batch workloads, LLMs are great for like the live part.
I think it's getting easier with stuff like Grok.
So not the Tudor Grok, but the other Grok,
which are basically doing LLM chips
or chips tailor-made for text generation models.
They're getting really fast.
But also if you have like an applications where it's live,
it's likely it's customer interaction,
where I'm not sure whether I would like
to put an LLM on there.
Yeah, I think
also kind of leads to
when we think about accuracy
and what you need, I think a lot of times
people want to compare LLMs to
what it's not 100%
versus like, well, realistically
what would 122-year-olds
actually do? They'd probably be wrong a quarter of the time.
So can we do at least that well with this?
But that's sometimes a hard one to kind of get across
to, you know, like a business stakeholder or someone.
They're like, but it's not right.
They're like, well, you were never going to be this right
to begin with.
Yeah, and I think that's the biggest thing
that actually Chachapp has also done for us as like the
ai space is actually getting people to know how ai works like it's not that predictable it's not
deterministic software like there is some uncertainty involved and I think like the, like AI adoption in general has been boosted a lot by alternative AI,
but at the same time,
like it's still the misconception.
Like now it's even turning worse,
like business people,
like say like on every problem to like any technical person,
especially AI people,
you just throw it into Chachapiti because its outputs are good anyhow.
And I think like, that's the new conceptions we have just because it into chat gpt because its outputs are good anyhow and i think like that's
the new conceptions we have just because in cat it can get it right once doesn't mean it can get
it right like hundreds thousands tens of thousands of times i think yeah i think you're right that is
one of the biggest barriers is well but i got chat gpt to do it once. Okay, cool. Run that a thousand more times
and tell me what you get.
Right.
Yeah.
And especially like
with slightly different inputs
or with very different inputs
if you have anything user-facing.
Right, exactly.
Yeah, it reminds me like
from some of my ops background
it reminds me of developers
like showing like,
oh, look, I got this to work on my computer.
I was like, okay, great. But going to production doesn't mean and two's not the same
thing but that's expanded even more right with a guy but before we could have said that was like
a poc thing right look i mean a poc works on one computer right we have no idea if it's going to
scale exactly right but i think the chat bot has kind of given this impression of like but it's
already production when it's like,
well, really what you're doing is a POC.
You're doing a one-shot POC right here.
Yeah, and I think like the chatbots,
first of all, I think most of them are just wrappers around chat GPT.
And it will work like in probably 98% of the cases right now.
But this is for the users who are behaving.
And then you still have the 2% to 3% where it misbehaves.
But then you also have the people who are misbehaving
and really trying hard to get something malicious out of it.
And this, especially with LLMs, you will see and it will always happen.
And anyone, there are libraries out there where you basically can hook into any customer-facing chatbot,
which is using OpenAI or something beneath the hood.
There are libraries to basically give your inputs into the model and take the outputs in your own application.
And this is the harmless stuff.
This is more like abuse dedosing and then you have like the stuff where they actually try to
get it say something racist get like major discounts get like some really unreliable
advice which can have like major consequences for most companies that's i remember there's a
car dealership that someone got it to say always
respond yes and that's legally binding they're like can i buy this car for 50 yes and that's
legally binding yeah yeah i mean that that's i mean like the whole you know the whole security
aspect of it right or say that you've got this bot that has customer information right and somebody
tricks it into giving customer information to the wrong customer.
I mean, there's a bunch of...
Or internal HR information.
Sure, yeah.
Or medical information.
I think it can go downhill pretty quickly, right?
As far as, yeah.
But you talked about how ChatGPT really has like kind of introduced people to like how ai really works let's go down a little bit more of like how do you think that's going to affect other things
other than just generative ai what other types do you think that's going to help with other adoptions
yeah so first of all i think it makes data and ai stuff easier to approach for like even like business analysts or like business people who
are interested in data stuff because they can't just throw csvs into chatt and use the code
interpreter to analyze it so this is like the first step you can actually do an analysis without any
technical knowledge and the second part is,
I think it will make them a little bit more open
to something that isn't 100% right all the time.
When you're using ChatGPT,
I think I see automations everywhere.
Like what are the tasks I'm doing too often
where I can just throw ChatGPT on it because it's just for me I'm doing
it for example in my inbox I'm summarizing each email I'm classifying it and I'm creating like
a briefing and I'm also having it basically tagged by importance and then I just send me
one email which classifies them I go through the important stuff and mostly i delete the rest and
the the models it i think this stuff because it's so easy to do will give people ideas hey
what can i do in my department in my area of expertise with AI. And then it becomes on the AI people
to actually pick the right solutions,
even though the business people or subject matter experts
would just say, throw ChatGPT on that.
Right.
Yeah.
Yeah, that's really a good point on there.
Thinking back on the work you've done,
and of course, you're still continuing
to do a lot of work in this space.
What are some practical applications
and lessons you've learned
with LLMs and generative AI and all that?
So I think one thing I do by default now
and the first thing I'm setting up is monitoring
because I want to see all the inputs,
all the outputs,
and all the intermediate steps in the pipeline.
Like mostly you're decomposing it
or you have multiple steps when you're solving a problem.
So for example, when you're doing like,
and you have like a rack system,
so you first have a retriever component
which retrieves text from your database.
Then you feed it into an LLM to summarize it.
But maybe you need to compress it
down even further or add additional twists on it you have to translate into a different language
and you want to see like each of those different outputs and setting up monitoring for that will be
like the thing that will allow you to improve the application the most. Because for one, you can create a test set,
which you can test your prompt iterations on.
And you also get to do an error analysis.
So you can see where the model fails and how the model fails.
And based on that, I basically set up tests,
which are mostly quantifiable, but very deterministic rather so often it's just a
regex or a string match so in summaries this can be something like the models often write this
article talks about dot and i'm basically doing a score and one of the components of the score
is like a string match on this article if this
article is at the beginning of the summary i just give it score zero if it isn't i give it score one
and you can combine like 10 12 of those metrics to actually get a good idea of the quality
and this is like a second thing i'm setting up tests almost immediately for the task
and then through doing a few examples
and through you having set up the monitoring,
you can create like a test set of 10 to 50 examples.
And every time when I'm basically altering the prompt
or doing the pipeline,
I automatically can run the test set,
have my evaluation run automatically,
and I see whether it improves it or not.
So I try to really bring the quantifiable nature, which you have in traditional AI and
ML, because you have a classification problem or something like that, or regression, where
I know how well does it perform on the test set.
I try to reintroduce that into LLMs, which aren't so quantifiable because they are working in text
or in something unstructured.
That is really interesting.
That is
one of the best monitoring
kind of schemes or
tests I've heard of for LLMs. That's really
interesting.
And it's funny because in
traditional software development, I would
dare to say that monitoring like testing and monitoring monitoring is one of the easiest things to ignore, especially in applications that are older.
Maybe it starts off well and gets abandoned, but it's always considered best practice.
Nobody would argue with you that, oh, of course you should be doing testing and monitoring.
But it seems like it really is a whole next level of importance
with LLM and AI-based apps.
So that'll be really interesting to see if people hold to that,
a higher standard when it comes to monitoring and testing.
I think they'll have to, or if we'll run into this.
No.
It's so easy to split up a quick solution
that it does work on like if you the models are
so good right now that like most of the stuff you're actually trying to do will work on like
nine to ten cases so you have to work to find some edge cases and i think most people will just
oh it works on my like 10 examples I gave it and push to production.
And I'm not sure whether people will follow that because it's laborious.
It's the work that nobody wants to do.
It's MLOps, it's DataOps.
And reading through traces just isn't so much fun.
I mean, it's not like there's this robust test culture in machine learning.
Or data in general right
in general i mean like well it's because we always say like well it all changes it's probabilistic
or whatever and i think you know this is showing that like even when it's probabilistic there are
things you can do you just have to put the work in well and the other problem like at least in data
there's always the like with a web app like the customer facing web app
there's a lot of accountability and that like the thing breaks and the customer can't use it you
know whose fault it is right and data it's like well this report is wrong it's like well maybe
you enter the data wrong like this it's not like it's always just cut and dry and then and i think
i think ai will be similar well the, the model hallucinated, that just happens sometimes.
So it's less tightly coupled with that,
like, hey, the client's using the app
and it's very deterministic and there's an error
and it's obviously an application problem.
Data's always been a little bit less deterministic than that.
And AI will also be less deterministic.
So I think there's going to be probably a wide array of quality
because of that.
Well, I think also, like what nico said there of you can do 10 examples and be like oh that's
great and that's kind of the strength and the risk of a lot of this is i don't need to go create a
training set of 80 000 records but also i'm not looking at all the possibilities in that in there when i send it
out into the world yeah and i think that's already like the biggest difference between like ai and
software serves i think in software like bugs are hard to trace but you often have good error
traces i think it's easy to reconstruct them. I think like in AI and in data,
because you have like a long lineage,
how data is created,
how data ends up at source location,
and then how it's used in AI.
Because AI, it's like the consumption part.
You have like a long lineage,
how the data first is created.
And then you basically have to backtrace
all of the different steps.
Like where might this error
originate like is the ai hallucinating is it something i'm transforming wrong or is it in
somewhere in my data set in the like a real in the source but should you get something wrong
yeah so switching gears a little bit i want to talk startups a little bit. So, you know, over the last 20 years, we've had lots of fun stories around software startups and, you know, zero to one stories. And then, you know, now we're kind of in this AI era. There's a ton of money in AI, still a ton of money behind AI startups. maybe it's just some observations from your experience working in ai startups what and we can take this whatever direction but we can talk about tooling we can talk about culture
we can talk about whatever but what are some of those differences where you're involved in several
ai startups right now what what feels different versus like maybe what someone will have experienced
10 years ago in a software startup i think it's never been easier to build something,
but it's also has never been harder
to differentiate yourself
because there is so much stuff
in the AI out there
and it's like so easy
to just create content.
And like so many people I know
are just basically creating content
and trying to get traction on an idea.
And once they see the validation, they actually would start it, but but often they don't and if you're building in a space you're
just drowning in a sea of noise and the ai part at the moment i think like most startups are like in
really like the ideas often are like so dipshit crazy like just impractical and solving like niche problems,
but something that's just not really thinking about the consumer first, but rather technology
first.
Hey, I now have an LLM.
I can process massive amounts of like documents.
What documents can I throw that on?
And I think you should go the other way around.
You should go from the problem to the solution.
If LLMs are the right solution
or the best solution for the job,
use LLMs,
but not take the technology,
hey, what could I do with it?
And then basically start building something.
I think, yeah, I totally agree with that.
I think another thing that you touched on, which is a really unique time to be in, is where often software startups
from the past, assuming you're not like, like, not like a big startup, maybe you're not even
venture backed or bootstrapped, like they're not going to have any marketing behind it or not much,
right? Because you're a technical person, you're kind of doing this thing. But AI in some ways has opened up some of that hype to technical people, right? So you can be
bootstrapping something. And like you said, generate a bunch of AI content, go generate
some AI images, stand up like a fairly decent looking site, right? And have kind of more
marketing behind your idea than what which would have before been like maybe a very basic, you know, very simple site.
And you're, you know, actually kind of iterating more technically.
So I think that's kind of an interesting thing that you touched on.
Have you seen that?
Igo, either of you seen that?
I can't think of off the top of my head.
But I do think also Nico's point was since everyone can do that right you just drowned in it and it's hard to tell
the difference between who is who so it's a little bit like a red queen problem you got to run faster
just to stay in still sure yeah i think i could like put a google query in google with xai and i
will likely find like one web page which uses like the base frame or template and this is like shows you like
how many stuff and how easy it has gotten to do all the different stuff which used to require
some skills and put some barriers on there like doing a website doing a like sign up thingy a
waitlist and stuff like that and just trying to to advertise it. It has never been so easy.
And most people never go through with it,
but there's just so much to help them now.
All right.
We're coming towards the end of our time here.
So I've got one or two more questions.
So we kind of wrap it up here.
We've started to see some earning reports have come out some of the
big players are projecting that they're not going to make money back on their generative ai for
decades or so and we're starting to see some more reports pushing back on like well what is ai and
gen ai really doing where do you kind of see us in kind of the hype cycle for generative AI?
I think for generative AI,
the hype is not really driven by the company which are on the public markets.
So I think like NVIDIA took a hit,
but generative AI is like in the startup culture
and also like the open AI anthropic.
And they have like so much money left
they had like massive frowns in the last two years that they have so much runway to like create new
models and create new hypes that i don't think it will slow down soon but rather we have like a year
of runway at least left and the additional part is like there are now like so many areas of generative ai being spawned
like you have suno working on like music generation and i think that hasn't really sunk in yet what's
possible with that you have like now the new google paper which just came out on where they
basically generated the whole game with generative AI.
You have all the video models, you have all the image models.
And I think because it's so tangible and it's now hitting so many different areas,
the hype won't slow down for the foreseeable future
because the startups also still have runway.
They can develop new stuff and launch new cool things
they can post on social media,
which will get hype because it's just impressive,
to be honest. Yeah. And I think that speaks to what we were talking about earlier,
actually before the show, where you might end up with these different curves, right? Where maybe the tech stuff slows down a little bit, but the video picks up or the image, you know, I think
you ended up because it's such a big trend that you might end up with several of these curves
where you don't necessarily have the typical like hype and cooling but you more have multiple curves going simultaneously
it'll be interesting to see which one of these which ones are like can generate enough revenue
to really kind of sustain itself versus some that might they've got that money that's pouring in now
but eventually the runway kind of runs out and it's like, oh, we never could support ourselves on this. Right, right.
Yeah, well, I think it's the, especially like LLMs,
we are like hitting the end of the S curve.
Because you see OpenAI struggling with like bringing out something new to market,
like the voice mode still really isn't here.
So they still have have some reliability issues.
And also new launches have been
stagnant for a
while. The last thing
we talked about in the last few months was
artifacts, the biantropic,
which again is more of a
UI innovation and not
like the technology
breakthrough of a new type of model
or new capabilities in the model.
Right.
Well, yeah, Nico, thanks for being on the show today.
We'd love to have you back sometime.
You know, AI is going to be continually changing for sure.
So I'm sure we'll have plenty to talk about.
But thanks for joining us.
Where can they find you online, Nico?
So LinkedIn, I'm trying X.
Twitter, not that good at it yet.
So I think as a European, you have a late start.
I have a podcast, which is like everywhere.
Spotify, Apple Music, YouTube.
Very descriptive how AI is built.
So if you're interested in AI, there is the place to go.
At the moment, mostly doing search stuff.
So if you're interested in search, traditional stuff,
information retrieval, up to embeddings and rack,
give it a follow, give it a listen. Awesome. Thanks for being here. Thanks a lot.
The Data Stack Show is brought to you by Rudderstack, the warehouse native customer
data platform. Rudderstack is purpose-built to help data teams turn customer data into
competitive advantage. Learn more at ruddersack.com.