The Data Stack Show - 205: How to make LLMs Boring (Predictable, Reliable, and Safe), Featuring Nicolay Gerold

Starting point is 00:00:00 Hi, I'm Eric Dotz. And I'm John Wessel. Welcome to the Data Stack Show. The Data Stack Show is a podcast where we talk about the technical, business, and human challenges involved in data work. Join our casual conversations with innovators and data professionals to learn about new data technologies and how data teams are run at top companies. Welcome back to the show. We're here with Nikolaj Girold. Nikolaj, welcome to the show.

Starting point is 00:00:33 Give us some background today on some of your previous experience and give us some highlights. Hey, yeah, happy to be here. So I'm Nikolaj. I run an AI agency in Munich. We also recently started Adventure Builder. Most of my history has been in LLMs and especially like controllable generations. So how to make them boring, which for me is like predictable, reliable and safe. And yeah, excited to chat with you today. Okay, Nikolai, we just spent a few minutes chatting to prepare for the show. And I'm really excited to dig into, from your opinion, what LLMs are actually good at and maybe what they're not so good at, but everybody still tries to make them do. What are you looking forward to chatting about?

Starting point is 00:01:15 I'm really looking forward to discuss like AI startups versus software startups, or even like AI versus data startups, because it really goes into like the deterministic software versus like unpredictable AI discussion as well. All right, let's dig in. Let's do it. So we're here with Nikolai and Eric is out today. So we have a special co-host, the cynical data guy here co-hosting.

Starting point is 00:01:40 So thanks for coming today, Matt. Thanks for having me. I'll try to be a little less cynical. All right, that'll be good. All right. All right. That'll be good. All right, Nikolai. Yeah. Give us some more on your background.

Starting point is 00:01:49 You've worked a lot with LLMs and AI even before it was cool. So give us a little bit of your background and unpack some of that for us. Yeah. So it started quite early. So we actually organized the hackathon with open ai and that's how we actually got to try all of that stuff and it was super d3 when it came out and back then it was like really hard to get anything out of it like one out of 10 outputs were actually like usable or in the direction of usable and since then I think they have evolved a lot,

Starting point is 00:02:25 but the problem is still the same. Like, how can we control them in the end? And yeah, during my university time, this was also my study topic. So I, in my thesis, wrote about controllable generation with LLMs and basically benchmark different methods for controlling them.

Starting point is 00:02:42 And since then, I started my own agency and now doing that for like different companies so it's quite a lot of fun yeah so tell me about the early days tell me about you know maybe that experience at the hackathon or even pre-hackathon like what was that first moment where you're like wow this is really unique or something that i haven't seen before yeah so in university we actually had the chance to do like text prediction models with rnns lstms even before that but once you go beyond like simple examples it was like utter nonsense where it's just through random tokens together and with lms at least like these sentences and the tokens which are close to each other had like some sense into them and often when it made like on a shorter length like a few

Starting point is 00:03:34 sentences a paragraph it really wrote coherent stuff but often just not factually correct and it was like it was for me like a real game changer and I really got heavily into AI through that because the practical applications I think are way more easier to imagine than with most of the traditional ML and AI

Starting point is 00:03:58 because there you have to think about so many different things and with LLMs you can imagine so many different use cases because with other LAMs, you can imagine so many different use cases because they're just transforming one text into another text. So that sounds like it was really interesting and probably something that

Starting point is 00:04:14 I think probably a lot of people were like, whoa, this is going to be a big deal. What were kind of for you some of those other milestones that you've seen where you're like okay you know like going from i'm getting nonsense to like hey this can actually make a coherent paragraph what were some of the other milestones you saw i think like if we go like one step before that like the first

Starting point is 00:04:37 milestones is the attention mechanism which was like i think somewhere in 2014 and rightly after that like you mentioned already like the instruction following part which is mostly through like the rlhf so the reinforcement learning through human feedback but it actually managed to align the model with human preferences so basically get them to output stuff that the majority of humans like. And this often gives you better output for common tasks, like write me an email and stuff like that. And also really made them more reliable for the chat interface,

Starting point is 00:05:17 which they introduced at the same time in the end, but which is for me also a major breakthrough. It's like an ui innovation in the end just make it very easy for the everyday person to use that stuff and which for ai is really hard to do most of the time and like a chat box is the easiest thing to use everyone knows it and everyone can use it and the results are they're like instantly like magic and if you want to go after that i think like the next one is scaling loss which often isn't like a breakthrough but actually like having the realization like as we scale up in parameters and in training size it gets better

Starting point is 00:05:58 and better and this was really like an interesting thing i think few shots are also something people ignore i think it's also a breakthrough, like just writing out the examples and giving it to the LLM or pre-filling its answer so it actually thinks it has written something already. I think that's also like something very interesting and interesting technique, which isn't so obvious when you look at the like

Starting point is 00:06:20 traditional ML and AI part. So for the, looking looking forward obviously there's like plenty more barriers you know things to overcome the obvious one you've already mentioned is compute getting compute costs down right because there's some ai a lot of ai applications are still subsidized practically right if you actually look at the math of how much compute and how much went into the training model like it doesn't quite work. So say we solve that problem and say we can continue to have process just by expanding training data sets and spending more on compute. Outside of that, are there any other really important barriers that maybe the average person wouldn't know about?

Starting point is 00:07:01 So I think the alignment, just because it's aligned to humans in general doesn't mean it's aligned to my preferences i think that's the first like barrier because i often have like a different taste to how stuff is written and i think like anyone who is interacting with other land is like they really tend to go into the emoji-ridden social media posts for most types of text you're running. And I think one barrier is actually fine-tuning models to the individual user or personalizing them. I think at the moment we are trying to do that with few shots,

Starting point is 00:07:39 but I think we can get smarter with that. And if we see already trends happening for like synthetic data which will make it way easier for like the everyday person to generate a training data set to adjust the model to it and fine-tuning model also it's really cheap at the moment and yeah you can even fine-tune at the moment the open ai models the gpt4 one it's for free so when you have like the capability to generate synthetic data based on your actually inputs and outputs and then basically personalize the model to your taste add a few few shots i think this is something that will get

Starting point is 00:08:16 really interesting and i think the second barrier is actually like the how to actually get the model to pick either something I feed into the context or something that's in its internal representation, so in its fate. Because a rack at the moment, you're feeding the stuff into the model and you're hoping it actually takes the stuff I fed it in. But still, often it hallucinates. And this is also still a barrier. How do I actually get the model to stick to that? And if there is no information on it,

Starting point is 00:08:49 just say, I don't know. And this is the third challenge, I think. Getting models to say, I don't know, which is, I think, for the foreseeable future, without a major architectural change, it's impossible. Yeah, I think that's also because of the way i think part of that is the way we train them too isn't it where we want it to give a response that's human-like or that would a human would find acceptable but how do you decide which responses

Starting point is 00:09:18 when in your training set should get i don't know you know like it's being trained to give an answer so it's going to try to give an answer that's what it seems yes it's like there are thousands of different possibilities of input i can feed it in and i want like a one that fixed output which is i don't know which is like based on what i've trained it with like next token prediction on the entire internet which is like moving it to generate like tokens based on its context. And then like basically I'm feeling like in anything like 20 different users will phrase a question in a different way. And then I expect it to output the same thing every single

Starting point is 00:09:57 time. I think that's, it's very unlikely that it actually gets to that. Yeah. Let's talk a little bit about approaches. Like you mentioned the data-centric AI to that. Yeah. Let's talk a little bit about approaches. Like you mentioned the data-centric AI approach as one model. There's other approaches there. But maybe explain what that is, what a data-centric approach is, and even contrast it with some of the other approaches to AI. Yeah. So I think it's easiest if I go the other route.

Starting point is 00:10:22 So in traditional AI and ML, I basically, I started with creating a data set. Then I picked the model. Then iteratively, I created features which allow me to predict an outcome or generate something. And then I basically, once I had the data set finished, I only adjusted the model.

Starting point is 00:10:45 So basically, I picked different features. I altered the architecture of the model. I added a few layers, for example, or I added an additional variable in the regression. And this is basically how I improved the model. I treated the data set as static, and I basically altered the model to improve the outputs, like increase my accuracy, for example. And I'm really hyped about data-centric AI, where you actually don't really tune the model, but you take an existing base model, so for example, an LLM, and you actually tune the data.

Starting point is 00:11:20 So you train the model, you let it generate the output on the test sets, and you then look at the examples that actually got wrong. And then you actually correct something in the input data, or you add additional samples where it does these categories correctly. Then you feed in the data into the model,

Starting point is 00:11:40 you train it new, or you fine tune it, and then you basically try again. And iteratively, you basically improve and add to your data set over time until you have a model that actually has a satisfactory outcome. And I think this is much more aligned to how it is done in or should be done in practice, where actually you have data shifts, you have changing data, and you have new user groups coming in, where you actually have to

Starting point is 00:12:06 adjust the data set over time and then train your model on the data so how much of that would be like engineering around prompts and context and how much of that would be engineering in the like actual like underlying data so the you have to separate it a little bit. So, in the prompts and context, this is not training a model. Training a model is really about adjusting the parameters. And this can be applied to any type of AI. I think adjusting the prompt at the moment is an easier way to, in parentheses, tune a model. Because you can adjust the outputs,

Starting point is 00:12:46 but it's not really tuning it. Right, it's kind of a shortcut. Yeah, prompting is restricted to a few sets of models where it's possible, one of which is otherLems. Another one is, for example, the SAM, the segment anything models from Meta, where you actually can give a few masks and a few through, they also call it prompts, which are like boxes. Yeah, I mean, that was, I think, I know in my own experience, that was a big thing working with actually a former guest in Cameron, Jago, where he showed me that method of, let's go, we're not going to add more data just indiscriminately. We're not going to go mess with the number of parameters. We're going to look and say,

Starting point is 00:13:25 oh, look, there are no examples in this edge case. We're going to go add a handful of those and suddenly that gets your accuracy a lot better. Kind of like you're filling the search space almost with your examples or like, you know, adding in, hey, our data is drifting. So we need to add an examples

Starting point is 00:13:43 to where it's drifting towards to kind of make it better, rather than, as you said, just mess with the model the entire time. Yeah, and also, I understand why people don't really like to do it, because working in the data directly, it's very laborious.

Starting point is 00:13:59 You have to be really careful most of the time, and you aren't really working in code, especially with generative models. You're reading through long text most of the time and you are really working in code especially with generative models you're like reading through long text all the time and trying to adjust them to get something good out of it right and it feels like the wrong thing to do right like as an engineer somebody that's an expert in it arml is like i should be working on the model i shouldn't be doing that this is low value right this is low valuevalue work. This is low-value work. Yeah. That's what a mechanical is. But that actually is the work in a lot of these.

Starting point is 00:14:29 Right. I think like AI gets better on like the stuff I build by how much time I spend in the data and look at basically because most of it is pipelines, it's not a single model where I'm feeding it in and feeding it out. Like how much time am I spending

Starting point is 00:14:44 looking through all of the different pipeline steps? Yeah. All right. Well, kind of talking about that, like what are some of the things, you know, when we look at LLMs and we look at kind of everyone with the hype around them and everyone's using them, there's also typically that first wave of they can do everything, right? But what are they, in your experience what are llms like actually good at what are the things that they do the best work with so llms can do everything they just can't do it good you can't slap everything into another lamb it's just like they perform badly on like so many different tasks for me where they accelerate is translating one form of text representation into another. And this is especially like, for example, one use case I love is data extraction.

Starting point is 00:15:34 So you take unstructured data, you take long text, and you create another representation, which is basically a JSON. And you structure it, and through that, it actually becomes workable. And this is the thing I, in most of the ventures, but also the projects, use LLM the most for. Because there they have the highest value. They can move through mountains of data in hours, which would be just impracticable to do with humans and they're

Starting point is 00:16:07 really great at i think also for like all of the like tasks that you don't really like to do but have to do and this is like very it's like you individually are driving like what good outputs looks like which is stuff like for example example, running emails, running blog posts, where you actually can rely on other lines heavily, but also the reliability part. So the expectancy of the output, like how accurate does it have to be?

Starting point is 00:16:36 Do I have to have like 99% or am I also happy with 80? There, I can take garbage outputs every now and then, and I can just regenerate. And for the low-risk, they and then, and I can just regenerate. And for the low-risk, they are great, and you can work with them. And same goes for coding. You can't just ask them to especially generate boilerplate code, which you have seen often.

Starting point is 00:16:57 Also in the law, I know a few people were using it heavily just to generate the boilerplate stuff, and they read through it, just review it and just work over it. I think like boilerplate tasks are a good task for them as well because like the criticality of the task, it isn't really high and you often have like a manual review anyhow. A little bit of getting you from that,

Starting point is 00:17:19 like going from zero to one step, that getting you off the blank page, getting you to a point where, you know, I've seen people they've used it for hey we've got this we got to write this proposal here's the rules of what it has to be make the first draft of it and it does that first one pretty well because you're always going to review it anyways or you always think you it in the end yeah you have to differentiate a little bit like between like enterprise applications of llms where you use them like a lot and like the personal applications of llms and i think like

Starting point is 00:17:53 personally when i use llms like all of the time for like nearly every task and because it solves like the blank page problem and i think also like I can explore like the space, what I actually don't want to do. Often like the outputs are garbage, but the errors the LM makes actually leave me forward. And I actually can put a page like, what do I actually want? Yeah.

Starting point is 00:18:19 So to go back kind of with that, the enterprise one, when you talked about, you know, we're going to take this unstructured data and we're going to put it in like a JSON format or something like that. I'm going to kind of selfishly ask because I've had trouble with this,

Starting point is 00:18:31 but how hard is it to get it to consistently put it in a format there? Like, are you going to, is it going to get better through prompting or do you actually have to do some retraining to it? And my mind goes to like email, right? Like that would be the number one thing i can think of is like i have emails where it's completely unstructured i want it in json and i'm going to do something with the json

Starting point is 00:18:51 so maybe that could be like a practical yeah example yeah so in the end it depends what model do you want to use which in an enterprise setting is basically determined whether the data has to be private or not. But you're using the big models. So, Coher, Anthropic, OpenAI, especially the large ones, they are so good at generating JSON by now and have been fine-tuned to do so that they don't really require any additional fine-tuning. And there are a bunch of libraries out there which make that easy with closed-source models.

Starting point is 00:19:24 One I like is Instructor, which basically allows you to define a Pydantic model. And then they output the data into the Pydantic model, which gives you also the ability to instantly validate the data. So if it doesn't hit, you get the validation area of bydantic and then you basically can decide do i want to retry or do i just basically ignore the output depends in the end on you and you also can define a lot of additional rules like validations like if it's numeric like is it within a certain range like do i have a min and a max i think like a lot of the different data stuff you have like usually in your database you can actually define and bring into the structure generation part as well. And I think that gets even more extreme when you go onto the open source side.

Starting point is 00:20:13 Because with that, you can use grammar parsing. So a lot of LLMs in closed source, you don't really get the output tokens. In open source LLMs, you get those output tokens and their probabilities. But since JSON is basically a lot of it is boilerplate as well. Like all the parentheses,

Starting point is 00:20:35 all the keys are predetermined. You don't really need to generate those. So in open source models, you can basically do a grammar parsing which basically ignores the tokens which are the same every time and only generates the part of the tokens, which are basically determined based on your input data, which are the values. And in that, you basically can define additional stuff. So basically, if you have a string, it only takes out basically what's possible within that string. But if you generate numbers, you can just throw away all the tokens, even if they're high probability, that are not numeric. And this makes it a lot easier to basically do the structure generation part.

Starting point is 00:21:19 I'm writing myself a mental note right now for that one. Yeah. No, I think that makes a lot of sense and like again back to the email example i mean i think there's a million business applications of like hey i have all this data in email i want to get it in a json type format and then do something with it um that makes a lot of sense too where basically like the way it's been described to me one of the main things within working with any llm is focusing it right like you're starting with really broad and you're trying to focus it you know to

Starting point is 00:21:52 get to more and more specific and you also want to focus the compute toward the highest value part of your equation right so if you're let's say quote spending compute on json which is going to be the same every single time like that's a waste like let's focus that on this one component of it so that makes a ton of sense why yeah and i think also like you know a lot of the things that that you've talked about and that i think we've all seen lms do best at typically are kind of those well if i was going to do it i would you know i'd get like 100 interns or something like that there's a lot of that type of stuff so cost really becomes a big thing there because i can't really spend a billion dollars to replace a couple interns yeah but this is like the best

Starting point is 00:22:38 way to think about it in my opinion like what are the tasks you would actually hire lots of people to do or that are just untouched because it would be so impractical to get people on that. This goes for every data lake that's out there. Every organization has terabytes of data just in text

Starting point is 00:22:58 and they are largely unused. With LLMs, you actually can make them usable and also enable stuff like retrieval augmented generation make a document base actually like workable because you get answers as opposed to like a blank page or a blank face yeah so i think i think this is a perfect segue we were talking before the show about single shot versus multi-shot and you mentioned kind of this like, retry mechanism, which makes a ton of sense. It's not something I thought of. But if you're, again, back to the

Starting point is 00:23:31 email parsing example, I'm going to parse email, I have the structure JSON, I'm just going to focus the LLM on this one key or value rather, because I already have a defined key. And then there I can also give that particular like multi, I can do that in five shots with some kind of validation and pick my favorite of, let's say, the five. That makes a lot of sense to me where I could get a much higher level of accuracy than if I was using an off-the-shelf, non-open source model where all of it, the whole JSON context has to be right. I'm regenerating some of these keys and values every single time and i don't have as much i can't focus the compute as much on the most valuable part of the task

Starting point is 00:24:12 yeah and there's like voting in the end i love it and most llms if you use them that's the end parameter you can let it generate like multiple times which is also really great for like evaluation like scoring text for example if you want to like score the output of the llm as well you can't do a maturity vote so you let it generate like five to ten different times and just take the average and stuff like that it makes makes it easy. And then you have the second shot stuff you can do without alarms, which is few shots. So basically giving the model a few examples of how to do it, which are usually like human labeled or human written examples

Starting point is 00:24:58 where you give it an example of the input and the output to show it how the task is actually done and this is often like especially for tasks where it's hard to define like how to do it so in like in writing i think like most of us would struggle to define our writing style and if i can give it a few examples of like few linkedin posts or something i wrote i can just throw that in and give him some guidance. And then if I generate like multiple different options, either when it's like something I have like running in the enterprise, I can take the option, which is like generated the most often, or I can score it and generate like the option which has the highest score.

Starting point is 00:25:42 Or if it's just like an output for me, which I want to use like down the line, I can use the option which I like the most which has the highest score or if it's just like an output for me which i want to use like down the line i can use the option which i like the most yeah i mean a lot of this reminds me of kind of when you know with machine learning where when we kind of realize that like a bunch of weak learners will do a better job than one strong learner. I mean, this all feels like we're kind of, you know, it has that like fractive feel to it. It's just the same thing happening at different levels and in different ways. Oh, look, if we can just get,

Starting point is 00:26:13 we get five shots at this, we're much more likely to come up with a good answer than if we just put all of our eggs in one, we're gonna make it really strong or something like that. Well, I think another component too is if the alternative was hey like you said i'm going to hire a hundred interns right there's and say you wouldn't actually do that right because maybe there's just not enough value in that cost but say you know say theoretically that you could get 100 free interns like okay maybe i

Starting point is 00:26:41 would do it but then there's's a time component of it would take them X amount of time, let's say several hundred hours. And then there's a validation component. Somebody that works for the company has to validate the work, et cetera. You've got a lot of time into it. So because of that, I feel like there's this extra space for the LLM to do the multi-shot approach. And it can run for hours and that's really not a big deal at all because the comparative other method is significantly longer versus using it in some other applications where you want this like millisecond response time right i'm quote the ai like that's just a much it seems like a much harder problem in the stage that we're at we're at right now yeah especially for like batch workloads, LLMs are great for like the live part.

Starting point is 00:27:27 I think it's getting easier with stuff like Grok. So not the Tudor Grok, but the other Grok, which are basically doing LLM chips or chips tailor-made for text generation models. They're getting really fast. But also if you have like an applications where it's live, it's likely it's customer interaction, where I'm not sure whether I would like

Starting point is 00:27:46 to put an LLM on there. Yeah, I think also kind of leads to when we think about accuracy and what you need, I think a lot of times people want to compare LLMs to what it's not 100% versus like, well, realistically

Starting point is 00:28:02 what would 122-year-olds actually do? They'd probably be wrong a quarter of the time. So can we do at least that well with this? But that's sometimes a hard one to kind of get across to, you know, like a business stakeholder or someone. They're like, but it's not right. They're like, well, you were never going to be this right to begin with.

Starting point is 00:28:21 Yeah, and I think that's the biggest thing that actually Chachapp has also done for us as like the ai space is actually getting people to know how ai works like it's not that predictable it's not deterministic software like there is some uncertainty involved and I think like the, like AI adoption in general has been boosted a lot by alternative AI, but at the same time, like it's still the misconception. Like now it's even turning worse, like business people,

Starting point is 00:28:54 like say like on every problem to like any technical person, especially AI people, you just throw it into Chachapiti because its outputs are good anyhow. And I think like, that's the new conceptions we have just because it into chat gpt because its outputs are good anyhow and i think like that's the new conceptions we have just because in cat it can get it right once doesn't mean it can get it right like hundreds thousands tens of thousands of times i think yeah i think you're right that is one of the biggest barriers is well but i got chat gpt to do it once. Okay, cool. Run that a thousand more times and tell me what you get.

Starting point is 00:29:27 Right. Yeah. And especially like with slightly different inputs or with very different inputs if you have anything user-facing. Right, exactly. Yeah, it reminds me like

Starting point is 00:29:39 from some of my ops background it reminds me of developers like showing like, oh, look, I got this to work on my computer. I was like, okay, great. But going to production doesn't mean and two's not the same thing but that's expanded even more right with a guy but before we could have said that was like a poc thing right look i mean a poc works on one computer right we have no idea if it's going to scale exactly right but i think the chat bot has kind of given this impression of like but it's

Starting point is 00:30:03 already production when it's like, well, really what you're doing is a POC. You're doing a one-shot POC right here. Yeah, and I think like the chatbots, first of all, I think most of them are just wrappers around chat GPT. And it will work like in probably 98% of the cases right now. But this is for the users who are behaving. And then you still have the 2% to 3% where it misbehaves.

Starting point is 00:30:31 But then you also have the people who are misbehaving and really trying hard to get something malicious out of it. And this, especially with LLMs, you will see and it will always happen. And anyone, there are libraries out there where you basically can hook into any customer-facing chatbot, which is using OpenAI or something beneath the hood. There are libraries to basically give your inputs into the model and take the outputs in your own application. And this is the harmless stuff. This is more like abuse dedosing and then you have like the stuff where they actually try to

Starting point is 00:31:09 get it say something racist get like major discounts get like some really unreliable advice which can have like major consequences for most companies that's i remember there's a car dealership that someone got it to say always respond yes and that's legally binding they're like can i buy this car for 50 yes and that's legally binding yeah yeah i mean that that's i mean like the whole you know the whole security aspect of it right or say that you've got this bot that has customer information right and somebody tricks it into giving customer information to the wrong customer. I mean, there's a bunch of...

Starting point is 00:31:49 Or internal HR information. Sure, yeah. Or medical information. I think it can go downhill pretty quickly, right? As far as, yeah. But you talked about how ChatGPT really has like kind of introduced people to like how ai really works let's go down a little bit more of like how do you think that's going to affect other things other than just generative ai what other types do you think that's going to help with other adoptions yeah so first of all i think it makes data and ai stuff easier to approach for like even like business analysts or like business people who

Starting point is 00:32:27 are interested in data stuff because they can't just throw csvs into chatt and use the code interpreter to analyze it so this is like the first step you can actually do an analysis without any technical knowledge and the second part is, I think it will make them a little bit more open to something that isn't 100% right all the time. When you're using ChatGPT, I think I see automations everywhere. Like what are the tasks I'm doing too often

Starting point is 00:33:01 where I can just throw ChatGPT on it because it's just for me I'm doing it for example in my inbox I'm summarizing each email I'm classifying it and I'm creating like a briefing and I'm also having it basically tagged by importance and then I just send me one email which classifies them I go through the important stuff and mostly i delete the rest and the the models it i think this stuff because it's so easy to do will give people ideas hey what can i do in my department in my area of expertise with AI. And then it becomes on the AI people to actually pick the right solutions, even though the business people or subject matter experts

Starting point is 00:33:53 would just say, throw ChatGPT on that. Right. Yeah. Yeah, that's really a good point on there. Thinking back on the work you've done, and of course, you're still continuing to do a lot of work in this space. What are some practical applications

Starting point is 00:34:06 and lessons you've learned with LLMs and generative AI and all that? So I think one thing I do by default now and the first thing I'm setting up is monitoring because I want to see all the inputs, all the outputs, and all the intermediate steps in the pipeline. Like mostly you're decomposing it

Starting point is 00:34:29 or you have multiple steps when you're solving a problem. So for example, when you're doing like, and you have like a rack system, so you first have a retriever component which retrieves text from your database. Then you feed it into an LLM to summarize it. But maybe you need to compress it down even further or add additional twists on it you have to translate into a different language

Starting point is 00:34:52 and you want to see like each of those different outputs and setting up monitoring for that will be like the thing that will allow you to improve the application the most. Because for one, you can create a test set, which you can test your prompt iterations on. And you also get to do an error analysis. So you can see where the model fails and how the model fails. And based on that, I basically set up tests, which are mostly quantifiable, but very deterministic rather so often it's just a regex or a string match so in summaries this can be something like the models often write this

Starting point is 00:35:34 article talks about dot and i'm basically doing a score and one of the components of the score is like a string match on this article if this article is at the beginning of the summary i just give it score zero if it isn't i give it score one and you can combine like 10 12 of those metrics to actually get a good idea of the quality and this is like a second thing i'm setting up tests almost immediately for the task and then through doing a few examples and through you having set up the monitoring, you can create like a test set of 10 to 50 examples.

Starting point is 00:36:12 And every time when I'm basically altering the prompt or doing the pipeline, I automatically can run the test set, have my evaluation run automatically, and I see whether it improves it or not. So I try to really bring the quantifiable nature, which you have in traditional AI and ML, because you have a classification problem or something like that, or regression, where I know how well does it perform on the test set.

Starting point is 00:36:38 I try to reintroduce that into LLMs, which aren't so quantifiable because they are working in text or in something unstructured. That is really interesting. That is one of the best monitoring kind of schemes or tests I've heard of for LLMs. That's really interesting.

Starting point is 00:36:59 And it's funny because in traditional software development, I would dare to say that monitoring like testing and monitoring monitoring is one of the easiest things to ignore, especially in applications that are older. Maybe it starts off well and gets abandoned, but it's always considered best practice. Nobody would argue with you that, oh, of course you should be doing testing and monitoring. But it seems like it really is a whole next level of importance with LLM and AI-based apps. So that'll be really interesting to see if people hold to that,

Starting point is 00:37:31 a higher standard when it comes to monitoring and testing. I think they'll have to, or if we'll run into this. No. It's so easy to split up a quick solution that it does work on like if you the models are so good right now that like most of the stuff you're actually trying to do will work on like nine to ten cases so you have to work to find some edge cases and i think most people will just oh it works on my like 10 examples I gave it and push to production.

Starting point is 00:38:07 And I'm not sure whether people will follow that because it's laborious. It's the work that nobody wants to do. It's MLOps, it's DataOps. And reading through traces just isn't so much fun. I mean, it's not like there's this robust test culture in machine learning. Or data in general right in general i mean like well it's because we always say like well it all changes it's probabilistic or whatever and i think you know this is showing that like even when it's probabilistic there are

Starting point is 00:38:35 things you can do you just have to put the work in well and the other problem like at least in data there's always the like with a web app like the customer facing web app there's a lot of accountability and that like the thing breaks and the customer can't use it you know whose fault it is right and data it's like well this report is wrong it's like well maybe you enter the data wrong like this it's not like it's always just cut and dry and then and i think i think ai will be similar well the, the model hallucinated, that just happens sometimes. So it's less tightly coupled with that, like, hey, the client's using the app

Starting point is 00:39:11 and it's very deterministic and there's an error and it's obviously an application problem. Data's always been a little bit less deterministic than that. And AI will also be less deterministic. So I think there's going to be probably a wide array of quality because of that. Well, I think also, like what nico said there of you can do 10 examples and be like oh that's great and that's kind of the strength and the risk of a lot of this is i don't need to go create a

Starting point is 00:39:37 training set of 80 000 records but also i'm not looking at all the possibilities in that in there when i send it out into the world yeah and i think that's already like the biggest difference between like ai and software serves i think in software like bugs are hard to trace but you often have good error traces i think it's easy to reconstruct them. I think like in AI and in data, because you have like a long lineage, how data is created, how data ends up at source location, and then how it's used in AI.

Starting point is 00:40:15 Because AI, it's like the consumption part. You have like a long lineage, how the data first is created. And then you basically have to backtrace all of the different steps. Like where might this error originate like is the ai hallucinating is it something i'm transforming wrong or is it in somewhere in my data set in the like a real in the source but should you get something wrong

Starting point is 00:40:37 yeah so switching gears a little bit i want to talk startups a little bit. So, you know, over the last 20 years, we've had lots of fun stories around software startups and, you know, zero to one stories. And then, you know, now we're kind of in this AI era. There's a ton of money in AI, still a ton of money behind AI startups. maybe it's just some observations from your experience working in ai startups what and we can take this whatever direction but we can talk about tooling we can talk about culture we can talk about whatever but what are some of those differences where you're involved in several ai startups right now what what feels different versus like maybe what someone will have experienced 10 years ago in a software startup i think it's never been easier to build something, but it's also has never been harder to differentiate yourself because there is so much stuff in the AI out there

Starting point is 00:41:32 and it's like so easy to just create content. And like so many people I know are just basically creating content and trying to get traction on an idea. And once they see the validation, they actually would start it, but but often they don't and if you're building in a space you're just drowning in a sea of noise and the ai part at the moment i think like most startups are like in really like the ideas often are like so dipshit crazy like just impractical and solving like niche problems,

Starting point is 00:42:08 but something that's just not really thinking about the consumer first, but rather technology first. Hey, I now have an LLM. I can process massive amounts of like documents. What documents can I throw that on? And I think you should go the other way around. You should go from the problem to the solution. If LLMs are the right solution

Starting point is 00:42:32 or the best solution for the job, use LLMs, but not take the technology, hey, what could I do with it? And then basically start building something. I think, yeah, I totally agree with that. I think another thing that you touched on, which is a really unique time to be in, is where often software startups from the past, assuming you're not like, like, not like a big startup, maybe you're not even

Starting point is 00:42:56 venture backed or bootstrapped, like they're not going to have any marketing behind it or not much, right? Because you're a technical person, you're kind of doing this thing. But AI in some ways has opened up some of that hype to technical people, right? So you can be bootstrapping something. And like you said, generate a bunch of AI content, go generate some AI images, stand up like a fairly decent looking site, right? And have kind of more marketing behind your idea than what which would have before been like maybe a very basic, you know, very simple site. And you're, you know, actually kind of iterating more technically. So I think that's kind of an interesting thing that you touched on. Have you seen that?

Starting point is 00:43:36 Igo, either of you seen that? I can't think of off the top of my head. But I do think also Nico's point was since everyone can do that right you just drowned in it and it's hard to tell the difference between who is who so it's a little bit like a red queen problem you got to run faster just to stay in still sure yeah i think i could like put a google query in google with xai and i will likely find like one web page which uses like the base frame or template and this is like shows you like how many stuff and how easy it has gotten to do all the different stuff which used to require some skills and put some barriers on there like doing a website doing a like sign up thingy a

Starting point is 00:44:22 waitlist and stuff like that and just trying to to advertise it. It has never been so easy. And most people never go through with it, but there's just so much to help them now. All right. We're coming towards the end of our time here. So I've got one or two more questions. So we kind of wrap it up here. We've started to see some earning reports have come out some of the

Starting point is 00:44:45 big players are projecting that they're not going to make money back on their generative ai for decades or so and we're starting to see some more reports pushing back on like well what is ai and gen ai really doing where do you kind of see us in kind of the hype cycle for generative AI? I think for generative AI, the hype is not really driven by the company which are on the public markets. So I think like NVIDIA took a hit, but generative AI is like in the startup culture and also like the open AI anthropic.

Starting point is 00:45:23 And they have like so much money left they had like massive frowns in the last two years that they have so much runway to like create new models and create new hypes that i don't think it will slow down soon but rather we have like a year of runway at least left and the additional part is like there are now like so many areas of generative ai being spawned like you have suno working on like music generation and i think that hasn't really sunk in yet what's possible with that you have like now the new google paper which just came out on where they basically generated the whole game with generative AI. You have all the video models, you have all the image models.

Starting point is 00:46:09 And I think because it's so tangible and it's now hitting so many different areas, the hype won't slow down for the foreseeable future because the startups also still have runway. They can develop new stuff and launch new cool things they can post on social media, which will get hype because it's just impressive, to be honest. Yeah. And I think that speaks to what we were talking about earlier, actually before the show, where you might end up with these different curves, right? Where maybe the tech stuff slows down a little bit, but the video picks up or the image, you know, I think

Starting point is 00:46:39 you ended up because it's such a big trend that you might end up with several of these curves where you don't necessarily have the typical like hype and cooling but you more have multiple curves going simultaneously it'll be interesting to see which one of these which ones are like can generate enough revenue to really kind of sustain itself versus some that might they've got that money that's pouring in now but eventually the runway kind of runs out and it's like, oh, we never could support ourselves on this. Right, right. Yeah, well, I think it's the, especially like LLMs, we are like hitting the end of the S curve. Because you see OpenAI struggling with like bringing out something new to market,

Starting point is 00:47:19 like the voice mode still really isn't here. So they still have have some reliability issues. And also new launches have been stagnant for a while. The last thing we talked about in the last few months was artifacts, the biantropic, which again is more of a

Starting point is 00:47:37 UI innovation and not like the technology breakthrough of a new type of model or new capabilities in the model. Right. Well, yeah, Nico, thanks for being on the show today. We'd love to have you back sometime. You know, AI is going to be continually changing for sure.

Starting point is 00:47:54 So I'm sure we'll have plenty to talk about. But thanks for joining us. Where can they find you online, Nico? So LinkedIn, I'm trying X. Twitter, not that good at it yet. So I think as a European, you have a late start. I have a podcast, which is like everywhere. Spotify, Apple Music, YouTube.

Starting point is 00:48:11 Very descriptive how AI is built. So if you're interested in AI, there is the place to go. At the moment, mostly doing search stuff. So if you're interested in search, traditional stuff, information retrieval, up to embeddings and rack, give it a follow, give it a listen. Awesome. Thanks for being here. Thanks a lot. The Data Stack Show is brought to you by Rudderstack, the warehouse native customer data platform. Rudderstack is purpose-built to help data teams turn customer data into

Starting point is 00:48:38 competitive advantage. Learn more at ruddersack.com.

Pet Camera - EBO Air 2

The Data Stack Show - 205: How to make LLMs Boring (Predictable, Reliable, and Safe), Featuring Nicolay Gerold

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

Pet Camera - EBO Air 2

The Data Stack Show - 205: How to make LLMs Boring (Predictable, Reliable, and Safe), Featuring Nicolay Gerold

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.