PurePerformance - How to test, optimize, and reduce hallucinations of AIs with Thomas Natschlaeger

Starting point is 00:00:00 It's time for pure performance. Get your stopwatches ready. It's time for Pure Performance with Andy Grabner and Brian Wilson. Hello, everybody, and welcome to another episode of Pure Performance. My name is Brian. Wilson. And as always I have with me, my co-host, Andy Grabner, who's here with the Davis AI figurine. And I just want to let people know, because this is audio only. Andy has not been mocking my intro lately. I think at least the last two or three podcasts, he's been being a normal

Starting point is 00:00:46 human being. So thank you for that, Andy. And I think we can all have some gratitude. But Andy's playing with toys today. Yeah, it looks like it, right? I don't know where, when I picked it up, Davis AI. I think it was the last. It was perform. I think it was the perform in 2020 before COVID, the last one before COVID, because the actress who played our DavyCi, she was also on site, if you remember. And you could take selfies with her and then you could take a, what's it called, a bubblehead? Yes, bubble hood.

Starting point is 00:01:19 Bobblehead, yes. Now, the question is, why do I have this figure in my hand today? Because you're feeling playful? Maybe I feel playful, yeah. But really, because last week, I had a chance to, actually I've been in Poland, and this is a true story, I've been in Poland, went to our office in Gdansk, and then in the evening, got back to my hotel, wanted to go to bed, but then I always make, not a mistake, in this case it was a good mistake, I opened up LinkedIn, and I saw the two of our colleagues, actually three of our colleagues, didn't ask me anything session, and these colleagues were. Gabreda Hassan Birkenmann, Sophia Habip who moderated the whole thing and our guest from today. Thomas Natschleger and you talked about how we internally build Dinotrace co-pilot, how it internally works, how we translate the prompts to DQL, how we train the model

Starting point is 00:02:17 and also how we test these model changes and how we in general deal with testing. And Brian, I thought because we both have a big history and passion for software quality, I want to invite Thomas and learn about how we can make sure that the LLMs and whatever AI models we use and also anybody uses out there, how we can make sure the results produce the right quality. It's very important topic. Especially as we move a lot more, everyone's moving more into AI. And just like we have to trust what our observability is telling us and all that kind of stuff, there's this trust factor with is what AI is telling me true, right?

Starting point is 00:03:00 And we've seen plenty of cases where, you know, I'm not saying in the tooling, but in general, if you take a look at co-pilot, GROC, some of these others, sometimes they just come out with these crazy, crazy either hallucinize, I was about to say hallucinization, hallucinations, or other things. So, you know, it's interesting you said that. I don't mean to jump on this one point right here, but until you said that, I really hadn't thought about this idea of like, can we trust what our AI?

Starting point is 00:03:25 is telling us. So, yes, I think it's really good. Please continue. And so with this now, I want to officially welcome Thomas Natschleger Principal Data Scientist at Dana Trace. And if I look at his LinkedIn profile, it's really fascinating to see you spend a lot of time at the Technical University in Graz, then at the Software Competence Center in Hagenberg. And then you were at machine learning engineer at Blue Sky Weather, so I hope you had a lot of blue sky predictions, because especially the last summers, we were pretty nice

Starting point is 00:03:59 here, and now you're at Anandris. But, Thomas, can you please introduce yourself to our audience, and then we'll jump into the topic. Yeah, thanks, Andy, for the nice intro. Yeah, it was enumerated all the places where I've been working. You can guess that I'm not 21 anymore. So, and actually, sometimes I made it. make the joke that I have written my first neural network in sea something like three decades ago.

Starting point is 00:04:31 So there is some experience in that area and also my station at the university in Guards was about computation in neuroscience and how the brain works. We tried to figure out some, let's say, theories about that. And that as of today comes in kind of handy to understand how all these layers in the brain work and when you compared it to, let's say, all this convolutional neural networks and deep neural networks and stuff like that. So this early on knowledge in that area helps me to understand what's going on to some degree in all these large language models.

Starting point is 00:05:08 Yes, and hopefully also back then the, let's say, the efforts we took at the university to simulate those brains and analyze that data to understand what it is doing and also come up with some ideas how to test hypotheses for this large language for this back then it was called spiking rule networks to come up with hypotheses for these so everything ties together here nicely and i think uh let's dive into this topic of of testing those beasts so to say yeah before actually talking about testing because now this is the way our podcasts work we always get an idea and a thought and then i want to talk a little bit more about this um you have been in this field for so many years I actually remember it was in the early 2000.

Starting point is 00:05:53 I did a, not a master thesis, but I did a, what's he called, before the master? Am I stupid? Bachelor? Bachelor? Thank you so much. Because I went to high school. That's why you didn't get a master thesis because you can't learn for bachelor. Exactly.

Starting point is 00:06:07 Because I went to high school, then started working. And afterwards, I did a bachelor degree while I was already working. And one of the things we did back then was in the early 2000. With somebody that you may also know, Ulrich Bodhofer. He was one, he was my lecturer and we did for the bachelor thesis, we worked also on some machine learning algorithms to predict and calculate how to best steer a kind of like an autonomous driving car and we were training the model with data and when to break and things like that. So that was like 20 something years ago.

Starting point is 00:06:44 And for most of us, and I think especially consumers of this technology, it feels like everything started like really three years ago because it became tangible. Can you explain from your perspective a little bit what has changed and why did we all of a sudden see this, was this a big, was this just luck or was this, why all of a sudden was this big change where the whole world is now not only talking about it, but also using LLMs? What has changed? What changed?

Starting point is 00:07:16 I mean, it was really, let's say, some. Some accumulation of, I don't know, events and long-term efforts, to be honest. I mean, the whole neural network community is working since 19. The first neural network was invented in 1950, something by McCulloch and Pitts. And then later on, the groundwork for all these deep neural network stuff, there was this fundamental paper by Rummelhardt and Hinton and guys in 1986 or something like that. And all of these papers actually were milestones and breakthrough. at that point in time, but due to the lack of, you know it, not having Nvidia back then, right?

Starting point is 00:07:57 It didn't make it into the public because all these great ideas actually demanded lots of computational power. And from my point of view, the major breakthrough was actually already earlier in, let's say, in the late 2010s when the first people started to think about reusing GPUs, not for games, for neural networks. Remember, a colleague of mine, I was a bishop at the university in Graz when they, he came to me and said,

Starting point is 00:08:28 look here, we are using, we are implementing variation in auto encoders and a graphics card. And I said, oh, maybe a good idea. And back then, everybody had to do it on their own, you know, making their hands really dirty with this really nasty, close to assembler code on graphics cards. Lots of bugs in there. Also great opportunity for testing.

Starting point is 00:08:49 Lots of bugs in there. this very low-level code. And then the, let's say, this kind of evolution then typically is coming in waves. And sometimes there is a threshold and some emergent behavior pops out, right? You know, it goes like this, like a baby when it's learning walking, right? There is a long time it's crawling around and a few times it tries and suddenly it walks, right? It's not that it first can do two steps and three and four, but suddenly it can do all of them. And that's the same in this evolution of neural network capabilities when everything comes

Starting point is 00:09:25 together, the hardware, the knowledge, the experience, and how to train them. And then in the computer vision area, this thing exploded around 2010, basically. And then people form the, let's say, large language, from the language modeling community, which mostly have been using kind of, I don't know, hidden Markov models and whatnot to model language back then at some point in time we're looking also more and more into this neural network area right and then let's say starting at 2015 to 2018 one has could already see that this language models evolved over time right and then there was this i think it was 2018 there was this one big paper which is called attention is all you need where researchers

Starting point is 00:10:18 from OpenEI published the first transformer model and then from there it was like only three years on and then I mean on hugging phase you could already have many of those transformer models back then and then OpenEI just made business out of it so baked it

Starting point is 00:10:34 with big money and easy to use API and that's where we are now it's really interesting Andy I never thought of that you know it's stupid that I didn't know like there's no reason I shouldn't have known that AI has been being worked on since, what did you say, the

Starting point is 00:10:51 1950s with the first neural network? Yes. But I think I fell in the same trap of everybody where it's like, you just think it's a recent development. But meanwhile, it's been talked about for forever, right? Whether it's in science fiction or movies, well, I guess that's science fiction as well. But, yeah, of course it was being worked on. But it's not something, I think, any regular person thought.

Starting point is 00:11:13 Like, there is this tremendous impression that it just popped up out of nowhere. and things don't pop up out of nowhere and it's been people working long and hard on this stuff and I guess it really sounds like when the technology finally caught up to give it the powers when it was able to finally

Starting point is 00:11:27 make that leap so just interesting question Andy thanks for asking that because it makes me feel stupid for not even thinking that it existed but yeah of course it did I have one maybe one funny anecdote

Starting point is 00:11:41 because there is this very well-known researcher in Switzerland, Juergen Schmidt-Huber and he is very famous known to be how to say very exact when it comes down to who which credit goes to whom right and he is always trying to find out what were the earliest citation of some new invention and something like that right and in his opinion all these kind of tensor flow pie torch neural network machinery which does all the learning and stochastic gradient descent, the real credit goes back to a student in Finland in 1970 or something like that who first in time, according to his opinion, invented first time automated gradient descent for some autonomous control machinery or whatnot. So if you are that serious, the whole journey already started.

Starting point is 00:12:46 quite some time ago. Yeah, and it's always good to look. The reason why I asked, right, because I was fortunate enough back when I did my thesis to work on this topic, even though back then I got to tell you, for me it was interesting, but still it was so cumbersome to create, to compile in our models and then to train it, and then the results were not that great. I mean, we had limited time, obviously. And then seeing what we do right now is just amazing.

Starting point is 00:13:16 But, and this is now where I want to kind of transition over, when I now use these large language models, I'm still sometimes frustrated because I ask the question and I get an answer that makes sense. I ask the same question again and then I get an answer that is complete not what I was expecting. And then I'm wondering why do you give me two different answers for a similar question, what has changed, how is this even possible? and now kind of to how we are using LLMs and with we, I mean the observability space. If you look at observability, a lot of, I think every vendor now is basically saying we have the data and we put our models on top and then we get better insights and better answers to make sense out of the data and you don't have to become an expert in analyzing all of this. I would like from you to know now a couple of months or maybe a year or two in on how we are applying this on our data. what are kind of the lessons learned and what are still the challenges and why what do we do with testing to kind of improve the situation so before I jump into this whole testing thing I would because I will refer to that sometimes basically we have I would say what we call skills in place and one is this more like a traditional traditional you know since two years traditional

Starting point is 00:14:40 chatbot based on a large language model and the other one is this translation of natural text into our query language right so that's kind of a little bit too this from my point of view a little bit too these twinked use cases the one is more get me a summary get me an explanation and easy to read text such that I can navigate for example our documentation that I can navigate another text and the other one is more like code generation you know all this code generation tools And in our case, it's generation of our own invented query language. So the output is structured, is let's say testable, actually, whether it's, for example, syntactically correct, whether it's semantically correct, whether it's executable by our

Starting point is 00:15:30 database engine, by our data lake engine. So from that perspective, when it comes to testing, there may be, or there are actually then two different approaches there. One is this kind of, let's say, free text output, which you want to check and to see whether the output of the system we built,

Starting point is 00:15:53 of our Davis co-pilot, is giving what you're expected, right? It's not exactly, as you said, sometimes it does it a little bit longer, a little bit more, longer paragraphs, smaller paragraphs, what not.

Starting point is 00:16:08 But as soon as it's close to your expectations, it's fine. But on the other hand, when you look at the DQL generation, it's a different thing. When it forgets one comma, right, it's not executable. So you need to make, you may want to help the LLM there, in particular to get the syntax right and then also to get the semantics right. So, and then also from what I have learned is that there is this whole. life cycle of such a system which you build.

Starting point is 00:16:42 It's not only the LLM, right? For example, our Davis co-pilot is what you typically call this retrieval augmented generation approach, where under the hood you use an LLM plus knowledge sources, knowledge bases, right, which you first, so you ask the system a question, then you try to find the relevant document text pieces in your knowledge base and you pull out this relevant and text chunks, and then you ask the large language model to give you a comprehensive summary of what was found in the knowledge base. That's this classical retrieval augmented generation.

Starting point is 00:17:22 And so what you need to test is not only the LLM, it's the whole system. It's more like an integration test at the end. Obviously, for each of the pieces, you can have a unit test, but overall, it's an integration test there. And when you first started out with that, but when you start with this kind of chatbot thing, it's a little bit similar, it's a little bit different than what I was used to do when I built, let's say, weather forecasts, right? Because you mentioned that I'm a machine learning engineer at Blue Sky, where we basically

Starting point is 00:18:00 train each day many, many hundreds and thousands of machine learning models to predict locally precise temperature, I don't know, energy production, humidity and stuff like that. So typically numerical values. For these numerical values, you have well-established KPI's for the quality of a joint model, root, mean, squared error, mean absolute error, error, error, square measurement, whatnot. While for this text output thing, those kind of measures had to be established over the last two years, why the others have been around for the last 50 years. And so the community also went through a rapid development process of coming up with various

Starting point is 00:18:48 types, how you actually measure those KPIs, whether an answer the system is producing is kind of close to what you were actually expecting. And this measure of closeness has also evolved over time with the availability of the measurement tools, I would argue. As I mentioned, let's say, let's do a time travel again and let's go, let's go back to 2012, for example, where most of the language modeling was done with, let's say, purely statistical models. And then back then also the measurement, whether a produced text and the desired text are

Starting point is 00:19:29 closed together have been based on measures like you do tokenization and you count how many tokens are equal, how large is the overlap of tokens, is the order somehow matching? So the token, I don't know if the output is, buy me some butter, for example, or whatnot. And the word butter in the expected output is in the first place, while in the expected answer, it's in the last place. Even the same word is there, it's not the perfect match, right? You try to come up with all kinds of measurements from a statistical point of view, taking all these things into account, let's say, engineered KPIs.

Starting point is 00:20:10 And then later on, as of now, then the LLMs came alive, right? They produced a text now, but they are now also used as judges to judge whether the generated text is actually that what you want as an output. So that's now a very common technique to use this, to use, let's say, an LLM which is more powerful that the ELLM you are using to generate the output as a judge to argue about is it faithful, is the answer, is the output relevant to the found documents? So that's the question whether it's faithful to avoid hallucinations, right? Because if the answer is completely off what you retrieved from your knowledge base and it still is making something up,

Starting point is 00:21:01 right then it's not the best because why would i show you the facts right if you still ignore them so yeah exactly that's about these let's say how you how you compute these um these kpies nowadays mostly again with other lLMs or some let's say semantic closeness closeness measures which are also used in in those in the search the semantic search thing um quick question on this because this sounds like, you know, we are, to explain this simply, we are creating digital twins of things we also have in human life. We have a teacher who is more, let's say an expert, who is, you know, better trained on a topic.

Starting point is 00:21:47 And then we have, you know, people that try to become experts. And then it's basically just like create an output and then I get created by the expert. It was like in peer programming. I have a junior developer and I have a senior developer and they do pair programming. And so with this, the kind of cross-check, I guess the only challenge is if we believe that these systems really always work. But if the expert is actually not an expert, because the expert was trained on wrong data, then in the end, the whole system just doesn't really produce great output because it's, maybe I'm thinking this wrong, but it's also like in human life. I was thinking the same thing, right? If you have, like, what's the data model that's being trained, that it's using to train, right?

Starting point is 00:22:33 Because if you were to open it up to everything available on the internet and the deep internet, right? There's going to be a lot of wrong information out there. There's going to be a lot of right information out there. You might have answers that don't apply. Like, you're, Thomas, what you're talking about sounds like, okay, we're looking at the result and is the report written correctly, on the information it used but are the citations the wrong citations that it used like yes you created an answer out of what you pulled back and that answer based on what you pulled back is written well but the data you pulled from is bad right and how do we gate keep that the other the other

Starting point is 00:23:16 thing i was thinking of and i forget the name of ibn's one the one that was on jeopardy right um i guess what was it watson was it watson yeah watson and i believe that much that much model is very similar to what I would call the stack overflow model, where people say this is the best answer, but then there are some other answers that may apply, right? And the devil is in the details, as you might say, this is probably the most common condition, but then there are some other ones that may or may not apply. And Watson, I remember when they were first talking about Watson back when it was on Jeopardy, the idea was to use it in a hospital scenario. So a doctor can put in all this stuff, and it's come back. 90% it's this.

Starting point is 00:23:56 But these are some other conditions, right? So the doctor at least has a list to say, okay, let me check. Let me go in order of what is most likely. But maybe it is going to be this thing way down here. And if we're only ever writing or returning an answer based on the top one and not considering the others, like how do you get that trust? How do you do the gatekeeping of the data that's being pulled in? How do you make sure the data that you're building from, Zandi saying, is proper data, right? because, again, most of the stuff

Starting point is 00:24:25 it's learning from is coming from people, right, and people have opinions. And it's not even just the opinion, but there could be mass hallucinations among people, right? Everyone says, oh, this is the common thing, this is how you solve it, right? Someone else comes along,

Starting point is 00:24:38 just like we have breakthroughs in technology to say, actually, if you make this little tweak, it's going to work better, you know? So how is that, I think, is that where you were going with this, Andy, as well? Because this was on my mind as this was being discussed. Like, that seems like the harder part to gain the trust on.

Starting point is 00:24:53 I mean, what we do is, and what you typically do is divide and conquer, right? So basically you described it already very well. It's, to some extent, it's two different things. First, you need to make sure that you retrieve the right data. And to ensure that this is working, we have dedicated, built dedicated tests which measure the retrieval quality. So in this kind of tests, we have test sets in place. That's the, basically, at the high level, that's the only approach in the whole machinery community. You need a test set, right?

Starting point is 00:25:36 Somebody needs to sit down and say, okay, for that particular answer, I would like to get these sources. For these particular answers, I would like to get these sources. And to make that reliable in our company, actually we worked with D1 and support, such that they give us for the question coming in from our customers, worked out RFAs, for example, and they said, okay, this has been the reliable sources we have been using to actually answer that question. And this kind of data source we are using to measure the retrieval quality and the retrieval precision.

Starting point is 00:26:14 But then now it comes to the LLM. The LLM gets to see this, right, and the expected answer. And now we want to make sure that the LLM, how we build the system prompt, how we build the guardrails and things like that, that it actually reads that stuff and let's say summarize it in a way that we want. It should not be too short. It should not be too long. It should take the importance part from there. And if you think about this classical summarization task for that task, the LLM actually would not need. factual knowledge, right?

Starting point is 00:26:52 It would just need the capability of, I don't know, of a linguistically very well-trained person, which is perfectly trained in doing excerpts and text summarization. It's not, it actually does not need to have the capability anymore to look up stuff via Google or find something or actually to, to understand, at a very technical level, what's there, it needs to be able to do this summarization task. And to some extent, that's, let's say, a different task than what is needed nowadays

Starting point is 00:27:29 in all this agentic frameworks when it comes to reasoning and planning. So here you're pulling off from the LLM more this kind of linguist expert, which needs to do the summarization. And if you prompt the judge properly, you can kind of turn it into this, kind of linguistic a person. Do we end up, will we maybe end up in a world

Starting point is 00:27:57 in thinking about now again our problem domain, observability. If I think about observability, we have many different personas who can benefit from observability to make better decisions. On the one side, it's the developer.

Starting point is 00:28:10 They may be looking at the logs, into the traces. We have the tester that looks at, you know, not maybe as deep into, traces, but into other signals, into their, and how the system scales, to some capacity planning. Then we have there's a re-team, we have the deployment team and so on. Do you think we will end up, or does it make sense, to train different types of models for

Starting point is 00:28:35 these different types of roles we also currently have in our day-to-day life to really, and I'm using the word again, digital twin, which I know obviously is not a new word, but is this where we are ending is this where we're heading to that we're creating smaller language models or smaller experts digital experts on a certain problem domain and then they can also talk with each other and argue and like we humans do or that's at least let's say the trend where the whole industry and community is going to that you have this kind of experts the question is how you build those experts let's say two or three years ago There was many, many people were kind of fine-tuning these large language models to adhere to a task, to a domain, and things like that.

Starting point is 00:29:26 But now with the advent of these reasoning models where you give a model, let's say, a set of tools which is appropriate for their job, for their, to build exactly that expert, right? or to build Ikea key that expert that's now another way to build those experts you don't necessarily fine train a model on a particular task but what what is happening now with these models which are which are general purpose built with high reasoning capabilities and let's say planning what tools to use in which situation is that you are trained generic expert when he reads the manual or let's say yeah when he reads the manual of five tools he immediately becomes an expert in using these tools and that's now how that's how you nowadays is going to build these kind of experts and this is what this whole agentic AI is about that you have a let's say

Starting point is 00:30:32 a generic brain where you can attach where you easily can attach any kind of tool and this generic brain immediately reads the manual of each individual tool and a small manual how they're best used together and then it immediately knows how to work with these tools. And yes, there is also companies or leaders in that space saying that then for exactly, for, let's say, narrowly focused agents that the brain which controls the tools must not be the biggest, it's not necessary that this is the biggest brain. It can also be a smaller brain, which is fine-tuned how to use this set of constraints tools, right? But that needs to be shown from my point of view, whether this holds true with the rapid development of this general purpose models.

Starting point is 00:31:32 You know, I think this brings up another important question with the data, right? If you, and I, you know, two observations on what you all just said. First of all, Andy, I think one of the dangers of the idea that you brought in would be, do we suddenly go back to siloed, you know, siloed AI, which is what we've been trying to get away from within organizations with people, right? People who are completely siloed in their area of expertise, and then there's not the cross-talk. and that's one of the things, I think, observability and the whole DevOps movement helped bring about

Starting point is 00:32:05 is getting people talking with each other. So however that is, whether they're built silos, but they communicate or it's built not too small. We don't want to get too granular on the size of it, but we don't want to get full co-pilot or chat GPT size where it's got everything, right? And going back to the idea of the data feeding it, it definitely feels like,

Starting point is 00:32:29 I feel like it would be important for these tools to not only say this is what I'm built for, but to have a list of all of its sources of data along they're published with it, right? Because one of the challenges you can run into is if you have bad, whoever's in control of the sources of data controls what the output is, right? And not to get political here for a moment, but if you follow anything that has been going on in this country, we have like organizations like the CDC where they're scrubbing all this data and some, of our states, you know, in terms of vaccines and all this kind of stuff, when we have a coalition of states that are sort of rebelling against that to keep true medical stuff alive versus the government sources, but traditionally it was always the government source of data was the

Starting point is 00:33:12 official. So if you have people swapping in and out data sources for their agendas anywhere, right, it could be anywhere. I'm not just picking on us. I'm just using that as an example. That's where it can get, if it's too broad of a model, right? That's where it can get into a danger zone, right? And that's where you can have bad actors manipulating that data that's going into it. Like, I don't see how this would happen, but can you see, like, you know, could there be a way that AWS would want to position its solutions as the proper ones versus as competitors and can it manipulate things, right?

Starting point is 00:33:46 I'm sure all that stuff will come up. So having those smaller models, but also publishing what the data, making public what the data sources are, I think would be critical to this trust of. is my answer, a reliable answer. If I can see the sources that are being used and say, okay, these are non-political sources, these are non-special interest sources, whether it be for a company or for a government or anything,

Starting point is 00:34:13 because, again, if you just use the super big ones, if you just went to generic public copilot, you're going to have way too much data coming in. So I really do like this idea of having some sort of smaller size, but maybe not micro I don't know if that makes sense if this is stuff that's

Starting point is 00:34:33 I mean in general this question where to cut systems and put trust boundaries it's like in the same thing in software engineering all over again

Starting point is 00:34:47 how do you where do you put the system boundaries where do you put your module boundaries your package boundaries how do you structure how to you structure the overall system right and yeah i think the what we will what we will see in 10 years nobody will knows today that's my my take but uh i still trust the that the convoy the divided

Starting point is 00:35:18 conco approach so and when it comes to this uh let's say to this whole motorization and data and talking to data and talking to each other. And because at the end, it's not yet decided whether this in context learning, so how you prompt LNMs or in the long term, the full, let's say, always need to train. So the need to train from scratch thing. And as far as I see the situation, this pre-training is more like. learning the general skills, but the factual knowledge, you can Google always, right? You can Bing it always. You will find it and then you need just to interpret it.

Starting point is 00:36:07 And this is, as of now, still something which stands true, except in three years we end up with retrained models every day. So I don't know if they can make it. But other than that, there is always still this analogy that a child is developing. That's the training phase. and then if it's trained, it has Google and Bing and whatnot, and it can pull in the facts. But then it has to be trained how to deal with these facts. And the question how to say how to pollute it your brain is how you interpret the facts. That's then what I think what you said, Brian.

Starting point is 00:36:50 Then if you, let's say, got trained in Austria to be. be a farmer, just saying something, right, then you will interpret the facts differently as if you were trained in, I don't know, in San Francisco being a, I don't know, software developer, right? And so the data lineage and how the training actually takes place and in which direction the large language model has been aligned to react to, right? These are these major training steps in all these large language models, where you have first the language training, then you have this reinforcement learning with human in the loops, then you have the, let's say, the next goal adjustment trainings, so to say.

Starting point is 00:37:36 I also think that there is the danger with all these very big players which don't, you don't get the details, right? They don't provide their data line, they don't provide all these fine-tuned models, all these very narrow details, how they actually do their fine tuning. So I would agree that it would be very cool to have an approach where you actually can download a model based on what background knowledge you would like to have in that model, right? You say, okay, I would like no biases into this direction.

Starting point is 00:38:16 I don't like a background in, I don't know, astrology, just saying something. keep that out of my brain, right? And make me the new data I receive interpret in this or in that way. Yeah, I got one more question. Sorry, Andy. I know you've probably got a lot of questions, but a lot of stuff's been coming up in this conversation.

Starting point is 00:38:39 What about feedback loops for input, right? So if you get some good answers from AI, like, you know, thinking about it in tech, right? I have a problem as going to, my models is going to suggest I do X, Y, and Z to correct my problem, right?

Starting point is 00:38:57 I go through some of that works, but I find I had to make a tweak, and there's this one thing that was different than what the model gave me, right? With how does that feedback go back into the model so the model can learn from that, right? And how critical is that? Because it seems like a lot of what the AI is learning is based on what exists, but then as we implement things, it's not getting those updates of what these changes are. If somebody publishes a paper or pushes into something, yes, the AI can learn from that.

Starting point is 00:39:31 But how do we tackle the issue of it needs that live feedback? It needs to see this work with a minor tweak in this case because of this little other variable that the model can learn from and consider. My eyes, that's to be considered on, can be considered on very different case, right? on the on the small scale for example on our generation of our decoal queries from natural language you have the feedback button in the product and if you give it a thumbs up or if you give it a thumbs down you're asked what was wrong and actually then under the hood we have our monitoring solution where we log all the prompts everything which went wrong everything

Starting point is 00:40:16 which went good with the thumbs up and down and then actually we look into the data into the generated dql and take the mistakes because sometimes it isn't just able to generate it and then we see okay it missed a comma it was not very sure about how to use the joint comment just saying something and then we take this into account and place these improvements in our so to say in context learning but as of now this doesn't go back to our lLM provider it's still makes the same problems, but we correct it via in-context learning. The other thing is if you go to an OpenEI API, you don't get this, let's say, enterprise SLAs, you cannot opt out from prompt logging, so they will log everything.

Starting point is 00:41:09 And when you then you skip up co-pilot, for example, and you give it a thumbs down, or you don't accept a code proposal or whatnot, then all these information goes back into to their service and they will leverage it for the next training. So depending on your settings, whether you turn it off or not, the information, the feedback you're getting from making it, making it and the LLM making a failure can enter the improvement chain at various levels, really down to the next retraining or the in context learning as we do it now for our product. so the lesson there is everyone give feedback if it doesn't work because if you're not given that

Starting point is 00:41:53 feedback you're not going to get it to improve that's great thanks all right andy i know you have a bunch of questions just a couple of quick thoughts and i know we get into the end but maybe some thoughts also for a follow-up conversation because thomas i think we want to have you back a couple of thoughts that came up first thing is the discussion earlier reminded me of a podcast we recorded the previous session with pinie restnik because he talked about the transformation from cloud native to AI native. And he said, where we are right now, defining what AI native is, is like back in 2014, if you would have defined what cloud native is.

Starting point is 00:42:30 Because back in the days, there was a new technology, containers, orchestration, Kubernetes, but people were just repackaging their apps and putting it on Kubernetes, but this is not cloud native. It wasn't cloud native, at least the way we see today. So there's still a lot of things. I think we have to learn what cloud native or what AI native is. I feel like if you're just trying to model AI based on the world as we know it right now to help us, then maybe we miss the opportunity to really redefine everything because we don't

Starting point is 00:43:00 need to just model the world we live in right now digitally, but maybe there's now an easier in a different way, right? Because why do we still need a developer in S3, a performance test? So maybe we need something completely different things to AI. So that's one of the thoughts that I had. And I had so many other thoughts, but I think this was the most dominant one, just trying to figure out. Oh, yeah, the last one is right now, and this is true whether I have an AI or not. If I want to use an AI, I need to know what question to ask.

Starting point is 00:43:37 If I'm a software engineer and I don't even know that I can ask for logs, because logs is where my critical details are about problems, then what does an AI help me, right? And I think this is also why we see, I guess, the move towards these agentic systems where I don't start prompting, but I have basically a digital assistant next to me that gives me proactive hints on also things that I've never thought about

Starting point is 00:44:05 because I've never learned about these things. It's these even autonomous agents then at the end, right? And I mean, from the very high level, it's a little bit like if you enter our product, right, and you have the health status on the front screen, you didn't even look, you didn't even ask whether there is something. You just saw that there is something. You already get informed. And now with these large language models, we probably can transform this into something different. than a traffic light, red, green, yellow, but give direct insights,

Starting point is 00:44:56 direct impact analysis more tailored to human senses, I would argue, or to human communication channels. Language is still one of our most prominent communication channel, where you actually communicate logical steps, right, and things like that. So there is a big of an opportunity if you bring the word autonomous to the agent. Yeah, exactly. And ideally the agent also knows who I am and what I typically do.

Starting point is 00:45:27 And then it doesn't bother me with data that is interesting, but doesn't help me right now. And it's focusing on things that help me right now. But I think the magic is I don't want to prompt. I want to get information that helps me to make the next right decision in my day-to-day job. Hey, Thomas, I know we are all running close on time because we all have other things to go to in the next couple of minutes. What I would like to encourage everyone that is listening to this podcast, as always, there are links that on the one side you had a presentation at a Java conference earlier this year. You had this Ask Me Anything session on LinkedIn. All of the videos are there.

Starting point is 00:46:08 I can really encourage you to check this out to learn on how we at Dinah Trace are leveraging this technology. what the architecture is, our testing considerations, also the whole topic of how we are monitoring it and seeing adoption and learning from this. So this is why we could probably talk for a much longer time. And Thomas, the world is not standing still. As you said, a lot has changed over the last 30, 40, 50 years, even or even longer.

Starting point is 00:46:34 And I'm sure you may not look like 21 anymore. And I know you're not, but even in the next years, you will still be around. And I'm sure we want to have you back. because you are much more in this topic and we can learn from you. So this is an open invitation to come back.

Starting point is 00:46:52 Thank you, Eddie. Yeah, absolutely. I want to thank you also for being here. I think this is our third AI-based podcast in a row, potentially, although it's hard to keep track. There's that one I missed. But I think this is very, very interesting topics. And there's what we hear of about AI.

Starting point is 00:47:14 in the news and then there's what we're learning about, the reality of it, on these kind of conversations. And I, you know, you could see the big gap, right? If you take what you hear in the news, it's like, oh, yeah, I'll give you all the right answers and it's good to go, right? And then we have these conversations and it's like, well, yeah, it's really helpful and it's making great strides, but there are these other factors, all these different things that people are working on to make it more trustworthy, to make it more accurate, to find better

Starting point is 00:47:42 use cases for it, right? I think this agentic stuff is really fantastic. You know, those kind of use cases versus, again, create an image of these two people, right, that it's kind of like the

Starting point is 00:47:57 novelty AI stuff versus the meat and potatoes AI stuff. So really appreciate you. Hopefully we'll have an ask me more session with you. And I hope our listeners are getting a lot out of this because it's I'm learning a ton

Starting point is 00:48:14 right and I'm I'm sure you are Andy and hopefully our listeners are learning a whole bunch from this also really appreciate people like you coming on to Thomas thank you very much thank you all right until our next episode everybody

Starting point is 00:48:29 if you have any thoughts ideas for episodes if you're finding this AI stuff good go ahead and send us an email at pure performance at dinatrase.com love to hear your thoughts or ideas especially in these AI topics. I think this is a really big place. It does still fit in with performance to a degree,

Starting point is 00:48:49 but it's also everywhere now. Yeah, that's it. Thank you, everyone. Bye-bye. Thank you. Bye.

PurePerformance - How to test, optimize, and reduce hallucinations of AIs with Thomas Natschlaeger

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.