PurePerformance - How to test, optimize, and reduce hallucinations of AIs with Thomas Natschlaeger
Episode Date: October 13, 2025While Artificial Intelligence seems to have just popped up when OpenAI brought ChatGPT to the consumer market it has its roots in the mids of the 20th century. But what is it that all of a sudden made... it into every conversation we seem to have?Thomas Natschlaeger, Principal Data Scientist at Dynatrace, who has been working in the AI and Machine Learning space for the past 30 years gives us a brief historical overview and describes the critical evolutionary steps and compelling events in that technology that made it to what it is today. Tune in and hear about how AIs are trained, how they are optimized and most importantly: how their outputs can be tested and validated!In our conversation we discuss current trends towards small language models that will help model digital twins of our existing roles and how AIs are used to Validate other AIs like we humans do when a senior engineer does pair programming with a junior and with that provides essential feedback on current accuracy and input to improve the outcome of future tasks.Links we discussedLinkedIn Profile from Thomas: https://www.linkedin.com/in/thomas-natschlaeger/Ask Me Anything Session on Davis CoPilot: https://www.linkedin.com/posts/grabnerandi_llm-copilot-activity-7373837743971393536-QgxV?utm_source=share&utm_medium=member_desktop&rcm=ACoAAABLhVQBbh8Jkn_K8din5tsQlMCpXRNzlKUVoxxed Conference Talk: https://amsterdam.voxxeddays.com/talk/?id=39801Attention is all you need paper: https://en.wikipedia.org/wiki/Attention_Is_All_You_Need
Transcript
Discussion (0)
It's time for pure performance.
Get your stopwatches ready.
It's time for Pure Performance with Andy Grabner and Brian Wilson.
Hello, everybody, and welcome to another episode of Pure Performance.
My name is Brian.
Wilson. And as always I have with me, my co-host, Andy Grabner, who's here with the Davis AI
figurine. And I just want to let people know, because this is audio only. Andy has not been
mocking my intro lately. I think at least the last two or three podcasts, he's been being a normal
human being. So thank you for that, Andy. And I think we can all have some gratitude.
But Andy's playing with toys today. Yeah, it looks like it, right? I don't know where,
when I picked it up, Davis AI. I think it was the last.
It was perform.
I think it was the perform in 2020 before COVID, the last one before COVID,
because the actress who played our DavyCi, she was also on site, if you remember.
And you could take selfies with her and then you could take a, what's it called, a bubblehead?
Yes, bubble hood.
Bobblehead, yes.
Now, the question is, why do I have this figure in my hand today?
Because you're feeling playful?
Maybe I feel playful, yeah.
But really, because last week, I had a chance to, actually I've been in Poland, and this is a true story, I've been in Poland, went to our office in Gdansk, and then in the evening, got back to my hotel, wanted to go to bed, but then I always make, not a mistake, in this case it was a good mistake, I opened up LinkedIn, and I saw the two of our colleagues, actually three of our colleagues, didn't ask me anything session, and these colleagues were.
Gabreda Hassan Birkenmann, Sophia Habip who moderated the whole thing and our guest from today.
Thomas Natschleger and you talked about how we internally build Dinotrace co-pilot,
how it internally works, how we translate the prompts to DQL, how we train the model
and also how we test these model changes and how we in general deal with testing.
And Brian, I thought because we both have a big history and passion for software quality,
I want to invite Thomas and learn about how we can make sure that the LLMs and whatever AI models we use
and also anybody uses out there, how we can make sure the results produce the right quality.
It's very important topic.
Especially as we move a lot more, everyone's moving more into AI.
And just like we have to trust what our observability is telling us and all that kind of stuff,
there's this trust factor with is what AI is telling me true, right?
And we've seen plenty of cases where, you know, I'm not saying in the tooling,
but in general, if you take a look at co-pilot, GROC, some of these others,
sometimes they just come out with these crazy, crazy either hallucinize,
I was about to say hallucinization, hallucinations, or other things.
So, you know, it's interesting you said that.
I don't mean to jump on this one point right here,
but until you said that, I really hadn't thought about this idea of like,
can we trust what our AI?
is telling us.
So, yes, I think it's really good.
Please continue.
And so with this now, I want to officially welcome Thomas Natschleger Principal Data Scientist at Dana Trace.
And if I look at his LinkedIn profile, it's really fascinating to see you spend a lot of
time at the Technical University in Graz, then at the Software Competence Center in Hagenberg.
And then you were at machine learning engineer at Blue Sky Weather, so I hope you
had a lot of blue sky predictions, because especially the last summers, we were pretty nice
here, and now you're at Anandris.
But, Thomas, can you please introduce yourself to our audience, and then we'll jump into
the topic.
Yeah, thanks, Andy, for the nice intro.
Yeah, it was enumerated all the places where I've been working.
You can guess that I'm not 21 anymore.
So, and actually, sometimes I made it.
make the joke that I have written my first neural network in sea something like three decades ago.
So there is some experience in that area and also my station at the university in
Guards was about computation in neuroscience and how the brain works.
We tried to figure out some, let's say, theories about that.
And that as of today comes in kind of handy to understand how all these layers in the brain
work and when you compared it to, let's say, all this convolutional neural networks and
deep neural networks and stuff like that.
So this early on knowledge in that area helps me to understand what's going on to some
degree in all these large language models.
Yes, and hopefully also back then the, let's say, the efforts we took at the university
to simulate those brains and analyze that data to understand what it is doing and also come
up with some ideas how to test hypotheses for this large language for this back then it was called
spiking rule networks to come up with hypotheses for these so everything ties together here nicely
and i think uh let's dive into this topic of of testing those beasts so to say yeah before actually
talking about testing because now this is the way our podcasts work we always get an idea and a thought
and then i want to talk a little bit more about this um you have been in this field for so many years
I actually remember it was in the early 2000.
I did a, not a master thesis, but I did a, what's he called, before the master?
Am I stupid?
Bachelor?
Bachelor?
Thank you so much.
Because I went to high school.
That's why you didn't get a master thesis because you can't learn for bachelor.
Exactly.
Because I went to high school, then started working.
And afterwards, I did a bachelor degree while I was already working.
And one of the things we did back then was in the early 2000.
With somebody that you may also know, Ulrich Bodhofer.
He was one, he was my lecturer and we did for the bachelor thesis, we worked also on some
machine learning algorithms to predict and calculate how to best steer a kind of like an autonomous
driving car and we were training the model with data and when to break and things like that.
So that was like 20 something years ago.
And for most of us, and I think especially consumers of this technology, it feels like everything
started like really three years ago because it became tangible.
Can you explain from your perspective a little bit what has changed and why did we all
of a sudden see this, was this a big, was this just luck or was this, why all of a sudden
was this big change where the whole world is now not only talking about it, but also using
LLMs?
What has changed?
What changed?
I mean, it was really, let's say, some.
Some accumulation of, I don't know, events and long-term efforts, to be honest.
I mean, the whole neural network community is working since 19.
The first neural network was invented in 1950, something by McCulloch and Pitts.
And then later on, the groundwork for all these deep neural network stuff,
there was this fundamental paper by Rummelhardt and Hinton and guys in 1986 or something like that.
And all of these papers actually were milestones and breakthrough.
at that point in time, but due to the lack of, you know it, not having Nvidia back then, right?
It didn't make it into the public because all these great ideas actually demanded lots of computational
power.
And from my point of view, the major breakthrough was actually already earlier in, let's say,
in the late 2010s when the first people started to think about reusing GPUs, not for games,
for neural networks.
Remember, a colleague of mine,
I was a bishop at the university in Graz when they,
he came to me and said,
look here, we are using, we are implementing variation in auto encoders
and a graphics card.
And I said, oh, maybe a good idea.
And back then, everybody had to do it on their own,
you know, making their hands really dirty
with this really nasty, close to assembler code on graphics cards.
Lots of bugs in there.
Also great opportunity for testing.
Lots of bugs in there.
this very low-level code.
And then the, let's say, this kind of evolution then typically is coming in waves.
And sometimes there is a threshold and some emergent behavior pops out, right?
You know, it goes like this, like a baby when it's learning walking, right?
There is a long time it's crawling around and a few times it tries and suddenly it walks, right?
It's not that it first can do two steps and three and four, but suddenly it can do all of them.
And that's the same in this evolution of neural network capabilities when everything comes
together, the hardware, the knowledge, the experience, and how to train them.
And then in the computer vision area, this thing exploded around 2010, basically.
And then people form the, let's say, large language, from the language modeling community,
which mostly have been using kind of, I don't know, hidden Markov models and whatnot to model
language back then at some point in time we're looking also more and more into this neural
network area right and then let's say starting at 2015 to 2018 one has could already see that
this language models evolved over time right and then there was this i think it was 2018 there
was this one big paper which is called attention is all you need where researchers
from OpenEI published the first transformer model
and then from there it was like
only three years on
and then I mean
on hugging phase you could already have
many of those transformer models back then and then
OpenEI just made business out of it
so baked it
with big money and easy
to use API
and that's
where we are now
it's really interesting Andy I never thought of that
you know it's stupid that I didn't know
like there's no
reason I shouldn't have known that AI has been being worked on since, what did you say, the
1950s with the first neural network?
Yes.
But I think I fell in the same trap of everybody where it's like, you just think it's a recent
development.
But meanwhile, it's been talked about for forever, right?
Whether it's in science fiction or movies, well, I guess that's science fiction as well.
But, yeah, of course it was being worked on.
But it's not something, I think, any regular person thought.
Like, there is this tremendous impression that it just popped up out of nowhere.
and things don't pop up out of nowhere
and it's been people working
long and hard on this stuff
and I guess it really sounds like
when the technology finally caught up
to give it the powers
when it was able to finally
make that leap
so just interesting question Andy
thanks for asking that
because it makes me feel stupid
for not even thinking that it existed
but yeah of course it did
I have one
maybe one funny anecdote
because there is this
very well-known researcher
in Switzerland, Juergen Schmidt-Huber
and he is very famous known to be how to say very exact when it comes down to who which credit goes to whom right and he is always trying to find out what were the earliest citation of some new invention and something like that right and in his opinion all these kind of tensor flow pie torch neural network machinery which does all the
learning and stochastic gradient descent, the real credit goes back to a student in Finland in
1970 or something like that who first in time, according to his opinion, invented first time
automated gradient descent for some autonomous control machinery or whatnot. So if you are
that serious, the whole journey already started.
quite some time ago.
Yeah, and it's always good to look.
The reason why I asked, right, because I was fortunate enough back when I did my thesis
to work on this topic, even though back then I got to tell you, for me it was interesting,
but still it was so cumbersome to create, to compile in our models and then to train
it, and then the results were not that great.
I mean, we had limited time, obviously.
And then seeing what we do right now is just amazing.
But, and this is now where I want to kind of transition over, when I now use these large language models, I'm still sometimes frustrated because I ask the question and I get an answer that makes sense.
I ask the same question again and then I get an answer that is complete not what I was expecting.
And then I'm wondering why do you give me two different answers for a similar question, what has changed, how is this even possible?
and now kind of to how we are using LLMs and with we, I mean the observability space.
If you look at observability, a lot of, I think every vendor now is basically saying we have the data and we put our models on top and then we get better insights and better answers to make sense out of the data and you don't have to become an expert in analyzing all of this.
I would like from you to know now a couple of months or maybe a year or two in on how we are applying this on our data.
what are kind of the lessons learned and what are still the challenges and why what do we do with testing to kind of improve the situation
so before I jump into this whole testing thing I would because I will refer to that sometimes basically we have I would say what we call skills in place and one is this more like a traditional traditional you know since two years traditional
chatbot based on a large language model and the other one is this translation of natural text
into our query language right so that's kind of a little bit too this from my point of view
a little bit too these twinked use cases the one is more get me a summary get me an explanation
and easy to read text such that I can navigate for example our documentation that I can navigate
another text and the other one is more like code generation you know all this code generation tools
And in our case, it's generation of our own invented query language.
So the output is structured, is let's say testable, actually, whether it's, for example,
syntactically correct, whether it's semantically correct, whether it's executable by our
database engine, by our data lake engine.
So from that perspective, when it comes to testing, there may be, or there are actually
then two different approaches there.
One is this kind of, let's say,
free text output,
which you want to check
and to see whether the output
of the system we built,
of our Davis co-pilot,
is giving what you're expected, right?
It's not exactly,
as you said,
sometimes it does it a little bit longer,
a little bit more,
longer paragraphs, smaller paragraphs,
what not.
But as soon as it's
close to your expectations, it's fine.
But on the other hand, when you look at the DQL generation, it's a different thing.
When it forgets one comma, right, it's not executable.
So you need to make, you may want to help the LLM there, in particular to get the syntax
right and then also to get the semantics right.
So, and then also from what I have learned is that there is this whole.
life cycle of such a system which you build.
It's not only the LLM, right?
For example, our Davis co-pilot is what you typically call this retrieval augmented
generation approach, where under the hood you use an LLM plus knowledge sources, knowledge
bases, right, which you first, so you ask the system a question, then you try to find
the relevant document text pieces in your knowledge base and you pull out this relevant
and text chunks, and then you ask the large language model to give you a comprehensive summary
of what was found in the knowledge base.
That's this classical retrieval augmented generation.
And so what you need to test is not only the LLM, it's the whole system.
It's more like an integration test at the end.
Obviously, for each of the pieces, you can have a unit test, but overall, it's an integration
test there.
And when you first started out with that, but when you start with this kind of
chatbot thing, it's a little bit similar, it's a little bit different than what I was
used to do when I built, let's say, weather forecasts, right?
Because you mentioned that I'm a machine learning engineer at Blue Sky, where we basically
train each day many, many hundreds and thousands of machine learning models to predict
locally precise temperature, I don't know, energy production, humidity and stuff like that.
So typically numerical values.
For these numerical values, you have well-established KPI's for the quality of a joint model,
root, mean, squared error, mean absolute error, error, error, square measurement, whatnot.
While for this text output thing, those kind of measures had to be established over the last
two years, why the others have been around for the last 50 years.
And so the community also went through a rapid development process of coming up with various
types, how you actually measure those KPIs, whether an answer the system is producing
is kind of close to what you were actually expecting.
And this measure of closeness has also evolved over time with the availability of the measurement
tools, I would argue.
As I mentioned, let's say, let's do a time travel again and let's go, let's go back to
2012, for example, where most of the language modeling was done with, let's say, purely
statistical models.
And then back then also the measurement, whether a produced text and the desired text are
closed together have been based on measures like you do tokenization and you count how many
tokens are equal, how large is the overlap of tokens, is the order somehow matching?
So the token, I don't know if the output is, buy me some butter, for example, or whatnot.
And the word butter in the expected output is in the first place, while in the expected answer,
it's in the last place.
Even the same word is there, it's not the perfect match, right?
You try to come up with all kinds of measurements from a statistical point of view, taking all
these things into account, let's say, engineered KPIs.
And then later on, as of now, then the LLMs came alive, right?
They produced a text now, but they are now also used as judges to judge whether the
generated text is actually that what you want as an output.
So that's now a very common technique to use this, to use, let's say, an LLM which is
more powerful that the ELLM you are using to generate the output as a judge to argue about
is it faithful, is the answer, is the output relevant to the found documents? So that's
the question whether it's faithful to avoid hallucinations, right? Because if the answer is
completely off what you retrieved from your knowledge base and it still is making something up,
right then it's not the best because why would i show you the facts right if you still ignore them
so yeah exactly that's about these let's say how you how you compute these um these kpies nowadays
mostly again with other lLMs or some let's say semantic closeness closeness measures
which are also used in in those in the search the semantic search thing um quick question
on this because this sounds like, you know, we are, to explain this simply, we are creating
digital twins of things we also have in human life.
We have a teacher who is more, let's say an expert, who is, you know, better trained
on a topic.
And then we have, you know, people that try to become experts.
And then it's basically just like create an output and then I get created by the expert.
It was like in peer programming.
I have a junior developer and I have a senior developer and they do pair programming.
And so with this, the kind of cross-check, I guess the only challenge is if we believe that these systems really always work.
But if the expert is actually not an expert, because the expert was trained on wrong data, then in the end, the whole system just doesn't really produce great output because it's, maybe I'm thinking this wrong, but it's also like in human life.
I was thinking the same thing, right?
If you have, like, what's the data model that's being trained, that it's using to train, right?
Because if you were to open it up to everything available on the internet and the deep internet, right?
There's going to be a lot of wrong information out there.
There's going to be a lot of right information out there.
You might have answers that don't apply.
Like, you're, Thomas, what you're talking about sounds like, okay, we're looking at the result and is the report written correctly,
on the information it used but are the citations the wrong citations that it used like yes you
created an answer out of what you pulled back and that answer based on what you pulled back is written
well but the data you pulled from is bad right and how do we gate keep that the other the other
thing i was thinking of and i forget the name of ibn's one the one that was on jeopardy right um i guess
what was it watson was it watson yeah watson and i believe that much that much
model is very similar to what I would call the stack overflow model, where people say this is the
best answer, but then there are some other answers that may apply, right? And the devil is in the
details, as you might say, this is probably the most common condition, but then there are some
other ones that may or may not apply. And Watson, I remember when they were first talking about
Watson back when it was on Jeopardy, the idea was to use it in a hospital scenario. So a doctor
can put in all this stuff, and it's come back. 90% it's this.
But these are some other conditions, right?
So the doctor at least has a list to say, okay, let me check.
Let me go in order of what is most likely.
But maybe it is going to be this thing way down here.
And if we're only ever writing or returning an answer based on the top one and not considering the others, like how do you get that trust?
How do you do the gatekeeping of the data that's being pulled in?
How do you make sure the data that you're building from, Zandi saying, is proper data, right?
because, again, most of the stuff
it's learning from is coming from people,
right, and people have opinions.
And it's not even just the opinion,
but there could be mass hallucinations among people,
right?
Everyone says, oh, this is the common thing,
this is how you solve it, right?
Someone else comes along,
just like we have breakthroughs in technology
to say, actually, if you make this little tweak,
it's going to work better, you know?
So how is that, I think,
is that where you were going with this, Andy, as well?
Because this was on my mind as this was being discussed.
Like, that seems like the harder part
to gain the trust on.
I mean, what we do is, and what you typically do is divide and conquer, right?
So basically you described it already very well.
It's, to some extent, it's two different things.
First, you need to make sure that you retrieve the right data.
And to ensure that this is working, we have dedicated, built dedicated tests which measure the retrieval quality.
So in this kind of tests, we have test sets in place.
That's the, basically, at the high level, that's the only approach in the whole machinery community.
You need a test set, right?
Somebody needs to sit down and say, okay, for that particular answer, I would like to get these sources.
For these particular answers, I would like to get these sources.
And to make that reliable in our company, actually we worked with D1 and support, such that
they give us for the question coming in from our customers, worked out RFAs, for example,
and they said, okay, this has been the reliable sources we have been using to actually answer
that question.
And this kind of data source we are using to measure the retrieval quality and the retrieval
precision.
But then now it comes to the LLM.
The LLM gets to see this, right, and the expected answer.
And now we want to make sure that the LLM, how we build the system prompt, how we build the guardrails and things like that, that it actually reads that stuff and let's say summarize it in a way that we want.
It should not be too short.
It should not be too long.
It should take the importance part from there.
And if you think about this classical summarization task for that task, the LLM actually would not need.
factual knowledge, right?
It would just need the capability of, I don't know,
of a linguistically very well-trained person,
which is perfectly trained in doing excerpts and text summarization.
It's not, it actually does not need to have the capability anymore
to look up stuff via Google or find something or actually to,
to understand, at a very technical level,
what's there, it needs to be able to do this summarization task.
And to some extent, that's, let's say, a different task than what is needed nowadays
in all this agentic frameworks when it comes to reasoning and planning.
So here you're pulling off from the LLM more this kind of linguist expert, which needs
to do the summarization.
And if you prompt the judge properly, you can kind of turn it into this,
kind of linguistic
a person.
Do we end up,
will we maybe end up in a world
in thinking about
now again our problem domain,
observability.
If I think about observability,
we have many different
personas who can benefit
from observability to make better decisions.
On the one side, it's the developer.
They may be looking at the logs,
into the traces.
We have the tester
that looks at, you know,
not maybe as deep into,
traces, but into other signals, into their, and how the system scales, to some capacity planning.
Then we have there's a re-team, we have the deployment team and so on.
Do you think we will end up, or does it make sense, to train different types of models for
these different types of roles we also currently have in our day-to-day life to really, and I'm
using the word again, digital twin, which I know obviously is not a new word, but is this
where we are ending is this where we're heading to that we're creating smaller language models or
smaller experts digital experts on a certain problem domain and then they can also talk with
each other and argue and like we humans do or that's at least let's say the trend where the
whole industry and community is going to that you have this kind of experts the question is how
you build those experts let's say two or three years ago
There was many, many people were kind of fine-tuning these large language models to adhere to a task, to a domain, and things like that.
But now with the advent of these reasoning models where you give a model, let's say, a set of tools which is appropriate for their job, for their, to build exactly that expert, right?
or to build Ikea key that expert that's now another way to build those experts you don't
necessarily fine train a model on a particular task but what what is happening now with these
models which are which are general purpose built with high reasoning capabilities and let's
say planning what tools to use in which situation is that you are trained generic expert when
he reads the manual or let's say yeah when he reads the manual of five tools he immediately becomes
an expert in using these tools and that's now how that's how you nowadays is going to build
these kind of experts and this is what this whole agentic AI is about that you have a let's say
a generic brain where you can attach where you easily can attach any kind of tool and this generic
brain immediately reads the manual of each individual tool and a small manual how they're best used together
and then it immediately knows how to work with these tools.
And yes, there is also companies or leaders in that space saying that then for exactly,
for, let's say, narrowly focused agents that the brain which controls the tools
must not be the biggest, it's not necessary that this is the biggest brain.
It can also be a smaller brain, which is fine-tuned how to use this set of constraints tools, right?
But that needs to be shown from my point of view, whether this holds true with the rapid development of this general purpose models.
You know, I think this brings up another important question with the data, right?
If you, and I, you know, two observations on what you all just said.
First of all, Andy, I think one of the dangers of the idea that you brought in would be,
do we suddenly go back to siloed, you know, siloed AI, which is what we've been trying to get away from
within organizations with people, right?
People who are completely siloed in their area of expertise, and then there's not the cross-talk.
and that's one of the things, I think, observability
and the whole DevOps movement helped bring about
is getting people talking with each other.
So however that is, whether they're built silos,
but they communicate or it's built not too small.
We don't want to get too granular on the size of it,
but we don't want to get full co-pilot or chat GPT size
where it's got everything, right?
And going back to the idea of the data feeding it,
it definitely feels like,
I feel like it would be important for these tools to not only say this is what I'm built for,
but to have a list of all of its sources of data along they're published with it, right?
Because one of the challenges you can run into is if you have bad, whoever's in control of the sources of data controls what the output is, right?
And not to get political here for a moment, but if you follow anything that has been going on in this country,
we have like organizations like the CDC where they're scrubbing all this data and some,
of our states, you know, in terms of vaccines and all this kind of stuff, when we have a coalition
of states that are sort of rebelling against that to keep true medical stuff alive versus
the government sources, but traditionally it was always the government source of data was the
official. So if you have people swapping in and out data sources for their agendas anywhere,
right, it could be anywhere. I'm not just picking on us. I'm just using that as an example.
That's where it can get, if it's too broad of a model, right? That's where it can get
into a danger zone, right?
And that's where you can have bad actors manipulating that data that's going into it.
Like, I don't see how this would happen, but can you see, like, you know, could there be a way
that AWS would want to position its solutions as the proper ones versus as competitors
and can it manipulate things, right?
I'm sure all that stuff will come up.
So having those smaller models, but also publishing what the data, making public what the
data sources are, I think would be critical to this trust of.
is my answer, a reliable answer.
If I can see the sources that are being used and say,
okay, these are non-political sources,
these are non-special interest sources,
whether it be for a company or for a government or anything,
because, again, if you just use the super big ones,
if you just went to generic public copilot,
you're going to have way too much data coming in.
So I really do like this idea of having some sort of smaller size,
but maybe not micro
I don't know if that makes sense
if this is stuff
that's
I mean in general
this question
where to cut systems
and put trust boundaries
it's like in
the same thing
in software engineering
all over again
how do you
where do you put the system boundaries
where do you put your
module boundaries
your package boundaries
how do you structure
how to you structure the overall system right and yeah i think the what we will what we will see in 10 years
nobody will knows today that's my my take but uh i still trust the that the convoy the divided
conco approach so and when it comes to this uh let's say to this whole motorization and data and
talking to data and talking to each other.
And because at the end, it's not yet decided whether this in context learning, so how you prompt
LNMs or in the long term, the full, let's say, always need to train.
So the need to train from scratch thing.
And as far as I see the situation, this pre-training is more like.
learning the general skills, but the factual knowledge, you can Google always, right?
You can Bing it always. You will find it and then you need just to interpret it.
And this is, as of now, still something which stands true, except in three years we end up
with retrained models every day. So I don't know if they can make it. But other than that,
there is always still this analogy that a child is developing. That's the training phase.
and then if it's trained, it has Google and Bing and whatnot,
and it can pull in the facts.
But then it has to be trained how to deal with these facts.
And the question how to say how to pollute it your brain is how you interpret the facts.
That's then what I think what you said, Brian.
Then if you, let's say, got trained in Austria to be.
be a farmer, just saying something, right, then you will interpret the facts differently
as if you were trained in, I don't know, in San Francisco being a, I don't know, software
developer, right? And so the data lineage and how the training actually takes place
and in which direction the large language model has been aligned to react to, right?
These are these major training steps in all these large language models, where you have
first the language training, then you have this reinforcement learning with human in the loops,
then you have the, let's say, the next goal adjustment trainings, so to say.
I also think that there is the danger with all these very big players which don't,
you don't get the details, right?
They don't provide their data line, they don't provide all these fine-tuned models,
all these very narrow details, how they actually do their fine tuning.
So I would agree that it would be very cool to have an approach
where you actually can download a model based on what background knowledge
you would like to have in that model, right?
You say, okay, I would like no biases into this direction.
I don't like a background in, I don't know, astrology, just saying something.
keep that out of my brain, right?
And make me
the new data I receive
interpret in this or in that way.
Yeah, I got
one more question. Sorry, Andy. I know you've probably got a lot of questions,
but a lot of stuff's been coming up in this conversation.
What about feedback loops for
input, right? So if
you get
some good answers from AI,
like, you know, thinking about it in tech,
right? I have a problem
as going to, my models
is going to suggest I do X, Y, and Z to correct my problem, right?
I go through some of that works, but I find I had to make a tweak,
and there's this one thing that was different than what the model gave me, right?
With how does that feedback go back into the model so the model can learn from that, right?
And how critical is that?
Because it seems like a lot of what the AI is learning is based on what exists,
but then as we implement things,
it's not getting those updates of what these changes are.
If somebody publishes a paper or pushes into something, yes, the AI can learn from that.
But how do we tackle the issue of it needs that live feedback?
It needs to see this work with a minor tweak in this case
because of this little other variable that the model can learn from and consider.
My eyes, that's to be considered on, can be considered on very different case, right?
on the on the small scale for example on our generation of our decoal queries from natural language
you have the feedback button in the product and if you give it a thumbs up or if you give
it a thumbs down you're asked what was wrong and actually then under the hood we have our
monitoring solution where we log all the prompts everything which went wrong everything
which went good with the thumbs up and down and then actually we look into the data
into the generated dql and take the mistakes because sometimes it isn't just able to generate it
and then we see okay it missed a comma it was not very sure about how to use the joint comment just
saying something and then we take this into account and place these improvements in our so to say
in context learning but as of now this doesn't go back to our lLM provider it's still
makes the same problems, but we correct it via in-context learning.
The other thing is if you go to an OpenEI API, you don't get this, let's say, enterprise
SLAs, you cannot opt out from prompt logging, so they will log everything.
And when you then you skip up co-pilot, for example, and you give it a thumbs down,
or you don't accept a code proposal or whatnot, then all these information goes back into
to their service and they will leverage it for the next training.
So depending on your settings, whether you turn it off or not, the information, the feedback
you're getting from making it, making it and the LLM making a failure can enter the improvement
chain at various levels, really down to the next retraining or the in context learning as we
do it now for our product.
so the lesson there is everyone give feedback if it doesn't work because if you're not given that
feedback you're not going to get it to improve that's great thanks all right andy i know you have a bunch
of questions just a couple of quick thoughts and i know we get into the end but maybe some thoughts
also for a follow-up conversation because thomas i think we want to have you back a couple of thoughts
that came up first thing is the discussion earlier reminded me of a podcast we recorded
the previous session with pinie restnik because he talked about
the transformation from cloud native to AI native.
And he said, where we are right now, defining what AI native is, is like back in 2014,
if you would have defined what cloud native is.
Because back in the days, there was a new technology, containers, orchestration, Kubernetes,
but people were just repackaging their apps and putting it on Kubernetes, but this is not cloud
native.
It wasn't cloud native, at least the way we see today.
So there's still a lot of things.
I think we have to learn what cloud native or what AI native is.
I feel like if you're just trying to model AI based on the world as we know it right now
to help us, then maybe we miss the opportunity to really redefine everything because we don't
need to just model the world we live in right now digitally, but maybe there's now an easier
in a different way, right?
Because why do we still need a developer in S3, a performance test?
So maybe we need something completely different things to AI.
So that's one of the thoughts that I had.
And I had so many other thoughts, but I think this was the most dominant one, just trying to figure out.
Oh, yeah, the last one is right now, and this is true whether I have an AI or not.
If I want to use an AI, I need to know what question to ask.
If I'm a software engineer and I don't even know that I can ask for logs, because logs is where my critical details are about problems,
then what does an AI help me, right?
And I think this is also why we see, I guess,
the move towards these agentic systems
where I don't start prompting,
but I have basically a digital assistant next to me
that gives me proactive hints
on also things that I've never thought about
because I've never learned about these things.
It's these even autonomous agents then at the end, right?
And I mean, from the very high level, it's a little bit like if you enter our product, right, and you have the health status on the front screen, you didn't even look, you didn't even ask whether there is something.
You just saw that there is something.
You already get informed.
And now with these large language models, we probably can transform this into something different.
than a traffic light, red, green, yellow,
but give direct insights,
direct impact analysis more tailored to human senses,
I would argue, or to human communication channels.
Language is still one of our most prominent communication channel,
where you actually communicate logical steps, right,
and things like that.
So there is a big of an opportunity if you bring the word autonomous to the agent.
Yeah, exactly.
And ideally the agent also knows who I am and what I typically do.
And then it doesn't bother me with data that is interesting, but doesn't help me right now.
And it's focusing on things that help me right now.
But I think the magic is I don't want to prompt.
I want to get information that helps me to make the next right decision in my day-to-day job.
Hey, Thomas, I know we are all running close on time because we all have other things to go to in the next couple of minutes.
What I would like to encourage everyone that is listening to this podcast, as always, there are links that on the one side you had a presentation at a Java conference earlier this year.
You had this Ask Me Anything session on LinkedIn.
All of the videos are there.
I can really encourage you to check this out to learn on how we at Dinah Trace are leveraging this technology.
what the architecture is, our testing considerations,
also the whole topic of how we are monitoring it
and seeing adoption and learning from this.
So this is why we could probably talk for a much longer time.
And Thomas, the world is not standing still.
As you said, a lot has changed over the last 30, 40, 50 years,
even or even longer.
And I'm sure you may not look like 21 anymore.
And I know you're not,
but even in the next years,
you will still be around.
And I'm sure we want to have you back.
because you are much more in this topic
and we can learn from you.
So this is an open invitation to come back.
Thank you, Eddie.
Yeah, absolutely.
I want to thank you also for being here.
I think this is our third AI-based podcast in a row, potentially,
although it's hard to keep track.
There's that one I missed.
But I think this is very, very interesting topics.
And there's what we hear of about AI.
in the news and then there's what we're learning about, the reality of it, on these kind of
conversations.
And I, you know, you could see the big gap, right?
If you take what you hear in the news, it's like, oh, yeah, I'll give you all the right
answers and it's good to go, right?
And then we have these conversations and it's like, well, yeah, it's really helpful and it's
making great strides, but there are these other factors, all these different things that
people are working on to make it more trustworthy, to make it more accurate, to find better
use cases for it, right? I think this agentic
stuff is really fantastic.
You know, those
kind of use cases versus,
again, create an image
of these two people,
right, that it's
kind of like the
novelty AI stuff versus
the meat and potatoes AI stuff.
So really appreciate you.
Hopefully we'll have an ask me more
session with you.
And I hope our listeners are getting
a lot out of this because it's
I'm learning a ton
right and I'm
I'm sure you are Andy
and hopefully our listeners are learning
a whole bunch from this also really appreciate people
like you coming on to Thomas
thank you very much
thank you
all right until our next episode everybody
if you have any thoughts
ideas for episodes if you're finding this AI stuff
good go ahead and send us an email
at pure performance at dinatrase.com
love to hear your thoughts or ideas
especially in these AI topics.
I think this is a really big place.
It does still fit in with performance to a degree,
but it's also everywhere now.
Yeah, that's it.
Thank you, everyone.
Bye-bye.
Thank you.
Bye.
