PurePerformance - OpenLLMetry - Observing the Quality of LLMs with Nir Gazit
Episode Date: January 15, 2024Its only been a year since ChatGPT was introduced. Since then we see LLMs (Large Language Models) and Generative AIs being integrated into every days life software applications. Developers have the ha...rd choice to pick the right model for their use case to produce the quality of output their end users demand.Tune in to this session where we have Nir Gazit, CEO and Co-founder of Traceloop, educating us about how to observe and quantify the quality of LLMs. Besides performance and costs engineers need to look into quality attributes such as accuracy, readability or grammatical correctness.Nir introduces us to OpenLLMetry - a set of Open Source extensions built on top of OpenTelemetry providing automated observability into the usage of LLMs for developers to better understand how to optimize the usage of LLMs. His advice to every developer is to start measuring the quality of your LLMs on Day 1 and continuously evaluate as you change your model, the prompt and the way you interact with your LLM stack!If you have more questions about LLM Observability check out the following links:OpenLLMetry GitHub Page: https://github.com/traceloop/openllmetryTraceloop Website: https://www.traceloop.com/OpenLLMetry Documentation: https://traceloop.com/docs/openllmetry
Transcript
Discussion (0)
It's time for Pure Performance!
Get your stopwatches ready, it's time for Pure Performance I still have all of my strength to bite through every top.
No, I don't know.
That looked like a shark's tooth, by the way, when you showed it to me.
That might have been the camera on the thing,
but it looked like a really, really large tooth,
which is why I was like, oh, what kind of tooth is that?
But it must have just been the optics on the camera.
So let all of our listeners know we're coming to you from the future,
because right now it's 2024, but it's not really.
I think this is our second-to-last recording of the year.
It is.
Although you're hearing it in the future, everybody,
we hope you all had a good end of year.
Well, I guess we can say that, right?
We can say we hope you had a good end of year,
because they're listening to it afterwards.
See, I'm answering all my own questions, Andy,
by babbling and rambling.
So I'm going to shut up now and let you do your magic.
I will do my magic.
No, thank you so much, Brian.
You're right.
It was another great year. I was using a lot of language.
I know.
It was another great year of podcasting.
And podcasting is only possible
if we actually have great guests.
Great guests that bring new topics
to our communities to our listeners and also to us because we always say we are
the ones that benefit the most because we listen to every episode and we learn in every episode
and today we are having a topic that is extremely popular at least at the end of 2023, and I'm pretty sure it will also be very popular
at the beginning of 2024,
which is we talk about observability
of LLMs, large language models.
And we couldn't have picked a better guest
than Nir Ghasit.
I hope I pronounced the name at least roughly correct.
CEO and co-founder of TraceLoop.
Nir, do me a favor,
can you quickly introduce yourself to the audience, who you are, what you do, why you think
what you do as a company is important in this day of age? Yeah, so I'm Nir, I'm CEO of TraceLoop,
as you said, and we're building a platform for evaluating
and monitoring the quality of LLM applications.
So looking at the outputs you get from LLMs
and telling you whether they're good or bad
and alerting you when they're bad.
And Nir, you mentioned a good word now
because you didn't say performance,
you didn't say resiliency, you didn't say resiliency, you didn't say
some other term. You said quality of the output of LLMs.
And that's obviously a big topic. Brian and I, we have a big background
in performance testing where you test the quality of your application under load
if it still produces the same results that you expect from a performance perspective as well.
Can you define, especially for the people that just know llms in general maybe they've just explored it with a chat gpt can you define how you measure quality and what quality means in terms of
llms that's that's a really that's a good question it's i think like it's a it's a complex answer i
don't think like there's a one there's a one line answer for that good question. I think it's a complex answer. I don't think there's a one-line answer for that
because if you go back to traditional machine learning models
and deep networks,
then you used to have really simple definitions of quality.
You measure precision, you measure recall,
so you know, let's say you're building a model
for classifying images,
whether you have cats or dogs in the image.
So it's really easy when you have a model
that is trained to recognize cats,
just count how many cats you recognize
out of pictures of cats
and how many cats you recognize out of pictures of dogs.
And there you get the precision and recall, right?
And then when you come to the generative AI landscape in general,
by the way, not just large language models, but also for images, then the question of good becomes
really complex. If you take a really long text and ask a GPT to summarize it, what is a good summary?
Even if you ask someone, is it a good good summary is this summary better than the other one
you'll get like mixed mixture of answers like some people like this one more some people like
this one more so like the answer what's the good quality text is highly context dependent and and
this is not just one answer like this is good this is bad but isn't that then i mean that's a big
challenge right because if you just look at the three of us
if you say something then brian may think differently about what you just said than i do
how can we then automate to measure the quality or kind of validate the quality or give it a metric
like a quality indicator how can we give a quality indicator to the output of an llm
in an automated way if we have so many different
opinions, I guess. I mean, how does this work? So what you usually do is first, there's not one,
you realize that there's not one metric that can measure everything and there's like multiple
metrics. And some of them might be depending on the specific task that you're trying to score.
And then secondly, most metrics won't be absolute,
but rather will be like relative.
So you can say, okay, this summary,
we'll go back for the like summary example.
This summary is better than the other one
because it contains this topic,
which is part of the original text
and the other summary didn't contain that topic.
And this topic is important,
so it should be part of the summary.
So this is like one aspect you can think of,
of like how to compare different summary, text summaries.
But other ways would be like, I don't know, is it grammatically correct? Is it more correct, like higher language than the other one?
Or is it repetitive or not?
How repetitive is it?
Like GPT has this tendency of being overly repetitive.
And I guess this is how it was trained.
So like how repetitive is the text that you got?
And then so you get like these 10, 20 sets of metrics.
And this is how you can compare a text x again text y and then you know
if you see some something something weird something jumps you know the metrics and you
know that something something is wrong you suddenly have a lot of power in your hands
if you think about that right because let's just take even the the grammar side of it
obviously you have grammar rules,
but if you look at things written in proper grammar,
a lot of times they may be harder for the regular people to read,
or it's not quite the way regular people might read or write these days.
So number one, you have to think,
is this topic good for proper grammar?
Because if it was maybe a bunch of teens or whatever, and I know I'm an old man talking about teens, right?
But if it was a bunch of teenagers or something, you might not be doing the absolute perfect grammar because they'd be like, this is boring to read.
Or if you think about even 18th century British literature with all its crazy language and it's just a horrible, horrible read. But at the same time, if you're enforcing some of those things,
you're also guiding culture into a way of,
are we going to go more and more casual?
Are we going to try to push more and more?
I don't know.
Just that statement you were saying just opened up the choices you make and the outputs, not just you, but whoever's working on all these things,
and the outputs that get introduced to everybody is going to start influencing culture in some way.
The more and more these things are being used.
And maybe I'm overblowing it a little, but it just dawned on me, yeah, this has to be done.
And you're all picking what that best is going to be.
So that's kind of cool.
I don't know if you thought about it in that way yet, but you know,
maybe you're going to be changing,
you know,
people are going to start using more words and more variation or whatever,
right?
You could go either way with that.
Let me ask you,
let me ask you a question on,
on this,
because I took a couple of notes and you said,
and the quality can be measured by,
for instance,
is it grammatically correct?
Is it repetitive?
Is it too repetitive? Is it repetitive? Is it too repetitive?
Is it accurate? Does it contain certain keywords?
But does this mean, I mean, for grammar,
I guess you can take the output and then automatically
validate if it's grammatically correct
because we have tools that can check this.
Repetitiveness, we can also measure it, right?
How often does a certain word repeat?
Now, accuracy is, I guess, hard because how do you measure the accuracy?
Because you need somebody that actually knows the subject matter and what you expect.
And then also the other one, does it contain certain keywords?
This also means you need to have a domain expert that says, hey, this needs to be in
there.
So my question to you is, what of these aspects
from quality can be
quantified and measured automatically?
And where do we still
potentially need a human being?
I think if you
look, I was
about to say that if you look at the content,
then you probably need some human being
at least to create
some guardrails,
to define the rules that the output text needs to comply with.
But even if you just look at the structure,
we were talking about an AI that generates complete content.
I don't know, you use AI to write an email.
But there's a lot of use cases where you use ai to let's say extract some specific i don't know
subjects or specific words from a really long text or you use ai to classify a really long text
and then the output is really strict like you expect the output to be just one word
and then you know the whole word of what's a good answer, what's a bad answer becomes completely different.
But even for, if we go back to the text examples,
then it still holds that you need a lot of guidance from humans,
from probably the developers or data scientists working on the project to tell you what do they expect the text to be or to contain.
Even for the grammar, I don't know,
Brian, you talked about us guiding higher or lower grammar
according to the metric that we define,
but actually the person developing the application
and using us, for example, as a monitoring tool
will be the one that actually defines
whether they want a good grammatical,
like a high grammatical text or a low one.
Like, for example, if I'm writing some contract,
then probably the language will have to be super...
Legally used, yeah.
Yeah, legally high words, you know.
And if I'm writing a Reddit post, then I want to lower it down.
I want to make it dumber and, I don't know, dumber, but simpler to understand.
You just offended the whole Reddit community.
Thank you so much for that.
We're going to get emails on this now, Andy.
Yeah, probably.
So, okay, coming, because you said, you know,
coming back to what you do actually and what motivates you and what drove you to, I guess,
you know, start TraceLoop.
That means you actually provide a framework,
a tool that allows developers
that are integrating LLMs into their applications
to really measure the quality of the LLMs that they're using.
Is this an accurate definition, description of what you guys do?
Yeah.
The first thing we did, we wanted to measure the quality of the output.
So we need to start collecting those outputs and start collecting how our customers, our
users are using LLMs.
So we needed to build an SDK that can collect these
metrics. And so we have a lot of experience with OpenTelemetry. So it made total sense to just
use the same foundation, the same protocol and extend it to support instrumenting calls to OpenAI,
calls to vector databases, and even frameworks like
ClangChain and Lama Index and others. So that means what you're providing is some instrumentation for
these, I don't know, handful of SDKs that I would use as a developer to develop against an LLM,
when you mentioned OpenAI, then I can just say I want to enable your tool and then I automatically
get OpenTelemetry, I guess, traces, metrics, what type of data do you expose?
What type of signals of observability?
Is it metrics, traces or logs as well?
So right now it's just traces.
This is the first thing we did.
We want to start sending metrics as well in the next month or two.
And then also logs.
We basically want to cover everything that's in the OpenTelemetry standard.
And do you see...
So when I'm a developer and I want to integrate an LLM,
where and when is the right time to actually look at the quality?
Is this part of my testing kind of process?
Or is this something that I need to do in production
because I can only truly validate the LLM
when it's under reload and real users are using it?
Or do you see more people using it up front
in the test environment?
So we actually see, we were surprised.
We thought that people would want to use it
once they're getting closer to production.
So they want to start monitoring and running it in scale and then looking at outputs.
But we were actually surprised to see many users using it really early on.
This is one of the first tools that they adopt, just installing open...
We haven't talked about the name of the tool, which is OpenLLMetry, the open source.
So this is one of the first things that they installed because it's just you install the SDK
and that's it. You automatically understand, it automatically figures out the calls you made to
OpenAI and monkey patches everything on Python or TypeScript. And then you just get logs of all the
prompts and everything. And so, for example, we've seen users using Lama Index,
which people use it, for example, for building rack pipelines
where you have like a vector database, you get some prompt
and then you send it to OpenAI to get a response
based on the data you got from the vector database.
And so we've seen users just want to see the prompts and responses that Lama Index builds
for them behind the scenes because they can't see it in the code.
So they just, during development, they want to see the prompts, they want to see the responses,
they want to understand why the model is behaving like it behaves.
So just installing open LLMT and getting these traces makes, gives them a lot of, you know, visibility into what's happening in their own program.
Brian, this kind of reminds me a little bit of, if you remember the days when we talked about Hibernate and other mapping tools, frameworks, where there's always a black box. It magically works, or it magically worked,
those entity relationship mappers like Hibernate.
And then we all of a sudden started to get distributed traces,
and we saw what is Hibernate actually doing
when you're requesting a certain type of data.
And then you saw, wow, this is really inefficient.
It's collecting too much data, making too many round trips. And I guess this is the same what you're saying.
We are dealing now with a new technology. We are dealing with SDKs
that we can make a call to and we get something back, but we don't really know what's happening
inside. And with OpenLMetry,
did I pronounce it correctly? OpenLMetry.
It's a tongue breaker,
you actually get the visibility
in what's happening inside the model
or inside that SDK.
And with this, developers can better understand
that technology and then hopefully also better use it.
Is this a good assumption of what people are doing?
Yeah, exactly.
During development, just seeing what's happening
and being able to debug it
and then later on, once it reaches
production, being able to
ongoingly monitor the quality
of how your users are using
your already built application.
It's funny, when you described
the two, you described it that way. I think
for a lot of people, all this
AI stuff, especially the
chat GPTs, seems like this magical
mysterious thing that you know sometimes when we're even talking to co-workers and we all work
in it they're like oh it's just amazing it just does this crazy stuff but behind it is really just
all code traces same thing as everything else right uh and getting that visibility into it
just yanks that curtain down right because
suddenly if you're looking at that and i'm seeing how it's being interacted with it's no longer this
oh this you know the terminator is coming or whatever it's just code operating right
um advanced code right but it uh it's pulling the mystery out of it which i think is good because a
lot of people are probably spooked in either way, either really into it or really scared of it.
And I was like, no, it's just code.
And I'm sure Andy will find an N plus one problem in there at some point.
Well, he was talking about Hibernate.
It was interesting.
We saw from database to microservice, N plus one is one of the most common problems.
And we're sure it'll hit there at some point.
So it'll be a common problem in that.
Nira, I have a question, and this might be very basic,
but I assume many of our listeners
have the same misunderstanding or not knowing.
When I am an application developer
and I want to use an LLM, let's say I want to provide a more easier natural language interface
to my software.
So I guess I have two options.
I can develop against a publicly available SaaS-hosted version.
I don't know, I think Microsoft provides OpenAI or something.
And then I assume I can also host my own model and train my own model.
These are my two options, right, that I have.
Yeah.
So what type of visibility or observability do I get with OpenLM?
OpenLMetry.
I really need to practice pronouncing it.
Maybe you need to change the name.
I don't know.
Or it's like you can get
people on stage and the one that can say
the best or ten times in a row
without messing up.
It becomes a drinking game, exactly.
Can you say it right every time you say it wrong?
You got to take a drink?
That's viral marketing there.
Yeah.
So my question is, how much do I learn as a developer
from the internals
of the LLM I'm calling when
I'm developing against a SaaS
offering versus something
where I host the model myself?
If you just use a model
as an API,
then you'll get the
same things.
Like basically the observability you get when looking at calls to a model like OpenAI or
your own self-hosted model is you can see the call, you can see the response, you can see the
token usage, but you can't really see what's happening within the model.
So of course, with Open OpenAI we have no idea
how GPT-4 even was the
architecture GPT-4 was built.
But for open source model we do
have some of these insights
but getting
that extra
bit of visibility, looking at what's
happening within the model, I think
I think
it probably won't make sense to most people,
maybe even all, because it's kind of like the magic of the model that was trained.
The model learned some understanding of text, and this is how it was able to generate text but we have no idea you know what
you look at the neural network like what each layer is is doing what's what is the role of
each layer in that like huge model with billions of parameters i've seen works you know with like
old models doing like image recognition and it was really cool to see, you know, if we look at the neural network, then you can actually see like each layer,
like the kind of understanding different aspects of, of the image. Like if you train a model to,
to recognize dogs, then you have like, uh, one layer that is really good at recognizing eyes
or one layer that's really good at recognizing ears or something.
And then together they're able to distinguish
whether it's a picture of a dog or not a picture of a dog.
So I'm guessing you have something similar like that if you look at an LLM.
But for most users, who cares?
Like it works or it doesn't.
Expanding the topic a little bit from the quality metrics
we talked about earlier, because Brian and I,
we have a history and done a lot of work
around performance engineering.
So performance for me is obviously one dimension of quality.
So how fast a system is responding.
Because if I ask a question and it takes a minute until I get something back,
then the question is, is the system really helping me in my workflow?
Do I get more efficient? Yes or no? Would I have been faster?
Is this a dimension you're also looking into performance?
Like how fast responses are coming back?
Yeah, I think this is one of the reasons why we started with tracing.
Because if you look at the trace of, let's say, like a RAG pipeline
where you have a call to a vector database
and you have maybe even multiple calls to OpenAI,
then looking at the latency, for example, of each call to OpenAI
can tell you a lot about maybe even what can
you optimize.
You can see that you're doing things sequentially, but you can parallelize, like you can do these
three calls in parallel and then save a lot of time because each call to OpenAI can take
like three to four seconds even.
So it's like if you're coming from like a traditional cloud and microservice environment,
three seconds is like forever.
I don't know.
Why would you?
Can you give me an example?
Why would I have?
Because I guess I have a very simplistic view of what an LLM can do for me.
I can say, HHGPT, create an abstract for this podcast,
and here's some input, right?
And I put it in, I hit enter, and then I assume from my naive
way of looking at it, there's this one API call and then one result
comes back. Now you just mentioned that there might be multiple calls to
OpenAI. Why would that be? Give me some use cases where
you would actually then split up the work.
So the simplest use case is one limitation
that we have of the technology today,
which is token limitation.
We can't, like there's a limit on the amount of text
we can input into OpenAI.
So if you have like a really long text
we want to summarize,
sometimes we just need to split it
into multiple shorter texts,
summarize each chunk,
and then take all the summarization
and create one summary
from all of them.
So this is like, so like the first part can be parallelized, right?
And then the last part needs to just take, collect all the summaries and build a summary
of the summaries from them.
And I think in that model, you just, at least to me, highlighted where the quality comes in, right?
Because if you think about chunking up a bit of text,
if text part three has some reference that only makes sense in context of text part one,
but those two are analyzed separately, you don't have that context connection.
So that then is the challenge of how do we take those three and when you talk about the
quality of the output, right?
You're going to get an output.
You can get it speedily.
It's, I think, the fascinating thing on this.
It's, you know, most of the times in checking quality is the answer accurate and accurate
meaning accurate.
You know, it's two plus two is four, and it's not spitting out five.
Search engine searching, that started becoming quantitative or qualitative, sorry.
This is that on much bigger scale.
And then when you talk about those having to tokenize or, you know, break up that data, much more complex.
You're almost going to need an AI or a quantum computer at some point
to handle all this.
It's really, really fascinating performance
and quality issues that,
again, and this is why we love doing this
podcast, because all these new topics
that we haven't thought, my mind is just going
all over the place right now, which is probably why I've been
messing up so much during the episode.
Anyhow,
back to you,y yeah yeah so is
there near and again you know we're coming at least i'm coming from a very basic understanding
of the whole new technology when we talk about performance is there an option when i use an llm
that i can say hey i rather have better performance but less accuracy so can i make a call to one of
these llms and say you know what, I need the response
within a second, but I don't care so much about the quality.
Is there a trade? Is there something that I can decide as a developer?
Yeah, definitely. You have you have like faster models than like if you go back to OpenAI,
then, you know, GPT 3.5 will be I think maybe now it's less correct, but until two months ago,
GPT-3.5 was much faster than GPT-4. But it's less accurate, but it depends how you measure
accuracy. So there might be some use cases where you want to use GPT-3.5 because it's faster. And
by the way, it's cheaper rather than GPT-4, but you still need
a way to figure out if it's okay for you to use it.
What do you lose by downgrading to GPT-3.5?
And for that, you need some metrics that you can actually compare and see, okay, this is
what I'm losing. The redundancy increases or the grammaticality decreases.
I don't know why.
But something changes the need to be able to make a conscious decision around it.
One thing that I recently learned, this was just a podcast we published
in the middle of December, where I was on the road in Hungary.
I was speaking at a conference and then I met a guy from a bank and they're obviously
also integrating LLMs into their online banking.
And it was really interesting for me to learn that the Hungarian language is very limited
in terms of material that is available
because there's only, I think, so many, like 10 million or 12 million Hungarians in the
world producing that much content that can be used to train models and therefore the
language models that are existing are actually not that great right now and that's why there's
an initiative that's going on from the individual banks and other industries I think to build their own models so Mike my
question to you is is do you see that this is something we or is it something
you see also with your customers that they're struggling just with the the
limits because they don't have enough material
available to train the model and therefore you're getting a lot of bad results
and therefore the decision to go into production with an NLM integration
means you just need to feed it with more data?
That's a good question. I actually speak a bit of Hungarian.
It's a really difficult language and I can see why GPT has difficulties understanding Hungarian. Japanese or Dutch, which are languages that I was surprised to see it, you know, being able to
answer questions in these languages. The problem is that, you know, we were talking about how hard
it is and how difficult and how much we don't know about how to measure the quality of text
in English. So in other languages, we're in much worse position.
Probably there hasn't been done much research around how to score
Hungarian text, even if you just want to see if it's grammatically correct,
which is a really simple task.
I'm not sure how many researchers that are out there doing that.
I would have to say
this will become a
much bigger challenge and we'll see
a huge gap between
the tools that we have
in the English language as opposed to
other languages.
We just won't have
the same quality of tools
as we have even today with the English language.
Yeah, it's interesting.
There was a lot of talk about cultural bias and facial recognition and things like that.
And this is bringing this up, right?
Depending on what language you speak or, you know,
and how popular that language is,
are people and cultures going to be left behind on this?
Is the technology going to be reachable by them?
It just always fascinates me what doors or questions these things all open up.
Because we think about these things in terms of our own situation.
But then that Hungarian thing, or I'm thinking like, okay, what if it was like Swahili, right? What are you going to do for a Swahili language model? It's going to be
even tinier, right? But yeah, just interesting challenges. And one thing I think we can rely on,
hopefully, as opposed to with the facial recognition thing, but I think what a lot
of stuff goes on in IT and all this is there's such a great community of people sharing
that hopefully we see that spirit, you know, break through all that.
But the problem is cost.
You know, I thought about like, I can compare it to Wikipedia, for example,
like the English Wikipedia is the largest one for sure.
But today you have other languages that have
really big
Wikipedia communities and it's easy
to get started because you just need
people who want to
contribute and
speak that language that can start writing
articles. But when it comes
to training
an LLM, you first need a lot of
money because it's really expensive to train an LLM, you first need a lot of money because it's really expensive
to train an LLM for a specific language.
And secondly, you need expertise.
And I'm not sure that every country would have the expertise of training its own LLM.
So you might have a lot of languages where no one knows how to even train an LLM for
that specific language.
Can you give me, besides the, let's say common use case,
I'm asking GPT to summarize the text for me or create something.
What are the use cases you have seen that your customers are implementing?
Why? Can you give me one or two examples? Because I just want to understand more than just, you know,
summarize some text for me.
What else is out there?
I think the most exciting use cases I'm starting to see today is the
multi-model use cases.
You know, with Gemini from Google and now GPT-4 as well,
GPT-4 Vision,
you're starting to get models that cannot just take text
and try to understand it.
You can also feed images to the model,
and then it can kind of communicate with that image,
like understand that image and answer questions about that image.
So you're starting to see, like, you have customers, like,
coming from a medical background, using, to try to understand patients and medical conditions.
And this is like, I think this will be an amazing,
a great advancement that we'll see in 2024,
because we'll soon have audio coming into these language models.
And this multidimensionality
will give us a huge amount of new opportunities
and applications we can build.
And this is like, I would say this is, I don't know.
Yeah, I would say this is the most interesting
and most exciting things I'm seeing users are doing today
in our platform.
And you call this again, multimodal LLMs or what did you call them?
Multidimensional or multimodal.
Multidimensional, yeah, cool.
So Brian, if this becomes a reality, then people
can say, there's a new podcast
for Brian and Andy, but it's an hour long
and I don't like all the jokes.
So please give me
a one minute summary
because then you know you can
understand audio. It's really awesome
actually. And then draw a picture so then
you don't need to take the screenshots anymore.
We can actually ask
the large language model to create a perfect picture
of our discussion
based on just the audio track.
I'll be drawn as a clown, I'm sure.
You know,
you brought up an interesting thing there.
Going back to the idea of having trade-offs between speed and accuracy, the different versions.
So that first thought that brings to mind is shopping around and picking which model you want to use for what purposes.
And that I just think of Mario Kart, the old Nintendo game where you can get more speed but that makes you slower to accelerate there's all these trade-offs
but then i'm thinking and i don't know if this is something you're seeing is llms talking to
other llms so maybe one is optimized for grammar another one is optimized for whatever like let's
say you had the picture one is going to analyze the picture and do all this stuff.
And instead of training that model then
to also write a fantastic summary,
do you see this thing
where we're taking data from one,
sending it to another one
that's more specialized?
Or let me bring it down
to even a more specific question.
Are we, do you think we're going to be seeing
specialized models?
Or is everybody going to be trying to do everything all at once together in their own model?
This is a really good question because I've been thinking about it a lot.
And if you see how the industry evolved in the last year, everything was so fast.
But when OpenAI started,
they used to have a model specific for chat and they had another model for tasks
and they had another model for code understanding.
And now they don't.
Now they have one model with general purpose.
You can do everything and that's it.
And it's a question.
And I even read once
I think it was like a year ago
that they're actually
when they deprecated
their code understanding
the model that can understand code
they said that
when they measure that
they saw that their newer
like GPT 3.5 back then
it's better than
its general purpose so it wasn't trained specifically
on code but it was better at code understanding and text understanding than the model that was
trained just on code and was supposedly better on just code understanding so and and then they
even said that training because like because let me go back.
They used to train one model only on code from GitHub and then another model only on text, like from Wikipedia and Reddit.
So this one model was fluent in text understanding and text generation
and another was understanding code generation.
And when they took the text generating model
and trained it on code,
they saw that it was actually becoming better
at text understanding, which is weird.
Like it wrote a lot of code
and then it suddenly became more proficient
and more fluid in like text understanding
and text generation. It's kind of weird.
Not sure anyone knows
why this happens, but this is
what they saw.
I would say that my guess is that
we will actually
have more generic
models that can do everything
rather than models
specific for specific
kind of tasks.
But on the other hand, we still see some models being better at some highly specific use cases
than others.
For example, I've seen a claim where anthropic models are more creative
than open AI.
So if you want to write a poem,
maybe you should do it with anthropic
and not with GPT.
Interesting.
So much to
see how it all plays out over time, right?
Yeah.
It's fascinating.
As you said,
we're only a year in. Yeah, GPT-3 So, yeah. It's fascinating. I mean, as you said, you're right.
We're only a year in.
Yeah, GPT-3.3 was a year ago.
Yeah, it's crazy.
And that has completely changed our world.
It obviously has.
I mean, we are, in our organization, MetaDynamic Trace,
we're also using LLMs now to provide on the
one side natural language interface to the data that we have by asking human questions,
like regular questions, and then get the queries.
Also to help our users to do certain tasks in the tool or just like, you know, get help
on how to do certain things.
And one of the things we are also doing,
and I'm sure everybody's doing this,
but we are redesigning or optimizing also our documentation,
everything that is available public to be easier consumed by LLMs, right?
Because LLMs are obviously, you know, scratching or scraping websites
and then using this to train the model.
So I guess if you are optimizing the data that you put out there,
then it's going to be easier for large language models also to produce a good quality value and answers out of it.
That's funny. That reminds me of like the whole, you know, optimizing for Google
and how everyone would try to cheat it by putting the metadata
and are people going to start trying to cheat on the...
I'm not saying we're cheating, obviously,
but just taking advantage of putting certain words in it
to trick the AI models.
One of the things that I'm always thinking about
and this reminds me of that,
is I think we haven't quite figured out what's the right interface
for interacting with these LLMs.
Right now, we have two types of interfaces.
One is the chat interface, that is widely common.
And the other one is the co-pilot interface.
And these are the two that are kind of succeeding.
I personally am more of a fan of the co-pilot interface. And these are the two that are kind of succeeding. I personally am more
of a fan of the co-pilot because it just integrates with my day-to-day work. I don't
need to do anything. It just works and it's kind of like magic. And it's interesting to see what
will happen there. What types of interfaces will we see in this domain in the next year or so?
Because this is still a novel question, in my opinion.
Nero, because we are kind of getting closer to the end of the recording here,
I wanted to kind of loop back to TraceLoop,
which is the company that you founded.
I know you mentioned earlier OpenLL Metri. See, I got it right this time, hopefully.
So that's an open source framework. And what is TraceLoop then? Can you just quickly fill me in
and also the listeners? Yeah, so the open source OpenLLMetry is basically an SDK built on top of open telemetry
that allows you to log traces and soon metrics and logs using open telemetry,
which you can then connect to any observability platform that supports open telemetry.
And then TraceLoop is one of the potential consumers of this SDK.
So if you connect the SDK, if you route the SDK to Traceloop,
then you get metrics and alerts around specifically the quality
of your LLM outputs and generated content.
And this is the platform. Basically, the input is OpenTelemetry, which is coming from
our SDK.
So if people listen to this and it's like, wow, okay, we also have a project where
I know we're trying to integrate LLMs
with our applications. You said the first thing that people typically do is download
OpenLLmetry to make sure that they get the telemetry data, then send it to your observability
backend. This might be TraceLoop or might be any other observability endpoint or backend.
Do you know what are kind of like the two, three things that people are when they use
your framework, that the three things that people are, when they use your framework,
that the three things that we want to give them,
that we want to tell them, hey, this is something you need to do,
this is something you will probably find, or these are some mistakes that you should not make,
just maybe as a good starting point for people
that get started now after this podcast?
I think that the most important thing
is to work methodologically with LLMs. Most people,
before they start using our platform, do everything manually. They look at the outputs
that they're getting, and then they decide whether they like it or not. And when they make changes,
like when they upgrade a model, when they change a prompt, or when they make their pipeline more
complex, the way that they quantify how,
like if they've gotten better or not
is just by looking again at the outputs
and seeing, okay, this looks good.
Like this looks like a good summary, for example.
So I would say be like,
start measuring the quality from day one.
Like start, define your metrics,
define the metrics according to the application
that you're using.
And of course, potentially do it with trace loops.
And then measure it all the time.
And when you make changes, measure the metrics and then see that those metrics that you've chosen have actually improved.
Don't just look at the text with your bare eyes and decide whether this
is what you want or not. You have to work more quantifiably from day one.
It's funny, that's the same CD model, right? I mean, like always testing, write
your code, check it. My database query went from one to five queries.
Do I want that, right?
Now you're just putting the quality in there.
Yeah, yeah, yeah.
Yeah, sorry, I said delivery.
I'm an idiot.
It's off today.
Something's wrong with me today, Andy.
No, no, no.
I think I'm dying.
Maybe your model is wrong.
Maybe you have some issues with your model.
Maybe this is the...
Who knows?
Maybe we're not talking with Brian Wilson.
Maybe we're talking with a model that was trained on only a subset of all of the podcasts that you've ever created.
And that's why you're off.
But the jokes are spot on bad.
So I at least got that right.
Exactly.
And you know, Nir, I don't know if this is, we don't have a time for this topic today,
but I wanted to bring it up in case either you know about this stuff, it might be a topic
for a future conversation, or if you know somebody who might be, or also for our listeners.
It got me thinking, obviously there's a rush to get all this stuff out to the market, right?
Faster you are on this.
And we know with speed comes sacrifices.
And one of the things that you said earlier about training on this and we know with speed comes sacrifices and one of the things that you
said earlier about you know training on code and different things i just started thinking of
security in in this stuff right like could people feed in a prompt to chat which would break it
right or you just like sql injection in a search but what is are there any different kind of considerations when it comes to security in these things?
Is the industry around these places integrating security?
We don't see a widespread integration of security, but it's starting to catch on.
Are they picking up one of the special considerations, or are they just wild, wild west and going in hoping nobody starts to exploit?
So I think that's a whole other topic, but I don't know if you know people
or if that's something that you're aware of.
Yeah, it's definitely a huge topic.
And I think this is, again, something we need to deal with in the next coming month.
It kind of relates to privacy, like privacy and security.
How do you make sure not to leak your own data to someone by accident or even someone trying to hack into your prompts and trying to extract
internal data that you didn't want to be exposed?
And yeah, I think there's a lot to be discovered there. And if you take traditional cybersecurity,
then we'll probably see hacks first and then solutions
and not the other way around, unfortunately.
Awesome.
Nir, thank you so much for enlightening us,
for telling us about what LLMs really are and do,
and giving us a little more use cases other than just give me a summary of a podcast,
which I sometimes try to use at JetGPT.
I'm really also, like you, excited about the multi-modal or multi-dimensional
where audio, image, and text can be analyzed,
and then opening up new help for the
medical field, for instance, which you mentioned. I also like the quality metrics, like how you can
measure just multiple dimensions to measure the quality, accuracy, repetitiveness, also obviously
performance costs, and then your tips in the end about you start to measure
from day one and then continuously validate the quality metrics because every change that you do
to your model to your prompts may have an impact on cost performance or on the on the output on the
on the correctness of what's coming out so that's really nice um Yeah. And folks, obviously, TraceLoop and OpenLLmetry is what you should do
and check out.
We will link to these.
You sent me a couple links.
Any final thoughts?
Anything else that I missed?
I would like to hear Nir say OpenLLmetry because I heard it a different way.
So let's hear from from the source
how do you say it I say open LLM tree I'm like I'm yeah you hear the M because it sounds like
open telemetry but open LLM but it sounds like you have three L's in there open LLM tree yeah
open LLM tree yeah LLM tree LLM treeree. Yeah. LLMTree. LLMTree.
Yeah, it's awesome.
I like that.
Yeah, the other thing, too, is, you know, I often, oftentimes we have new topics on and they often quickly become my favorite.
This is quickly becoming one of my favorite topics now. aspect that we started with, which was testing for quality of subjective data, right? But doing
that programmatically, it's mind-boggling on its own. And then pulling in the trade-off of quality
versus performance, at least for now, right? That might change in the future obviously with things getting better um it's just opening all sorts of you know thoughts and ideas on my mind again and
again this is why we love doing these podcasts we love having guests on like you so can't thank you
enough if you have it on top of your mind if not um what's the biggest thing you expect to see
coming out of this field in the in in this new year or you're expect to see coming out of this field in this new year? Or you're hoping to see come out of this field in this new year?
The Gemini models.
Like we've seen the Gemini Pro and then there's Gemini Max,
which should be out sometime in the future.
I think it's going to be an exciting year looking at how these models evolve.
Just a year, we now have GPT-4 and Gemini.
Who knows what we'll have in December 2024. Awesome.
All right.
Thank you.
All right, everybody.
Thank you for listening.
We hope you all found this as fascinating as we did.
And thank you once again Nir for coming on and happy new year, everybody.
Thanks.
Thank you so much.