PurePerformance - OpenLLMetry - Observing the Quality of LLMs with Nir Gazit

Starting point is 00:00:00 It's time for Pure Performance! Get your stopwatches ready, it's time for Pure Performance I still have all of my strength to bite through every top. No, I don't know. That looked like a shark's tooth, by the way, when you showed it to me. That might have been the camera on the thing, but it looked like a really, really large tooth, which is why I was like, oh, what kind of tooth is that? But it must have just been the optics on the camera.

Starting point is 00:01:02 So let all of our listeners know we're coming to you from the future, because right now it's 2024, but it's not really. I think this is our second-to-last recording of the year. It is. Although you're hearing it in the future, everybody, we hope you all had a good end of year. Well, I guess we can say that, right? We can say we hope you had a good end of year,

Starting point is 00:01:24 because they're listening to it afterwards. See, I'm answering all my own questions, Andy, by babbling and rambling. So I'm going to shut up now and let you do your magic. I will do my magic. No, thank you so much, Brian. You're right. It was another great year. I was using a lot of language.

Starting point is 00:01:36 I know. It was another great year of podcasting. And podcasting is only possible if we actually have great guests. Great guests that bring new topics to our communities to our listeners and also to us because we always say we are the ones that benefit the most because we listen to every episode and we learn in every episode and today we are having a topic that is extremely popular at least at the end of 2023, and I'm pretty sure it will also be very popular

Starting point is 00:02:05 at the beginning of 2024, which is we talk about observability of LLMs, large language models. And we couldn't have picked a better guest than Nir Ghasit. I hope I pronounced the name at least roughly correct. CEO and co-founder of TraceLoop. Nir, do me a favor,

Starting point is 00:02:26 can you quickly introduce yourself to the audience, who you are, what you do, why you think what you do as a company is important in this day of age? Yeah, so I'm Nir, I'm CEO of TraceLoop, as you said, and we're building a platform for evaluating and monitoring the quality of LLM applications. So looking at the outputs you get from LLMs and telling you whether they're good or bad and alerting you when they're bad. And Nir, you mentioned a good word now

Starting point is 00:03:03 because you didn't say performance, you didn't say resiliency, you didn't say resiliency, you didn't say some other term. You said quality of the output of LLMs. And that's obviously a big topic. Brian and I, we have a big background in performance testing where you test the quality of your application under load if it still produces the same results that you expect from a performance perspective as well. Can you define, especially for the people that just know llms in general maybe they've just explored it with a chat gpt can you define how you measure quality and what quality means in terms of llms that's that's a really that's a good question it's i think like it's a it's a complex answer i

Starting point is 00:03:44 don't think like there's a one there's a one line answer for that good question. I think it's a complex answer. I don't think there's a one-line answer for that because if you go back to traditional machine learning models and deep networks, then you used to have really simple definitions of quality. You measure precision, you measure recall, so you know, let's say you're building a model for classifying images, whether you have cats or dogs in the image.

Starting point is 00:04:06 So it's really easy when you have a model that is trained to recognize cats, just count how many cats you recognize out of pictures of cats and how many cats you recognize out of pictures of dogs. And there you get the precision and recall, right? And then when you come to the generative AI landscape in general, by the way, not just large language models, but also for images, then the question of good becomes

Starting point is 00:04:31 really complex. If you take a really long text and ask a GPT to summarize it, what is a good summary? Even if you ask someone, is it a good good summary is this summary better than the other one you'll get like mixed mixture of answers like some people like this one more some people like this one more so like the answer what's the good quality text is highly context dependent and and this is not just one answer like this is good this is bad but isn't that then i mean that's a big challenge right because if you just look at the three of us if you say something then brian may think differently about what you just said than i do how can we then automate to measure the quality or kind of validate the quality or give it a metric

Starting point is 00:05:18 like a quality indicator how can we give a quality indicator to the output of an llm in an automated way if we have so many different opinions, I guess. I mean, how does this work? So what you usually do is first, there's not one, you realize that there's not one metric that can measure everything and there's like multiple metrics. And some of them might be depending on the specific task that you're trying to score. And then secondly, most metrics won't be absolute, but rather will be like relative. So you can say, okay, this summary,

Starting point is 00:05:52 we'll go back for the like summary example. This summary is better than the other one because it contains this topic, which is part of the original text and the other summary didn't contain that topic. And this topic is important, so it should be part of the summary. So this is like one aspect you can think of,

Starting point is 00:06:14 of like how to compare different summary, text summaries. But other ways would be like, I don't know, is it grammatically correct? Is it more correct, like higher language than the other one? Or is it repetitive or not? How repetitive is it? Like GPT has this tendency of being overly repetitive. And I guess this is how it was trained. So like how repetitive is the text that you got? And then so you get like these 10, 20 sets of metrics.

Starting point is 00:06:41 And this is how you can compare a text x again text y and then you know if you see some something something weird something jumps you know the metrics and you know that something something is wrong you suddenly have a lot of power in your hands if you think about that right because let's just take even the the grammar side of it obviously you have grammar rules, but if you look at things written in proper grammar, a lot of times they may be harder for the regular people to read, or it's not quite the way regular people might read or write these days.

Starting point is 00:07:18 So number one, you have to think, is this topic good for proper grammar? Because if it was maybe a bunch of teens or whatever, and I know I'm an old man talking about teens, right? But if it was a bunch of teenagers or something, you might not be doing the absolute perfect grammar because they'd be like, this is boring to read. Or if you think about even 18th century British literature with all its crazy language and it's just a horrible, horrible read. But at the same time, if you're enforcing some of those things, you're also guiding culture into a way of, are we going to go more and more casual? Are we going to try to push more and more?

Starting point is 00:07:55 I don't know. Just that statement you were saying just opened up the choices you make and the outputs, not just you, but whoever's working on all these things, and the outputs that get introduced to everybody is going to start influencing culture in some way. The more and more these things are being used. And maybe I'm overblowing it a little, but it just dawned on me, yeah, this has to be done. And you're all picking what that best is going to be. So that's kind of cool. I don't know if you thought about it in that way yet, but you know,

Starting point is 00:08:25 maybe you're going to be changing, you know, people are going to start using more words and more variation or whatever, right? You could go either way with that. Let me ask you, let me ask you a question on, on this,

Starting point is 00:08:37 because I took a couple of notes and you said, and the quality can be measured by, for instance, is it grammatically correct? Is it repetitive? Is it too repetitive? Is it repetitive? Is it too repetitive? Is it accurate? Does it contain certain keywords? But does this mean, I mean, for grammar,

Starting point is 00:08:51 I guess you can take the output and then automatically validate if it's grammatically correct because we have tools that can check this. Repetitiveness, we can also measure it, right? How often does a certain word repeat? Now, accuracy is, I guess, hard because how do you measure the accuracy? Because you need somebody that actually knows the subject matter and what you expect. And then also the other one, does it contain certain keywords?

Starting point is 00:09:18 This also means you need to have a domain expert that says, hey, this needs to be in there. So my question to you is, what of these aspects from quality can be quantified and measured automatically? And where do we still potentially need a human being? I think if you

Starting point is 00:09:36 look, I was about to say that if you look at the content, then you probably need some human being at least to create some guardrails, to define the rules that the output text needs to comply with. But even if you just look at the structure, we were talking about an AI that generates complete content.

Starting point is 00:09:59 I don't know, you use AI to write an email. But there's a lot of use cases where you use ai to let's say extract some specific i don't know subjects or specific words from a really long text or you use ai to classify a really long text and then the output is really strict like you expect the output to be just one word and then you know the whole word of what's a good answer, what's a bad answer becomes completely different. But even for, if we go back to the text examples, then it still holds that you need a lot of guidance from humans, from probably the developers or data scientists working on the project to tell you what do they expect the text to be or to contain.

Starting point is 00:10:45 Even for the grammar, I don't know, Brian, you talked about us guiding higher or lower grammar according to the metric that we define, but actually the person developing the application and using us, for example, as a monitoring tool will be the one that actually defines whether they want a good grammatical, like a high grammatical text or a low one.

Starting point is 00:11:13 Like, for example, if I'm writing some contract, then probably the language will have to be super... Legally used, yeah. Yeah, legally high words, you know. And if I'm writing a Reddit post, then I want to lower it down. I want to make it dumber and, I don't know, dumber, but simpler to understand. You just offended the whole Reddit community. Thank you so much for that.

Starting point is 00:11:38 We're going to get emails on this now, Andy. Yeah, probably. So, okay, coming, because you said, you know, coming back to what you do actually and what motivates you and what drove you to, I guess, you know, start TraceLoop. That means you actually provide a framework, a tool that allows developers that are integrating LLMs into their applications

Starting point is 00:12:02 to really measure the quality of the LLMs that they're using. Is this an accurate definition, description of what you guys do? Yeah. The first thing we did, we wanted to measure the quality of the output. So we need to start collecting those outputs and start collecting how our customers, our users are using LLMs. So we needed to build an SDK that can collect these metrics. And so we have a lot of experience with OpenTelemetry. So it made total sense to just

Starting point is 00:12:33 use the same foundation, the same protocol and extend it to support instrumenting calls to OpenAI, calls to vector databases, and even frameworks like ClangChain and Lama Index and others. So that means what you're providing is some instrumentation for these, I don't know, handful of SDKs that I would use as a developer to develop against an LLM, when you mentioned OpenAI, then I can just say I want to enable your tool and then I automatically get OpenTelemetry, I guess, traces, metrics, what type of data do you expose? What type of signals of observability? Is it metrics, traces or logs as well?

Starting point is 00:13:17 So right now it's just traces. This is the first thing we did. We want to start sending metrics as well in the next month or two. And then also logs. We basically want to cover everything that's in the OpenTelemetry standard. And do you see... So when I'm a developer and I want to integrate an LLM, where and when is the right time to actually look at the quality?

Starting point is 00:13:42 Is this part of my testing kind of process? Or is this something that I need to do in production because I can only truly validate the LLM when it's under reload and real users are using it? Or do you see more people using it up front in the test environment? So we actually see, we were surprised. We thought that people would want to use it

Starting point is 00:14:02 once they're getting closer to production. So they want to start monitoring and running it in scale and then looking at outputs. But we were actually surprised to see many users using it really early on. This is one of the first tools that they adopt, just installing open... We haven't talked about the name of the tool, which is OpenLLMetry, the open source. So this is one of the first things that they installed because it's just you install the SDK and that's it. You automatically understand, it automatically figures out the calls you made to OpenAI and monkey patches everything on Python or TypeScript. And then you just get logs of all the

Starting point is 00:14:42 prompts and everything. And so, for example, we've seen users using Lama Index, which people use it, for example, for building rack pipelines where you have like a vector database, you get some prompt and then you send it to OpenAI to get a response based on the data you got from the vector database. And so we've seen users just want to see the prompts and responses that Lama Index builds for them behind the scenes because they can't see it in the code. So they just, during development, they want to see the prompts, they want to see the responses,

Starting point is 00:15:16 they want to understand why the model is behaving like it behaves. So just installing open LLMT and getting these traces makes, gives them a lot of, you know, visibility into what's happening in their own program. Brian, this kind of reminds me a little bit of, if you remember the days when we talked about Hibernate and other mapping tools, frameworks, where there's always a black box. It magically works, or it magically worked, those entity relationship mappers like Hibernate. And then we all of a sudden started to get distributed traces, and we saw what is Hibernate actually doing when you're requesting a certain type of data. And then you saw, wow, this is really inefficient.

Starting point is 00:16:04 It's collecting too much data, making too many round trips. And I guess this is the same what you're saying. We are dealing now with a new technology. We are dealing with SDKs that we can make a call to and we get something back, but we don't really know what's happening inside. And with OpenLMetry, did I pronounce it correctly? OpenLMetry. It's a tongue breaker, you actually get the visibility in what's happening inside the model

Starting point is 00:16:31 or inside that SDK. And with this, developers can better understand that technology and then hopefully also better use it. Is this a good assumption of what people are doing? Yeah, exactly. During development, just seeing what's happening and being able to debug it and then later on, once it reaches

Starting point is 00:16:47 production, being able to ongoingly monitor the quality of how your users are using your already built application. It's funny, when you described the two, you described it that way. I think for a lot of people, all this AI stuff, especially the

Starting point is 00:17:03 chat GPTs, seems like this magical mysterious thing that you know sometimes when we're even talking to co-workers and we all work in it they're like oh it's just amazing it just does this crazy stuff but behind it is really just all code traces same thing as everything else right uh and getting that visibility into it just yanks that curtain down right because suddenly if you're looking at that and i'm seeing how it's being interacted with it's no longer this oh this you know the terminator is coming or whatever it's just code operating right um advanced code right but it uh it's pulling the mystery out of it which i think is good because a

Starting point is 00:17:44 lot of people are probably spooked in either way, either really into it or really scared of it. And I was like, no, it's just code. And I'm sure Andy will find an N plus one problem in there at some point. Well, he was talking about Hibernate. It was interesting. We saw from database to microservice, N plus one is one of the most common problems. And we're sure it'll hit there at some point. So it'll be a common problem in that.

Starting point is 00:18:11 Nira, I have a question, and this might be very basic, but I assume many of our listeners have the same misunderstanding or not knowing. When I am an application developer and I want to use an LLM, let's say I want to provide a more easier natural language interface to my software. So I guess I have two options. I can develop against a publicly available SaaS-hosted version.

Starting point is 00:18:36 I don't know, I think Microsoft provides OpenAI or something. And then I assume I can also host my own model and train my own model. These are my two options, right, that I have. Yeah. So what type of visibility or observability do I get with OpenLM? OpenLMetry. I really need to practice pronouncing it. Maybe you need to change the name.

Starting point is 00:19:04 I don't know. Or it's like you can get people on stage and the one that can say the best or ten times in a row without messing up. It becomes a drinking game, exactly. Can you say it right every time you say it wrong? You got to take a drink?

Starting point is 00:19:18 That's viral marketing there. Yeah. So my question is, how much do I learn as a developer from the internals of the LLM I'm calling when I'm developing against a SaaS offering versus something where I host the model myself?

Starting point is 00:19:37 If you just use a model as an API, then you'll get the same things. Like basically the observability you get when looking at calls to a model like OpenAI or your own self-hosted model is you can see the call, you can see the response, you can see the token usage, but you can't really see what's happening within the model. So of course, with Open OpenAI we have no idea

Starting point is 00:20:05 how GPT-4 even was the architecture GPT-4 was built. But for open source model we do have some of these insights but getting that extra bit of visibility, looking at what's happening within the model, I think

Starting point is 00:20:22 I think it probably won't make sense to most people, maybe even all, because it's kind of like the magic of the model that was trained. The model learned some understanding of text, and this is how it was able to generate text but we have no idea you know what you look at the neural network like what each layer is is doing what's what is the role of each layer in that like huge model with billions of parameters i've seen works you know with like old models doing like image recognition and it was really cool to see, you know, if we look at the neural network, then you can actually see like each layer, like the kind of understanding different aspects of, of the image. Like if you train a model to,

Starting point is 00:21:18 to recognize dogs, then you have like, uh, one layer that is really good at recognizing eyes or one layer that's really good at recognizing ears or something. And then together they're able to distinguish whether it's a picture of a dog or not a picture of a dog. So I'm guessing you have something similar like that if you look at an LLM. But for most users, who cares? Like it works or it doesn't. Expanding the topic a little bit from the quality metrics

Starting point is 00:21:51 we talked about earlier, because Brian and I, we have a history and done a lot of work around performance engineering. So performance for me is obviously one dimension of quality. So how fast a system is responding. Because if I ask a question and it takes a minute until I get something back, then the question is, is the system really helping me in my workflow? Do I get more efficient? Yes or no? Would I have been faster?

Starting point is 00:22:21 Is this a dimension you're also looking into performance? Like how fast responses are coming back? Yeah, I think this is one of the reasons why we started with tracing. Because if you look at the trace of, let's say, like a RAG pipeline where you have a call to a vector database and you have maybe even multiple calls to OpenAI, then looking at the latency, for example, of each call to OpenAI can tell you a lot about maybe even what can

Starting point is 00:22:45 you optimize. You can see that you're doing things sequentially, but you can parallelize, like you can do these three calls in parallel and then save a lot of time because each call to OpenAI can take like three to four seconds even. So it's like if you're coming from like a traditional cloud and microservice environment, three seconds is like forever. I don't know. Why would you?

Starting point is 00:23:09 Can you give me an example? Why would I have? Because I guess I have a very simplistic view of what an LLM can do for me. I can say, HHGPT, create an abstract for this podcast, and here's some input, right? And I put it in, I hit enter, and then I assume from my naive way of looking at it, there's this one API call and then one result comes back. Now you just mentioned that there might be multiple calls to

Starting point is 00:23:36 OpenAI. Why would that be? Give me some use cases where you would actually then split up the work. So the simplest use case is one limitation that we have of the technology today, which is token limitation. We can't, like there's a limit on the amount of text we can input into OpenAI. So if you have like a really long text

Starting point is 00:23:56 we want to summarize, sometimes we just need to split it into multiple shorter texts, summarize each chunk, and then take all the summarization and create one summary from all of them. So this is like, so like the first part can be parallelized, right?

Starting point is 00:24:09 And then the last part needs to just take, collect all the summaries and build a summary of the summaries from them. And I think in that model, you just, at least to me, highlighted where the quality comes in, right? Because if you think about chunking up a bit of text, if text part three has some reference that only makes sense in context of text part one, but those two are analyzed separately, you don't have that context connection. So that then is the challenge of how do we take those three and when you talk about the quality of the output, right?

Starting point is 00:24:49 You're going to get an output. You can get it speedily. It's, I think, the fascinating thing on this. It's, you know, most of the times in checking quality is the answer accurate and accurate meaning accurate. You know, it's two plus two is four, and it's not spitting out five. Search engine searching, that started becoming quantitative or qualitative, sorry. This is that on much bigger scale.

Starting point is 00:25:16 And then when you talk about those having to tokenize or, you know, break up that data, much more complex. You're almost going to need an AI or a quantum computer at some point to handle all this. It's really, really fascinating performance and quality issues that, again, and this is why we love doing this podcast, because all these new topics that we haven't thought, my mind is just going

Starting point is 00:25:38 all over the place right now, which is probably why I've been messing up so much during the episode. Anyhow, back to you,y yeah yeah so is there near and again you know we're coming at least i'm coming from a very basic understanding of the whole new technology when we talk about performance is there an option when i use an llm that i can say hey i rather have better performance but less accuracy so can i make a call to one of these llms and say you know what, I need the response

Starting point is 00:26:07 within a second, but I don't care so much about the quality. Is there a trade? Is there something that I can decide as a developer? Yeah, definitely. You have you have like faster models than like if you go back to OpenAI, then, you know, GPT 3.5 will be I think maybe now it's less correct, but until two months ago, GPT-3.5 was much faster than GPT-4. But it's less accurate, but it depends how you measure accuracy. So there might be some use cases where you want to use GPT-3.5 because it's faster. And by the way, it's cheaper rather than GPT-4, but you still need a way to figure out if it's okay for you to use it.

Starting point is 00:26:52 What do you lose by downgrading to GPT-3.5? And for that, you need some metrics that you can actually compare and see, okay, this is what I'm losing. The redundancy increases or the grammaticality decreases. I don't know why. But something changes the need to be able to make a conscious decision around it. One thing that I recently learned, this was just a podcast we published in the middle of December, where I was on the road in Hungary. I was speaking at a conference and then I met a guy from a bank and they're obviously

Starting point is 00:27:33 also integrating LLMs into their online banking. And it was really interesting for me to learn that the Hungarian language is very limited in terms of material that is available because there's only, I think, so many, like 10 million or 12 million Hungarians in the world producing that much content that can be used to train models and therefore the language models that are existing are actually not that great right now and that's why there's an initiative that's going on from the individual banks and other industries I think to build their own models so Mike my question to you is is do you see that this is something we or is it something

Starting point is 00:28:17 you see also with your customers that they're struggling just with the the limits because they don't have enough material available to train the model and therefore you're getting a lot of bad results and therefore the decision to go into production with an NLM integration means you just need to feed it with more data? That's a good question. I actually speak a bit of Hungarian. It's a really difficult language and I can see why GPT has difficulties understanding Hungarian. Japanese or Dutch, which are languages that I was surprised to see it, you know, being able to answer questions in these languages. The problem is that, you know, we were talking about how hard

Starting point is 00:29:16 it is and how difficult and how much we don't know about how to measure the quality of text in English. So in other languages, we're in much worse position. Probably there hasn't been done much research around how to score Hungarian text, even if you just want to see if it's grammatically correct, which is a really simple task. I'm not sure how many researchers that are out there doing that. I would have to say this will become a

Starting point is 00:29:49 much bigger challenge and we'll see a huge gap between the tools that we have in the English language as opposed to other languages. We just won't have the same quality of tools as we have even today with the English language.

Starting point is 00:30:07 Yeah, it's interesting. There was a lot of talk about cultural bias and facial recognition and things like that. And this is bringing this up, right? Depending on what language you speak or, you know, and how popular that language is, are people and cultures going to be left behind on this? Is the technology going to be reachable by them? It just always fascinates me what doors or questions these things all open up.

Starting point is 00:30:37 Because we think about these things in terms of our own situation. But then that Hungarian thing, or I'm thinking like, okay, what if it was like Swahili, right? What are you going to do for a Swahili language model? It's going to be even tinier, right? But yeah, just interesting challenges. And one thing I think we can rely on, hopefully, as opposed to with the facial recognition thing, but I think what a lot of stuff goes on in IT and all this is there's such a great community of people sharing that hopefully we see that spirit, you know, break through all that. But the problem is cost. You know, I thought about like, I can compare it to Wikipedia, for example,

Starting point is 00:31:19 like the English Wikipedia is the largest one for sure. But today you have other languages that have really big Wikipedia communities and it's easy to get started because you just need people who want to contribute and speak that language that can start writing

Starting point is 00:31:37 articles. But when it comes to training an LLM, you first need a lot of money because it's really expensive to train an LLM, you first need a lot of money because it's really expensive to train an LLM for a specific language. And secondly, you need expertise. And I'm not sure that every country would have the expertise of training its own LLM. So you might have a lot of languages where no one knows how to even train an LLM for

Starting point is 00:32:04 that specific language. Can you give me, besides the, let's say common use case, I'm asking GPT to summarize the text for me or create something. What are the use cases you have seen that your customers are implementing? Why? Can you give me one or two examples? Because I just want to understand more than just, you know, summarize some text for me. What else is out there? I think the most exciting use cases I'm starting to see today is the

Starting point is 00:32:34 multi-model use cases. You know, with Gemini from Google and now GPT-4 as well, GPT-4 Vision, you're starting to get models that cannot just take text and try to understand it. You can also feed images to the model, and then it can kind of communicate with that image, like understand that image and answer questions about that image.

Starting point is 00:32:57 So you're starting to see, like, you have customers, like, coming from a medical background, using, to try to understand patients and medical conditions. And this is like, I think this will be an amazing, a great advancement that we'll see in 2024, because we'll soon have audio coming into these language models. And this multidimensionality will give us a huge amount of new opportunities and applications we can build.

Starting point is 00:33:33 And this is like, I would say this is, I don't know. Yeah, I would say this is the most interesting and most exciting things I'm seeing users are doing today in our platform. And you call this again, multimodal LLMs or what did you call them? Multidimensional or multimodal. Multidimensional, yeah, cool. So Brian, if this becomes a reality, then people

Starting point is 00:33:52 can say, there's a new podcast for Brian and Andy, but it's an hour long and I don't like all the jokes. So please give me a one minute summary because then you know you can understand audio. It's really awesome actually. And then draw a picture so then

Starting point is 00:34:08 you don't need to take the screenshots anymore. We can actually ask the large language model to create a perfect picture of our discussion based on just the audio track. I'll be drawn as a clown, I'm sure. You know, you brought up an interesting thing there.

Starting point is 00:34:25 Going back to the idea of having trade-offs between speed and accuracy, the different versions. So that first thought that brings to mind is shopping around and picking which model you want to use for what purposes. And that I just think of Mario Kart, the old Nintendo game where you can get more speed but that makes you slower to accelerate there's all these trade-offs but then i'm thinking and i don't know if this is something you're seeing is llms talking to other llms so maybe one is optimized for grammar another one is optimized for whatever like let's say you had the picture one is going to analyze the picture and do all this stuff. And instead of training that model then to also write a fantastic summary,

Starting point is 00:35:09 do you see this thing where we're taking data from one, sending it to another one that's more specialized? Or let me bring it down to even a more specific question. Are we, do you think we're going to be seeing specialized models?

Starting point is 00:35:24 Or is everybody going to be trying to do everything all at once together in their own model? This is a really good question because I've been thinking about it a lot. And if you see how the industry evolved in the last year, everything was so fast. But when OpenAI started, they used to have a model specific for chat and they had another model for tasks and they had another model for code understanding. And now they don't. Now they have one model with general purpose.

Starting point is 00:35:58 You can do everything and that's it. And it's a question. And I even read once I think it was like a year ago that they're actually when they deprecated their code understanding the model that can understand code

Starting point is 00:36:15 they said that when they measure that they saw that their newer like GPT 3.5 back then it's better than its general purpose so it wasn't trained specifically on code but it was better at code understanding and text understanding than the model that was trained just on code and was supposedly better on just code understanding so and and then they

Starting point is 00:36:40 even said that training because like because let me go back. They used to train one model only on code from GitHub and then another model only on text, like from Wikipedia and Reddit. So this one model was fluent in text understanding and text generation and another was understanding code generation. And when they took the text generating model and trained it on code, they saw that it was actually becoming better at text understanding, which is weird.

Starting point is 00:37:16 Like it wrote a lot of code and then it suddenly became more proficient and more fluid in like text understanding and text generation. It's kind of weird. Not sure anyone knows why this happens, but this is what they saw. I would say that my guess is that

Starting point is 00:37:34 we will actually have more generic models that can do everything rather than models specific for specific kind of tasks. But on the other hand, we still see some models being better at some highly specific use cases than others.

Starting point is 00:37:59 For example, I've seen a claim where anthropic models are more creative than open AI. So if you want to write a poem, maybe you should do it with anthropic and not with GPT. Interesting. So much to see how it all plays out over time, right?

Starting point is 00:38:20 Yeah. It's fascinating. As you said, we're only a year in. Yeah, GPT-3 So, yeah. It's fascinating. I mean, as you said, you're right. We're only a year in. Yeah, GPT-3.3 was a year ago. Yeah, it's crazy. And that has completely changed our world.

Starting point is 00:38:38 It obviously has. I mean, we are, in our organization, MetaDynamic Trace, we're also using LLMs now to provide on the one side natural language interface to the data that we have by asking human questions, like regular questions, and then get the queries. Also to help our users to do certain tasks in the tool or just like, you know, get help on how to do certain things. And one of the things we are also doing,

Starting point is 00:39:06 and I'm sure everybody's doing this, but we are redesigning or optimizing also our documentation, everything that is available public to be easier consumed by LLMs, right? Because LLMs are obviously, you know, scratching or scraping websites and then using this to train the model. So I guess if you are optimizing the data that you put out there, then it's going to be easier for large language models also to produce a good quality value and answers out of it. That's funny. That reminds me of like the whole, you know, optimizing for Google

Starting point is 00:39:42 and how everyone would try to cheat it by putting the metadata and are people going to start trying to cheat on the... I'm not saying we're cheating, obviously, but just taking advantage of putting certain words in it to trick the AI models. One of the things that I'm always thinking about and this reminds me of that, is I think we haven't quite figured out what's the right interface

Starting point is 00:40:10 for interacting with these LLMs. Right now, we have two types of interfaces. One is the chat interface, that is widely common. And the other one is the co-pilot interface. And these are the two that are kind of succeeding. I personally am more of a fan of the co-pilot interface. And these are the two that are kind of succeeding. I personally am more of a fan of the co-pilot because it just integrates with my day-to-day work. I don't need to do anything. It just works and it's kind of like magic. And it's interesting to see what

Starting point is 00:40:38 will happen there. What types of interfaces will we see in this domain in the next year or so? Because this is still a novel question, in my opinion. Nero, because we are kind of getting closer to the end of the recording here, I wanted to kind of loop back to TraceLoop, which is the company that you founded. I know you mentioned earlier OpenLL Metri. See, I got it right this time, hopefully. So that's an open source framework. And what is TraceLoop then? Can you just quickly fill me in and also the listeners? Yeah, so the open source OpenLLMetry is basically an SDK built on top of open telemetry

Starting point is 00:41:26 that allows you to log traces and soon metrics and logs using open telemetry, which you can then connect to any observability platform that supports open telemetry. And then TraceLoop is one of the potential consumers of this SDK. So if you connect the SDK, if you route the SDK to Traceloop, then you get metrics and alerts around specifically the quality of your LLM outputs and generated content. And this is the platform. Basically, the input is OpenTelemetry, which is coming from our SDK.

Starting point is 00:42:11 So if people listen to this and it's like, wow, okay, we also have a project where I know we're trying to integrate LLMs with our applications. You said the first thing that people typically do is download OpenLLmetry to make sure that they get the telemetry data, then send it to your observability backend. This might be TraceLoop or might be any other observability endpoint or backend. Do you know what are kind of like the two, three things that people are when they use your framework, that the three things that people are, when they use your framework, that the three things that we want to give them,

Starting point is 00:42:49 that we want to tell them, hey, this is something you need to do, this is something you will probably find, or these are some mistakes that you should not make, just maybe as a good starting point for people that get started now after this podcast? I think that the most important thing is to work methodologically with LLMs. Most people, before they start using our platform, do everything manually. They look at the outputs that they're getting, and then they decide whether they like it or not. And when they make changes,

Starting point is 00:43:17 like when they upgrade a model, when they change a prompt, or when they make their pipeline more complex, the way that they quantify how, like if they've gotten better or not is just by looking again at the outputs and seeing, okay, this looks good. Like this looks like a good summary, for example. So I would say be like, start measuring the quality from day one.

Starting point is 00:43:39 Like start, define your metrics, define the metrics according to the application that you're using. And of course, potentially do it with trace loops. And then measure it all the time. And when you make changes, measure the metrics and then see that those metrics that you've chosen have actually improved. Don't just look at the text with your bare eyes and decide whether this is what you want or not. You have to work more quantifiably from day one.

Starting point is 00:44:16 It's funny, that's the same CD model, right? I mean, like always testing, write your code, check it. My database query went from one to five queries. Do I want that, right? Now you're just putting the quality in there. Yeah, yeah, yeah. Yeah, sorry, I said delivery. I'm an idiot. It's off today.

Starting point is 00:44:35 Something's wrong with me today, Andy. No, no, no. I think I'm dying. Maybe your model is wrong. Maybe you have some issues with your model. Maybe this is the... Who knows? Maybe we're not talking with Brian Wilson.

Starting point is 00:44:48 Maybe we're talking with a model that was trained on only a subset of all of the podcasts that you've ever created. And that's why you're off. But the jokes are spot on bad. So I at least got that right. Exactly. And you know, Nir, I don't know if this is, we don't have a time for this topic today, but I wanted to bring it up in case either you know about this stuff, it might be a topic for a future conversation, or if you know somebody who might be, or also for our listeners.

Starting point is 00:45:15 It got me thinking, obviously there's a rush to get all this stuff out to the market, right? Faster you are on this. And we know with speed comes sacrifices. And one of the things that you said earlier about training on this and we know with speed comes sacrifices and one of the things that you said earlier about you know training on code and different things i just started thinking of security in in this stuff right like could people feed in a prompt to chat which would break it right or you just like sql injection in a search but what is are there any different kind of considerations when it comes to security in these things? Is the industry around these places integrating security?

Starting point is 00:45:52 We don't see a widespread integration of security, but it's starting to catch on. Are they picking up one of the special considerations, or are they just wild, wild west and going in hoping nobody starts to exploit? So I think that's a whole other topic, but I don't know if you know people or if that's something that you're aware of. Yeah, it's definitely a huge topic. And I think this is, again, something we need to deal with in the next coming month. It kind of relates to privacy, like privacy and security. How do you make sure not to leak your own data to someone by accident or even someone trying to hack into your prompts and trying to extract

Starting point is 00:46:36 internal data that you didn't want to be exposed? And yeah, I think there's a lot to be discovered there. And if you take traditional cybersecurity, then we'll probably see hacks first and then solutions and not the other way around, unfortunately. Awesome. Nir, thank you so much for enlightening us, for telling us about what LLMs really are and do, and giving us a little more use cases other than just give me a summary of a podcast,

Starting point is 00:47:12 which I sometimes try to use at JetGPT. I'm really also, like you, excited about the multi-modal or multi-dimensional where audio, image, and text can be analyzed, and then opening up new help for the medical field, for instance, which you mentioned. I also like the quality metrics, like how you can measure just multiple dimensions to measure the quality, accuracy, repetitiveness, also obviously performance costs, and then your tips in the end about you start to measure from day one and then continuously validate the quality metrics because every change that you do

Starting point is 00:47:52 to your model to your prompts may have an impact on cost performance or on the on the output on the on the correctness of what's coming out so that's really nice um Yeah. And folks, obviously, TraceLoop and OpenLLmetry is what you should do and check out. We will link to these. You sent me a couple links. Any final thoughts? Anything else that I missed? I would like to hear Nir say OpenLLmetry because I heard it a different way.

Starting point is 00:48:23 So let's hear from from the source how do you say it I say open LLM tree I'm like I'm yeah you hear the M because it sounds like open telemetry but open LLM but it sounds like you have three L's in there open LLM tree yeah open LLM tree yeah LLM tree LLM treeree. Yeah. LLMTree. LLMTree. Yeah, it's awesome. I like that. Yeah, the other thing, too, is, you know, I often, oftentimes we have new topics on and they often quickly become my favorite. This is quickly becoming one of my favorite topics now. aspect that we started with, which was testing for quality of subjective data, right? But doing

Starting point is 00:49:11 that programmatically, it's mind-boggling on its own. And then pulling in the trade-off of quality versus performance, at least for now, right? That might change in the future obviously with things getting better um it's just opening all sorts of you know thoughts and ideas on my mind again and again this is why we love doing these podcasts we love having guests on like you so can't thank you enough if you have it on top of your mind if not um what's the biggest thing you expect to see coming out of this field in the in in this new year or you're expect to see coming out of this field in this new year? Or you're hoping to see come out of this field in this new year? The Gemini models. Like we've seen the Gemini Pro and then there's Gemini Max, which should be out sometime in the future.

Starting point is 00:49:55 I think it's going to be an exciting year looking at how these models evolve. Just a year, we now have GPT-4 and Gemini. Who knows what we'll have in December 2024. Awesome. All right. Thank you. All right, everybody. Thank you for listening. We hope you all found this as fascinating as we did.

Starting point is 00:50:15 And thank you once again Nir for coming on and happy new year, everybody. Thanks. Thank you so much.

PurePerformance - OpenLLMetry - Observing the Quality of LLMs with Nir Gazit

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.