Software Huddle - Practical AI for LLMs with Emanuel Lacić

Starting point is 00:00:00 Hey, folks, this is Alex. And today we have Emmanuel Lachich on the show. Emmanuel has a super interesting background in AI, LLMs, all that sort of stuff. He was in academia for a while, so went really deep on this stuff. Now he's been in industry. He's been working at InfoBip for the last couple of years, building some of this AI stuff and putting it into production. So Sean and I were able to catch him live at InfoBip Shift. And he just told us, you know, what are some of the best practices, what we can expect to see over the next couple of years and things like that. Really interesting conversation. As always,

Starting point is 00:00:28 if you have any comments, suggestions, guests, anything like that, feel free to reach out to me or Sean. With that, let's get to the show. Emmanuel, welcome to the show. Thanks for having me. Yeah, I'm super excited to have you here because we're at InfoBip Shift here. You're talking about chatbots and LLMs and all that stuff, which is a very hot topic right now, but you have not only some practical experience, but a lot of deep research background on that, so I'm excited about that. Could you maybe introduce yourself and tell us a little bit about your background?

Starting point is 00:00:55 Sure, sure, yeah. I started with an engineer, I was at college working already, and then my first job was also like a normal engineering developer role And then at some point I said like okay Really, I want to like maybe try out a bit of science and research because it's really interesting especially kind of some research topics, so I went to Austria for Work at the University of Graz so there basically started my PhD then moved to a company where basically what it is and this is what really excites me like this applied

Starting point is 00:01:29 research because a lot of people like who just go like research work on science stuff yeah they're like okay I mean this like theoretical thing like I don't have I don't see the application where it ends up and this was a really I would say a nice mix-up because I had the opportunity on the one hand with my team, okay, let's really do state-of-the-art research, publish at top-tier conferences, work together with the research community, and then also at the same time working with different companies in the industry. Hey, you're the experts in that kind of field in AI. Can you work with us? Let's build something cool and deploy it and really see how does

Starting point is 00:02:05 it perform, is it helping their users and how can we improve on that. And usually when you combine those kind of things, then you see a bunch of different problems. Yeah, that's where the magic is though, making it real for a lot of folks. That's awesome. Can you tell us about what your research area is sort of focused on? We're a specialist in basically recommender systems. So these are basically the algorithms behind how Amazon decides what product to show you,

Starting point is 00:02:34 YouTube how does it decide what kind of content to show you there, Spotify to generate, how does it generate the whole playlists. The applications are really far and and wide you have everything basic but usually whenever you have some kind of user which interacts with some kind of content that's basically that and this is basically a subfield of information retrieval where you're basically searching for stuff and trying to Show it to the user before he even explicitly requests it. Basically helping him to have less burden of this information overload because I don't know how many videos are on Instagram, like real.

Starting point is 00:03:14 No way I'm going to search through everything. Are one of those recommender systems based on essentially building some sort of model representation of the user so we can say, say like my model kind of looks like your model and we know what you did and essentially can predict what you know this user is gonna do based on that or is there a lot of times based on the similarity from the content where this content is related to this other content and we've seen people kind of bridge the gap of those things so we're gonna essentially recommend that next it's that and combined together and on top of that different

Starting point is 00:03:45 kind of hybrid specialized models so you have basically the recommended system research community is like i would say really uh uh it's highly uh collaborating with the industry so there is a lot of different outcomes out there so you have like specialized models how to like just model the user perspective its behavior from neural networks some other methods then what do you do when you have a lot of user data logged in users but what do you do if you have just like anonymous user sessions those kind of things then depending on what you're trying to achieve for instance if you think about the human resources domain, so if you're matching candidates with job opportunities, maybe you need to ensure fairness, that you're not discriminating

Starting point is 00:04:33 against some sensitive attributes, those kind of things. You also have to consider, let's say, the perspective of the content and also the user perspective and you have also seasonal factors. Usually it is a combination of multiple models in a huge pipeline and it changes over time. So what works today doesn't mean it's going to work in a few months. So there's constant innovation in that field.

Starting point is 00:04:58 Yeah, and historically I feel like a lot of AI has been driven or AI innovation has been driven from research community but now especially with like LLMs and the cost of GPUs running this stuff at scale is a lot of the innovation now going to be driven by industry or is it sort of a combination of the two? Actually I would say I mean this is the thing that's most fascinating, it has become so much easier for like anybody to like try new stuff out. I would say this is exactly the case, a lot of, now more innovations come from the industry.

Starting point is 00:05:36 Usually it's, I think it's always a collaboration of both, but I would say the main way how to achieve innovation is like just to scale it up as many people as possible. It doesn't matter if you like have a scientific background. Basically, there is always, there's a, you don't know, like maybe there's somebody who really now has the motivation, has got an idea and he tried, I would say the best thing that you can do is now like open it up and try to scale it up as much as possible. Yeah, and someone with a different background might actually be able to draw the dots together in a new innovative way where someone who's maybe coming from more of a traditional sort of theoretical computer science background, they're kind of locked into their way of thinking

Starting point is 00:06:19 and they're not able to necessarily connect the dots. I can actually give you an example from, I mean, again, a bit of research, but it doesn't matter, it's maybe just a parallel so this cross collaboration with different people from different backgrounds I think it's the most like useful thing that you can do for instance we like in my previous research team we collaborated with psychologists so they know about some psychological models from human memory theory, I know, a lot of decades ago. So they just, by collaborating with them, you see, okay, maybe this is some kind of model that we can

Starting point is 00:06:52 test out as an algorithm, adapt it, and then out of that, for instance, like, yeah, we came up with some other, basically, like some new algorithm to recommend stuff based on recency and frequency, how the human brain basically forgets stuff, so those kind of things. Usually there are many other applications, so I would say the best way that you can do is really collaborate with different kinds of backgrounds, because you're in your own filter bubble and usually don't see far outside of it. You mentioned a few different companies that use that use this stuff like Netflix for videos or Spotify for music, Amazon or Etsy for products. Are they using pretty similar patterns or are they

Starting point is 00:07:32 gonna be pretty different and distinct depending on like what sort of area? So I would say every larger company that's like a huge pipeline of different stuff and you have specific teams who focus on optimizing let's say some kind of KPIs depending you can for instance have let's let's imagine a website so you have the home page I mean this is a really simple example so you have the home page and then you have some kind of details page of a specific product usually we have like different maybe algorithms or even like teams working on specific parts where you're gonna show that content and it's same thing holds depending on like

Starting point is 00:08:08 what's the scenario or the use case that you're trying to improve on there are then different things of course they're sharing the the family of algorithms but it's always domain specific there is no like silver bullet this one algorithm is gonna work on your example in your because you have different kind of data, different user behavior, especially I would say the user behavior is here the most important thing because how I'm like behaving on Netflix, it's not the same thing as I'm on YouTube or Instagram because like my attention span is much shorter. So just like give me now, now something.

Starting point is 00:08:49 So I would say the user behavior drives it mostly and also the content, which is on that. You mentioned how models sort of get out of date. You have to be always kind of updating your model. Is that because the baseline is shifting? It's like, it was this good, but now everyone's getting that good. We need to get better. Or is it because, hey, the model's even shifting user behavior and different things? or what accounts for

Starting point is 00:09:05 sort of one thing is really I would say fruitful community so a lot of progress is being made over the past years and still it's an ongoing thing so out of it people just like to try to do some stuff and then it also depends as I said like based on the user behavior. Something if you have, let's say, seasonal patterns, then something was going to work on in winter, during Christmas, is not the same kind of algorithm. So you have to maybe like you're specializing the algorithms based on some kind of an objective, let's say, in that timeframe.

Starting point is 00:09:42 Or another thing, you're changing something in your product, in the UI. So you're changing basically also the way how users behave and how they perceive the content. And even some small things, how in the UI you want to present something, may then require that in the end you're gonna change the algorithm because of that UI change, users perceive something. For instance, you have this position bias, this effect in human psychology, where people are more likely to click on the first thing. So when did you go last into the second Google page? Yeah, never. Yeah, exactly. That was a really, really needed answer. People then depend, if you're then showing on some list or a car carousel or if you see like a netflix has like this different like

Starting point is 00:10:26 uh carousels one in each other so if you look like at uh like visual representations of the uh from the eye tracking software so they usually like focus on this like first down right and that's that and then like if you go to this infinite scrolling again they're different kind of behaviors and usually if you have your own product, you're going to change things up. You want to stay relevant. You could want to be better than a competition even because of that. And then let's say one thing, it's not only, okay, what's the underlying algorithm, but you have like a specific field of like how to explain what is black box model.

Starting point is 00:11:01 Why did it show me that? And this again, like is like complementary to this baseline algorithm so now you're stacking on top of that other other things and then it's a combination of everything yep where are we at with that sort of thing like having this block or black box model or getting some explanatory power on that are we are we getting better understanding why the model is recommending what it is I would say this is one of the really it's getting much more traction, especially in Europe.

Starting point is 00:11:26 In Europe you have now this AI Act, where the idea is really to try to bring more transparency and fairness to those black boxes. So depending, let's say, especially if you are applying, it doesn't matter the recommendation, but any AI, let's say in some high risk domain, let's say health, or really impacting some kind of people, you may be needed that, okay, you need to explain those.

Starting point is 00:11:54 And a lot of things have been done, and they're gaining much more traction, and especially I would say now with large language models, like now this generative AI trend, it gets, at least as I saw in the recent trends in the research community, much more people are working on it, and not only. So this is, I would say, one of the things

Starting point is 00:12:16 that is also highly driven by the industry, because it has been shown that, by even if you don't need to, if you explain to your users why they are getting something, it correlates really with the acceptance rate and their engagement. So you're actually inclined to, hey, do this. The more transparent you are, the user actually trusts more your platform and is less likely

Starting point is 00:12:40 to churn or whatever. Yeah. So you brought up LLMs, and I know for me, I would say they really came on the scene with the release of ChatGPT. I'd sort of heard about them before, but I wasn't really paying attention. I would say for a lot of engineers, that was true, and definitely the general public.

Starting point is 00:12:58 For you, that was more steeped in this research. Did ChatGPT seem like a big shift change or step change for you? Or was that just more like, hey, I've seen GPT-1 and 2 and 3? It was sort of a progression. I was aware of Tomales at that time. But I think this is a really nice example. It's like collaboration with the industry and science community. Because if you're able really to open it up up at the scale then you now because I

Starting point is 00:13:26 would say when like with chatGPT when it came real like it's so easy like here go go to the website use that API you're able to like bring so much new kind of research so the thing is if you let's say put it let's put it in perspective you're now some kind of scientist and you're working somewhere you don't have the necessary resources like how we're gonna pay the cost of a hardware to host now your old LLM to train it for how many hours I mean this is really remarkable how many man hours and really money has went into that in the end that there is some kind of interface people can use it so usually I would say the biggest problem with

Starting point is 00:14:06 like usually in science and research is reproducibility and basically to be able to know to whatever somebody has said hey we've done this how can i now run it and use it and now basically somebody has said hey everybody use it so this is I would say one of the driving factors yeah and I think one of the things that like chat GPT did which was unique was it become essentially the user interface for AI whereas I think before and much you were you had some familiarity with AI like it was kind of this thing that sounded like science fiction I think to non like technical people that were involved with it.

Starting point is 00:14:46 And it was hard to sort of explain what are the things that you can do with it. And then suddenly you created this super useful tool that you can point people to. It's like, here's an application of AI. Yeah, I mean, this is, I would say, the combination. You can have the best AI model, whatever it does, but if you don't like apply it in the right way, like it's not going to maybe be useful to anybody. So from their side, it's a combination of also like from the whole engineering department, how to make it user friendly, how actually to be usable.

Starting point is 00:15:21 And usually it's collaboration between between as you said like hey you don't go out of your filter bubble work with multiple i think that's the perfect example and i mean uh for instance like one of the really good benefits which came out of that so now you have like multiplying projects and github frameworks with train your own lm fine-tune it on uh if you have like like some kind of user feedback, which where people, so you want to do some reinforcement learning, or there are now some new human-aware loss

Starting point is 00:15:52 optimization functions where if you have implicit feedback, then here you can even much tune your model even better. So now you have many more things which weren't available before that. Yeah, very cool. Okay, so this is a good time I think to shift into your talk which is prompting enough. Can you just tell us maybe about your talk? Yeah, so what I'm going to talk in general, so now like with also this, I would say the rise of generative AI, making co-pilots, so most

Starting point is 00:16:23 prominent kind of co-pilot like GitHub co-pilot here, so most prominent kind of co-pilot, like a GitHub co-pilot here, it's gonna generate the code. You have BitJourney is also an example of co-pilot, generates the image. Microsoft now with Windows Office also started like to put those features, but the trend is basically many software technologies are gonna want to have a feature where user just writes natural language and that natural language should now achieve or some kind of action or do some kind of task. Does ChatGPT fall into Copilot or is that just like a different thing?

Starting point is 00:16:57 So Copilot is, let's say, the general term of those kind of behavior. User writes something, text, whatever, I want to do something, and now your domain specific software, it doesn't need to be on, let's say GitHub, but your own software which usually does something now translates this into some kind of meaningful action. But the input is like natural language from a user. I want to tell you what you want to do. And chat, let's say, OpenAI,

Starting point is 00:17:28 its services can be used for that. You can just make some kind of, let's say, prompts. And what I'm going to go into my talk is, okay, sure, you would start with prompting, try to see can you maybe that natural language transfer in some kind of code block or element that you can use in your own domain specific solution but at least like in our case we had to even fine tune or we wanted to like with prompting you can achieve I know that much so in my talk I'm gonna go into direction okay how you even gonna measure how good is it performing or not so like is it hallucinating again for your domain what does actually hallucinations mean how accurate are you with your predictions and then if you like want to go over

Starting point is 00:18:13 that you can either fine-tune commercially available so from GPT 3.5 or open-source models so this is like what we also did at InfoBip and where I'm gonna show, okay here is a process if you would now like this is this was like our process but usually it's similar there are already some interesting research papers and blog posts and they go over that how what you need to do but usually it's like domain specific data you have to set up again the whole pipeline to know in your domain in your business scenario what you want to achieve yeah I want to get into specifics there but

Starting point is 00:18:54 maybe before that do you think because I feel like there's this growing trend where people are building more and more co-pilots or essentially some sort of like free form text input powered by alarms do you think there's from a user experience perspective like is there danger with us like stuffing co-pilots like into every piece of tooling like doing it's kind of like in the era of like social media when that first came out like suddenly everything has to have like a social component whether it makes sense or not. And if you give people free form entry, sometimes that's too much choice and it can actually lead to a bad user experience. I think the last part is you nailed it perfectly. I think the one thing that it looks like it's

Starting point is 00:19:37 going to happen, like everybody, like users are going to expect there is some kind of functionality, but it's now your job to say, hey, look, here, it doesn't make sense. Because you're right, it's not at every step of the process you need to have those kind of, some things just don't make sense. It's easier, I know, to do some kind of different maybe UI solution or some backend, something doing. So I would say everybody, just because of the hype, will want to have some kind of features. But in the end, it's going to over time go down and say,

Starting point is 00:20:12 okay, now here it really makes sense, here not. Those are the use cases where usually you would like to have a co-pilot functionality because we saw people really it drives user engagement. Yeah, cool. So based on your sort of experience with building these chatbots, I want to walk through like some of the questions I hear come up a lot and like just hear maybe your experience with it. So maybe let's start with just choosing a model, right? You know, I think OpenAI and the GBTs sort of kicked this off, but we've got all the anthropic models. We've got a bunch of open source models, we've got Llama 3 that just came out. In choosing a model, how much should I just be going with the standard OpenAI type stuff? How much should I be looking at other ones? There is this, I like this principle, fail fast, fail often. So whatever helps you to fail as fast as possible. And this is, I would say, one of the benefits with I mean it's not

Starting point is 00:21:06 that I'm doing a commercial like I mean you if you have the necessary engineering expertise and you have like I just deploy some kind of model and really even like with with Lama 3 and now like this is really easily available if you have the necessary hardwares or structure just run it or even you have like smaller models we can whatever basically helps you to make a proof of concept as fast as possible because i would say this is uh this is the most important thing because you don't want to spend i don't know several months into whatever openai makes it and makes it really easily accessible but uh the same thing like if you have the necessary

Starting point is 00:21:43 engineering expertise like you can run your model or like if you have a necessary engineering expertise like you can run your model or like if you already have some kind of GPU or a hosted instance I mean it's not hard to get so it's possible but just don't I would say over engineer in the beginning just try to see if you can come up with something useful and then based on the insights think about okay good now how can I improve if it brings me value it It can be that, okay, look, as you said, maybe it doesn't bring me value. It's not this AI-assisted solution system. Maybe it's not that kind of useful.

Starting point is 00:22:18 You have to test out. One of the trends that I've heard recently and I I'm seeing from like customers that I work with and stuff like that is that a lot of people will start probably in that mindset of fail fast. Let's start with like open AI or some sort of public model where it has an API interface and I can get up and running quickly, but then they reach a point in that journey where now they need to take this production and they need more control. So that's, they end up sort of going more of a sort of private LLM route where they can really sort of adjust the knobs as needed. Is that something that you think is like necessary for the era that we're in, in terms of the technology right now that at some point you do need to get in there and kind of like tweak

Starting point is 00:23:01 things in order to create a really good experience? That's it. On the first thing, a lot of times it's also money thing. So once you have traction, that you have let's say a larger amount of requests, then it makes sense where you would like want to save up costs. But yeah, I would say then at this point it really helps you like improve on your specific use case. Because then you have the power also to tune the model, to adapt it, you have more control over it. Because at the beginning I would say maybe it's not that much important just to have something where you can gain traction.

Starting point is 00:23:37 But once the traction is there and you're running and you're thinking about, okay, do I need a higher throughput, do I need to maybe like to want to improve that model to boost user engagement or really to, I don't know, add some specific steps of my user journey to help them achieve better results, then it makes sense. And I would say it's more like of a process. You start with whatever helps you to be as fast as possible and once you like reach a certain threshold, then you move to a privately hosted. I mean you have like different let's say frameworks which already help you with that so it's a community which really grows and already

Starting point is 00:24:13 now there is a lot of things there. Yeah one thing I saw someone talking about recently is whether you should maybe start with a more powerful LLM and then move to a weaker one to save on costs or go the other way? Do you have any thoughts on that? Like, hey, should I prove it out first and then try and go cheaper? How do you think about that? We went with this, like, go with a more powerful LLM and then cut the cost because it's like with, I don't know, it's sort of like investment. You want to see as fast as possible, yeah, well, like, what's possible? Sure, the costs are like, okay, we're gonna see in a given timeframe, let's say a month, something,

Starting point is 00:24:49 is it useful? Because it could be, if you go with a small, not that sophisticated model, that you think, okay, there is no way this solution is not gonna achieve any traction, but maybe the only reason is because you went with a smaller model. I would say the beginning costs are justified. You just have to take it. And if you're starting with a more powerful model and you get something that you're happy with and then you go to sort of like a

Starting point is 00:25:15 cheaper smaller model, then does it also help from a testing perspective? Because you can essentially compare from an input to output perspective, like this is what we get with the higher cost, more powerful model. Are we able to essentially replicate that for our specific use case using the lower power, less expensive model? Exactly. It's a trade-off. So you have to see, okay, now, am I satisfied maybe with a little bit less performance,

Starting point is 00:25:40 but let's say on how accurate the model is but in the end if it's enough save on the cost and the runtime and the throughput but at the same time you have like one of the recent like trends is like using really like higher like like larger models to fine-tune your own smaller models for it's like what I'm also gonna talk in my presentation is that there are already some smaller models which, let's say, used LLAMA2, they proved a bit of the model out which was not necessary, but in the end it's still possible to achieve good performance with those kind of models. So in the end there are already new strategies coming up,

Starting point is 00:26:25 how to start with a larger model and then use that knowledge to, let's say, find your specialized. I think the trend will go more into the direction of a lot of domain-specific specialized models, which are smaller, more performant, like expert users or expert models,

Starting point is 00:26:43 you could say. Yeah, I've heard that with a company that I talked to that is working on deploying models to essentially like phones and other edge devices. So they have to compress the model. And then when they compress the model, presumably you're going to lose some level of accuracy, but they can then fine tune it against the original model to get a compressed model that actually has some more performance. Exactly. I mean, if you have, let's say, I don't know, the amount of, let's say, content that you're

Starting point is 00:27:12 generating is like, if you only usually in your use case, like, need 20% of it, then you don't care about 80% of the rest if it doesn't, like, correctly generate or predict. It's more depending on yours and like we have phones are really I would say a perfect example of that because I think this is one of the like one of the areas which are now also gaining traction how can I now make my own language model for an Android iPhone whatever some kind of like a smartphone which it's now running on there because I don't need to then waste time or network to send data. And I also then have basically improved accuracy, or improved privacy, because privacy aspect

Starting point is 00:27:55 is also something which now people will, I would say, start to think more and more as time goes on. Yeah. And there's certain applications where if you're doing it you have to do it on device in order to support the use case like real something like real-time translation you're not going to be able to do like a network call in there to do translation and then come back like it would never ever work. Yeah no so this is I would say one of the one of the next I would say big construction sites that are going to happen.

Starting point is 00:28:26 Because yeah, that's a problem. That's something which really helps. It improves a lot of user experience and especially if you're traveling somewhere, you don't have internet at that point in time, you still want to have some kind of functionality and these are real cases. Yeah. You talked about moving from the platforms to private LLMs for control and different things like that. What about just purely on cost?

Starting point is 00:28:48 If I'm maybe using a GPT 3.5 or I'm using an equivalent model that's private and I'm running myself, am I going to be able to save a lot of money doing that or is it not going to net out? Let's have a look. One thing, maybe just to relate to that, to this previous question. One of the like also like this cost saving strategy is like to mitigate the cost to somebody else. So let's say if you're running that model of somebody's mobile phone, so you're not paying for that. Right. So those are like one of those like strategies that you can do.

Starting point is 00:29:20 But usually, yeah. So once you go like bigger in the scale, you're gonna have, I would say, better cost control with your own models. And I think this is also like if you like when you perceive like somebody writes in Reddit or somewhere, chat GPT works now worse or better. I mean, usually companies also like experimenting background, see, okay okay if they can maybe like save some kind of cost maybe improve or like where is this trade of like how good like this is something that i think every company with every ai model uh is doing regardless if it's a large language model or any

Starting point is 00:29:57 kind of different ai and then does um that take into account also the fact that you might have to hire new people that have the like domain expertise plus there's an infrastructure costable like it's not just like hey can i get this thing running on a server i now need to i'm now responsible for scale uh reliability i mean there's a lot of value that is why people go to like public cloud services like i'm not gonna i don't want to run my own data center. People still make sometimes the choice to run their own data center, but it's now like I would say if you're in cloud, it's a choice that you're making at some certain level of

Starting point is 00:30:34 scale, but there's always going to be these trade-offs from the maintenance perspective. I think it's the same progress as you had like with DevOps. Now you have like MLOps. This is now even growing and if you compare these charts which show all the tools related to machine learning operations a few years ago to now, it just grows. This is something I would say also, a new field of expertise that roles that the company will need to acquire because there is a long way, I mean it doesn't need to be that long but there there's like several steps if you go like from just a purely experimenting

Starting point is 00:31:14 with some kind of model until it really runs in production and you're I know even thinking about having multiple of them switching between them measuring online performance and incrementally doing some kind of updates. So that's a burden which it's not like only one person or one role can manage. You have to have specialized roles for that. Yeah, you're in charge now, not just of deployment of the model,

Starting point is 00:31:39 but all the updates and upgrades as well. Exactly. Because just if you think the amount of time, how like let's say now if you're just specializing in data science in this models now I want to like spend my time into thinking about okay what kind of data maybe brings me a better improvement? But if I only I mean it makes sense to focus on that but if I also need to focus on how I'm gonna deploy it and something and you're just like the day has 24 hours so you will have to cut somewhere off so I would say if you as you like grow like in scale

Starting point is 00:32:15 the usage of your own elements you're gonna have like specialized roles in those kind of things you will need to have. Yeah you mentioned fail fast or even like deploying improvements to your models, things like that. How are you even measuring improvements or measuring failure, things like that with this non-deterministic system? What are some patterns that are in there? So usually you would, let's say in our case, so we know, I would say let's see like this like this like 80% of the any work on some kind of let's say AI project is make preparing the data that you actually are able

Starting point is 00:32:55 to make an offline experiment which simulates as good as possible the online behavior this is what you're trying to achieve to have let's say in the talk about LLMs, so you have the input is something written by the user. I mean you have also open source benchmarks which you can then use and then on your test you're trying to see, okay this was the input, this was the generated output and then you have different accuracy metrics or even some like one of the let's say recent trends is use a large language models as a judge like because to grade the accuracy yes or to grade even like to grade to grade like the from the psychology perspective the user acceptance like

Starting point is 00:33:40 because for instance this is this is one of the one recent papers that they read I mean they're like models for large language models which are trained on reddit data so reddit is a really like I would say an interesting community people write really interesting stuff so you can really I think so what they did let's say in that paper to see okay how would the reddit user upload or download your given that input this is the this is the rule like let's say I know this is the generated output if that would be in reddit like how many like up downloads you would you get or even like

Starting point is 00:34:16 how likely is it that your let's say output would generate and this like going as a like a tree like structure to go like downwards so how many more collaboration or how interesting is it response so one of the trends is gonna be like hey let's use chat gpt4 to tell me what a user how likely a user would be accepted is it interesting for them uh is it correct those kind of thing because the hypothesis is because i mean the large language models have the name also large in them so a large amount of data has been used to train them so the inherent knowledge is somewhere in there so the the task is how can I now get that knowledge back basically to see hey given whatever you have been trained on

Starting point is 00:35:01 and there's a high probability basically you can maybe tell me if now given whatever my generated the responses if it's correct or not or if it's interesting or not so those kind of like new like evaluation measures are going to like I would say pop up even more yeah one thing I've when I've talked to people and I talk about hey what are the ways I can improve the results I'm getting from my co-pilot type system, things like that, I think I could make my prompts better and work on that. I could choose a better model that would maybe fit better. I could fine tune what you brought up a few times.

Starting point is 00:35:35 I could do RAG, retrieval off-limit generation. How do you think about those four? Are there certain ones you say, hey, try and do these first before you go into these, or I'm not seeing much effectiveness with these? how do you think about those different approaches? It's a it will depend on domain because okay, let's say retrieval argument a generation is like one of those popular now Architectures now which everybody's uses you using but like if you have besides just retrieving stuff You want to do some kind of action? So you have to have some kind of action so you have to have some

Starting point is 00:36:05 kind of way to understand am i like doing a rug uh scenario do i need to do something else and there are now also like other uh strategies who like try before you even do that someone okay this is actually you have to now call something i'll do something else i think the retrieval argument generation was just the most popular because it's like really also easy to understand and it falls into a lot of common scenarios and use cases which people can apply to. And maybe easier for like a non-trained, like a normal engineer to do compared to fine-tuning. We have different vector stores which are available. There are already frameworks how to convert the text into some kind of embedding. Basically coming to that back like fail fast, fail often.

Starting point is 00:36:50 It's really easy to fastly develop something. And now you have to see how to continue with it. Because we also saw it depends on really your use case. If it's just like question answering, then Rag is something. But usually then once you start with that, you see like, oh, okay, we have to continue with something else. Yeah. And then with RAG, like, I think it's easy to kind of like, it's easy conceptually, and then it's easy to probably get something out of a prototype standpoint, but like to go beyond that, at least from what I hear, like that, that's like where it takes a certain level

Starting point is 00:37:22 of expertise and there's a lot of tweaking and figuring out how do I actually get the correct context, and not too much, in order to get a result that I need. That's where I think the hard part is. Completely correct. And then you have some other effects that may happen that you're not even aware of or that you want to control. Let's say you have some content that you're in your vector store that you're

Starting point is 00:37:51 retrieving and this comes then in the generated response. So one of the like one of the things that can happen is that then your model is like biased towards popularity that depending how you uh how you implemented your retrieval stage that some parts of the content will just in the end have more likely we will be more likely to be seen so and then the question is okay in your rug pipeline do you have some kind of mechanism to track how often something gets retrieved like do you even know if this is a problem or not if yes okay can you maybe as i said like maybe have adapt this retrieval phase to get a bigger context or try to like say maybe say hey listen this this paragraph has uh this kind of type of content retrieved a lot maybe you want

Starting point is 00:38:39 to like give some more like serendipity or something depending what you want to do yeah do you need to when you're doing like rag it probably, depending on what you want to do. Yeah, do you need to, when you're doing RAG, it probably depends on the use case, but do you need to add some level of randomness or variance in there so that you're not always getting, maybe you don't always want the perfect context based on probability, similar to how an LLM, when you're generating the next token,

Starting point is 00:39:03 you don't always want the most high probability token because when you do that you actually get language that isn't that great, like a great response, super repetitive and stuff like that. So they have to introduce some level of randomness. We'll now discover a really interesting research project to really find out when is the right time to... When to hallucinate or when not to. I mean, if you go back in machine learning, you have different algorithms like multi-armed bandits, those kind of things. Usually there's always this question of when I'm doing exploration versus exploitation.

Starting point is 00:39:39 This is usually a task that you need to solve. In the beginning, you're not thinking about it. Sure, like, why would you? But then once, like, okay, you have something that works, then you start to think, okay, when should I, like, maybe focus more, like, exploring different stuff? Or maybe sometimes you really need to be factual. You have to see, okay, now this is, let's say, the legal domain.

Starting point is 00:40:01 So if you're, like, I don't know, for retrieving stuff which may help your interview win in some kind of case, do you want to explore a bit different? It depends where you're applying that model. And also like, especially let's say with some kind of randomness factor, do you want to maybe like sample the tokens to be like, oh, be more engaging? But then you're in legal domains really factual and you have to be explicit in what you're saying. I think that's one of the big challenges right now with LLMs is the notion of memorization.

Starting point is 00:40:36 There are certain times when I want a memorized result. If I'm asking for what is the quote that Mark Twain made about San Francisco or something like that, I want the exact quote. I don't want a proxy to that quote or something like that. But if I'm saying like, tell me Alex's social security number or something like that, then I don't want it to give me a memorized result there. So there's some times when memorization is okay.

Starting point is 00:41:00 And then there's other times where you absolutely want to guardrail against it. And I think that's one of the triggers. It's hard to know, like differentiate at the LLM. Exactly. How can you also safeguard if, okay, you have whatever you don't care let's say in the data that you're using for retrieving, but maybe you have like some sensitive data like personal identifiable stuff. So how can you ensure that you're anonymizing the stuff at the right amount of time. Yeah. Not in the beginning, maybe later. So what I think it's gonna be is, or it looks like that, that people are gonna just build

Starting point is 00:41:33 pipelines and adding new, let's say, features which, you know, like anonymize this stuff or now make sure, hey here want to double check make an additional prompt to to get exactly that kind of content which is said or some kind of before even prompting to see okay now i have specialized let's say prompting strategies or even models depending on the task and first it's going to be like a router here go left and then he can point you to the right direction it's going to be i would say here, go left and then he can point you in the right direction. It's going to be, I would say, a more complex pipeline as we go in time. Yeah, I think you have to move it to the pipeline level where basically the governance around the data is better understood than trying to do it at the LLM level.

Starting point is 00:42:19 Because essentially what you put in the LLM, it's a little bit like you have a soup at that point. It's like a broth. So if I have a broth, I can't essentially... You can't pull out the potato. Yeah, I can't pull out the potato with my spoon. But that's essentially what you're trying to do sometimes where people are trying to essentially shove that responsibility onto the soup to try to differentiate between potatoes and carrots, but it's much easier to do that at the pipeline level.

Starting point is 00:42:44 Exactly. I mean, even like if you talk about like finger pointing, like if you like generate something like bad, like something unexpected, you're going to get a finger point. Hey, why did this AI now generate this? So yeah, it goes back to the explanatory stuff. So I would say it's a nice like of different LLM related features or action plans which are going to go. Do you think RAG is here to stay or do you think it's one of the better tools that we have at the moment? I would say information retrieval for a long, has been relevant and is still relevant. So it's going to evolve into something.

Starting point is 00:43:26 So it's going to be, I would say, one of the really recent trends that we now saw is generative information retrieval. So I think this is now the next step where people are experimenting with where you want to train your LLM to already generate what should be retrieved. So the current like most pop-up cases, okay, I have some kind of index and then I'm doing some embedding comparisons and or some kind of strategies with maybe re-ranking whatever and then I'm using that with OpenAI or LLAMA or whatever kind of L with maybe re-ranking whatever and then I'm using that with OpenAI or LLAMA or whatever kind of LLAM. But now the trend is going to already

Starting point is 00:44:12 train the model to give me the basically the ideas let's say of the documents which should be the same thing like as retrieved. So this is a so there was I think what in April there was the European Conference of Information Trivial that they also had like a nice tutorial on that. And this seems like one of the, if you also look at the SIGIR community, like those are now, I would say, the trends where what people are trying to achieve. I think one reason RAC has been so popular has been, you know, the smaller context windows, but now we've really seen those grow.

Starting point is 00:44:43 Do you think in five years it'll be the case where context is cheap and plentiful, to where you can stuff millions and millions of tokens and it won't cost that much? Or like, now that we're seeing Gemini with a million token context windows. I think it depends heavily on the hardware, because in the end, let's say this context, in the end it's a matrix, so you're doing matrix multiplications. So if you're able on a hardware perspective really to do it efficiently so this will be able to say the main driver from I know Nvidia and the who's gonna be the first one who really makes

Starting point is 00:45:16 it possible this is in the end like okay who can and how are you or maybe there will come up some new solution to even like I don't I mean the whole point of like the attention mechanism to see like what part of context is maybe not really important so either those kind of like new strategy will come up or really the hardware which I would say if we get a really huge boost again like a new jump in from the GPU perspective, then probably the next thing that's going to happen is even bigger contexts. Do you have thoughts on vector databases in terms of the specialized vector databases that are like, that's all they do versus the vector databases that have essentially taken,

Starting point is 00:46:01 they've gone from relational relational databases or even even like document stores like MongoDB now supports like vectors. Like where do you see the world of essentially structured data and vector databases going? So I mean this is also interesting. I would say the cool thing is now a lot of like NoSQL, SQL solutions like I know Postgre has also like now has a PG vector support. So depending like in your engineering team where the expertise lies, if you already have vector support, use it. This falls into this fail fast, fail often principle just as fast as possible to test it out. There was also one, I mean sure for

Starting point is 00:46:42 it's like if you go for performance, if you already know that you will go, gonna have like, I don't know, billions of transactions, you have like, those kind of specialized vector databases, this definitely makes sense. The thing is, it really depends on your own team and let's say the company, the company organization. For instance, there was one research paper which basically stated, okay, is Lucene enough? So you have Elasticsearch, Apache Solar, which are built on Lucene. And the thing is, those are basically really optimized for fast retrieval. You don't only have vectors because they were like focused more on text search.

Starting point is 00:47:25 So you have multiple stuff which you can use in addition to embeddings and like even like in Elasticsearch you besides dense embeddings you have like sparse embeddings those kind of things which are also you can maybe try out to see if this helps or you can like make some kind of hybrid combinations. But to get back at the papers, the statement was if your company, if your business is already let's say established, they have a dedicated team who already supports Elasticsearch, go with that just from a business perspective, from a business value because you will much faster develop something.

Starting point is 00:48:04 And then if you really then get to the point where I know we have to have like millions or how many like transactions that have really billions of documents stored in it then think about really specialist or like if you're at the start I think whatever like wherever you feel it's like it will help you to achieve something good. And then, I mean like the whole Lucid I would say stack is really well known, well maintained. You have a lot of features. I would say you won't go wrong with that. Once then you see, okay, I have some performance requirements where I need, then you can think about to go for them but I mean it depends highly on you your own team your own preferences

Starting point is 00:48:49 and proficiency well we're coming up on time here so I want to close off with some quickfire questions sure we said people with so if you could master one skill that you don't have right now what would it be to try to be much more on time I think I was like what one minute before I came here, one minute before, so I should have maybe like planned. That's great. What about what wastes the most time in your day? I mean it's communication. I mean it's still important. It's hard sometimes to say hey okay I have to focus more on my stuff like it depends it's I mean I mean it falls to the main of everybody like I mean if you collaborate with a lot of people you always want to help somebody then and it

Starting point is 00:49:32 will take time but maybe it's not a waste it's like it and kind of investment maybe like short term you think like ah I don't have time but like I think about it yeah yeah that's good yeah cool if you can invest in one company that's not not Infobip, not who you currently work for, which company would that be? There is no other company I would invest in. Great answer. That's the first time I've heard that. Yeah, they're monitoring your answer right now. Yeah, cool. What tool or technology could you not live without?

Starting point is 00:49:58 Tool or technology? I mean, I don't think that there's a lot of things. You mean more like software development? Yeah, you can take it anywhere you want. You could say airplanes, I don't know. I would never fall into the middle of really being mainstream, but I'm hooked on mobile. Yeah, for sure. What about what person, what one person influenced you the most in your career? My basically

Starting point is 00:50:28 my I would say I had the luck at every company where I worked at, I really had good mentors like and usually this really defines you how you're gonna like, like it really can help and bootstrap you in your career. I think I had really the luck that wherever I came, and especially now in InfoWeb, you're working with people who are great, and this really makes or breaks you a lot of times. Yeah, absolutely.

Starting point is 00:50:53 All right, last one. It's a good one for you. Five years from now, will there be more people writing code or less? Oh, I would say more people, but in different ways. I think it's going into the direction of opening up. I hope so, at least. I'm a fan of it. Yeah, well, Emmanuel, this was a great talk.

Starting point is 00:51:11 Thanks for doing this. I really appreciate you coming on. And best of luck on your talk later today. Sure, thanks. Thanks for coming on. Cheers.

Your Ad Here

Software Huddle - Practical AI for LLMs with Emanuel Lacić

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.