Software Huddle - Navigating Large Language Models with Vino Duraisamy from Snowflake

Starting point is 00:00:00 So as an NLP engineer, it feels so good to finally have the attention back on the language models where everybody's excited about working on. BERT and GPT taking mainstream and almost everyone talking about what GPT is, it gives you goosebumps and it's in a very good way. So are we kind of expecting too much from every company to suddenly become an AI company overnight? Yeah, I think that's a very good point brought up because certainly because of this wave, everybody is wanting to do AI and some people are not even like if your team does not have AI researchers, not even researchers like AI builders, for example, why are you even talking like you need to have some foundation on which you want to now start

Starting point is 00:00:46 experimenting and whatnot right with the current state of gen ai and in particular alums like who do you think or what are like use cases are these like well positioned to be most useful for at this time any application that has a human in the loop will be super ready to get started and i don't even think they need to like think twice about what are the right ways about this whole, you know, bias and fairness and all of that, as long as there is a human involved. I know the human also comes with their own bias and everything. But then it is like, super powerful to have a human in the loop versus just, you know, automating a bunch of stuff and hoping it all works out. Hey, everyone, welcome. Sean Falconer here, one of the hosts of Software Huddle. And today

Starting point is 00:01:27 on the show, Vino DeRossamy from Snowflake. Vino has been working as a data and AI engineer for her entire career across companies like Apple, Treeverse, and now Snowflake. And in this episode, we dive into her thoughts on what's happening in AI right now and what a practical LLM strategy for a company should look like. We discussed the hard, unsolved problems in the space like privacy, hallucinations, transparency, testing, and bias. There's a lot of problems. And we're very much in the wild west days of AI, and it still takes a ton of work to move beyond prototype to production with any AI application. There's lots of hype, but not necessarily that many enterprises actually launching products that take advantage of these generative AI systems yet. I thought Vino had a

Starting point is 00:02:09 lot of real-world perspective to share, and I think you're going to enjoy the conversation. And if you do, please remember to subscribe to the show and leave a positive rating and review. All right, let's kick it over to the interview with Vino. Vino, welcome to Software Huddle. Thank you. Thanks for having me today. Excited to talk to you. Yeah, I'm always excited to talk to you. I think we met a couple years ago now, but let's start by having you introduce yourself. I know you've been in the data engineering and AI space for most of your career. How did that start and what led you to the position

Starting point is 00:02:41 that you're in today at Snowflake? Oh, okay. I guess when I started out, I was working as a data engineer and I thought looking and taking inspiration from the software engineering world, at that time, full-stack software developers were quite the thing. And I was like, oh my God, data is lagging behind software engineering by a decade. You know, probably five, six years from now, full-stack data engineering is going to become the next big thing. So I'm going to have to build that

Starting point is 00:03:08 experience end to end. So I consciously took opportunities as data engineer, and then ML engineer worked with big, you know, companies like Nike and iProtech. And, you know, Apple built that end to end ML experience. And then only for me to realize that the data world has decided to go on a completely different end now. We have, you know, like so many different roles with almost overlapping responsibilities, and it's kind of impossible to combine them all together into just one full stack role. But it's been a very interesting journey, you know, even as a data engineer with that ML experience, or as an ML engineer with the data experience, it helps you understand the different components

Starting point is 00:03:47 in your entire pipeline and kind of worked out well for me in the end anyway. Currently, I work at Snowflake as a developer advocate focusing on data engineering and the language model workloads. So it helps to have that wider breadth of knowledge about the two different fields having worked in those two experiences. So it helps to have that wider breadth of knowledge about the two different fields,

Starting point is 00:04:07 having worked, you know, in those two experiences. Yeah, I think that's kind of like a natural maturity curve for any discipline is, you know, as an area gets, you know, essentially matures, you get more specialization. Like if you were, you know, a doctor, like a medical doctor from 100 years ago, you basically you did everything, you know, you took splinters of people's toes and you perform surgery and deliver babies. But now, like, the level of specialization that happens in the medical field is, like, really, really, like, specialized. And we're seeing that more and more in engineering, software engineering and data engineering and so forth. Like, there's just too much to know at this point for any one person to kind of like cover it all. And you have to essentially reach some level of specialization as, you know, areas scale and like and companies grow. And in particular in the data space, there's just so much data to deal with.

Starting point is 00:04:56 You end up having these like specialized roles to kind of just like manage it. For sure. Yeah. I mean, I guess. manage it for sure yeah i mean i guess more than like the industry maturing looking at the number of tools we have in the data and ai space it's kind of almost impossible for just any one person to pick up tools end to end to be able to do that so i guess that kind of contributed to the story as well but i see the roles getting very specialized in niche and it's all good, I guess. They have AI engineering roles and prompt engineer roles coming up. It's going to be fun to watch out. Yeah, new things all the time. So you're a developer advocate at Snowflake and you mentioned essentially having some general knowledge

Starting point is 00:05:41 helps because you can probably flex across different areas, talk to different types of people. But are there particular areas of focus at Snowflake that you have or particular product lines that you are more focused on? Okay, so I'm not focused on specific product line per se, but then broadly the data engineering workloads and, you know, a little bit of LLM workloads as they come on. But yeah, that's pretty much all I do. I'm mostly doing, you know a little bit of llm workloads as they come on but yeah that's pretty much all i do almost doing you know workloads and whatever there is also like another workload focused on app developers where snowflake native apps are the next big thing but i'm not so much you know involved in the app development side of things yeah that's actually the area that i'm probably most involved in so the you also, you know, you mentioned,

Starting point is 00:06:25 you know, where you started your career, you kind of bounced around from some big tech companies like Apple and stuff like that. But you've also worked for startups like LakeFS. How would you compare and contrast those experiences? Oh, wow. Okay. I guess when I worked with bigger companies

Starting point is 00:06:42 like Nike and Apple, for example, there was a lot of process and everything in place. So it's easy for you to kind of hop on board. And on week two, you're already ready to understand what's going on. And you have something easy to kind of start contributing to because the process and the method, everything is in place for you. But then like startup was like a true wild journey for me because I never worked with a startup before and then to work for a series a startup trying to you know identify a product market fit and everything it's almost like you become a swiss army knife of sorts you end up

Starting point is 00:07:17 doing everything like sometimes you're doing communities sometimes you're doing product and a bit of engineering updating the docs and whatnot. I think it helped me broaden or like widen the horizons of what's going on in the different niche areas within the field and all of that. But I guess the prime, like the main difference, like at least I see with Snowflake and with LakeFS, for example, is LakeFS was all about almost survival, you know, trying to identify do we even have a product market fit in this specific niche of a product. But with Snowflake, it's like, oh, it's an already established behemoth. And all we're trying to do is kind of think about scaling every single thing. So you think of even the smallest of initiatives and programs, like the first thing on your mind is, oh my God, how am I going to scale it? Because Snowflake scale is entirely different game compared to LakeFS. So I guess working with problems at scale

Starting point is 00:08:15 is probably the biggest difference for me and getting used to, you know, do that. Yeah, that's a good point. I think there's also a lot of, of course, variants in the world of software. It's very different working for, say, Series A versus Series D, where presumably a Series D has hopefully found product market fit, and they're actually working on some of those scale problems versus being very early. It's a lot of unknowns that you have to try to navigate.

Starting point is 00:08:40 It's both exciting, but it also can be a little bit, I don't know, jarring, because you're constantly context switching and maybe you know owning different types of roles and kind of stepping in trying to solve a variety of different problems yep for sure and it's fun also you know with snowflake for example right like the with the scale also comes that there's so many people that you can like fall back upon there's like tons of resources that you can leverage and like, you know, work, make it work for your programs and whatnot. Yeah. You're not the first person to ever like have this particular problem in the history

Starting point is 00:09:12 of the company. There's probably been someone who's like laid some of the groundwork or thought through it, uh, for you. Like you can be a startup and be, I don't know, like the, uh, you know, like it needs some, some healthcare coverage thing and no, no one knows the answer to the question or something like that because no one's ever experienced it before. Yeah, for sure. It was super helpful. Yeah. So I want to talk about real LLM strategy with you. So there's, there's a ton of, of course, going on in general right now and it's kind of hard to grok sometime what's

Starting point is 00:09:43 real and truly impactful when it's buried sort of under 50 feet of like overhyped AI marketing fluff. So first of all, as someone that actually has real NLP and AI experience, what are your thoughts on the hype? Is this just a hype cycle or is this something new? What's it all mean from your standpoint? No, I mean, this clearly is a hype cycle for sure. And I feel it is in a good way because two years ago when I was working with language models, NLP was not this, the coolest areas to work on. At that time, the world was, you know, all the rage was all about computer vision models and object recognition and object detection.

Starting point is 00:10:22 Like that was the coolest thing everybody wanted to work on. So as an NLP engineer, it feels so good to finally have the attention back on the language models where everybody's excited about working on. BERT and GPT taking mainstream and almost everyone talking about what GPT is, it gives you goosebumps and it's in a very good way.

Starting point is 00:10:47 But I feel feel I guess you know part of having all the attention and limelight come to you is also not everything is being fully understood because suddenly CharGPT was everywhere and everybody was using CharGPT and nobody fully truly understood what exactly it was for and what are you supposed to do with it? What does it mean to you? And like, you know, there is a lot more to it. I mean, if within the researcher community, we are debating as to, oh, can chat GPT do reasoning and, you know, all sorts of math related tasks? Then what would you think about a common man trying to understand what chat GPT can do and cannot, right? It just, it literally like just hit the world and we were like, okay, what is this?

Starting point is 00:11:30 What are we supposed to deal with it? And it was like almost a shock. In a way, I feel like that probably is an aspect that needs a lot of eyes and involves, I mean, it should involve some sort of an understanding and regulation of what's going on over there. I see. So you think that there's essentially the excitement and hype around something like JackGBT, but not necessarily true understanding from maybe the general public or even people who are in the engineering space? Yeah, for sure right i guess like you know you summarized before like it's great to have the hype because that's how

Starting point is 00:12:11 now you're gonna make you know how difficult it was to probably take these you know data people and the company's executives to focus on and invest in data and ai a couple of years ago and now thanks to the hype, you don't need to sell these AI tools. Like, you know, it's not that hard to get any investment or any buy-in to build these AI projects within these companies. The best thing that could have happened to AI teams.

Starting point is 00:12:37 But then it also means that probably it's on the AI researchers and developers to kind of build that sort of understanding and expectation within the executives as well. I think a downside of this as well could be that essentially because there's such an appetite for investing in AI, either within an organization or even within the venture capital market for startups, there's also room for a lot of like, I don't know, you know, sneaky little salesman, you know, selling, selling a dream that maybe is not quite close to reality.

Starting point is 00:13:12 You know, would you say we're sort of in the wild west of the days of AI to some degree? I think for sure. I mean, wild west in the sense, I feel like everybody's like, oh oh my god ai is going to be the next big thing but how exactly is it going to impact your specific business or your line of business or your domain your industry i don't think any of us have you know gotten that narrowed down we've all been using ai and ml and the classical you know machine learning models for different use cases and whatnot. But then looking at a powerful model like this, we have still not figured out what to do with it. And I remember we were, I mean, I was there at a bunch of, you know, hackathons in SF last week and like in the last few months as well. There were like two separate tracks, one for all kinds of chatbots, one for for is there anything other than chatbots anybody else

Starting point is 00:14:06 out there is doing with these llms because every other production you know workload or use case that i hear of is some version of some co-pilot or some assistant or something that's reading the documentation and answering questions for people that's essentially i would say like maybe a customer assistant right but that's literally all that we're thinking of at this point but we're still yet to explore what other you know use cases can it be applied for and whatnot and of course you know multi-modal stuff's coming in with all these great beautiful images that you can create with i you know yesterday you know opening i also announced the text-to-speech model and then Whisper. And it's like the whole multimodal stuff has not been explored as well.

Starting point is 00:14:50 Primarily, the focus has been on LLM, and we've not been able to get past this whole chatbot-type application. degree, people's first introduction in a lot of ways, or a lot of people's first introduction to what you can do with Gen AI is ChatGPT, which naturally makes you think about or it could be GitHub Copilot, but it naturally makes you think about chat and these types of use cases. It's kind of like the lowest hanging fruit entry into using some of these systems. But I think the real, like, impactful work that I'm excited about is some of the things that are going on in, like, the world of biotech around, you know, using generative AI for, you know, drug discovery and sort of really transforming

Starting point is 00:15:37 an entire industry from something that historically has been more, less of a design, less of an engineering discipline and a little bit sort of like accidental discovery and very expensive. And it could really lower essentially the cost and also increase the time to market or decrease the time to market. And then also, you know, things around like real-time translation is something that now people are talking about actually being a reality within the next few years where, you know, I could be speaking to you and you could have a headphone in and like, it's real time and you're just, you know, hearing it in whatever language that you need to hear it,

Starting point is 00:16:11 which would be like incredible thing to see as well. So I think we're not that far away from going beyond the chatbot. And, but I think that it's kind of like the natural entry point for a lot of people to think about. Yeah, I think for sure. And yeah, I guess personally excited about using it on the creative aspects of it, for example, like writing. Like it's literally going to help every person on earth to write better. And not everybody has that, you know, like quirky, funny, humorous way of writing a blog, say, for example. And it's like super helpful when it comes to those aspects of it and in in general like any sort of creative

Starting point is 00:16:50 way of using it I guess one of the most interesting ways I've you know had people tell me how they use it is I met a couple and I guess they are from two different countries so they come from two different backgrounds and cultures so they have their own version of you know heroes or superheroes call them in their own culture and religion and to teach their kids about them they are using chagi pd to kind of collaborate these two worlds together to write stories and i was like wow that's so like almost sounds magical to me i'm like that's beautiful and that's probably is the you know power of chat gpt to be as creative as it can be and I mean I don't know you tell me how difficult is it to come up with new stories every day to tell your kids to put them to bed maybe come up with fancy new

Starting point is 00:17:36 world that would involve superman and yoda and whatnot yeah I mean, I use ChatGPT to pick my son's weekly show-and-tell items, so I definitely be relying, when he gets to the point where he wants to hear a custom story from me, I'll be relying on AI to help me generate that. One of the other things that we mentioned was there's such an appetite for businesses to try to either invest in these technologies, give more resources to their, you know, AI and data teams, which is great. But is it realistic to some degree?

Starting point is 00:18:14 Like, essentially, you know, based on my experience, I think most companies struggle to even do analytics well. So are we kind of expecting too much from every company to suddenly become an AI company overnight? Yeah, I think that's a very good point brought up because certainly because of this wave, everybody is wanting to do AI and some people are not even like if your team does not have AI researchers, not even researchers like AI builders, for example, why are you even talking? Like you need to have some foundation on which you want to now start experimenting and whatnot, right? But I feel like that may not be that big a deal because now we have a slew of tools that come in.

Starting point is 00:18:57 I mean, again, like I guess a line of self-promo here, but then I saw the Snow Day announcements and Cortex trying to enable analysts with this all the chat gpt powered functions and I'm like oh wow so you can be a data person who have nothing to do with any AI stuff but why do you have to worry about it right like let's say if I'm using SQL all these years and I use the average of the min and the max functions just you know just like I know how a calculator would do them I don't have to worry about how it is being run internally so why can't I do the same thing to create general like product descriptions or to create something I just like

Starting point is 00:19:37 taking the power of these chat GPT from AI stuff to all the different use cases by building these different products has been a very interesting thing too. So I feel, you know, there's going to be bimodal distribution of companies when it comes to adoption, right? So one who actually have these ML engineers and builders who want to build solutions for themselves, probably going to be very highly custom based on their domain data set and whatnot. And there's going to be another group of companies who are probably going to be using solutions from all the different companies to kind of take in their AI needs and requirements.

Starting point is 00:20:13 Yeah, and I think that we'll get better at building abstraction layers that make it easier to essentially even customize some of the kind of out-of-the-box functionality. Like you mentioned Cortex, like there's functions for, you know, summarization for, you know,

Starting point is 00:20:28 a question and answering for translation and so forth. But then I think when you start getting into trying to customize it for a specific domain, then you need to, like today, it's still like a fair amount of effort to build like a whole like embedding model and kind of use information retrieval methods to pull the right uh context to feed into the llm and then you know make sure that you're not giving too

Starting point is 00:20:50 many tokens and there's quite a bit of work that's involved in that process today but i think that's going to get easier with time it's kind of like the nature of any like early you know technology um and it's going to get simpler for people to do that kind of work without deep expertise and machine learning. Yeah, for sure. Because I feel, like think about Postgres having a pgVectorDB, right? So now everyone's going to, like every existing data tool or AI ML tool is going to kind of arm themselves to be able to serve their customers better in terms of these AI aspects of it. And yesterday I was like, last

Starting point is 00:21:25 week, I literally spent all my time to build a retrieval augmented generation based, you know, LLM assistant, and I'm trying to understand how should I chunk and going to Lama index and trying to identify what should, all sorts of in the weeds trying to build an LLM chatbot. And then yesterday, open AI people tell me, oh, you know what? You don't need to do anything. Why don't you write a small prompt? And then we would create a custom GPT for you.

Starting point is 00:21:52 There is built-in retrieval. There's built-in vector. You don't need to worry about embedding, creation of embedding, storage of embeddings. Don't need to deal with vector databases. And I'm like, oh, wow. So this is only going to get better and better. So I feel the long tail of companies who don't have those AI resources are going to probably benefit the most

Starting point is 00:22:11 because you don't need to do anything, but you still get everything handed it over to you on a platter because you can access all these tools to do the same. Yeah. So even though there's this transformation transformation that's happening, and I think the closest like analogy that I've experienced is like the introduction of the internet in terms of like, like a hard sort of wine drawn in the sand of like, we had like the pre disconnected, you know, Dewey decimal system era of the world to like the connected world where, you know, I can talk to pretty much anybody that's, you know,

Starting point is 00:22:44 connected to the internet in the world. And, you know, I can talk to pretty much anybody that's, you know, connected to the internet in the world. And, you know, we've come a long way in the last 30 years since I was sort of introduced. But now we're sort of, I think, drawing another hard line in the sand of like the pre sort of AI era to the post AI era. But do you think there's a danger when it comes to all the hype that's going on where a lot of the investment dollars are going to go into essentially companies that are doing something in that space, or even on the research side, probably researchers that are getting the most funding right now, or getting their talks, or, you know, conference talks, or journal submissions accepted and stuff like that are going to be kind of riding the hype train to some degree.

Starting point is 00:23:30 Yeah, I mean, very much. This has been like, I don't know, personal pet peeve of mine for a while now. Well, we had classical ML models, right? Let's say you built a model for housing price prediction. And you know, for a fact that it cannot be 100% accurate model. And it made mistakes on some data points. And that was totally fine and that's kind of how the world operated right and this is chat gpt is just a different type of model it's a text generation model and whatnot even this model is supposed to have some errors like that's the whole point of it like we know that it cannot be 100 accurate but then for some reason when chat gpt is giving you wrong messages or responses like we have somehow compared it and humanized it saying oh my god the models are hallucinating oh my god is chat gpt lying is it intentionally lying what is

Starting point is 00:24:18 the intention i'm like oh just because now you want to create a term like coin a whole new thing and then go do some research again like you basically ask chat gpt a couple of times and then be like oh my god so chat gpt is lying and we built a framework to understand how it lies and i understand that we're trying to ride the wave and the hype cycle and all of that but i do feel like to some extent we're like stretching it too further because at the end of the day it is an ml model just a large language model but it is also still bound to you know produce the errors it cannot always be wrong i mean it cannot always be right but it's like humanizing it and trying to like over i feel it's probably just to get more papers out. But again, I wouldn't call it this is the right direction for us to go.

Starting point is 00:25:09 But I feel this is probably a cycle too, maybe. Yeah. Well, it's kind of more clickbaity. It's going to get you more views to refer to these things as lying and hallucinations than essentially error rates, precision, and recall from traditional machine learning. When a traditional classification model misclassifies my Fitbit exercise as a sit-up when I was doing a push-up, that is what we would report as an error rate.

Starting point is 00:25:39 No one makes a big deal about that. But when essentially a gpt model or an l you know or you know another type of lm produces something that's like misinformation then i think it's because it feels more human the output of it then you know there's a stronger reaction to it for some degree yeah i think i understand like you said right just because the responses are more human like we're maybe a little bit more scared and take it very seriously when something is wrong like the responses are wrong but then again if as researchers or the academia and the industry

Starting point is 00:26:16 together are trying to hype it up unnecessarily then what would you like how does it you know what are the ripple effects of that within the common folks and people are going to be like oh my god these ai models can lie and these ai models i cannot trust these ai models and that's it is important for us as the community to maybe build some rules around models so it becomes more and more trustworthy and as an you know society as a whole we get more used to having these models to help us, really. But then if we ourselves are inventing these new constructs and kind of producing or creating these scary scenarios, what would you expect the media to pick up?

Starting point is 00:27:00 And what would you expect common folks to understand from this? I feel like there is this whole responsible AI and there's a whole field of it, but it's just not for people who are researching in the responsible AI and ethical AI field. It should be for literally every person who is working with AI to some extent to have that ownership and accountability and be responsible when you talk to or even, you know, come up with these new terms. Yeah, absolutely.

Starting point is 00:27:29 And then, you know, I think also, you know, in the sort of abstract academic sense of an AI innovation, in many ways, like the goal might be to mimic humans or even outperform humans in some respect. But to be human also means to be flawed and sometimes have, you know, bias or lie. And but the bar for AI is much higher. We want like human level or better performance, but we also want to eliminate all bias and all error. So I guess like, what are your thoughts on sort of the balance between creating something that sort of like mimics or feels human, but also making sure that, you know, it isn't essentially, you know, performing errors or, you know, we understand that there's potential errors. Or even on the flip side, when you mentioned responsible AI, you can get sort of get into the potential for toxicity or bias.

Starting point is 00:28:32 Yeah, I think, you know, so when you think about this whole humans are not fair and we have inherent biases and everything, and then how is this AI model being trained? It is being trained on the data on the internet. And it's all the data that we created as humans are going to be biased inherently so when you trained your model on the data on the web what did you expect it to be like isn't it obvious that it is going to be biased is it's not going to be fair and everything that a normal human would have is going to be there with the charge gpt or any other you know ai model too and i feel like we're trying to almost even demonize it saying oh my god chargpt responded something so sexist and i'm like hey it's i mean i guess

Starting point is 00:29:14 the better analogy for that would be a child just growing up in the world right and you know the world is an unfair place but then when you bring up your child you are trying to kind of set boundaries and make you know some sort of a protective mechanism like you as a parent make sure the child has this protective environment to grow up so it doesn't mean the just because the world is fair you let the child out in the world and be like you know what you deal with it you're gonna have to because it's the world's unfair we do create whatever protective environment we go to make sure as much as we can again to make sure the child grows up you know in an ideal way with ideal behaviors ideal modeled behaviors i guess very similarly even with this model right what like looking at the size of these language models it's probably impossible to create

Starting point is 00:30:04 synthetic data that is out you know purely fair and out of bias and everything to train these Looking at the size of these language models, it's probably impossible to create synthetic data that is purely fair and out of bias and everything to train these models. And then I don't even think they will respond human-like with all of that. It's kind of impossible to create this ideal data set to train these models. You've got to work with what we've got, which is this biased data set. But then though, take this biased data set, But then though, like take this biased data set, train the model, the model is doing good, but then you need that protective environment or some sort of a post-processing layer for your specific applications to make sure, you know,

Starting point is 00:30:35 the bias and the fairness and the toxicity and all of that are being kept under control. That's almost a straightforward approach, but I feel it's again it's garbage in garbage out so the problem is in the data where does the data come from so the problem is in the world like why are we demonizing ai models for that like that's pretty much all it can do so i feel that is also something i i don't know how the narrative was built out but then this entire demonizing ai and like the world's gonna end and this whole like i also read up about this future of life institute so they've created this entire almost like a research field now to make sure how are we going to make sure we don't go extinct or like ai's are not going to become a

Starting point is 00:31:24 threat to us and like a whole different thing and I'm like I don't know what's going on behind the screens of like what what is even the thought process behind such a thing but I feel like we should not demonize AI just for the sake of our own short-term benefits of whatever it may be like publishing a paper and get getting grants and funds and whatnot. Yeah. I mean, there's enough kind of fear around technology innovation in particular things that are like AI, robotics, I think quantum computing is another one that we've kind of touched on. Those are like the three like hot button subjects that like make people scared.

Starting point is 00:31:59 And I think part of it is just like they're complicated. Not a lot of people understand everything that's involved. It feels a little black boxy and there's potential for major impact on people's, maybe their jobs or whatever it is. We probably, as people working in technology or researchers, we're sort of doing ourselves a disservice by coming up with terms that sort of fuel that fear even more. Right, exactly. I mean, the responsibility is literally on us to make sure we help the world

Starting point is 00:32:34 understand what these tools and models are, so we can all benefit from it together, but not the other way around. But hey, I guess probably this is how it starts and it takes some time to kind of go over the friends the world can also be okay with the ai modes so one of the things you we we touched on when you were talking there in terms of you know like garbage in garbage out and as much as we're in this like ai revolution it's also it's very much like a data revolution we couldn't have the models that we have today, 15 years ago, because we didn't have enough like digitized data to even do the training. Can you talk a little bit about the value of data when it comes to AI training?

Starting point is 00:33:16 Okay. Yeah, for sure. I guess, I'm not sure where I heard this from, but it's like all your AI problems, or at least most of your AI problems or ML problems are going to be data problems, right? If you have this right ideal data, like training data to train your model, your model, of course, is going to turn out well. But then when it comes to the data, there is like a bunch of things, of course, like the standard data quality issues and making sure you have the right kind of data, you're using the right kind of features. And there was this classical ML problems with feature engineering and everything,

Starting point is 00:33:51 which kind of got away thanks to these LLMs. Now, we don't even have control over what features do we want to use and everything. So it's all taken care of by itself. But then coming up with this data and identifying and to some extent, although we don't have control over what data exists out there in the world, but then taking that data and doing some sort of pre-processing and massaging and making sure it's not too out, you know, on the companies today are using their own data to fine-tune the models or using their own data to build a model or train a model from scratch. That's probably why we've not talked about the data quality issues or how does one even prepare the data to train a model like GPT, for example. So we've not even like, as a community, I feel like we've not even gone there

Starting point is 00:34:46 and touched upon all of that a lot, but I feel more and more, as the field matures and a lot of tools get introduced to make it easy for us, there is gonna be, again, like the foundational and the fundamental problems of data quality is gonna become, I guess it's probably a cycle, it's probably gonna become the next big thing in the fundamental problems of data quality is going to become, I guess it's probably a cycle.

Starting point is 00:35:06 It's probably going to become the next big thing in the next couple of years. Well, even I've heard that a lot of the innovation that's happened in particular in the GPT models from moving from 3 to 3.5 to 4, a lot of it's not necessarily major changes in how they're doing deep learning. They're actually changes around how they're like preparing the data and ensuring like higher data quality and reducing you know uh the propensity for like hallucinations and um and so forth so it's really about like how do we innovate the data level rather than necessarily like big changes to what's happening at the deep learning level for sure yeah right because there are like two big components to it one is like the pre-processing

Starting point is 00:35:51 layer and the other one is the post-processing pre-processing is literally all data problems and that was that will i'm not sure what happened to the initiative if you remember andrew andrewing he came up with this term called data-centric AI and data-centric ML, I think probably a couple of years ago before CharGPT was the entire thing. And his ideology was that, you know what, we may not be able to do a lot of improvement to the models by just playing around with the hyperparameters as much, because there is only so much you can do by, you know, playing around with hyperparameters. But then if you fix the hyperparameters and work on your data, we will be able to further reach the higher metrics in terms of accuracy or F1 or recall or whatever. And that was one of the, I guess, successful ways in which industries

Starting point is 00:36:36 and companies were also doing that, even when I was working as a language engineer. So it's going to come back again, but probably a little later when we're ready for that, I guess, adoption. In traditional machine learning, how do people usually go about measuring and improving data quality? Okay, so I guess in traditional machine learning models, they measure the models using certain metrics, and then data quality is also measured in terms of, okay, if I do this pre-processing method or if I do a standardization and a normalization or even limitization and all sorts of these pre-processing features, it was almost like an experiment, right? It was not a test to data quality, but it was more of a test to

Starting point is 00:37:23 if I do this specific pre-processing, how does it affect the efficiency of my model? And that's kind of how we worked with. And one, like, you know, using a specific feature over the other one was not necessarily a right or a wrong thing, but for this use case, for this model, this is the feature that, you know, gives me that. So it's all good. But when it comes to the not-so-traditional large language models, you cannot really do any of these handcrafted feature engineering methods to improve your accuracy. Or there's not even accuracy.

Starting point is 00:37:59 But then, you know, so, and the traditional methods again, right? Like if you do a limitization and standardization and novelization, you're not trying to create, I guess, discrete output here. It's not a simple prediction. So when you do remove all the extra context from the text that you're feeding in for your training, it's going to affect all the, you know, the way in which it creates your output responses as well, which is probably why it has become a challenge for us to understand, okay, so for my large language model, I don't have an accuracy as a metric or F1 score as a metric, but then how am I going to make sure, what am I

Starting point is 00:38:39 supposed to do with this data to get this ideal, you know, positive, non-toxic kind of responses. I don't think we still have that, you know, solved today to understand what kind of data quality metrics do I even need to use that data to train my large language models. It's almost still experimental and we are figuring it out again. Yeah. So just to kind of make sort of parse what you were saying there, in sort of traditional machine learning, usually we have some idea of what we want to produce

Starting point is 00:39:13 from the model as output. Like maybe it's a classification model or a predictive model or something like that. So we can essentially have some test set where we know what the prediction should be, and then essentially run the model against that test set to figure out how precise it was. And then we can make tweaks to the data to try to improve the quality and then run the test again and see if that improves it. But we in the large world of large language models, it's not so clear because

Starting point is 00:39:40 we're essentially generating text. And it's not's not clear like you're not going to necessarily generate the same text every single time because there's some you know randomness factor that's involved with the generation of text and then it's also not necessarily clear like what is like a good response or not i know if i tell it to write a blog post or something like that uh well you know the uh one version of that might be better than another based on my subjective understanding, but it's hard to necessarily empirically measure what is better. Yeah, I guess that's pretty much it. Like you clearly summarized it, which is basically, we don't have a great way of measuring how good an LLM model is.

Starting point is 00:40:24 And when you don't have a metric to evaluate, so there is nothing you can, like, you're not knowingly, consciously with the right intent tuning to get that, you know, metric. So it's just almost throwing everything and then see what sticks kind of an approach is what we're doing with at least, you know, training the LLM models currently. And we don't have like, you know, but I guess evaluating an LLM model could be a whole different discussion in itself. And as a research problem, I don't think it's fully solved. We are still working on, you know, coming up with different frameworks and benchmarks and whatnot. And currently, it's very interesting, because we do have a bunch of frameworks that

Starting point is 00:41:01 people use as benchmarks to see, okay, how good is my model performing on this specific, you know, Q&A data set and whatnot. But again, you would have to create this data set for every different use case with every different industry. So it cannot be generalized. Like, you cannot just say, you know, what is the accuracy of the model and be done with it. Because everything is kind of non-deterministic here with large language models and text. Yeah, and generally, I think the breadth of potential use cases is much wider than what we should have seen from traditional ML.

Starting point is 00:41:37 Because a lot of times where we've been successful historically with ML is on fairly narrow problems, whereas the LLM approach is more of a general solution to a wide range of problems that you're going to apply AI. I could use it for question answering, for reasoning, for writing. I could do a whole bunch of different things, and then you can break it down by domain and stuff. What is good in one domain or one use case might be you know drastically different somewhere else yeah for sure right like the same model could be doing well for a specific use case it may not be for another one so you you there is like no one standard metric to kind of uniformly say what's better what's not so we touched on clearly like evaluation testing ensuring data quality measuring data quality those are all big problems.

Starting point is 00:42:25 What are some of the other real problems in the LLM space that you think need to be solved? Ah, okay. I guess the biggest one being this whole governance, privacy, and security of it. It was funny because, again, going back to the OpenAI's Dev Day keynote yesterday, they had a very interesting announcement called Copyright Shield, where OpenAI will help their customers with the legal fees and everything on the legal front if there is a copyright infringement from using their APIs. I'm like, wow, okay. And you do know that when you go to data conferences and you are watching the keynote, you have a bunch of data governance tools. How do you set up your RBAC and access control?

Starting point is 00:43:13 Who gets access to what? And what are the PII governance mechanism? And what are we using to make sure our PII data is not exposed? And that is a whole different kind of a discussion, right? Like we have tools, we know what to do. And we're just trying to get the, you know, stakeholders to start doing that. Whereas come to these LLM ones, it was mind boggling to actually hear someone say, oh, you know what, if you ever run into a problem, here is your insurance, like,

Starting point is 00:43:41 like for your legal support. And I'm like, oh, wow. So there was nothing on the security or the governance of the privacy, you know, aspect to it. And I was, of course, you know, shocked, but then it's also the wild west, I guess, the industry's like very much in the early stages. So it's kind of understandable. Yeah, I think this is, I think this is one of the major problems with the space right now that needs to be solved. And I think for a lot of companies to really move beyond sort of like demos and proof of concepts to like production grade products that they can feel confident are, you know, a risk of like a data leak or some sort of privacy infringement down the road is they need to be able to figure out this problem. We kind of know, at least we understand the challenges with regular data management around, like you said, you know, governance, RBAC, like all these types of things of, we don't always do it and do it or do it well, but we kind of understand, you know, if someone comes to your business and says, I want my data deleted, it might take a bunch of work to track down.

Starting point is 00:44:49 It's not optimal to track down all those places that you need to delete it. You might not necessarily do a good job of it if you don't essentially have a lot of these tools and infrastructure in place to deal with these types of things like right to be forgotten. But we understand, delete the row from the database and it's gone. There's no delete a row from the database, and it's gone. There's no delete a row from an LLM. So if, you know, sensitive data is part of that training process, or even part of an embedding model or something like that, there's no way to really delete that information. And I think that's a place that is one of the central challenges right now that not necessarily sort of everybody's like, really understanding the nuance there. And on top of that, it really comes down to not only

Starting point is 00:45:30 not sharing the information, but how do we essentially train or augment models in a way that can still leverage essentially like copyright or private information without essentially leaking the information and then controlling access to it afterwards. Yeah, I think this, like, you know, the whole privacy and security aspect of it will not be fully solved until the field of explainable AI kind of comes into the mainstream because today, right,

Starting point is 00:45:59 we have literally zero control over how this, you know, encoding and decoding happens within GPT models and how it actually understands and makes sense of the input training data, right? We literally have just feeding the data and then it's just working on it by itself and it's done, which is probably why we don't have any control over how am I going to delete the specific data from a specific person or how am I going to make sure this specific data from a specific person or how am I going to make sure this specific you know private set of data is not being exposed can't do any of that only because we don't fully understand how these models work but I feel these

Starting point is 00:46:36 are like two parallel things and maybe you know I'm not sure if you're on board with the same idea too but I feel like unless the field of explainable AI moves further, it's going to be difficult for the security and the privacy components to kind of keep up with that. What is, this is the first time I've heard that term explainable AI. So I'm assuming that's essentially a way to kind of, for the AI to explain why it makes decisions. Is that right?

Starting point is 00:47:01 Yeah, exactly. Like if you think about the classical ML model, for example, right? So when you do have a certain prediction, there is always a way of going back and saying, okay, so I predicted the house prices to be this, but what were the factors, you know, based on which you arrived at this price? Like, you know, the number of fruit, like there are like input features you can always map back to. when it comes to these you know i guess fuzzy neural networks or anything after neural nets really there is no way of trying to explain how they do what they do which kind of makes it like you know even more fuzzier for us to understand like how are you actually going to identify how is your data, how is specific PII data getting encoded as part of

Starting point is 00:47:46 the training process? And how are you going to make sure that specific part is not getting exposed? Yeah, that's a good, that's a good point. I don't know, I know, there's some, you know, recent research we actually touched on in a recent episode that around like trying to apply different methods for like deleting certain information from lms it doesn't really solve i think the underlying problem most and it's also rely most of them rely on fine tuning which is expensive like they're not that realistic a lot of the approaches that i've seen or there's just sort of the uh like complete like you know um sealed off approach where it's just like we're not going to let any

Starting point is 00:48:25 of that information in but then you lose out a lot of the value of the model of using potentially like your own you know company documents for example to try to train an lm or augment it oh yeah i mean i have a whole different story with this whole fine tuning of models but i i'm not sure if that's probably the right way to do it I guess I don't know if I can take the lessons from my previous you know language model training experience to large language models but broadly like one of the biggest challenges when it comes to fine-tuning for example is when you have proprietary data and if you train your model only on that proprietary data, like your company's documentation, for example, it's not even proprietary, but if

Starting point is 00:49:10 it's just your company's documentation, and the model only knows about your company's product because it's only trained on the documentation. So when I'm going to ask questions, it's not going to talk something political or religious or anything scandalous so you can be 100 sure it can hallucinate as to how do i use what is the syntax of this function and whatnot but it's not going to give you any scandalous related you know topics but when you take a generally available foundation model that was trained on this you know world data and then you try to fine-tune it with your own documentation and whatnot how are you still going to be sure that it's not going to respond to any random irrelevant religious or political question right you can never guarantee that and it and again i guess

Starting point is 00:49:57 even historically transfer learning hasn't been as successful or like not so popular method where it comes to you know fine-tuning your existing model to a whole different task because it also loses meaning if the foundation model is good at abc tasks and you're fine-tuning it with a new task you cannot be very sure that it is still able to do the abc tasks it was originally trained for. So then what is the point of it again, right? So you might as well have trained it on your own data. You can at least be sure that it's not going to talk any relevant things or like any scandalous things to your customers. So again, I'm not big on fine-tuning,

Starting point is 00:50:37 but I guess that's kind of how the world is going to evolve from prompt engineering to RAG to fine-tuning. And then eventually we'll hopefully get to custom training of proprietary LLMs, but it's probably going to be a long way there. With the current state of Gen AI and in particular LLMs, who do you think, or what use cases are these well-positioned to be most useful for at this time? I feel like all sorts of productivity tools are like super helpful because

Starting point is 00:51:10 when it comes to any of these, you know, developer productivity or marketing productivity or like creative, any of these tools, there is still a human in the loop. It's just making our work easier or like making the getting started easier so these kind of use cases i feel are like very well primed for you know taking like i don't have to worry about hallucination i don't have to worry about all the hundred problems that come with gpt because i know there is always going to be a human in the loop to kind of you know oversee what they're going to like what the tools are going gonna do so at least for the moment where you know as we're positioned today any application that has a human in the loop will be like super

Starting point is 00:51:52 ready to get started and i don't even think they need to like think twice about what are the right ways about this whole you know bias and fairness and all of that as long as there is a human involved i know the human also comes with their own bias and everything, but then it is super powerful to have a human in the loop versus just automating a bunch of stuff and hoping it all works out. Yeah, I mean, I think it's hard to argue that things like GitHub Copilot and the other coding assistants haven't been valuable to people. I think people are seeing massive efficiency gains from that,

Starting point is 00:52:24 but we're still a long way from just copying, pasting code, valuable to people. You know, I think people are seeing massive efficiency gains from that, but we're still a long way from just, you know, copying, pasting code, submitting it to production and hoping it works. But probably, yeah, it's a long way to go for sure. And then what are some of your recommendations, I guess, for, you know, developers interested in the space or businesses, you know, do you think, you know, in your opinion, are people like asking the right questions when they're kind of like, you know, charging into the world of AI? Okay. I think this is probably the greatest time to be a developer or anyone like a builder working in tech, really. Again, I'm sorry to keep going back to the Open AI's Dev Day. I feel like at this point, it's probably like a recap of the Dev Day keynote.

Starting point is 00:53:08 But they had announced this chat GPTs and a chat GPT, I'm sorry, GPTs and a GPT store where you can build your own custom GPT for any use case and then put it out in the store and people can use that. And almost like a one-on-one equivalent for me, like I was instantly able to connect to snowflake marketplace so you put your you know data apps out there or like gpt out there and you can have your users use it and be done with it and i'm like probably even as a

Starting point is 00:53:37 builder right you can be a one person founder and then go on to build a million dollar or even like a bill you never know so the opportunities as a developer with these kind of tools are like super cool and endless and if you're not a builder too again right with all sorts of you create a custom gpt by writing an input prompt like you don't even need to know like forget python forget this forget that i'm like how cool is that and if you do have a great use case and you may be able to make some money off of that that's like like putting so much power in the hands of the builders and developers and that's like super cool but when it comes to the organizations though it makes me really wonder as to what should the organizations really do? What should their approach be at this point?

Starting point is 00:54:28 Because yesterday after the announcement, I personally felt like a bunch of tools that I've worked with might go obsolete because now, oh, no. OK, done. But if you're an organization who probably are already paying those companies to experiment and build with, what should you be even thinking? So as an org, should I only stick to these big players, hoping that they will eventually catch up and there will be a feature parity? Or am I supposed to take the bets and try the new tool that's probably going to be a lot more useful to my team in the short run? Because again, everything is... The competition is within the timeline, right? You want to be the first within your

Starting point is 00:55:12 industry, within your, like, you know, domain to kind of go win and capture that market. So are you, like, supposed to experiment, experiment, experiment, be ready to, you know, bet on the small startups? Or should you be like, I never know if the startup is even going to be there for a while. So do I wait for these big players to catch up? That's going to be, I don't think I have an answer for that. But that's kind of what I was thinking in my headspace was after, you know, looking at all the announcements. Yeah, I think, I mean, I think that's always a challenge for all companies, you know, investing

Starting point is 00:55:43 in technology, even with the sort of the migration of to the cloud, I think, you know, people face similar challenges is like, do I stick with like my own, or do I invest in like the on-prem solution or, or, or, you know, stick with my on-prem solution, or do I start to modernize and move to the cloud now? And it's always a delicate balance. I think it really comes down to what,

Starting point is 00:56:08 I guess like having confidence in what other competitive advantage it gives your business or, you know, the ROI that you're going to get to your business. And I guess your level of confidence that hopefully the company that you're investing in, like buying their licensing model, is actually going to be here in a couple of years.

Starting point is 00:56:24 It's always, you know, a certain amount of leap of faith with that. Gotcha. But yeah, I think that's going to be very interesting for me at least to watch out because, yeah. Awesome. Well, as we start to wrap up, is there anything else you'd like to share? Any other, you know, sort of nuggets of wisdom that or things that you've been thinking about? I mean, I guess I probably have like one last pointer, which I feel is, again, coming off from my personal experience. I run into a bunch of folks at meetups and, you know, hackathons and whatnot. And I see a lot of data folks who don't have a lot to do with AI are kind of isolating themselves and be like, oh yeah, there's just a lot happening in the AI world.

Starting point is 00:57:05 It has nothing to do with me. But then as the data engineer or data analyst person, you know that it's going to come into your zone. You know it's like someone is building an AI model, you will be probably building the data pipeline to take the data to train. So it's always going to come into your zone for sure. So it's probably wise to keep up with,

Starting point is 00:57:28 I know it's impossible to keep up with everything that's going on, but then it doesn't hurt to probably give some tools to try and just keep abreast of what's going on around. So that's going to be very helpful. Yeah, I mean, I think it's similar to any of these digital transformation shifts

Starting point is 00:57:42 that happen, like moving to the internet, moving to mobile, moving to cloud, like you're kind of doing yourself a disservice by like writing it off as like, you know, nothing to do with me. I think this is something that's probably going to touch essentially every like facet of not only engineering, but most people's jobs in some capacity over the next decade or so. So it's, it's important, uh, I think to like, not only engineering, but most people's jobs in some capacity over the next decade or so.

Starting point is 00:58:13 So it's important, I think, to like at least educate yourself on what is happening and like how you might be able to leverage these tools to do your job better and more efficiently or, you know, how you might even be able to contribute in an interesting way. Yeah, for sure. It's going to be very interesting for everyone especially for students like i'm doing a course like on my weekends doing this executive mba thing and i was kind of shocked that people were not even using chat gpd for anything and i'm like oh wow okay but you all work in tech but then i'm like no but we don't work with ai and i was like taken aback quite because i've met a lot of people who have nothing to do with tech and have been using Chargy PD too.

Starting point is 00:58:50 Like I go to a parlor and the lady, she has a book club with her friends and then she uses Chargy PD to get the summary of the books. So she doesn't have to read the entire book, but then she can still go with her friends and sound smart and everything. So it's kind of very interesting to see the ones you think would be catching up with all the tech developments or not. And then it's like ChatGPT, regardless of one's background, is kind of impacting everyone. Yeah, ChatGPT is like the modern Cliff Notes version. Right. It gives everyone an opportunity to sound smart. Well, Veino, I want to thank you so much for being here. Hopefully we'll see each other in person again,

Starting point is 00:59:28 and I'm sure some sort of event coming up. But this was very enjoyable. We'll have to have you back down the road. Yep. Thank you so much for having me. And yeah, I will hopefully run into a meetup or something in SF soon. All right. Thanks.

Starting point is 00:59:41 Cheers.

Software Huddle - Navigating Large Language Models with Vino Duraisamy from Snowflake

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.