Software Huddle - Navigating Large Language Models with Vino Duraisamy from Snowflake
Episode Date: December 12, 2023In this episode, we spoke with Vino Duraisamy, Developer advocate at Snowflake. Vino has been working as a data and AI engineer for her entire career across companies like Apple, Treeverse, and now Sn...owflake. And in this episode, we dive into her thoughts on what's happening in AI right now and what a practical LLM strategy for a company should look like. We discussed the hard, unsolved problems in the space like privacy, hallucinations, transparency, testing, and bias. There's a lot of problems. We're very much in the Wild West days of AI, and it still takes a ton of work to move beyond prototype to production with any AI application. There's lots of hype, but not necessarily that many enterprises actually launching products that take advantage of these generative AI systems yet. We thought Vino had a lot of real world perspective to share, and we think you're going to enjoy the conversation. Follow Vino: https://twitter.com/vinodhini_sd Follow Sean: https://twitter.com/seanfalconer
Transcript
Discussion (0)
So as an NLP engineer, it feels so good to finally have the attention back on the language
models where everybody's excited about working on. BERT and GPT taking mainstream and almost
everyone talking about what GPT is, it gives you goosebumps and it's in a very good way.
So are we kind of expecting too much from every company to suddenly become an AI company
overnight? Yeah, I think that's a very good
point brought up because certainly because of this wave, everybody is wanting to do AI and some
people are not even like if your team does not have AI researchers, not even researchers like
AI builders, for example, why are you even talking like you need to have some foundation on which you want to now start
experimenting and whatnot right with the current state of gen ai and in particular alums like who
do you think or what are like use cases are these like well positioned to be most useful for at this
time any application that has a human in the loop will be super ready to get started and i don't
even think they need to like think twice about what are the right ways about this whole, you know, bias and
fairness and all of that, as long as there is a human involved. I know the human also comes with
their own bias and everything. But then it is like, super powerful to have a human in the loop
versus just, you know, automating a bunch of stuff and hoping it all works out.
Hey, everyone, welcome. Sean Falconer here, one of the hosts of Software Huddle. And today
on the show, Vino DeRossamy from Snowflake. Vino has been working as a data and AI engineer for
her entire career across companies like Apple, Treeverse, and now Snowflake. And in this episode,
we dive into her thoughts on what's happening in AI right now and what a practical LLM strategy
for a company should look like.
We discussed the hard, unsolved problems in the space like privacy, hallucinations,
transparency, testing, and bias. There's a lot of problems. And we're very much in the wild west days of AI, and it still takes a ton of work to move beyond prototype to production with any AI
application. There's lots of hype, but not necessarily that many enterprises actually
launching products that take advantage of these generative AI systems yet. I thought Vino had a
lot of real-world perspective to share, and I think you're going to enjoy the conversation.
And if you do, please remember to subscribe to the show and leave a positive rating and review.
All right, let's kick it over to the interview with Vino.
Vino, welcome to Software Huddle.
Thank you. Thanks for having me today.
Excited to talk to you. Yeah, I'm always excited to talk to you. I think we met a couple years ago
now, but let's start by having you introduce yourself. I know you've been in the data
engineering and AI space for most of your career. How did that start and what led you to the position
that you're in today at Snowflake? Oh, okay.
I guess when I started out, I was working as a data engineer and I thought looking and taking inspiration
from the software engineering world,
at that time, full-stack software developers were quite the thing.
And I was like, oh my God,
data is lagging behind software engineering by a decade.
You know, probably five, six years from now,
full-stack data engineering is going to become the next big thing. So I'm going to have to build that
experience end to end. So I consciously took opportunities as data engineer, and then ML
engineer worked with big, you know, companies like Nike and iProtech. And, you know, Apple
built that end to end ML experience. And then only for me to realize that the data world has decided to go on a
completely different end now. We have, you know, like so many different roles with almost overlapping
responsibilities, and it's kind of impossible to combine them all together into just one full stack
role. But it's been a very interesting journey, you know, even as a data engineer with that ML
experience, or as an ML engineer with the data experience,
it helps you understand the different components
in your entire pipeline
and kind of worked out well for me in the end anyway.
Currently, I work at Snowflake as a developer advocate
focusing on data engineering
and the language model workloads.
So it helps to have that wider breadth of knowledge
about the two different fields
having worked in those two experiences. So it helps to have that wider breadth of knowledge about the two different fields,
having worked, you know, in those two experiences.
Yeah, I think that's kind of like a natural maturity curve for any discipline is, you know, as an area gets, you know, essentially matures, you get more specialization.
Like if you were, you know, a doctor, like a medical doctor from 100 years ago, you basically
you did everything, you know, you took splinters of people's toes and you perform surgery and deliver babies.
But now, like, the level of specialization that happens in the medical field is, like, really, really, like, specialized.
And we're seeing that more and more in engineering, software engineering and data engineering and so forth.
Like, there's just too much to know at this point for any one person to kind of like cover it all. And you have to essentially reach some level of specialization as, you know, areas scale and like and companies grow.
And in particular in the data space, there's just so much data to deal with.
You end up having these like specialized roles to kind of just like manage it.
For sure. Yeah. I mean, I guess. manage it for sure yeah i mean i guess more than like the industry maturing looking at the number
of tools we have in the data and ai space it's kind of almost impossible for just any one person
to pick up tools end to end to be able to do that so i guess that kind of contributed to the
story as well but i see the roles getting very specialized in niche and it's all good,
I guess. They have AI engineering roles and prompt engineer roles coming up.
It's going to be fun to watch out. Yeah, new things all the time. So you're a developer
advocate at Snowflake and you mentioned essentially having some general knowledge
helps because you can probably flex across different areas, talk to different types of people. But are there particular areas of focus at Snowflake that you have or
particular product lines that you are more focused on? Okay, so I'm not focused on specific product
line per se, but then broadly the data engineering workloads and, you know, a little bit of LLM
workloads as they come on. But yeah, that's pretty much all I do. I'm mostly doing, you know a little bit of llm workloads as they come on but yeah that's pretty much all i do
almost doing you know workloads and whatever there is also like another workload focused on
app developers where snowflake native apps are the next big thing but i'm not so much
you know involved in the app development side of things yeah that's actually the area that i'm
probably most involved in so the you also, you know, you mentioned,
you know, where you started your career,
you kind of bounced around from some big tech companies
like Apple and stuff like that.
But you've also worked for startups like LakeFS.
How would you compare and contrast those experiences?
Oh, wow.
Okay.
I guess when I worked with bigger companies
like Nike and Apple, for example,
there was a lot of process and everything in place.
So it's easy for you to kind of hop on board.
And on week two, you're already ready to understand what's going on.
And you have something easy to kind of start contributing to because the process and the method, everything is in place for you.
But then like startup was like a true wild journey for me because I never worked with
a startup before and then to work for a series a startup trying to you know identify a product
market fit and everything it's almost like you become a swiss army knife of sorts you end up
doing everything like sometimes you're doing communities sometimes you're doing product and
a bit of engineering updating the docs and whatnot.
I think it helped me broaden or like widen the horizons of what's going on in the different niche areas within the field and all of that.
But I guess the prime, like the main difference, like at least I see with Snowflake and with LakeFS, for example, is LakeFS was all about almost survival, you know, trying to identify do we even have a product market fit in this specific niche of a product.
But with Snowflake, it's like, oh, it's an already established behemoth. And all we're
trying to do is kind of think about scaling every single thing. So you think of even the smallest
of initiatives and programs, like the first thing on your mind is, oh my God, how am I going to scale it? Because Snowflake
scale is entirely different game compared to LakeFS. So I guess working with problems at scale
is probably the biggest difference for me and getting used to, you know, do that.
Yeah, that's a good point. I think there's also a lot of, of course, variants in the world of software.
It's very different working for, say, Series A versus Series D,
where presumably a Series D
has hopefully found product market fit,
and they're actually working on some of those scale problems
versus being very early.
It's a lot of unknowns that you have to try to navigate.
It's both exciting, but it also can be a little bit,
I don't know, jarring,
because you're constantly context switching and maybe you know owning different types of roles and kind of
stepping in trying to solve a variety of different problems yep for sure and it's fun also you know
with snowflake for example right like the with the scale also comes that there's so many people that
you can like fall back upon there's like tons of resources that you can leverage and like,
you know, work, make it work for your programs and whatnot.
Yeah. You're not the first person to ever like have this particular problem in the history
of the company. There's probably been someone who's like laid some of the groundwork or
thought through it, uh, for you. Like you can be a startup and be, I don't know, like
the, uh, you know, like it needs some, some healthcare coverage thing and no, no one knows the answer to the question or something like that because no
one's ever experienced it before.
Yeah, for sure. It was super helpful.
Yeah. So I want to talk about real LLM strategy with you.
So there's, there's a ton of, of course,
going on in general right now and it's kind of hard to grok sometime what's
real and truly impactful when it's buried
sort of under 50 feet of like overhyped AI marketing fluff. So first of all, as someone
that actually has real NLP and AI experience, what are your thoughts on the hype? Is this just a hype
cycle or is this something new? What's it all mean from your standpoint? No, I mean, this clearly is a hype cycle for sure.
And I feel it is in a good way because two years ago when I was working with language
models, NLP was not this, the coolest areas to work on.
At that time, the world was, you know, all the rage was all about computer vision models
and object recognition and object detection.
Like that was the coolest thing everybody wanted to work on.
So as an NLP engineer,
it feels so good to finally have the attention back
on the language models
where everybody's excited about working on.
BERT and GPT taking mainstream
and almost everyone talking about what GPT is,
it gives you goosebumps and it's in a very good way.
But I feel feel I guess you know part of having all the attention and limelight come to you is also not everything is being fully
understood because suddenly CharGPT was everywhere and everybody was using CharGPT and nobody fully
truly understood what exactly it was for and what are you supposed to do with it?
What does it mean to you?
And like, you know, there is a lot more to it.
I mean, if within the researcher community, we are debating as to, oh, can chat GPT do reasoning and, you know, all sorts of math related tasks?
Then what would you think about a common man trying to understand what chat GPT can do and cannot, right?
It just, it literally like just hit the world and we were like, okay, what is this?
What are we supposed to deal with it?
And it was like almost a shock.
In a way, I feel like that probably is an aspect that needs a lot of eyes and involves,
I mean, it should involve some sort of an understanding and regulation of what's going on over there.
I see. So you think that there's essentially the excitement and hype around something like JackGBT,
but not necessarily true understanding from maybe the general public or even people who are in the engineering space?
Yeah, for sure right
i guess like you know you summarized before like it's great to have the hype because that's how
now you're gonna make you know how difficult it was to probably take these you know data people
and the company's executives to focus on and invest in data and ai a couple of years ago
and now thanks to the hype,
you don't need to sell these AI tools.
Like, you know, it's not that hard
to get any investment or any buy-in
to build these AI projects within these companies.
The best thing that could have happened to AI teams.
But then it also means that probably
it's on the AI researchers and developers
to kind of build that sort of understanding
and expectation within the executives as well.
I think a downside of this as well could be that essentially because there's such an appetite for investing in AI,
either within an organization or even within the venture capital market for startups, there's also room for a lot of like, I don't
know, you know, sneaky little salesman, you know, selling, selling a dream that maybe
is not quite close to reality.
You know, would you say we're sort of in the wild west of the days of AI to some degree?
I think for sure.
I mean, wild west in the sense, I feel like everybody's like, oh oh my god ai is going to be the next big
thing but how exactly is it going to impact your specific business or your line of business or your
domain your industry i don't think any of us have you know gotten that narrowed down we've all been
using ai and ml and the classical you know machine learning models for different use cases and whatnot.
But then looking at a powerful model like this, we have still not figured out what to do with it. And I remember we were, I mean, I was there at a bunch of, you know, hackathons in SF last week and like in the last few months as well.
There were like two separate tracks, one for all kinds of chatbots, one for for is there anything other than chatbots anybody else
out there is doing with these llms because every other production you know workload or use case
that i hear of is some version of some co-pilot or some assistant or something that's reading the
documentation and answering questions for people that's essentially i would say like maybe a
customer assistant right but that's literally all that we're thinking of at this point but we're still yet to explore what other you know
use cases can it be applied for and whatnot and of course you know multi-modal stuff's coming in
with all these great beautiful images that you can create with i you know yesterday you know
opening i also announced the text-to-speech model and then Whisper.
And it's like the whole multimodal stuff has not been explored as well.
Primarily, the focus has been on LLM, and we've not been able to get past this whole chatbot-type application. degree, people's first introduction in a lot of ways, or a lot of people's first introduction to what you can do
with Gen AI is ChatGPT, which naturally makes you think about
or it could be GitHub Copilot, but it naturally makes you think about chat
and these types of use cases. It's kind of like the lowest hanging fruit
entry into using some of these systems. But I think
the real, like, impactful work that I'm
excited about is some of the things that are going on in, like, the world of biotech around,
you know, using generative AI for, you know, drug discovery and sort of really transforming
an entire industry from something that historically has been more, less of a design,
less of an engineering discipline
and a little bit sort of like accidental discovery and very expensive.
And it could really lower essentially the cost and also increase the time to market
or decrease the time to market.
And then also, you know, things around like real-time translation is something that now
people are talking about actually being a reality within the next few years where, you know, I could be speaking to you and you could have a headphone in and like,
it's real time and you're just, you know, hearing it in whatever language that you need to hear it,
which would be like incredible thing to see as well. So I think we're not that far away from
going beyond the chatbot. And, but I think that it's kind of like the natural entry point for a
lot of people to think about.
Yeah, I think for sure. And yeah, I guess personally excited about using it on the creative aspects of it, for example, like writing.
Like it's literally going to help every person on earth to write better.
And not everybody has that, you know, like quirky, funny, humorous way of writing a blog, say, for example.
And it's like
super helpful when it comes to those aspects of it and in in general like any sort of creative
way of using it I guess one of the most interesting ways I've you know had people tell me how they use
it is I met a couple and I guess they are from two different countries so they come from two
different backgrounds and cultures so they have their own version of you know heroes or superheroes call them in their own culture and religion and to teach their kids
about them they are using chagi pd to kind of collaborate these two worlds together to write
stories and i was like wow that's so like almost sounds magical to me i'm like that's beautiful
and that's probably is the you know power of chat gpt to be as
creative as it can be and I mean I don't know you tell me how difficult is it to come up with
new stories every day to tell your kids to put them to bed maybe come up with fancy new
world that would involve superman and yoda and whatnot yeah I mean, I use ChatGPT to pick my son's weekly
show-and-tell items, so I definitely
be relying, when he gets to the point where he wants to hear a custom story from me,
I'll be relying on AI to help me generate that.
One of the other things that we mentioned was there's such an appetite
for businesses to try to either invest in these technologies, give more resources to their,
you know, AI and data teams, which is great.
But is it realistic to some degree?
Like, essentially, you know, based on my experience, I think most companies struggle to even do
analytics well.
So are we kind of expecting too much from every company to suddenly become an AI company overnight?
Yeah, I think that's a very good point brought up because certainly because of this wave, everybody is wanting to do AI and some people are not even like if your team does not have AI researchers, not even researchers like AI builders, for example, why are you even talking?
Like you need to have some foundation on which you want to now start experimenting and whatnot,
right?
But I feel like that may not be that big a deal because now we have a slew of tools
that come in.
I mean, again, like I guess a line of self-promo here, but then I saw the Snow Day announcements
and Cortex trying to enable
analysts with this all the chat gpt powered functions and I'm like oh wow so you can be a
data person who have nothing to do with any AI stuff but why do you have to worry about it right
like let's say if I'm using SQL all these years and I use the average of the min and the max
functions just you know just like I know how
a calculator would do them I don't have to worry about how it is being run internally so why can't
I do the same thing to create general like product descriptions or to create something I just like
taking the power of these chat GPT from AI stuff to all the different use cases by building these
different products has been a very interesting
thing too. So I feel, you know, there's going to be bimodal distribution of companies when it comes
to adoption, right? So one who actually have these ML engineers and builders who want to build
solutions for themselves, probably going to be very highly custom based on their domain data
set and whatnot. And there's going to be another group of companies who are probably going to be using solutions
from all the different companies
to kind of take in their AI needs and requirements.
Yeah, and I think that we'll get better
at building abstraction layers
that make it easier to essentially
even customize some of the kind of
out-of-the-box functionality.
Like you mentioned Cortex,
like there's functions for, you know,
summarization for, you know,
a question and answering for translation and so forth.
But then I think when you start getting into
trying to customize it for a specific domain,
then you need to, like today,
it's still like a fair amount of effort
to build like a whole like embedding model
and kind of use information retrieval methods to
pull the right uh context to feed into the llm and then you know make sure that you're not giving too
many tokens and there's quite a bit of work that's involved in that process today but i think that's
going to get easier with time it's kind of like the nature of any like early you know technology
um and it's going to get simpler for people to do that kind of work without deep expertise and machine learning. Yeah, for sure.
Because I feel, like think about Postgres having a pgVectorDB, right?
So now everyone's going to, like every existing data tool or AI
ML tool is going to kind of arm themselves to be able to serve their customers better
in terms of these AI aspects of it. And yesterday
I was like, last
week, I literally spent all my time to build a retrieval augmented generation based, you know,
LLM assistant, and I'm trying to understand how should I chunk and going to Lama index and trying
to identify what should, all sorts of in the weeds trying to build an LLM chatbot. And then yesterday,
open AI people tell me,
oh, you know what?
You don't need to do anything.
Why don't you write a small prompt?
And then we would create a custom GPT for you.
There is built-in retrieval.
There's built-in vector.
You don't need to worry about embedding,
creation of embedding, storage of embeddings.
Don't need to deal with vector databases.
And I'm like, oh, wow.
So this is only going to get better and better.
So I feel the long tail of companies who don't have those AI resources are going to probably benefit the most
because you don't need to do anything, but you still get everything handed
it over to you on a platter because you can access all these tools to do the same.
Yeah. So even though there's this transformation transformation that's happening, and I think the closest like analogy that I've experienced is like the
introduction of the internet in terms of like,
like a hard sort of wine drawn in the sand of like,
we had like the pre disconnected, you know,
Dewey decimal system era of the world to like the connected world where,
you know, I can talk to pretty much anybody that's, you know,
connected to the internet in the world. And, you know, I can talk to pretty much anybody that's, you know, connected to the internet in the world. And, you know, we've come a long way in the last 30 years since I was
sort of introduced. But now we're sort of, I think, drawing another hard line in the sand of like the
pre sort of AI era to the post AI era. But do you think there's a danger when it comes to all the
hype that's going on where a lot of the investment dollars
are going to go into essentially companies that are doing something in that space,
or even on the research side, probably researchers that are getting the most funding right now,
or getting their talks, or, you know, conference talks, or journal submissions accepted and stuff
like that are going to be kind of riding the hype train to some degree.
Yeah, I mean, very much. This has been like, I don't know, personal pet peeve of mine for a while now. Well, we had classical ML models, right? Let's say you built a model for housing price
prediction. And you know, for a fact that it cannot be 100% accurate model. And it made mistakes on
some data points. And that was totally fine and that's kind
of how the world operated right and this is chat gpt is just a different type of model it's a text
generation model and whatnot even this model is supposed to have some errors like that's the whole
point of it like we know that it cannot be 100 accurate but then for some reason when chat gpt is giving
you wrong messages or responses like we have somehow compared it and humanized it saying oh
my god the models are hallucinating oh my god is chat gpt lying is it intentionally lying what is
the intention i'm like oh just because now you want to create a term like coin a whole new thing and then go do some research
again like you basically ask chat gpt a couple of times and then be like oh my god so chat gpt
is lying and we built a framework to understand how it lies and i understand that we're trying
to ride the wave and the hype cycle and all of that but i do feel like to some extent we're like
stretching it too further because at the end
of the day it is an ml model just a large language model but it is also still bound to you know
produce the errors it cannot always be wrong i mean it cannot always be right but it's like
humanizing it and trying to like over i feel it's probably just to get more papers out. But again, I wouldn't call it this is the right direction for us to go.
But I feel this is probably a cycle too, maybe.
Yeah.
Well, it's kind of more clickbaity.
It's going to get you more views to refer to these things as lying and hallucinations
than essentially error rates, precision, and recall from traditional machine learning.
When a traditional classification model misclassifies
my Fitbit exercise as a sit-up when I was doing a push-up,
that is what we would report as an error rate.
No one makes a big deal about that.
But when essentially a gpt model or an l you
know or you know another type of lm produces something that's like misinformation then i
think it's because it feels more human the output of it then you know there's a stronger reaction
to it for some degree yeah i think i understand like you said right just because the responses
are more human like
we're maybe a little bit more scared and take it very seriously when something is wrong like
the responses are wrong but then again if as researchers or the academia and the industry
together are trying to hype it up unnecessarily then what would you like how does it you know
what are the ripple effects of that within
the common folks and people are going to be like oh my god these ai models can lie and these ai
models i cannot trust these ai models and that's it is important for us as the community to
maybe build some rules around models so it becomes more and more trustworthy and
as an you know society as a whole we get more used to having these models to help us, really.
But then if we ourselves are inventing these new constructs and kind of producing or creating
these scary scenarios, what would you expect the media to pick up?
And what would you expect common folks to understand from this? I feel like there is this whole responsible AI
and there's a whole field of it, but it's
just not for people who are researching in the responsible AI and ethical AI
field. It should be for literally every person who is working with AI
to some extent to have that ownership and accountability and be responsible
when you talk to or even,
you know, come up with these new terms.
Yeah, absolutely.
And then, you know, I think also, you know, in the sort of abstract academic sense of
an AI innovation, in many ways, like the goal might be to mimic humans or even outperform
humans in some respect.
But to be human also means to be flawed
and sometimes have, you know, bias or lie. And but the bar for AI is much higher. We want like human
level or better performance, but we also want to eliminate all bias and all error. So I guess like,
what are your thoughts on sort of the balance between creating something that sort of like mimics or feels human, but also making sure that, you know, it isn't essentially, you know, performing errors or, you know, we understand that there's potential errors.
Or even on the flip side, when you mentioned responsible AI, you can get sort of get into the potential for toxicity or bias.
Yeah, I think, you know, so when you think about this whole humans are not fair and we have inherent biases and everything,
and then how is this AI model being trained?
It is being trained on the data on the internet.
And it's all the data that we created as humans are going to be biased inherently so
when you trained your model on the data on the web what did you expect it to be like isn't it
obvious that it is going to be biased is it's not going to be fair and everything that a normal
human would have is going to be there with the charge gpt or any other you know ai model too and i feel like we're trying to almost even demonize
it saying oh my god chargpt responded something so sexist and i'm like hey it's i mean i guess
the better analogy for that would be a child just growing up in the world right and you know the
world is an unfair place but then when you bring up your child you are trying to
kind of set boundaries and make you know some sort of a protective mechanism like you as a parent make sure the child has this protective environment to grow up so it doesn't mean the
just because the world is fair you let the child out in the world and be like you know what you
deal with it you're gonna have to because it's the world's unfair we do create whatever protective environment we
go to make sure as much as we can again to make sure the child grows up you know in an ideal way
with ideal behaviors ideal modeled behaviors i guess very similarly even with this model right
what like looking at the size of these language models it's probably impossible to create
synthetic data that is out you know purely fair and out of bias and everything to train these Looking at the size of these language models, it's probably impossible to create synthetic
data that is purely fair and out of bias and everything to train these models.
And then I don't even think they will respond human-like with all of that.
It's kind of impossible to create this ideal data set to train these models.
You've got to work with what we've got, which is this biased data set.
But then though, take this biased data set, But then though, like take this biased data set,
train the model, the model is doing good, but then you need that protective environment or
some sort of a post-processing layer for your specific applications to make sure, you know,
the bias and the fairness and the toxicity and all of that are being kept under control.
That's almost a straightforward approach, but I feel it's again it's garbage in garbage out so
the problem is in the data where does the data come from so the problem is in the world like
why are we demonizing ai models for that like that's pretty much all it can do so i feel that
is also something i i don't know how the narrative was built out but then this entire
demonizing ai and like the world's gonna end and this whole like i also read up about this
future of life institute so they've created this entire almost like a research field now to make
sure how are we going to make sure we don't go extinct or like ai's are not going to become a
threat to us and like a whole different
thing and I'm like I don't know what's going on behind the screens of like what what is even the
thought process behind such a thing but I feel like we should not demonize AI just for the sake
of our own short-term benefits of whatever it may be like publishing a paper and get getting grants
and funds and whatnot. Yeah. I mean,
there's enough kind of fear around technology innovation in particular things that are like AI, robotics, I think quantum computing is another one that we've
kind of touched on.
Those are like the three like hot button subjects that like make people scared.
And I think part of it is just like they're complicated.
Not a lot of people understand everything that's involved. It feels a little black boxy and
there's potential for major impact on
people's, maybe their jobs or whatever it is.
We probably, as people working in technology
or researchers, we're sort of doing ourselves a disservice by
coming up with terms that sort of fuel that fear even more.
Right, exactly. I mean, the responsibility is literally on us to make sure we help the world
understand what these tools and models are, so we can all benefit from it together,
but not the other way around. But hey, I guess probably this is how it starts and it takes some time to kind of go over
the friends the world can also be okay with the ai modes so one of the things you we we touched on
when you were talking there in terms of you know like garbage in garbage out and as much as we're
in this like ai revolution it's also it's very much like a data revolution we couldn't have
the models that
we have today, 15 years ago, because we didn't have enough like digitized data to even do the
training. Can you talk a little bit about the value of data when it comes to AI training?
Okay. Yeah, for sure. I guess, I'm not sure where I heard this from, but it's like
all your AI problems, or at least most of your AI problems
or ML problems are going to be data problems, right? If you have this right ideal data,
like training data to train your model, your model, of course, is going to turn out well.
But then when it comes to the data, there is like a bunch of things, of course,
like the standard data quality issues and making sure you have the right kind of data,
you're using the right kind of features.
And there was this classical ML problems with feature engineering and everything,
which kind of got away thanks to these LLMs.
Now, we don't even have control over what features do we want to use and everything.
So it's all taken care of by itself.
But then coming up with this data and identifying and to some extent,
although we don't have control over what data exists out there in the world, but then taking that data and doing some sort of pre-processing and massaging and making sure it's not too out, you know, on the companies today are using their own data to fine-tune the
models or using their own data to build a model or train a model from scratch. That's probably
why we've not talked about the data quality issues or how does one even prepare the data
to train a model like GPT, for example. So we've not even like, as a community, I feel like we've not even gone there
and touched upon all of that a lot,
but I feel more and more,
as the field matures and a lot of tools get introduced
to make it easy for us,
there is gonna be, again,
like the foundational and the fundamental problems
of data quality is gonna become,
I guess it's probably a cycle, it's probably gonna become the next big thing in the fundamental problems of data quality is going to become, I guess it's probably a cycle.
It's probably going to become the next big thing in the next couple of years.
Well, even I've heard that a lot of the innovation that's happened in particular in the GPT models
from moving from 3 to 3.5 to 4, a lot of it's not necessarily major changes
in how they're doing deep learning.
They're actually changes around how they're like preparing the data and ensuring like higher data quality and reducing
you know uh the propensity for like hallucinations and um and so forth so it's really about like how
do we innovate the data level rather than necessarily like big changes to what's happening
at the deep learning level for sure yeah right because there are like two big components to it one is like the pre-processing
layer and the other one is the post-processing pre-processing is literally all data problems
and that was that will i'm not sure what happened to the initiative if you remember andrew andrewing
he came up with this term called data-centric AI and data-centric ML,
I think probably a couple of years ago before CharGPT was the entire thing. And his ideology
was that, you know what, we may not be able to do a lot of improvement to the models by just
playing around with the hyperparameters as much, because there is only so much you can do by,
you know, playing around with hyperparameters. But then if you fix the hyperparameters and work on your data, we will be able to further reach the higher metrics in terms of accuracy or F1 or
recall or whatever. And that was one of the, I guess, successful ways in which industries
and companies were also doing that, even when I was working as a language engineer.
So it's going to come back again, but probably a little later when we're
ready for that, I guess, adoption. In traditional machine learning, how do people usually go about
measuring and improving data quality? Okay, so I guess in traditional machine learning models,
they measure the models using certain metrics, and then data quality is also measured in terms of,
okay, if I do this pre-processing method or if I do a standardization and a normalization
or even limitization and all sorts of these pre-processing features, it was almost like
an experiment, right? It was not a test to data quality, but it was more of a test to
if I do this specific
pre-processing, how does it affect the efficiency of my model? And that's kind of how we worked with.
And one, like, you know, using a specific feature over the other one was not necessarily a right or
a wrong thing, but for this use case, for this model, this is the feature that, you know, gives
me that. So it's all good.
But when it comes to the not-so-traditional large language models,
you cannot really do any of these handcrafted feature engineering methods to improve your accuracy.
Or there's not even accuracy.
But then, you know, so, and the traditional methods again, right?
Like if you do a limitization and standardization and novelization,
you're not trying to create, I guess, discrete output here.
It's not a simple prediction.
So when you do remove all the extra context from the text that you're feeding in for your training,
it's going to affect all the, you know, the way in which it creates your output responses as well, which is probably
why it has become a challenge for us to understand, okay, so for my large language model, I don't have
an accuracy as a metric or F1 score as a metric, but then how am I going to make sure, what am I
supposed to do with this data to get this ideal, you know, positive, non-toxic kind of responses.
I don't think we still have that, you know, solved today to understand what kind of data
quality metrics do I even need to use that data to train my large language models.
It's almost still experimental and we are figuring it out again.
Yeah.
So just to kind of make sort of parse what you were saying there,
in sort of traditional machine learning,
usually we have some idea of what we want to produce
from the model as output.
Like maybe it's a classification model
or a predictive model or something like that.
So we can essentially have some test set
where we know what the prediction should be, and then
essentially run the model against that test set to figure out how precise it was. And then we can
make tweaks to the data to try to improve the quality and then run the test again and see if
that improves it. But we in the large world of large language models, it's not so clear because
we're essentially generating text. And it's not's not clear like you're not going to
necessarily generate the same text every single time because there's some you know randomness
factor that's involved with the generation of text and then it's also not necessarily clear like
what is like a good response or not i know if i tell it to write a blog post or something like
that uh well you know the uh one version of that might be better
than another based on my subjective understanding, but it's hard to necessarily empirically measure
what is better. Yeah, I guess that's pretty much it. Like you clearly summarized it,
which is basically, we don't have a great way of measuring how good an LLM model is.
And when you don't have a metric to evaluate,
so there is nothing you can, like, you're not knowingly, consciously with the right intent
tuning to get that, you know, metric. So it's just almost throwing everything and then see what
sticks kind of an approach is what we're doing with at least, you know, training the LLM models
currently. And we don't have like, you know, but I guess evaluating an LLM model could
be a whole different discussion in itself. And as a research problem, I don't think it's fully
solved. We are still working on, you know, coming up with different frameworks and benchmarks and
whatnot. And currently, it's very interesting, because we do have a bunch of frameworks that
people use as benchmarks to see, okay, how good is my model
performing on this specific, you know, Q&A data set and whatnot. But again, you would have to
create this data set for every different use case with every different industry. So it cannot be
generalized. Like, you cannot just say, you know, what is the accuracy of the model and be done with
it. Because everything is kind of non-deterministic here
with large language models and text.
Yeah, and generally, I think the breadth of potential use cases
is much wider than what we should have seen from traditional ML.
Because a lot of times where we've been successful historically with ML
is on fairly narrow problems, whereas the LLM approach is more of a general solution to a wide range of problems that you're going to apply AI.
I could use it for question answering, for reasoning, for writing.
I could do a whole bunch of different things, and then you can break it down by domain and stuff.
What is good in one domain or one use case might be you know drastically different somewhere else yeah for sure right like the same model could be doing well for a specific
use case it may not be for another one so you you there is like no one standard metric to kind of
uniformly say what's better what's not so we touched on clearly like evaluation testing
ensuring data quality measuring data quality those are all big problems.
What are some of the other real problems in the LLM space that you think need to be solved?
Ah, okay.
I guess the biggest one being this whole governance, privacy, and security of it.
It was funny because, again, going back to the OpenAI's Dev Day keynote yesterday,
they had a very interesting announcement called Copyright Shield, where OpenAI will help their customers with the legal fees and everything on the legal front if there is a copyright infringement from using their APIs. I'm like, wow, okay. And you do know that when you go to data conferences
and you are watching the keynote,
you have a bunch of data governance tools.
How do you set up your RBAC and access control?
Who gets access to what?
And what are the PII governance mechanism?
And what are we using to make sure our PII data is not exposed?
And that is a whole different kind of a discussion, right?
Like we have tools,
we know what to do. And we're just trying to get the, you know, stakeholders to start doing that.
Whereas come to these LLM ones, it was mind boggling to actually hear someone say,
oh, you know what, if you ever run into a problem, here is your insurance, like,
like for your legal support. And I'm like, oh, wow. So there was nothing on the security or
the governance of the privacy, you know, aspect to it. And I was, of course, you know, shocked,
but then it's also the wild west, I guess, the industry's like very much in the early stages.
So it's kind of understandable. Yeah, I think this is, I think this is one of the major problems with the space right now that needs to be solved.
And I think for a lot of companies to really move beyond sort of like demos and proof of concepts to like production grade products that they can feel confident are, you know, a risk of like a data leak or some sort of privacy infringement down the road is they need to be able to figure out this problem. We kind of know, at least we understand the challenges with regular data
management around, like you said, you know, governance, RBAC, like all these types of things
of, we don't always do it and do it or do it well, but we kind of understand, you know, if someone
comes to your business and says, I want my data deleted, it might take a bunch of work to track down.
It's not optimal to track down all those places that you need to delete it.
You might not necessarily do a good job of it if you don't essentially have a lot of these tools and infrastructure in place to deal with these types of things like right to be forgotten.
But we understand, delete the row from the database and it's gone.
There's no delete a row from the database, and it's gone. There's no delete a row
from an LLM. So if, you know, sensitive data is part of that training process, or even part of
an embedding model or something like that, there's no way to really delete that information. And I
think that's a place that is one of the central challenges right now that not necessarily sort of
everybody's like, really understanding the nuance there. And on top of that, it really comes down to not only
not sharing the information, but how do we essentially train or augment models in a way
that can still leverage essentially like copyright or private information without essentially leaking
the information and then controlling access to it afterwards.
Yeah, I think this, like, you know,
the whole privacy and security aspect of it will not be fully solved
until the field of explainable AI
kind of comes into the mainstream
because today, right,
we have literally zero control
over how this, you know, encoding and decoding
happens within GPT models and how
it actually understands and makes sense of the input training data, right? We literally have
just feeding the data and then it's just working on it by itself and it's done, which is probably
why we don't have any control over how am I going to delete the specific data from a specific person
or how am I going to make sure this specific data from a specific person or how am I going to make sure this specific you know private set of data is not being exposed
can't do any of that only because we don't fully understand how these models work but I feel these
are like two parallel things and maybe you know I'm not sure if you're on board with the same
idea too but I feel like unless the field of explainable AI moves further,
it's going to be difficult for the security and the privacy components
to kind of keep up with that.
What is, this is the first time I've heard that term explainable AI.
So I'm assuming that's essentially a way to kind of,
for the AI to explain why it makes decisions.
Is that right?
Yeah, exactly.
Like if you think about the classical ML model, for example, right? So when you do have a certain prediction, there is always a way of
going back and saying, okay, so I predicted the house prices to be this, but what were the factors,
you know, based on which you arrived at this price? Like, you know, the number of fruit,
like there are like input features you can always map back to. when it comes to these you know i guess fuzzy
neural networks or anything after neural nets really there is no way of trying to explain how
they do what they do which kind of makes it like you know even more fuzzier for us to understand
like how are you actually going to identify how is your data, how is specific PII data getting encoded as part of
the training process? And how are you going to make sure that specific part is not getting exposed?
Yeah, that's a good, that's a good point. I don't know, I know, there's some, you know,
recent research we actually touched on in a recent episode that around like trying to apply different
methods for like
deleting certain information from lms it doesn't really solve i think the underlying problem most
and it's also rely most of them rely on fine tuning which is expensive like they're not that
realistic a lot of the approaches that i've seen or there's just sort of the uh like complete like
you know um sealed off approach where it's just like we're not going to let any
of that information in but then you lose out a lot of the value of the model of using potentially
like your own you know company documents for example to try to train an lm or augment it
oh yeah i mean i have a whole different story with this whole fine tuning of models but i i'm not sure
if that's probably the right way to do it I guess I don't know if I can
take the lessons from my previous you know language model training experience to large language models
but broadly like one of the biggest challenges when it comes to fine-tuning for example is
when you have proprietary data and if you train your model only on that proprietary
data, like your company's documentation, for example, it's not even proprietary, but if
it's just your company's documentation, and the model only knows about your company's
product because it's only trained on the documentation.
So when I'm going to ask questions, it's not going to talk something political or religious
or anything scandalous so you can be 100 sure it can hallucinate as to how do i use what is the syntax of this function
and whatnot but it's not going to give you any scandalous related you know topics but when you
take a generally available foundation model that was trained on this you know world data and then
you try to fine-tune it with your own documentation and whatnot how are you still going to be sure that it's not going to respond to any random irrelevant
religious or political question right you can never guarantee that and it and again i guess
even historically transfer learning hasn't been as successful or like not so popular method where
it comes to you know fine-tuning your existing model to a whole different task because it also loses meaning if the foundation model is
good at abc tasks and you're fine-tuning it with a new task you cannot be very sure that it is still
able to do the abc tasks it was originally trained for. So then what is the point of it again, right?
So you might as well have trained it on your own data.
You can at least be sure that it's not going to talk any relevant things
or like any scandalous things to your customers.
So again, I'm not big on fine-tuning,
but I guess that's kind of how the world is going to evolve
from prompt engineering to RAG to fine-tuning.
And then eventually we'll hopefully get to custom training of proprietary
LLMs, but it's probably going to be a long way there.
With the current state of Gen AI and in particular LLMs,
who do you think, or what use cases
are these well-positioned to be most useful for at this time?
I feel like all sorts of productivity tools are like super helpful because
when it comes to any of these, you know, developer productivity or marketing productivity or
like creative, any of these tools, there is still a human in the loop. It's just making our work
easier or like making the getting started
easier so these kind of use cases i feel are like very well primed for you know taking like i don't
have to worry about hallucination i don't have to worry about all the hundred problems that come
with gpt because i know there is always going to be a human in the loop to kind of you know oversee
what they're going to like what the tools are going gonna do so at least for the moment where you
know as we're positioned today any application that has a human in the loop will be like super
ready to get started and i don't even think they need to like think twice about what are the right
ways about this whole you know bias and fairness and all of that as long as there is a human
involved i know the human also comes with their own bias and everything,
but then it is super powerful to have a human in the loop
versus just automating a bunch of stuff and hoping it all works out.
Yeah, I mean, I think it's hard to argue that things like GitHub Copilot
and the other coding assistants haven't been valuable to people.
I think people are seeing massive efficiency gains from that,
but we're still a long way from just copying, pasting code, valuable to people. You know, I think people are seeing massive efficiency gains from that, but
we're still a long way from just, you know, copying, pasting code, submitting it to production
and hoping it works. But probably, yeah, it's a long way to go for sure.
And then what are some of your recommendations, I guess, for, you know, developers interested in
the space or businesses, you know, do you think, you know, in your opinion, are people like asking the right questions when they're kind of like, you know,
charging into the world of AI? Okay. I think this is probably the greatest time to be a developer
or anyone like a builder working in tech, really. Again, I'm sorry to keep going back to the Open
AI's Dev Day. I feel like at this point, it's probably like a recap of the Dev Day keynote.
But they had announced this chat GPTs and a chat GPT,
I'm sorry, GPTs and a GPT store
where you can build your own custom GPT for any use case
and then put it out in the store and people can use that.
And almost like a one-on-one equivalent for me,
like I was instantly
able to connect to snowflake marketplace so you put your you know data apps out there or like gpt
out there and you can have your users use it and be done with it and i'm like probably even as a
builder right you can be a one person founder and then go on to build a million dollar or even like a bill you never know so the
opportunities as a developer with these kind of tools are like super cool and endless and if you're
not a builder too again right with all sorts of you create a custom gpt by writing an input prompt
like you don't even need to know like forget python forget this forget that i'm like how cool is that and if you do have a great use case and you may be able to make some money off
of that that's like like putting so much power in the hands of the builders and developers and
that's like super cool but when it comes to the organizations though it makes me really wonder
as to what should the organizations really do?
What should their approach be at this point?
Because yesterday after the announcement, I personally felt like a bunch of tools that I've worked with might go obsolete because now, oh, no.
OK, done.
But if you're an organization who probably are already paying those companies to experiment and build with,
what should you be even thinking?
So as an org, should I only stick to these big players,
hoping that they will eventually catch up and there will be a feature parity?
Or am I supposed to take the bets and try the new tool that's probably going to be a lot more useful to my team in the short run?
Because again, everything is... The competition is within the timeline, right? You want to be the first within your
industry, within your, like, you know, domain to kind of go win and capture that market. So are
you, like, supposed to experiment, experiment, experiment, be ready to, you know, bet on the
small startups? Or should you be like, I never know if the startup is even going to be there for a while.
So do I wait for these big players to catch up?
That's going to be, I don't think I have an answer for that.
But that's kind of what I was thinking in my headspace was after, you know, looking
at all the announcements.
Yeah, I think, I mean, I think that's always a challenge for all companies, you know, investing
in technology, even with the sort of the migration of to the cloud,
I think, you know,
people face similar challenges is like, do I stick with like my own,
or do I invest in like the on-prem solution or, or, or, you know,
stick with my on-prem solution,
or do I start to modernize and move to the cloud now?
And it's always a delicate balance.
I think it really comes down to what,
I guess like having confidence
in what other competitive advantage it gives your business
or, you know, the ROI that you're going to get
to your business.
And I guess your level of confidence
that hopefully the company that you're investing in,
like buying their licensing model,
is actually going to be here in a couple of years.
It's always, you know, a certain amount of leap of faith with that.
Gotcha. But yeah, I think that's going to be very interesting for me at least to watch out because,
yeah. Awesome. Well, as we start to wrap up, is there anything else you'd like to share? Any other,
you know, sort of nuggets of wisdom that or things that you've been thinking about?
I mean, I guess I probably have like one last pointer,
which I feel is, again, coming off from my personal experience. I run into a bunch of folks
at meetups and, you know, hackathons and whatnot. And I see a lot of data folks who don't have a lot
to do with AI are kind of isolating themselves and be like, oh yeah, there's just a lot happening in the AI world.
It has nothing to do with me.
But then as the data engineer or data analyst person,
you know that it's going to come into your zone.
You know it's like someone is building an AI model,
you will be probably building the data pipeline
to take the data to train.
So it's always going to come into your zone for sure.
So it's probably wise to keep up with,
I know it's impossible to keep up
with everything that's going on,
but then it doesn't hurt to probably
give some tools to try and just keep abreast
of what's going on around.
So that's going to be very helpful.
Yeah, I mean, I think it's similar
to any of these digital transformation shifts
that happen, like moving to the internet, moving to mobile,
moving to cloud,
like you're kind of doing yourself a disservice by like writing it off as
like, you know, nothing to do with me.
I think this is something that's probably going to touch essentially every
like facet of not only engineering,
but most people's jobs in some capacity over the next decade or so.
So it's, it's important, uh, I think to like, not only engineering, but most people's jobs in some capacity over the next decade or so.
So it's important, I think, to like at least educate yourself on what is happening and like how you might be able to leverage these tools to do your job better and more efficiently
or, you know, how you might even be able to contribute in an interesting way.
Yeah, for sure.
It's going to be very interesting for everyone especially for
students like i'm doing a course like on my weekends doing this executive mba thing and i
was kind of shocked that people were not even using chat gpd for anything and i'm like oh wow
okay but you all work in tech but then i'm like no but we don't work with ai and i was like taken
aback quite because i've met a lot of people who have nothing to do with tech and have been using Chargy PD too.
Like I go to a parlor and the lady, she has a book club with her friends and then she uses Chargy PD to get the summary of the books.
So she doesn't have to read the entire book, but then she can still go with her friends and sound smart and everything. So it's kind of very interesting to
see the ones you think would be catching up with all the tech developments or not. And then it's
like ChatGPT, regardless of one's background, is kind of impacting everyone.
Yeah, ChatGPT is like the modern Cliff Notes version.
Right.
It gives everyone an opportunity to sound smart. Well, Veino, I want to thank you so much for being here.
Hopefully we'll see each other in person again,
and I'm sure some sort of event coming up.
But this was very enjoyable.
We'll have to have you back down the road.
Yep.
Thank you so much for having me.
And yeah, I will hopefully run into a meetup or something in SF soon.
All right.
Thanks.
Cheers.