Programming Throwdown - 161: Leveraging Generative AI Models with Hagay Lupesko
Episode Date: July 10, 2023MosaicML’s VP Of Engineering, Hagay Lupesko, joins us today to discuss generative AI! We talk about how to use existing models as well as ways to finetune these models to a particular tas...k or domain. 00:01:28 Introductions00:02:09 Hagay’s circuitous career journey00:08:25 Building software for large factories00:17:30 The reality of new technologies00:28:10 AWS00:29:33 Pytorch’s leapfrog advantage00:37:24 MosaicML’s mission00:39:29 Generative AI00:44:39 Giant data models00:57:00 Data access tips01:10:31 MPT-7B01:27:01 Careers in Mosaic01:31:46 FarewellsResources mentioned in this episode:Join the Programming Throwdown Patreon community today: https://www.patreon.com/programmingthrowdown?ty=h Subscribe to the podcast on Youtube: https://www.youtube.com/@programmingthrowdown4793 Links:Hagay Lupesko:Linkedin: https://www.linkedin.com/in/hagaylupesko/Twitter: https://twitter.com/hagay_lupeskoGithub: https://github.com/lupeskoMosaicML:Website: https://www.mosaicml.com/Careers: https://www.mosaicml.com/careersTwitter: https://twitter.com/MosaicMLLinkedin: https://www.linkedin.com/company/mosaicml/Others:Amp It Up (Amazon): https://www.amazon.com/Amp-Unlocking-Hypergrowth-Expectations-Intensity/dp/1119836115Hugging Face Hub: https://huggingface.co/ If you’ve enjoyed this episode, you can listen to more on Programming Throwdown’s website: https://www.programmingthrowdown.com/ Reach out to us via email: programmingthrowdown@gmail.com You can also follow Programming Throwdown on Facebook | Apple Podcasts | Spotify | Player.FM | Youtube Join the discussion on our DiscordHelp support Programming Throwdown through our Patreon ★ Support this podcast on Patreon ★
Transcript
Discussion (0)
Hey everybody.
So we have seen so many AI hype cycles around so many different areas, right?
We've seen self-driving cars was a big deal in 2009, if people remember that.
At the time, Ray Kurzweil has been talking about the singularity forever.
Oh, there was even beyond AI, there was Bitcoin and Web3 and all of that.
And Patrick and I, we've had folks on, but ourselves, I've kept a little bit of an
arm's reach from the latest shiny object syndrome. But I think generative AI is amazing. I'll just
put it out there. I don't think it's singularity AGI type stuff, but I do think that there's a tremendous opportunity
to create value with generative AI.
I've been really excited about it.
I've been diving deep into the literature and also applications.
I know a lot of other folks have too.
It's a really exciting area.
It's an area that I'm pretty excited about as well.
And I'm super excited to have Hagai Levesco on the show.
He's the VP of Engineering at Mosaic ML.
So thanks for coming on the show, Hagai.
Hey, Jason.
Hey, Patrick.
Thanks for having me.
Cool.
So we'll definitely dive into generative AI and how folks can use it at home or at their business.
But let's start off by talking a little bit about you.
What's your background? What was the path that you took that brought you to Mosaic?
Yeah. So I'm currently the VP of Engineering at Mosaic ML.
And I guess we'll probably touch on Mosaic ML a bit later on.
But I really started my career a while back now.
If there's been video, you could have seen all my gray hair.
So I started my career as an engineer, you know, back in Israel where I was born and raised.
And really earlier in my career, I did a bunch of things around computer vision, medical imaging, vision for factory automation.
I even spent a couple of years living in China, working on a startup there.
Wow. So wait, let's dive into that a little bit. So you were in Israel and then the US and then
China or straight from Israel to China? No, yeah. So straight from Israel to China.
And so what was that like?
That was, it must be a huge culture shock.
It was definitely initially a shock
and then really a fantastic experience
because, you know, as we all know,
China is even today, actually,
you know, kind of growing rapidly.
Back then it was really superb,
you know, moving super, super quickly.
So just the story is, you know, I was a young engineer back then,
had some experience, expertise in computer vision.
And, you know, this was actually, so just to put things on kind of on the timeline,
this was pre the deep learning revolution.
So I'm talking about 2007 80s you know neural networks
were not working well so computer vision actually was completely different like the way you apply
computer vision to a to a problem yeah just just to put put context i think you know there's a
bunch of hand-coded things right like there was these i remember patrick probably knows this way
better than i do but there was um a whole bunch of filters, right?
Like Sobel filter and these like directional filters.
And you would basically try to build your own deep learning system by just stacking all of these filters as an expert.
And then at the end, you would have some shallow model that, you know, is stacked on top of all these other things.
Exactly. That was exactly the way you'd apply, you know, define different filters,
you would hand tune them. I mean, today from, you know, computer vision neural networks, the
convolution kernels are kind of, you know, figured out during the training process.
Back then, we would use convolution quite a bit bit and you would hand tune the convolution
to work for your problem.
That was actually a lot of fun.
It was really interesting process.
Of course, it made kind of the solutions
not super scalable where for different customers,
different problems,
you'd have to sit down and tweak things.
You know, field engineers,
that's a lot of what they would do.
They would sit down with these systems
and tweak the the parameters including the convolution kernels by hand oh wow that's that's
wild yeah because you know when you're when the convolution kernel is not doesn't know anything
about your objective like it's trying to like find edges but that's not your objective your
objective is to say like is there a face in this picture
and edges just happen to be kind of like tangentially and you know interesting to that
that objective and so then it's like can you come up with a filter it's even more interesting
and then yeah to your point deep learning now just does everything for us which is pretty wild
yeah and it's even more than that like Like you had to, usually typically in a typical
computer vision pipeline, you know, you'd start by taking the input image and then preparing it
to be kind of ready for the convolution operator. So you'd have to do different tricks. It was like
a whole toolbox of tricks you do to like clean up the image, you know, normalize it manually,
and then start scrubbing the image with different
morphological operators.
Yeah, so it was quite a ride.
But going back to kind of the experience in China, so, you know, I was just married back
then.
I asked my wife, hey, do you want to go to an adventure in China?
And she initially said no.
And then I was able to convince her that it's going to be kind of an experience of a lifetime.
And we just hopped on the plane.
It was a small company, startup,
like maybe five people.
They brought me in as sort of the computer vision expert,
although definitely there was tons of, you know,
I wasn't that of an expert,
but, you know, I said, what the heck?
And we built a whole system, including hardware.
Of course, the differentiator of the system was the software, but it was hardware with,
you know, robotics to, you know, from conveyor belts controlled by, you know by by different actuators through imaging system lighting
cameras through integration with the automation in a typical factory and that product was for the
pcb industry the printed circuit board industry oh cool yeah yeah, you know, we built a product from scratch.
We're able to, you know, sell it to a few companies.
I spent a lot of time on factory floors in China, which is by itself is an experience.
Oh, I bet.
I heard they're massive.
They're massive.
It's like a city.
Now, you know what's funny?
I mean, I'm from Israel.
So when I was growing up as a kid, Israel was about like 6 million people.
And by the way, just for context, I think Israel is often in the news, but many people don't realize that it's tiny,
both in terms of population and geographical size. It's actually smaller than the state of New Jersey in terms of the size.
Oh, I never knew that. Yeah. So, I mean, you know, here I am coming from Israel, you know, six million people country,
moving to China, going to a suburb of Shanghai.
And that suburb was, you know, a small suburb, six million people.
So, yeah, just the size of China is massive.
And yeah, so, you know, it was fun.
Are there tours?
Like, let's say, I've never been to China, I'll be honest.
I would love to go.
I just never had the opportunity.
But if I went, could I tour a factory?
Like, is that a thing that tourists do or not really?
I think it's a great idea for a startup, Jason.
But no, I don't think it's an option.
But these factories are really interesting because they're like little cities, literally little cities.
One of our first customers was actually a factory owned by a Taiwanese company and wasn't considered a very big factory, but it had 50,000 workers.
Wow. Oh,000 workers. Wow.
Oh, my goodness.
So, you know, five times 10 to the power of four.
And what you realize after you go there is, first of all, most of the workers are fairly young, meaning, you know, 18-ish.
And they actually live in the factory.
Oh. 18-ish and they actually live in the factory and they have dorms there they have everything they need like you know food social activities you know places to work out it's it's literally
almost like a student dorm only you work you don't study so I found that really interesting
yeah one of the things that that blew my mind was
this is a long time ago um but they interviewed Tim Cook and they were talking about manufacturing
like what it's like because I think this is the time where they're making the Mac Pro in America
and they were talking about the difference there and what he he said was, in China, if you need a million people,
literally, like you need 1 million people to show up to like, you know, boost the iPhone,
you know, production, you can get a million people. And when he said that, and he wasn't,
wasn't hyperbolic. I mean, he literally meant it that it really that hit home on this kind of scale
we're talking about. Yeah. And China, by the way, is not done with that process. There are still, the majority of the Chinese population is still
in villages and looking to go to the city where they can find, you know, have work, get proper,
you know, wages, you know, and start their lives. And this is part of what I think many people don't
understand about the Chinese, you know, China and the Chinese government is that they are under immense pressure to sustain growth so that their masses actually kind of have a path to a better life.
And that's part of why they're so aggressive on growth.
They just have to grow very quickly to kind of, you know, serve that need of their population.
That makes sense. So what happened to the startup? Did the startup grow very quickly or no? you know, serve that need of their population.
That makes sense.
So what happened to the startup?
Did the startup grow very quickly or no?
So it started well.
And then, you know, I don't know how much,
how many of the listeners know,
but 2008 slash nine,
there was a pretty massive financial crisis.
And we were hit,
the startup was hit very significantly by that crisis.
It started as the mortgage crisis in the US and then very quickly kind of expanded globally.
As often is the case, right, when there is a crisis,
people start cutting back on their purchases
and then the PCB industry was hit significantly.
So did the chip industry,
just because the demand for devices went significantly down.
So the startup didn't shut down, but it definitely kind of
were on a good trajectory.
And then it just kind of, you know, most of the orders were cut back,
budgets were cut back.
Yeah, so, but we were able to still work through things.
However, at some point I had some family issues.
I had to go back to Israel.
That was about, you know, two years later. So I went back to Israel and, you know,
for a while I was flying to China every month, but it's really unsustainable, especially when
you have a young family that needs, you know, needs you to be there for them. And I had my
first son who was born. So at some point I just parted
ways with the startup. Yep that makes sense and so then after that you were oh so at that point
you were back in Israel at some point you were at the US what happened there? Yeah so went back to
Israel and then kind of I went back to working in an area that I had some experience on before, medical imaging. So, yeah, I actually went to work for GE Healthcare in Israel
and we built a cardiovascular imaging system,
which was, you know, really a lot of fun.
And I think for those that have worked in healthcare, you know,
there are definitely some downsides.
Like it's a very slow moving field because of a lot of regulation.
And in general, the audience is very, customer base is very conservative.
But then you really feel, on the plus side, you really feel that you're changing the world for the better.
Because if you can develop systems that give better care, help detect diseases earlier, help treat diseases. It's really something that, you know,
you feel really good about, right, doing.
So I did that for a while.
And then, you know, Amazon reached out
and they didn't have an R&D office in Israel back then.
I mean, now they have tons of R&D offices in Israel,
but back then they did not.
And they interviewed me
and then asked to relocate me to the US.
Ah, so you went to Seattle.
No, so they wanted to relocate me to Seattle.
But again, I mentioned my wife earlier
and how she had to approve moving to China.
So again, my wife is the decision maker on these things.
And after thinking about that together,
Seattle was not the right place for us
in terms of just the weather, family.
So we moved to the Bay Area.
Ah, okay, got it.
So I know Amazon has this like Lab 126 or that that's where the
kindle came out of and some of these things is that where you went or was there a separate office
um no so the opportunity at least you know amazon kind of offered me back then was actually to join
amazon music in sf amazon music back then was a relatively small team. It was a very basic product back then.
They were kind of following the iTunes model.
And again, for folks that are a bit older like me,
you'd know that digital music actually started by selling songs.
So you would buy a song, you would buy an album, you would pay for it,
you would have the rights to a digital copy and you know you can deploy it on your uh you know whatever players you had audio streaming was
not hardly a thing back then i mean now it's how all of us consume you know music and more than
music but back then the technology was not there the business terms were not there so it was a very
different world but i joined amazon
music at a really amazing time where streaming just started picking up so i actually helped ship
amazon music starting in the us and then we expanded globally and that was really super
cool experience because it was part of you know we're participating in that revolution of kind of taking music from being
you know download digital content to streaming digital content and that was that was you know
a huge revolution for the the entire industry yeah i feel like uh this is you know obviously
out of out of my depth but i do feel like just thinking about it economically, it better aligns the
incentives, right? Because I remember, I definitely remember, you know, I was huge into, you know,
bands and going to concerts, you know, in high school and even in middle school. I think there
was one year in high school where I went to something like a hundred shows in one year,
and I still had all the tickets and everything. I mean, not big bands, because that would, you know,
that would be, that would break the bank, but, you know, a bunch of the tickets and everything. I mean, not big bands, because that would break the bank,
but a bunch of local shows and everything.
And there were times where I saw an album and the cover.
This is back when we were buying CDs.
The cover looked awesome.
And I'd never heard the band before,
but the artwork looked really great.
So I bought it and then the songs were terrible.
And so it's like, okay, well, I lost 10 bucks and so now you know because it's streaming you know the songs that you enjoy that
you listen to again and again if that that time is logged and then that credit is assigned you
know to the appropriate musician and so and so now i mean the sad part of it is no one cares about
album art anymore but the good part of it is that people are just laser focused on the music and the message.
Yeah, yeah.
I think it definitely really revolutionized the entire music industry.
It also increased the pie.
And that's, I think, it's a good lesson, by the way, because I think whenever there's
a new technology that comes by,
there's always the kind of the pushback, right?
Especially when it's a fundamental technology that changes how people,
you know, interact with content, for example, or interact with technology.
There's always a pushback because, you know, very naturally people are concerned.
And we'll get to AI, I guess, later.
But I think we all can see similar patterns.
But in reality, first of all, these technology changes are something, you know, usually you cannot block.
You can, you know, slow them down a little bit if you really try hard, but you can't block. But second of all, they're usually opening up really new opportunities,
business opportunities, consumption opportunities,
education opportunities, and whatnot.
They typically tend to be for the best or at least have a path
that is for the best.
And in the music industry, yeah, there was a lot of pushback
from the big labels, the companies that control the rights to most of the content, at least on the Western world.
But eventually they kind of, you know, went along with it.
And now if you look at the revenue of the music industry from streaming, obviously it's much bigger than, you know, CD sales.
But it's also in just if you look at the entire streaming revenue versus
what cd sales were its peak uh streaming is now a bigger business so uh and it's not surprising
right we all have now phones in our pockets which is also audio streaming devices and just the reach
of content is much more much more broad today yeah that's right. And actually, kind of a little foreshadowing here,
but one of the most popular trending folks on Spotify
was AI Drake,
which is an AI version of Drake.
And I was, I listened to some of the tracks
and I was blown away.
I think they eventually got banned from Spotify
because they were using drake's face
as their face and so that you can't do that um but they were i want to say in the top 10 of
trending for spotify which got my attention it and i listened to it it sounded amazing um i actually
was really shocked even with everything we've seen so far with gender i was really shocked at the quality of it come to find out that actually a person wrote the lyrics so i i kind of you know i thought
that it was all the way ai where someone just pressed a button but uh but no a person did write
the lyrics but the the text is speech you know and getting the music and getting it all to match
the rhythm and everything it was just flawless i mean it, if you haven't yet, I don't think it's on Spotify anymore.
I'm not sure what happened there.
But you can definitely go on YouTube and look up AI Drake and listen to these songs.
It's pretty wild.
Amazing.
Yeah.
Anyway, so going back to kind of my story.
So I spent some time in Amazon Music.
It was a lot of fun, but it also very new to me.
I mean, it wasn't about computer vision, obviously. It was also about machine learning, which I was
anyway, you know, didn't do a lot of things on, you know, outside of my graduate studies and,
you know, what applies to computer vision. I was focusing over there more on algorithms for audio streaming, web applications, you know, scaling this from, you know,
millions of customers in the US to tens of millions
and later even more globally.
And also, I think for me, I would just, you know,
relocated from Israel to the Bay Area.
The culture was very different.
The way technology is developed,
like the culture within the companies were different. The way technology is developed, like the culture within the companies
were different. And I was, you know, to a large degree, really adapting to that.
Why don't you double click on that? Like what is, you know, because Patrick and I have basically
been in the US our whole lives. I lived in Italy for two months, other than that I've lived in the
US or Canada my whole life. And so what really struck you about like,
maybe, you know, culture and then corporate culture over here?
Yeah, so, wow, I don't even know where to start, because the changes are, you know,
the differences are pretty significant. So, you know, I started by, you know, in Israel,
Israel's culture is, you know, very casual and also very direct.
For better or worse, you know, people, you know, would often, you know, not beat around the bushes when they have something to say.
You know, they'll tell it to you in your face, even if it may be a bit offending.
And in Israel, it's not considered offending.
It's just, you know, people tell it for what it is.
I think in the US, you know, people tend to be,
I don't have to use the word respectful.
It just kind of be more, you know,
have more tact around saying things.
So when they have something, you know, difficult to say or have
some significant feedback, you know, they would share it in a way that is very processed. So for
me initially, I had to, you know, really adjust, right, my noise cancellation. So, you know, to
really learn that, you know, if people say something, even if they say it in a really,
you know, nice way,
I have to read a little bit more into it just because I'm used to, hey, if someone has something important to say
and if it's critical of something that is going on or something that I have done,
it would come from my experience in Israel and the way I grew up, it would come very directly.
In the US, I had to learn to understand the nuances a little bit better.
That definitely resonates with me. You know, like Patrick and I grew up on the East Coast,
or I guess maybe you'd call it the South, I mean, Southeast. But, you know, in moving to California,
you know, I think the way I kind of expressed it, I didn't really tell people this because it also
lacks tact. But just to like explain it, I kind of felt like the people around me were passive aggressive
and I was actively aggressive. But yeah, I felt like similar to what you were saying. I would
just say stuff and then other people that would meet, especially in the corporate world, I would
realize, you're saying a day later, oh, this person actually, they were actually, you know, really happy with this or really
upset with that.
It's like there's intentionally, you know, a bit of noise in the signal to try to, yeah,
I don't know what it is.
Maybe it's like there's always plausible deniability of everything.
You know, it's just, it's like a politeness thing.
But yeah, even though I've grown up here the whole time, I had to go through the same experience. everything you know it's just it's like a politeness thing but uh but yeah i i even
though i've grown up here the whole time i i had to go through the same experiences
yeah and i think to your point it's you know the u.s is a very big place so i guess my experience
has been based mostly on the you know the culture in california and other areas in the u.s right
like you said are probably you, you know, somewhat different.
But, yeah, you know, this is one of the differences.
I think the other thing, which I actually think that is kind of aligned, actually,
just done a bit differently, you know, Israel versus California is, you know,
taking initiative and thinking out of the box you know israel is a you know small country grew up you know israel kind of was developed in an area with a lot of security you know problems
so israelis tend to be very you know creative out of the box thinkers and also you know don't have
too many too much respect for the way things are done, right?
It's like you always think about ways you can do something better.
And, you know, not surprisingly, per capita, Israel is the number one country in terms of startup, right?
Founding startups.
A lot of it comes from that Israeli mindset and culture.
I do think this is, you know, California is actually kind of very similar,
maybe for different reasons. But in California, also kind of independent thinking, thinking out
of the box, taking initiative, not conforming with status quo is, I feel is kind of encouraged
and even maybe, you know, something that is highlighted's highlighted yeah that makes sense yeah it's
interesting i think i do feel like there's there's like a real independent spirit you know if you
visit places like i've never been to israel but if you visit india for example it felt like
like a libertarian paradise because there's so many small companies. If a policeman arrests you, you just give him money one-to-one.
You know, you don't have to go to court.
And so it just kind of felt like,
yeah, like if the libertarian folks,
if you kind of like take it to the limit,
that's what you would get.
I do feel like in the US,
there is just, and healthcare particularly,
there's just so much structure.
And there's pros and cons to that, but it is different for sure.
You were at Amazon and then at some point you got into...
So that kind of was sort of like an intro to AI, your sort of introduction to recommender systems and some of these kind of large-scale AI and kind of augmenting what you did with computer vision.
And then what's the path from that to being kind of like all in on AI?
What happened next?
Yeah, so I spent a few years working in Amazon Music and then kind of decided, hey, I need a change.
And then I moved within
Amazon to AWS, Amazon Web Services. And back then, AWS was already kind of a rapidly growing
business that was already fairly large. So today, AWS is a business in Amazon that generates about $85 billion in revenue every year, which is just massive, right?
Like if it would be its own company, it would have been one of the five biggest software companies out there in terms of software revenue.
But back then, it was not that big, but still fairly big.
But their machine learning offering was very limited back then.
And then they doubled down on it.
And I thought that was a really interesting area to be part of.
And, you know, I was fortunate enough that they accepted to take me in.
In my master's degree in Tel Aviv University, I studied a little bit about, you know, machine learning, among other things.
Just like many other folks, you folks who did the CS back then, machine learning was not what it is
today in terms of the dominance.
Back then, we were building AWS SageMaker.
That's today a very successful
machine learning platform offered by AWS. It's a big business.
From what I hear, it's the fastest service in AWS's history
in terms of growth.
So I joined that team and then contributed to SageMaker,
worked on deep learning frameworks.
Back then, AWS tries to double down on a framework called MXNet,
which is kind of similar to TensorFlow or PyTorch,
only it wasn't successful as both of these frameworks.
It's tough.
I mean, I'm amazed PyTorch was able to take the lead.
Yeah.
And I had, yeah, I think that was really interesting.
I definitely took a lot of lessons from that because, you know, I was on the team that
lost.
I was on the MXNet team.
And I think, you know, you learn a lot from things
that don't work according to plan. Typically, you actually learn more from the things that don't
work according to plan or fail than from your success. Because success, you tend to attribute
it to yourself and, you know, yourself and your team and that's it. But failure is you're kind
of forced to think harder, right, about why things didn't work
out.
Yeah, I would love your take on this, because I don't know how that all played out.
It's a little bit, I'm definitely a user of TensorFlow and PyTorch.
But sort of how did PyTorch kind of take the lead and leapfrog over everybody?
You know, and I guess like, what were maybe some of the mistakes MXNet did or some of the gaps that PyTorch was able to fill that
allowed them to do that yeah I think some of the things I observed and again
I think there's definitely many more angles to it but the first thing I think
is usability is the number one thing, right?
And I think, especially for us as engineers, we tend to sometimes underrate usability thinking,
oh no, usability, you know, it's similar, right?
Like people can achieve the same goal in different way.
One way may be more complex than the other, but it's fine.
Performance matters more.
That's like a very common pitfall. And I think definitely, I think on MXNet side, we definitely
fell into that pitfall where we optimize for performance rather than optimizing for usability.
So I think that is one key learning. And I think every tool developer, platform developer,
framework developer out there, I recommend always put
usability as the most important thing. Performance, you can catch up later on. And actually for,
you know, for people to get started, they actually don't look, usually they don't look too much into
the performance. They would look more about the usability, how easy it is to onboard,
how easy it is to learn, how easy is it to extend? How easy is it to apply it to kind of core problems?
Because, you know, at the end of the day,
usability is what allows, first of all,
people to move quickly solving a problem.
And your tool exists to solve a problem.
It doesn't exist for the soul of, you know, existing as a tool.
And moving quickly actually saves tons of time and money.
So I definitely say usability over performance.
That's one key learning to keep in mind.
Yeah, I think I actually, you know,
if we follow that trail, follow that breadcrumb trail,
and one of the things that Facebook did really well
was having a lot of different roles inside the company.
You know, it wasn't just everyone was software engineer.
And I think that that, although it seems esoteric, if you think about it, that really plays into
this where if everyone's a software engineer and a software engineer is building things
for other software engineers, then of course, why can't you use this really convoluted API?
I did, and I'm a software engineer, right? But if you have, you know,
research scientists,
machine learning engineer,
and then embedded engineer
and software engineer,
then, you know,
it's more clear that like
the machine learning engineer,
the research scientist
is the customer
of something like PyTorch.
And that you can't really expect,
even though they have engineer in the title,
you can't really expect them even though they have engineer in the title, you can't really expect them
to figure out some weird C++ error.
And so I think setting up that distinction early on
kind of caused all these sort of downstream effects.
Because I think if Amazon had treated
the folks using MXNet as true customers
instead of engineers,
Amazon is amazing at customer satisfaction, right?
So it's almost like maybe that's where the issue kind of started, right?
Yeah, yeah.
That was definitely part of it.
I think the other thing which you also alluded to is the importance of building
a community. And actually building a community is definitely not trivial, requires deep thought.
I'd say at the equivalent level to thinking through software design, for example, you want
to think about how do you design your community? How do you design it so people, wherever they are,
they want to use the tool, they're well-supported,
they have resources, they have people to follow.
That also requires a lot of deep thought.
When I look at the PyTorch, I think they definitely,
I'm not sure if they did that from the beginning,
but at some point they started investing a lot in that and I think they did it fantastically well.
I think there is a real PyTorch community
and I actually consider myself now part of the PyTorch community. You know, I spoke at their last, you know,
PyTorch developer conference. I met with a lot of people at MosaicML. You know, we are part of
that community. And that community is what helps PyTorch be successful and more importantly,
really be used by so many people in a really
productive and constructive way.
Yeah, that's a really good call out.
Yeah, I agree 100%.
I feel like I'm also part of that community.
I think they did an amazing job with SEO.
When you search Google for PyTorch issues, it'll take you to the PyTorch forum.
I don't even know if there's a PyTorch stack overflow.
I mean, I'm sure there is, but I don't know if there's a significant one. But they've done an amazing job of being the place where you go for issues and solutions. on their ML platform. And now you're at Mosaic, which is a startup kind of full circle.
It's a startup here in the US, but a startup nonetheless.
And can you kind of describe that?
I mean, it feels a little scary.
We talked about Amazon, the enormity of the business.
AWS is one of the biggest businesses in the world.
And so how do you kind of take that leap to Mosaic knowing kind of what you're up against? Yeah, so I think we missed another
kind of station along the way, which is after AWS ML, I joined Facebook. Back then it was Facebook,
today it's Meta. And I joined the Meta's AI team. And then I did a bunch of things over there that were a lot of fun,
starting with the recommendation platform called Deeper back then,
and then expanding also into foundation model services there
for language understanding, image understanding, video understanding.
I still remember, you know, Haggai and I worked together,
and I still remember when Hag know, Hagai and I worked together and I still remember when Hagai first joined.
And I remember thinking, wow, this person has long hair.
The guy had this really long hair.
But total genius, a ton of respect.
You've helped me a ton along the way.
So I really do appreciate it.
It's been a pleasure.
It was a pleasure working with you.
See, it was awesome.
We did a bunch of cool AI stuff together.
And I think you left after I did, I think, or maybe it was before.
It was around the same time, though.
A little bit after you did.
Yeah.
Okay.
Yeah.
I really kind of really wanted to, after so many years in big tech companies and there's definitely
nothing wrong with big tech companies i think you learn a lot you do a lot you have really kind of
your impact just you know propagates through you know these immense customer bases right that these
companies have you know in a smaller company have more there, you feel like your impact
is more direct and you do have more bandwidth and time to kind of do a zero to one thing,
right? Like build something from scratch, solve a core problem with very kind of where it's more
easy for you to see kind of full ownership or work with others, right? So I was really tingling for that.
And then I, you know, so the opportunity with Mosaic ML,
really loved the team, the folks there,
really kind of felt good about the business problem.
And I can tell you about that.
And then just kind of, you know,
decided to make the leap and join Mosaic ML.
Got it. Cool.
And that, is that the first time you started really diving into generative AI
or did you do some of that at Facebook and Amazon?
No, it was more at MosaicML.
I think even when I joined MosaicML,
I don't even know, like the term generative AI
was probably used back then, but not as often as it is used today.
Right. Yep.
So yeah, at MosaicML, the mission of MosaicML was really to, back then when
I joined, it was, let's make machine learning more efficient.
The reason for making it more efficient is that, you know, anybody can see the pace at
which the complexity of training deep learning models is increasing.
And I think, by by the way that trend is
actually toning down now we can get to it in a minute but you know if you look at even the last
four years you know going from birth i think it was 2018 to gpt3 175 million parameters a couple
of years later there's actually an there has been a growth in the number of model parameters of an order of magnitude every year, which is,
it's just insane, right? And obviously, it requires much more compute and, you know,
transformer architecture, because of the transformer blocks, it's quadratic in terms
of the number of parameters. So, you know, that growth just kind of limited the number of companies, organizations out there that can actually leverage these advanced models
just because it became much more expensive to train these models and, of course, also to deploy them.
So Mosaic tried to initially to just make this more efficient so it's more accessible.
As we built our product, which is the mosaic ml platform it's a
platform for training and deploying these models i kind of realized that the problem space is more
than just efficiency i would even say efficiency right is a feature but then there's a lot of other
things that make these models less accessible you know it's the complexity of uh you know setting up
the infrastructure it's the complexity of getting started with some baseline model.
Again, going back to ease of use, right?
How can this be made as easy as possible
so as many companies as possible can leverage this technology?
And this is our focus now at Mosaic.
It's just making state-of-the-art AI with a focus on generative AI
accessible to any company out there,
you know, not just kind of the usual suspects of the big technology companies or big labs like OpenAI or Google Research.
Yeah, that makes sense.
So, you know, I think generative AI might be at that point where, you know,
an average person has heard the word but has no idea what it is, like, it's not defined
for them. And so it's, it just occupies this sort of space, this soup of different things that they
have seen and read about. And so this is a great time for us to really define it. Like, what is
what is generative AI? And, you know, what's kind of the brief history there? Yeah, so I would say generative
AI refers to, you know, AI technology and more specifically deep learning models that do a really
incredible job generating media such as text or images or videos or audio through very simple prompts.
And I think typically what we see today in something like, you know,
a model like ChatGPT is, you know, you put in text,
phrasing a request or a question,
and the model does a really incredible job following through on your request.
And then, of course, there's also the kind of another poster child
in stable diffusion, where it's a text-to-image or text-plus-image-to-image model that just takes a simple natural language text prompt with a request to generate some visual and does an incredible job of generating that visual. So those are, you know, that's what generative AI is at its core. And I think we're
just seeing the beginning of it, meaning these models will be much better at following through
on your requests. Plus, they'll be able to generate very impressive additional, you know,
mediums, right? And I think video is one such example where we're still early on in video
generation, and we'll see much more
impressive things come along. But you can even expand this further, right? I recently did a
keynote at a conference in Boston. I spent a lot of time creating the slides. Like I had the idea
of what I want to talk about, but then a lot of time was spent creating the slides. I can
definitely see generative AI sooner than later actually generating the slides for me, doing
a pretty good job at it.
Yeah, that totally makes sense.
One of the things that always really inspired me, but I didn't know where it was going,
was unsupervised and self-supervised learning.
I thought, and this this goes way way back
i had this idea where and feel free to in the audience steal this idea i'd love for someone
to actually build this but it was an idea where you would have sort of like a zombie game so you'd
be you know it's very typical you fight the zombies there's an infestation you need to get
the medical supplies whatever but but you would be
you'd start on your in your own house so the idea is you know i would plug into google maps or one
of these map services and so i would somehow render the game across the entire planet and so
whenever someone played the game it would get their location from their phone. Actually, I guess Pokemon Go is kind of like this, isn't it?
But you would play in your own house.
The thing that I ran into was, you know, how could I figure out which buildings should have which supplies?
And so I thought, well, I could, you know, I could scrape Wikipedia and scrape the Internet and I could try to figure out what I wanted
ultimately was for not me for like just the computer to do the work to figure out, oh,
if I sneak into this hospital, which is like a real hospital on Google Maps, that I would
find medical supplies there.
And if I sneak into a car dealership, I wouldn't find medical supplies.
I would find gasoline or something.
Right. dealership, I wouldn't find medical supplies, I would find gasoline or something, right. And so,
you know, rather than having some content, you know, human in the loop there, I wanted it all
to just get rendered, right. And then that kind of led to learning about these embeddings where,
where somebody has, you know, scanned all of the internet through Common Crawl or Wikipedia or these things. And they figured
out the similarity between words. So you could actually see what's the similarity between hospital
and medical kit. And that would be more similar than gas station and medical kit. And so you
would use that to sort of generate your game here. And I found that to be, I mean, I never finished the project,
but I found it to be just really inspiring
how I created like the entire planet
worth of supplies,
like in these buildings.
And it just was one of these
kind of really satisfying moments.
And so ever since then,
I've always been into that.
And there's just something magical about that.
Maybe you could talk a little bit about, you know, how does that actually work?
So, you know, when someone's on Mosaic or on, you know, SageMaker, which you also worked on, how do they build these giant models?
Yeah.
So Mosaic ML offers a platform, right, which is a platform for training and then deploying these models.
I can start by maybe quickly describing, you know, when you are, you know, when someone wants to train such a model, let's say a large language model, LLM, what they need to do and then, you know, how a platform can actually help them achieve that.
So typically, the first thing is it always starts with defining the
task, right? You're trying to solve. And, you know, there's definitely a lot of general purpose
LLMs out there, right? That just are, you know, their task is to basically be able to,
on the business level, kind of be able to follow through on instructions, requests, questions, and do a good job responding to what a human is asking or prompting.
And then when you look at the machine learning task,
it's basically completion of the next word.
So when you get an input sequence of words, which is a human sentence,
complete the next word, and then complete the next word after that,
and the next word after that.
And when you do this a bunch of times you get a coherent response the first thing is of course you know you need to figure out your data set training data set and i kind of breeze through some things
of course they're pretty complex but the data set then there's the model architecture which covers
things like uh you know both the architecture of the neural network, and
we always use neural networks for these things today, as well as the scale of the model.
Because with the same architecture, you can scale it, meaning the number of parameters
across the different layers can be larger or smaller.
It has implications on the compute you'll need and the amount of data you'll need, and
we'll get to that in a
minute. Then you want to set up your training regime, meaning the hyperparameters for training,
as well as your evaluation. How are you going to evaluate your model versus the original task that
you had? The next step after that is going to be deploying the model once you have a good model,
and that's almost like a related problem, but but almost separate now what's important to note about all of these things that
i've said is the the scale of these models tend to be uh you know very large and uh what we've
what what i think the community the industry has found is that when you scale these transformer-based architecture up,
you get what's called emergent behavior, meaning the model is suddenly, it's like a step function
chain where the model is suddenly able to handle new problems that, I mean, it wasn't explicitly
optimized for, and they just emerge with a bigger model size and more training data. One example for that is the ability to solve math problems.
I think both OpenAI and their GPT work and paper, Google with their Lambda paper called
out some of these emergent behaviors, including solving math problems, but other things as
well.
Yeah, it's important to
mention to double click on the scale yeah i you know when when i was getting interested in the
large language models i thought well you have a pretty decent gpu i mean it's i don't know three
or four years old it has i don't know one gigabyte of gpu memory or something i don't know actually
how much it has on the order of one or two gigabytes and uh i thought oh i could just download the the data set and train a train a model myself
and uh the answer is i'll save everyone in the audience some time you can't do this
so the data set is enormous the models are enormous even if you want to fine tune the model
you still have to load it into memory and i think they said you needed
like 60 basically you need a gpu that costs two thousand dollars if you want to do this yourself
which is uh out of my budget so um um so yeah you kind of have to use a service i think this might
be the uh you know maybe some of the computer vision models
were like this, but for me, at least this is the first time where you just, you can't try this at
home. You can do it yourself, but you can't do it in your own house. Yeah, exactly. So, I mean,
just to give a few examples, I mean, you know, Meta published results of training a model called OPT175, which is a 175 billion parameter
model.
They trained, they haven't published the weights, but they did publish a log book and other
details.
It was trained for over a month on thousands of GPUs and budget, that kind of operation can be in the millions of dollars.
And I'm not even talking about preparation before,
deploying after, just the training.
Right. Is a parameter and a weight the same thing?
When people say there's 7 billion parameters,
is that 7 billion weights?
Yeah, usually that's how people refer to it.
So it's definitely immense.
Although I think what we're learning in an industry is that, you know, there has been a few things. So I think first of all,
a model like OPT or even GPT-3, it was actually under-trained for its size, meaning you can take
an actually a smaller model with less parameters, train it on more data, and it will actually
perform just as well or even better than a bigger model trained on less data.
Now, how do they know that?
How do they measure that?
Yeah, so there's a paper published by Open Mind.
I think people tend to refer to it as a chinchilla paper.
I just don't remember the exact name of the paper. I think our community is having a really good sense
of humor when choosing model names and paper names. Right, it's all animals, right?
There's llama, alpaca, koala.
Yeah, but yeah, so the chinchilla paper basically talks about the scaling laws,
meaning for a given, it's
all about, you know, all referring to very similar architecture.
So transformer-based architecture and then different model sizes, what's the amount of
data?
And typically that's counted as number of tokens that are required to train it to its
full capacity.
Now the way they, there is no fancy math kind of analysis.
Unfortunately, I think machine learning
is somewhat still feels more like alchemy than science.
So what they do is just train a bunch of experiments
and just took the same architecture,
train it in different data sizes, data set sizes, and then measured the
various evaluation metrics and then kind of came up with their analysis.
And they were able to train, I think it was a 60 billion parameter model on, I don't remember
how many tokens, and it outperformed the valuation metrics of GPT-3
with 175 billion parameter.
So although the number of parameters was a third
of OpenAI's big model,
the smaller model actually outperformed the bigger one.
Oh, interesting.
Yeah.
So I think there's new kind of things being discovered by the day.
I can tell you that I give example of two of our customers at Mosaic ML.
One is Replit.
So Replit is kind of very popular online IDE.
I'm sure the listeners, some of them at least are familiar.
And for those that don't, definitely check it out.
Replit is a fantastic tool for software developers. They built their code assistant, right, called Ghostwriter. And,
you know, it does things that are pretty cool, making developers much more productive, like,
you know, code completion, you know, it can create functions from comments it can explain your code for you etc so it's
really a nice tool the model behind it was trained on mosaic ml platform it's a three billion
parameter model uh you know only you quote quote unquote only three billion it's funny today three
billion considered a small model just a couple of years ago, it was considered huge. But 3 billion parameter model trained on, I think, about 500 billion tokens of just code,
open source code. Is there a common place where I could get a scrape of all of GitHub or something?
How do you get that many tokens of code? Yeah, there are a lot of datasets. I think it's called the stack. That's an open source
dataset that you can access. Replit itself, obviously, because people have, you know,
using them for, you know, they store a lot of code. They have access to some of that. Of course,
when the writers of that code allowed Replit to use it. So, yeah.
So there's definitely a lot of kind of specific data sets.
Plus, you also tend to mix, right?
So usually, and that's where the alchemy part comes in, right?
Usually you want to mix your training data set
so it's a bit balanced.
So, you know, you want to mix a little bit
of kind of natural natural language from you know
wikipedia or other websites because you know even code comments for example they're you know
they're written in plain english they're not written in c plus plus or whatever other
programming language so people tend to do mixing by the way replete published a fantastic blog post
talking about how they built that model.
And there's a lot of details there, including both the modeling side,
data set management side, as well as the infrastructure.
But just going back to kind of the TLDR, so it was a relatively small model,
specialized for being a code assistant. And it actually outperformed the, I think,
two or three times larger OpenAI Codex.
And that's the model, OpenAI Codex is the model
behind GitHub Copilot.
So I think what's interesting there is, first of all,
you know, a relatively small company, I mean, Replit is a startup.
It's a big startup.
It's still a startup. It's a big startup. It's still a startup. Was able to train a model smaller than another model, as well as outperforming it on quite
a few evaluation metrics.
And we're able to do it with actually quite a small team.
Yeah, that's amazing.
I think you touched on so many different things there. So one thing is, you know, folks should definitely get familiar with doing things on the cloud. And we've talked
about this for many shows, we've had folks, we literally just had a show on Kubernetes
a few episodes ago. And so you kind of, you'll definitely have a lot of tools at your disposal,
which will abstract away, you know away layer upon layer of this.
But it's good to kind of get familiar with running things on the cloud because storing a 500 billion token data set on your desktop, probably out of the question.
And definitely the models, it would just the capital cost would get out of control and a lot of students you know if you're in college or high school you know often there's a whole bunch of different amazon credits that you can get and
all sorts of services there yeah totally yeah and then it sounds like the the process for you know
if you say you know i'm a musician and i want to uh train a model on you know all of these uh
actually music we talked about here's an even different one.
I'm really into theater and I want all of the English plays around Shakespeare's era
all ingested into some large language model. Step one is to find that data set.
And so it sounds like what I usually do,
and Hagar, I'd love to get your advice on this too,
but I usually just type what I want
and then add the word data set at the end into Google
and try to see if someone's already done this.
Do you have any tips for getting access to data?
I think a great place to start would be Hugging Face Hub.
They have a data set, repositories there, and a lot of them, actually.
Actually, the problem is choosing the right one out of so many available.
But Hugging Face Hub is a great place to start.
Similarly, by the way, for starting with the model architecture.
So the nice thing is is at this point,
you don't have to do anything from scratch.
There are data sets available,
there are models available,
and then there's a lot of training recipes available.
And the best way to get started
is just to start with something that is working
and then hacking it, right,
to fit your specific needs, etc.
You know, tweaking the dataset mix, tweaking the task you're training your model for.
That actually kind of brings me to a point where, you know, even taking a step back,
you know, how can people leverage generative AI or even being more specific, you know,
large language models, for example.
We've been talking about kind of training your own model a little bit now, and I think
it's definitely, it's gotten much easier today, but it's still, you know, even when I look
at, you know, Replit, which we just discussed, right, training that model, Replit's model
took about 500 GPUs running for about 10 days.
Oh, wow. Yeah. that model replitz model took about 500 gpus running for about 10 days oh wow yeah so you know for those of us that are familiar with efforts at google and facebook it sounds you know like
something relatively small and fast and definitely it is comparing to the bigger things that have
been happening but then if you approach it from the perspective of, you know, maybe a much smaller company
or even just someone who just wants to do a cool project, a student or just someone
doing a cool project on the side, that's definitely still big and requires a lot of, right, monetary
investment.
Yeah.
But there are other ways to actually get started with LLMs that are much faster and cheaper.
And we can maybe talk about those as well yeah i think uh we'll definitely dive into
that going back to something you said earlier i do feel like it's very alchemic at the moment
and and i think the reason for that if you think about what actually i want to say standardized or
what took us to the next level in in chemical alchemy was just the reproducibility
and the affordability of experiments. So people could run just thousands of experiments,
do them in parallel in very sanitized environments. If we go to that factory in China,
we'll have to wear those suits where we can't get any dust anywhere. And so everything has been extremely sanitized and as a result, just very reproducible. And so that's ultimately what turned alchemy from
alchemy into chemistry. And so here, you're totally right. It's not only that it took those
500 machines for four days, but it's that it's probably their 20th or 30th model. So it's their 20th time dropping, you know, $2,000 to train this model.
And they're constantly altering the data and mixing with it and hyperparameter tuning and all of that.
So totally agree.
I think, you know, very, very hard.
You know, it's a big investment to train one of these from scratch.
And so that's, I guess, where fine tuning and other things come in.
And what have you seen kind of on that front?
Yeah.
So, you know, I think two other alternatives that people, approaches people are taking
that are slower, you know, have a lower barrier of entry or easier to get started.
One is just using a model behind an API.
And OpenAI is, I think,
one service that is very broadly used already today
where basically it's very easy to get started.
You just sign up with the service, get an API key, and then you have access to a really
powerful general purpose model, you know, and what's nice is to have access to that
capacity.
All you need to do is kind of just write an API call and, you know, whatever is your favorite
programming language, but it's fairly simple
and you don't need to know anything about machine learning.
But still, you have that power, that capacity.
That's one good way to get started,
especially to create prototypes, right?
Or to play around with the technology
and understand what it's capable of.
Yeah, to that point,
there's a lot you can do with engineering the prompt.
I have a project that I'm working on with OpenAI and, you know, it was giving me answers that were not unreasonable, but didn't fit the product that I was trying to build.
And I kind of found that by, you know, playing around with the prompt.
One of the tricks I found is you can kind of if you know the beginning of the answer, so if you know, for example, it should start with answer colon
or a person's name colon or the answer is,
if you actually write that, it makes a huge, huge difference
because it massively narrows the scope.
So for example, I would ask a question,
and this is actually, I wasn't using OpenAI at this point,
I was using Lama,
which is the Facebook's open source LLM. So I asked a question, and then it generated another
question, and then another question, and another question. I was like, no, I want you to answer
the question. And so I found it as simple as putting your answer colon at the end of my question sentence, told it
that it's expecting an answer.
And so to your point, even before you try anything with gradients and loss functions
and all of that, just playing around with something like OpenAI's model or any model
as a service can teach you about the problem you're trying to solve.
Exactly. Yeah. And I think the prompt engineering is definitely another kind of field of alchemy,
if you like it, but it today does have a really massive impact on the quality of responses you get
from text completion. And I think an important thing to remember, I do think that folks who, you know,
understand how the sausage is made
are the best prompt engineers out there.
Although there's definitely, you know,
I think if you just Google prompt engineering today,
you already see a lot of interesting kind of examples
in Google to get started with.
But one important thing to remember
when you are creating your prompt is,
remember these models are typically trained with next word completion, right?
So it's the auto-aggressive transformer models.
They just try to predict the next word.
So if you are giving them the beginning of the answer, for example,
you already really made the problem much simpler for them, right?
Because they don't have to guess your intent and get it right.
You are indicating
your intent to them by giving them the first few words of their answer. So that's a great way to
squeeze better results out of them. I would say that personally, I expect this thing to matter
less and less because models will be just much better at understanding your intent,
maybe even better than we are at some point.
Yep.
It's definitely getting there.
And the other interesting trend
that is also pushing things in that way
is what's called instruction fine-tuning.
So my guess would be that maybe with the Lama model
you played with, it was the base Lama
and not an instruction fine-tuned version of Lama.
Right, it was just the naive stock vanilla. Yeah. And with instruction fine-tuned version of Lama. Right, it was just the naive
stock vanilla. Yeah, and with instruction fine-tuning, what people do is they take a base
model. You know, Lama is pretty good. They have multiple sizes, but it's a pretty good model
overall. Yeah, I think I could only fit the 7 billion on my computer. Yeah, that would have been my guess.
But then they fine-tuned that model to follow instructions.
And then this means this model, you know,
yeah, just has seen a lot of example of an instruction
and a response to that instruction.
And it's now that it can do a better job following instructions.
And then, you know, assuming...
How does that work?
Like, how does the system know that the question is finished?
Like, how do they actually do that fine-tuning?
Yeah, so typically, again,
there's the art of how you format your data set.
So typically, if you look at the instruction fine-tune,
most instruction fine-tune data sets,
they'll have sort of a structure of instruction, colon, some instruction text, and then response, colon, and some text around that.
Sometimes people use also like hashes to kind of written, the model kind of has an easier time, right, to follow your instruction. Does this make sense?
Yeah, this makes sense. Actually, you know, this, this, I don't want to take this on too much of a tangent, but how do you deal with if most of the data is just crawled off the internet? How do people deal with all the HTML and the markup?
I mean, if you're reading the New York Times and they italicize something, how does that
get into the model?
Yeah, so there's different approaches.
I mean, some models, you actually want them to be able to generate HTML, right?
I'll put that aside for a minute.
Let's assume for a minute you want your model only to be able to write text.
So when you curate your training data set, you filter out things like HTML tags, markdown formats, and stuff like that.
So your model only gets the text data and doesn't see anything else.
That makes sense.
Yeah.
For some models, you do want them to create HTML.
In that case, you do want to preserve that.
Right.
But again, your dataset should not only understand HTML, but also understand kind of the context
of, you know, now you're asked to generate HTML and, or now you want to generate Python
code.
And then instruction fine tuning is really helpful at explaining to the model that,
hey, for a given response, it's expected to generate the distributions that are more,
you know, text distributions or Python code or whatnot.
Got it. And so I've seen this thing called LoRa. Is that, that seems pretty pivotal,
like the low rank stuff seems pretty pivotal to the fine-tuning.
What's the sort of connection there?
Is instruction fine-tuning?
How does that actually work?
Yeah, so I'm definitely not an expert in LoRa,
and I think it's also still pretty early days.
But with LoRa, the idea is that you can do fine-tuning
much more efficiently by decomposing matrices,
and then your fine- is more efficient but then
you can also apply you can take a base model and apply fine tuning by just uh you know applying
the factorized matrices you got from from laura but is that is that the common thing so if someone
let's say someone out there wants to fine tune a model let's continue with the screenwriting example. So someone takes
Llama off the internet and they want to adapt it to screenwriting. And let's say they've found the
screenwriting data set and somehow they've converted it to Markdown or they've stripped
out all the HTML. So they have the screenwriting data, they have the Facebook model. How would
they, you know, either using Mosaic or using
something else, like how would they actually fine tune that? Is there a module that everyone uses
or something? Yeah. So what you would typically do, first of all, you know, curate that data set
with, you know, basically we just text. So in this case, let's say it's screenwriting. So what
people would typically do, they'll curate a data set that includes a lot of examples
of screenplays, text, and then they would take a base model that was pre-trained on
general purpose language.
So that model should be pretty good at English, grammar, syntax, and understanding various
concepts and all of that. But then that pre-trained model,
they will just continue a training regime
with that data set that they have.
So they would fine tune it on that data.
Now we're not even getting into LoRa.
LoRa is more like a way to do this in a more optimal manner,
both for the fine tuning and applying that fine tune.
I'll put that aside for a minute.
There's a much more simple thing to do is just to take that data set you created and
then just continue training the pre-trained model with that data.
And what it will, it will kind of force the model's parameter to be better tuned for that kind of text,
that kind of language.
At Mosaic, we recently...
The other thing I would say is that it's also much cheaper and faster than the pre-training,
because for the pre-training, you need to train it right on billions or even trillions of tokens.
At Mosaic ML, we recently open sourced a model called MPT7B, so 7 billion parameters.
It was trained on 1 trillion tokens of text, of language, which is huge.
And this, you know, it cost us about $200,000 to train this model, this size on this number of tokens. But then to fine tune it
on, we did an instruction fine tuned version, a chat fine tuned version, as well as a model that
is able to actually write books or write stories, write fiction. that was much, much cheaper and much faster.
Like just to give you some data around that,
it's all published in our blog post,
but the base model took us about 10 days on 440 A100 GPUs.
It's almost the best gpus out there except for the h100s just coming up so it costs us about two hundred thousand dollars so so so those 400 gpus
for four days for 10 days oh 10 days okay so 400 so that's a 4000 GPU days cost $200,000. Yeah, it's not cheap.
Yeah, yeah, it's definitely not cheap. But luckily, we've open sourced it with the weight. So anyone
can build on top of it. Now, how does that work? Do they need your PyTorch code? They would, right?
Yeah. So the PyTorch code is defining the model architecture. So that code has been open sourced,
obviously. But there's, you know, and it is,
there are a bunch of optimizations we'd put in there,
but, you know, it's PyTorch code,
a little bit of C++ for some of the optimized operators,
but that's it.
And then there's the weights itself,
which is typically stored in a separate file,
but then, you know, it's just PyTorch weight.
So we have example code, but basically once you instantiate
the class for your model, you just use the standard
PyTorch interface to load the parameters into the model.
Yeah, getting kind of going full circle, you know, computer vision, we've been
doing this for a while where, you know, you have a trunk
model and then you have a bunch of heads
for that model one head detects um you know traffic lights another head detects pedestrians
stuff like that and so um it's well-traveled ground there i wonder how data efficient it is
i guess there's no way to really know right you you try to amp up the learning rate but but there's not really a
scientific way to say okay this is how many playwright scripts you need to have a model
that's reasonable it's like one of these things it's like really hard to calculate it's really
hard yes it's still more of empirical trial and error but what's interesting is, you know, so the version of MPT-7b, that model we open source, that
version that is instruction fine-tuned, we took the MPT-7b, we took, you know, a data
set, or I think we combined a couple of data sets that are just out there for, you know,
instruction.
I think it was the Dolly dataset from Databricks. So about 10 million tokens of
data for instruction fine tuning. And basically within like a couple of hours on one node with
eight GPUs, we fine tuned that model. So just to put things in perspective, the base model
took us 10 days of hundreds of GPUs costing $200,000 to train.
But to take that model and then instruction find unit
for following instructions like we discussed earlier,
that took us two hours with just eight GPUs
costing us about 40 bucks.
That's it.
So this is definitely within reach for anyone out there.
It's like the difference between buying a tractor or buying the seeds to plant, right?
Yeah, it's a huge difference.
And that actually kind of is a segue to, you know, we spoke about the first way to leverage LLMs, just call an API. The second way is take an open source model
and either use it as is or fine tune it for your needs. Either way, you know, it's fairly
accessible today, fairly cheap and available. And so what about serving? I want to use the
time we have left to talk about that. Let's say you try Lama on Hugging Face. And, you know, a lot of these Hugging Face sites have a web UI where you can ask questions, you know, it's not as sophisticated
as OpenAI's site, but it's good enough, you can type in your question, it'll generate an answer.
And you say, yep, this is good enough, you're you fine tune a model. And now you want to build a
website or some service for people you had a because the model can't even fit on on your gpu
like how do you even how do you serve the model do people use cpus to serve the model is that a thing
what's the story there so if you take um you know for example mpt7b so or the lama7b so it's a 7
billion parameter model right every parameter let's, you know, two to four bytes, depending
if you're using FP16 or FP32. Typically serving today is done with FP16 or more specifically
BF16, you know, so cellular parameters times two bytes, that's, you know, 14 gigabytes. That
actually does fit on kind of good GPUs today, like the NVIDIA A140 gigs,
or even the A10s, 24 gigs or 32 gigs of memory. So these production grade GPUs can, you know,
one GPU can hold such a model. But then there's other complexity there. I mean, you know, first
of all, when you're talking about text generation,
you want it to be fast and efficient.
Now, remember, the way these models work
is they generate one word after the other,
or one token, actually, after the other.
So actually, the latency of inference matters a lot,
especially for interactive applications,
because, you know, a typical response to a model
is definitely not just one word.
It typically has, I'd say, tens, some questions even hundreds of tokens.
So we want the inference to be as optimal as possible.
And that's definitely one, I think, area of development.
Even if you play today with models like ChatGPT,
it's streaming the output word by word, but you can still see that it takes a while.
So setting up optimized inference is one area where there's definitely more and more tooling.
And I think there's more room for the machine learning community to invest in. Now, one thing about that, you know, with regular deep learning models,
like predicting the probability of an event,
you would want to serve on the CPU because you don't have a batch
versus in training, you know, you have a batch of data.
Is that true here or is even just generating one word better on a GPU?
Yeah, so GPUs definitely can. I think where they become really cost
efficient there is, like you said, handling a batch. Now, the tricky thing is when you want to
generate output for an input sequence, and let's say you want to generate 50 tokens, you have to first calculate a token number, you know,
response token number zero, and then you feed it in to generate token response number one,
right?
And then, and so on and so forth.
So there is a sequential angle to this.
Where you can do batch even for inference is when you have, you know, you have a service
and you have multiple requests,
different requests at the same time,
then you can batch.
Oh, right.
Yeah.
But then you need scale to be able to handle
something like that, or it's an offline process,
like, you know, batch inference,
which is typically offline,
and then you can do those things.
Going back to the question of a CPU,
so I think the main advantage of a CPU is just cost, right? Because GPUs are very
expensive. I know Intel has been doing a lot of work to get their new CPU generation to be pretty
good at handling transformer architecture so people can use it. You know, I have yet to see kind of inference of these kind of models work well on CPU.
But I know it is an area actually Intel is working on.
I even saw a demo.
They did something which looked pretty promising.
But then when you look at the details, it was their newest generation of CPUs.
And actually the cost of that CPU, at least on the AWS,
was actually the same as the cost of a low end GPU. So performance was good, but then cost, there was no difference. So yeah,
I think, you know, if we look at the trend of, you know, computing and processors is that
the cost of running complex workloads, you know, always goes down, right? And I expect this to happen.
So there's been a lot of interesting work by the community of folks kind of allowing
you to run, you know, these models on commodity hardware.
There's something called Lama CPP, I think, that someone hacked together where it's, you
know, super efficient, you know, low-level implementation of inference for LAMA on a commodity CPU.
So I think it will definitely get there, although we're not there now.
It actually brings me to, there's another important angle of inference.
And again, that's like a differentiating factor, I think, between using an API, a model behind
an API versus using either your own model or an open source model.
And that issue is a huge issue, actually, of data privacy.
You know, when you are leveraging a model behind an API,
you have to send your data outside of a premise
into another service.
Yeah, this was in the news.
I think I hesitate to get the company
wrong here but i'm pretty sure it was samsung the employees were using open ai and then yeah
somehow i don't know what actually happened there but somehow uh opening i got their data or their
schematics or something yeah so yeah there was that that was a big story in the news where um
i think engineers at Samsung, that was
the report, they were using chat GPT to kind of write down some of their plans.
And then that data somehow leaked.
It's not clear if it was leaked because OpenAI is using data people send their service for
inference, they're using it to retrain the model.
And then the model memorizes some of what it's seeing.
And then it leaked in a response somewhere else.
Yeah, I think somebody else searched
for like the model number.
You know, a competitor was like,
tell me more about the Samsung S4000.
And OpenAI was like, sure, here's what I know.
Yeah, so I don't know if OpenAI,
I don't know if they changed it yet,
but if you're using the free version of ChatGPT,
the default is opt-in, meaning many people are not aware of it,
but you're by default opt-in to share your data with OpenAI
and then use it for training their model and whatnot.
And that's really something to pay attention to.
And I think the industry needs to mature a little bit.
And I also, my personal take
is that I think
there should be legislation
that governs how these models are used
and the privacy of data
and all that stuff.
But that's just something
for everyone to remember.
There are a lot of advantages
of using a model behind an API.
And we went through that,
those advantages.
But one drawback
is definitely if they if you care about data privacy, if you're like in finance, or healthcare,
or similar industries, you probably don't want to send your data over the wire somewhere. Or,
you know, you want to get very strong, right, guarantees from your service provider about how
this data is going to be used or is not going to be used yeah i mean i think the the old adage you get what you pay for applies here you know if
you're using a free api and open ai is spending you know we talked about hundreds of thousands
of dollars you know on on keeping these gpu machines up even to do inference um you know
you're giving something back right and so you know, you know, using HuggingFace or
Mosaic or one of these services where you're paying for the service, you know, you could
probably get much better privacy guarantees. Yeah. The other thing, by the way, is also cost.
So what people are finding out is when they use models behind APIs, then I think at small scale and prototyping,
it's very cheap, cost-effective.
But if and when this becomes a core use case
for your application,
it's becoming very expensive,
especially if you have a large-scale operation.
So that's also something, you know,
I think people are realizing sometimes a bit too late,
and that's also something to you know, I think people are realizing sometimes a bit too late. And that's also something to factor in because you can get much better cost efficiency if
you are serving your own model or either an open source model or a model you trained.
If you serve it on your own infrastructure, of course, you need to set this up.
And there are some services that help you do that, including Mosaic ML, but it's much more cost efficient than actually using an external service that has a margin and whatnot.
So cost is becoming a thing at scale.
Yeah, I think if everyone uses OpenAI, then it's driving towards a monopoly, which just putting my economist hat on results in infinite profitability for OpenAI.
You know, conversely, if everyone's using their own model, and it's just a matter of who can host
your model, that's driving to zero profitability, or like infinite competition, which is good for
you as a person who wants to use the model. So it sounds like, you know, it doesn't take a lot of
money or time. It really takes you kind of out there building those skills to, you know, grab
the right data set, grab the right model, try a bunch of different fine tuning and learn how that
system works. And in the end, end up with a model that can create some unique value for you or for a
addressable market that you have.
So one thing about, I want to dive into Mosaic, the company here. So there's a ton of folks out
there, listeners who are really interested in this technology, just like there are people
across the world in all disciplines interested, and they would love to get their foot in the door,
work more with AI and machine learning and generative AI.
And so talk a little bit about what's it like at Mosaic and what kind of folks you're trying to
hire for and just general kind of job seeking advice. Yeah. So let's start with Mosaic. So we're
still pretty early stage startup. We're now about 60 employees.
Most of us are in SF, but we also have a couple of other offices,
including in New York and even Copenhagen.
And we're really, you know, kind of relatively small team,
just trying to do a good thing by making state-of-the-art AI
with the focus of generative AI just more accessible.
So any organization out there, that's what we're out to do.
Any organization out there should be able to, you know,
leverage these models in whatever way works for them,
you know, a model behind an API, which we offer,
or open-source models that we open source,
make available to the community,
or pre-training and fine-tuning your own model.
And we think there is, you know,
great business opportunity with that.
And it also, it's going to kind of really help
kind of the next generation of startups,
as well as big enterprises to use AI.
Yeah, so, and then we're hiring actually.
So, you know, the business has
been going well, you know, we're seeing good traction with customers. We're seeing good
traction with the community and we are growing the team and we're hiring across, you know,
software engineers, both for our cloud platform, as well as for machine learning runtimes. We hire
researchers for our fantastic research team that is using our platform to,
you know, build these amazing models like MPT that we've open sourced. So there's researchers,
we are hiring interns across these both teams, both the research team and the engineering team.
And then we hire for other functions. You know, we hire across product, technical program management, recruiting.
So it's really kind of, you know,
I feel that the team is kind of hitting on all cylinders.
And then as part of that,
we're also continuing our growth.
Yeah, it's really exciting.
And, you know, I guess I'm biased,
but I'm really excited about both the mission of what
we're trying to do as well as kind of the the culture and team at mosaic cool it makes sense
and so you know as we talked about for you know a relatively small sum you can take the mpt model
and you can fine-tune it to do playwright If someone does playwrights of Shakespeare, let me know.
I would love to just add me on Twitter.
I would love to see that.
But I think the best way to get noticed
at a company like Mosaic is to use the product, right?
And to build something and have a portfolio
of accomplishments that you could do relatively low cost kind of adjacent to that
so if someone's a student you know a college student even high school student you know is
mosaic a tool for them is it something that they should know about for when they go into industry
is there is there sort of a free tier like what is the story. Yeah, great question. So at Mosaic, we do have a few open source components
that anyone can use.
So there is the models I mentioned earlier,
the MPT series of models.
But there's also a training library called Composer.
It's a PyTorch training library, which
just helps train PyTorch models faster and better.
And there's also a streaming dataset library,
which is really useful for training models when you need to stream all the training data from cloud buckets.
However, the product itself,
so far it's been really geared towards enterprises,
meaning there's no free tier or community tier
where people can just easily get started with the platform.
And the reason it was designed this way is just kind of how the company evolved, right?
You know, at the end of the day, it is a business.
And we were going after enterprises initially to establish the business.
And that has gone really well.
And the next thing on our plate is offering some sort of a community tier where, you know,
a broader set of practitioners out there can get started using what we have to offer. And this will
come soon. And I think at that point, definitely, it's going to be very easy for anyone to just get
started, try us out, either use our models as APIs or fine tune either our models or any model out there that is available on the Hugging Face Hub or GitHub or anywhere else.
As well as, of course, kind of pre-training your own model.
Although this tends to usually cater more to the enterprises that have enough data and have the budgets to pre-train these models.
So stay tuned.
It will come.
And at that point, it's going to be amazing.
I'm really looking forward to that moment where we kind of open the floodgates and allow
the community to really engage fully with us.
Cool.
That makes sense.
I mean, in the meantime, folks can get the MPT model.
They can get all of the the
weights the pie torch code so that they can continue training on their own data set and
there's there's a whole myriad of different services out there so if this sounds cool to
you you should you should uh you know put in the sort of sweat equity here to uh to build something
neat uh definitely email us uh you know tag us on social media with anything you build.
We've actually, inside Baseball here,
we've been really good at placing people.
I've gotten emails lately from people who have been on the show
representing a variety of different companies saying,
oh, we have our first intern who found out about us from the show.
So I think that's a real testament to the audience out there.
You folks are super motivated, highly technical,
which is really great to see
that we're able to sort of,
Patrick and I can kind of connect
to interested parties here, which is awesome.
So we'll put the links to Mosaic ML
and their careers page, all of that on the site,
if that's something that interests you folks out there. Hag thank you so much for coming on the show you know i think we did an
awesome job kind of covering you know in the audio format um you know how how this whole system is
evolving how it works technically uh there'll be tons of resources of the show notes for people
to follow up and i want to just really appreciate you uh
you know spending time with us today thank you jason thanks for having me and uh i really
enjoyed this this chat cool thanks everyone out there have a good one
music by eric barn dollar programming throwdown is distributed under a creative commons attribution
share alike 2.0 license you're free to share copy distribute transmit the work to remix adapt the
work but you must provide an attribution to Patrick and I and share alike
in kind.