Programming Throwdown - 161: Leveraging Generative AI Models with Hagay Lupesko

Episode Date: July 10, 2023

MosaicML’s VP Of Engineering, Hagay Lupesko, joins us today to discuss generative AI!  We talk about how to use existing models as well as ways to finetune these models to a particular tas...k or domain. 00:01:28 Introductions00:02:09 Hagay’s circuitous career journey00:08:25 Building software for large factories00:17:30 The reality of new technologies00:28:10 AWS00:29:33 Pytorch’s leapfrog advantage00:37:24 MosaicML’s mission00:39:29 Generative AI00:44:39 Giant data models00:57:00 Data access tips01:10:31 MPT-7B01:27:01 Careers in Mosaic01:31:46 FarewellsResources mentioned in this episode:Join the Programming Throwdown Patreon community today: https://www.patreon.com/programmingthrowdown?ty=h Subscribe to the podcast on Youtube: https://www.youtube.com/@programmingthrowdown4793 Links:Hagay Lupesko:Linkedin: https://www.linkedin.com/in/hagaylupesko/Twitter: https://twitter.com/hagay_lupeskoGithub: https://github.com/lupeskoMosaicML:Website: https://www.mosaicml.com/Careers: https://www.mosaicml.com/careersTwitter: https://twitter.com/MosaicMLLinkedin: https://www.linkedin.com/company/mosaicml/Others:Amp It Up (Amazon): https://www.amazon.com/Amp-Unlocking-Hypergrowth-Expectations-Intensity/dp/1119836115Hugging Face Hub: https://huggingface.co/ If you’ve enjoyed this episode, you can listen to more on Programming Throwdown’s website: https://www.programmingthrowdown.com/ Reach out to us via email: programmingthrowdown@gmail.com You can also follow Programming Throwdown on Facebook | Apple Podcasts | Spotify | Player.FM | Youtube Join the discussion on our DiscordHelp support Programming Throwdown through our Patreon ★ Support this podcast on Patreon ★

Transcript
Discussion (0)
Starting point is 00:00:00 Hey everybody. So we have seen so many AI hype cycles around so many different areas, right? We've seen self-driving cars was a big deal in 2009, if people remember that. At the time, Ray Kurzweil has been talking about the singularity forever. Oh, there was even beyond AI, there was Bitcoin and Web3 and all of that. And Patrick and I, we've had folks on, but ourselves, I've kept a little bit of an arm's reach from the latest shiny object syndrome. But I think generative AI is amazing. I'll just put it out there. I don't think it's singularity AGI type stuff, but I do think that there's a tremendous opportunity
Starting point is 00:01:06 to create value with generative AI. I've been really excited about it. I've been diving deep into the literature and also applications. I know a lot of other folks have too. It's a really exciting area. It's an area that I'm pretty excited about as well. And I'm super excited to have Hagai Levesco on the show. He's the VP of Engineering at Mosaic ML.
Starting point is 00:01:33 So thanks for coming on the show, Hagai. Hey, Jason. Hey, Patrick. Thanks for having me. Cool. So we'll definitely dive into generative AI and how folks can use it at home or at their business. But let's start off by talking a little bit about you. What's your background? What was the path that you took that brought you to Mosaic?
Starting point is 00:01:55 Yeah. So I'm currently the VP of Engineering at Mosaic ML. And I guess we'll probably touch on Mosaic ML a bit later on. But I really started my career a while back now. If there's been video, you could have seen all my gray hair. So I started my career as an engineer, you know, back in Israel where I was born and raised. And really earlier in my career, I did a bunch of things around computer vision, medical imaging, vision for factory automation. I even spent a couple of years living in China, working on a startup there. Wow. So wait, let's dive into that a little bit. So you were in Israel and then the US and then
Starting point is 00:02:40 China or straight from Israel to China? No, yeah. So straight from Israel to China. And so what was that like? That was, it must be a huge culture shock. It was definitely initially a shock and then really a fantastic experience because, you know, as we all know, China is even today, actually, you know, kind of growing rapidly.
Starting point is 00:03:02 Back then it was really superb, you know, moving super, super quickly. So just the story is, you know, I was a young engineer back then, had some experience, expertise in computer vision. And, you know, this was actually, so just to put things on kind of on the timeline, this was pre the deep learning revolution. So I'm talking about 2007 80s you know neural networks were not working well so computer vision actually was completely different like the way you apply
Starting point is 00:03:31 computer vision to a to a problem yeah just just to put put context i think you know there's a bunch of hand-coded things right like there was these i remember patrick probably knows this way better than i do but there was um a whole bunch of filters, right? Like Sobel filter and these like directional filters. And you would basically try to build your own deep learning system by just stacking all of these filters as an expert. And then at the end, you would have some shallow model that, you know, is stacked on top of all these other things. Exactly. That was exactly the way you'd apply, you know, define different filters, you would hand tune them. I mean, today from, you know, computer vision neural networks, the
Starting point is 00:04:15 convolution kernels are kind of, you know, figured out during the training process. Back then, we would use convolution quite a bit bit and you would hand tune the convolution to work for your problem. That was actually a lot of fun. It was really interesting process. Of course, it made kind of the solutions not super scalable where for different customers, different problems,
Starting point is 00:04:37 you'd have to sit down and tweak things. You know, field engineers, that's a lot of what they would do. They would sit down with these systems and tweak the the parameters including the convolution kernels by hand oh wow that's that's wild yeah because you know when you're when the convolution kernel is not doesn't know anything about your objective like it's trying to like find edges but that's not your objective your objective is to say like is there a face in this picture
Starting point is 00:05:05 and edges just happen to be kind of like tangentially and you know interesting to that that objective and so then it's like can you come up with a filter it's even more interesting and then yeah to your point deep learning now just does everything for us which is pretty wild yeah and it's even more than that like Like you had to, usually typically in a typical computer vision pipeline, you know, you'd start by taking the input image and then preparing it to be kind of ready for the convolution operator. So you'd have to do different tricks. It was like a whole toolbox of tricks you do to like clean up the image, you know, normalize it manually, and then start scrubbing the image with different
Starting point is 00:05:46 morphological operators. Yeah, so it was quite a ride. But going back to kind of the experience in China, so, you know, I was just married back then. I asked my wife, hey, do you want to go to an adventure in China? And she initially said no. And then I was able to convince her that it's going to be kind of an experience of a lifetime. And we just hopped on the plane.
Starting point is 00:06:12 It was a small company, startup, like maybe five people. They brought me in as sort of the computer vision expert, although definitely there was tons of, you know, I wasn't that of an expert, but, you know, I said, what the heck? And we built a whole system, including hardware. Of course, the differentiator of the system was the software, but it was hardware with,
Starting point is 00:06:36 you know, robotics to, you know, from conveyor belts controlled by, you know by by different actuators through imaging system lighting cameras through integration with the automation in a typical factory and that product was for the pcb industry the printed circuit board industry oh cool yeah yeah, you know, we built a product from scratch. We're able to, you know, sell it to a few companies. I spent a lot of time on factory floors in China, which is by itself is an experience. Oh, I bet. I heard they're massive. They're massive.
Starting point is 00:07:18 It's like a city. Now, you know what's funny? I mean, I'm from Israel. So when I was growing up as a kid, Israel was about like 6 million people. And by the way, just for context, I think Israel is often in the news, but many people don't realize that it's tiny, both in terms of population and geographical size. It's actually smaller than the state of New Jersey in terms of the size. Oh, I never knew that. Yeah. So, I mean, you know, here I am coming from Israel, you know, six million people country, moving to China, going to a suburb of Shanghai.
Starting point is 00:07:55 And that suburb was, you know, a small suburb, six million people. So, yeah, just the size of China is massive. And yeah, so, you know, it was fun. Are there tours? Like, let's say, I've never been to China, I'll be honest. I would love to go. I just never had the opportunity. But if I went, could I tour a factory?
Starting point is 00:08:15 Like, is that a thing that tourists do or not really? I think it's a great idea for a startup, Jason. But no, I don't think it's an option. But these factories are really interesting because they're like little cities, literally little cities. One of our first customers was actually a factory owned by a Taiwanese company and wasn't considered a very big factory, but it had 50,000 workers. Wow. Oh,000 workers. Wow. Oh, my goodness. So, you know, five times 10 to the power of four.
Starting point is 00:08:52 And what you realize after you go there is, first of all, most of the workers are fairly young, meaning, you know, 18-ish. And they actually live in the factory. Oh. 18-ish and they actually live in the factory and they have dorms there they have everything they need like you know food social activities you know places to work out it's it's literally almost like a student dorm only you work you don't study so I found that really interesting yeah one of the things that that blew my mind was this is a long time ago um but they interviewed Tim Cook and they were talking about manufacturing like what it's like because I think this is the time where they're making the Mac Pro in America and they were talking about the difference there and what he he said was, in China, if you need a million people,
Starting point is 00:09:46 literally, like you need 1 million people to show up to like, you know, boost the iPhone, you know, production, you can get a million people. And when he said that, and he wasn't, wasn't hyperbolic. I mean, he literally meant it that it really that hit home on this kind of scale we're talking about. Yeah. And China, by the way, is not done with that process. There are still, the majority of the Chinese population is still in villages and looking to go to the city where they can find, you know, have work, get proper, you know, wages, you know, and start their lives. And this is part of what I think many people don't understand about the Chinese, you know, China and the Chinese government is that they are under immense pressure to sustain growth so that their masses actually kind of have a path to a better life. And that's part of why they're so aggressive on growth.
Starting point is 00:10:37 They just have to grow very quickly to kind of, you know, serve that need of their population. That makes sense. So what happened to the startup? Did the startup grow very quickly or no? you know, serve that need of their population. That makes sense. So what happened to the startup? Did the startup grow very quickly or no? So it started well. And then, you know, I don't know how much, how many of the listeners know,
Starting point is 00:10:57 but 2008 slash nine, there was a pretty massive financial crisis. And we were hit, the startup was hit very significantly by that crisis. It started as the mortgage crisis in the US and then very quickly kind of expanded globally. As often is the case, right, when there is a crisis, people start cutting back on their purchases and then the PCB industry was hit significantly.
Starting point is 00:11:22 So did the chip industry, just because the demand for devices went significantly down. So the startup didn't shut down, but it definitely kind of were on a good trajectory. And then it just kind of, you know, most of the orders were cut back, budgets were cut back. Yeah, so, but we were able to still work through things. However, at some point I had some family issues.
Starting point is 00:11:43 I had to go back to Israel. That was about, you know, two years later. So I went back to Israel and, you know, for a while I was flying to China every month, but it's really unsustainable, especially when you have a young family that needs, you know, needs you to be there for them. And I had my first son who was born. So at some point I just parted ways with the startup. Yep that makes sense and so then after that you were oh so at that point you were back in Israel at some point you were at the US what happened there? Yeah so went back to Israel and then kind of I went back to working in an area that I had some experience on before, medical imaging. So, yeah, I actually went to work for GE Healthcare in Israel
Starting point is 00:12:29 and we built a cardiovascular imaging system, which was, you know, really a lot of fun. And I think for those that have worked in healthcare, you know, there are definitely some downsides. Like it's a very slow moving field because of a lot of regulation. And in general, the audience is very, customer base is very conservative. But then you really feel, on the plus side, you really feel that you're changing the world for the better. Because if you can develop systems that give better care, help detect diseases earlier, help treat diseases. It's really something that, you know,
Starting point is 00:13:08 you feel really good about, right, doing. So I did that for a while. And then, you know, Amazon reached out and they didn't have an R&D office in Israel back then. I mean, now they have tons of R&D offices in Israel, but back then they did not. And they interviewed me and then asked to relocate me to the US.
Starting point is 00:13:32 Ah, so you went to Seattle. No, so they wanted to relocate me to Seattle. But again, I mentioned my wife earlier and how she had to approve moving to China. So again, my wife is the decision maker on these things. And after thinking about that together, Seattle was not the right place for us in terms of just the weather, family.
Starting point is 00:13:58 So we moved to the Bay Area. Ah, okay, got it. So I know Amazon has this like Lab 126 or that that's where the kindle came out of and some of these things is that where you went or was there a separate office um no so the opportunity at least you know amazon kind of offered me back then was actually to join amazon music in sf amazon music back then was a relatively small team. It was a very basic product back then. They were kind of following the iTunes model. And again, for folks that are a bit older like me,
Starting point is 00:14:35 you'd know that digital music actually started by selling songs. So you would buy a song, you would buy an album, you would pay for it, you would have the rights to a digital copy and you know you can deploy it on your uh you know whatever players you had audio streaming was not hardly a thing back then i mean now it's how all of us consume you know music and more than music but back then the technology was not there the business terms were not there so it was a very different world but i joined amazon music at a really amazing time where streaming just started picking up so i actually helped ship amazon music starting in the us and then we expanded globally and that was really super
Starting point is 00:15:18 cool experience because it was part of you know we're participating in that revolution of kind of taking music from being you know download digital content to streaming digital content and that was that was you know a huge revolution for the the entire industry yeah i feel like uh this is you know obviously out of out of my depth but i do feel like just thinking about it economically, it better aligns the incentives, right? Because I remember, I definitely remember, you know, I was huge into, you know, bands and going to concerts, you know, in high school and even in middle school. I think there was one year in high school where I went to something like a hundred shows in one year, and I still had all the tickets and everything. I mean, not big bands, because that would, you know,
Starting point is 00:16:04 that would be, that would break the bank, but, you know, a bunch of the tickets and everything. I mean, not big bands, because that would break the bank, but a bunch of local shows and everything. And there were times where I saw an album and the cover. This is back when we were buying CDs. The cover looked awesome. And I'd never heard the band before, but the artwork looked really great. So I bought it and then the songs were terrible.
Starting point is 00:16:23 And so it's like, okay, well, I lost 10 bucks and so now you know because it's streaming you know the songs that you enjoy that you listen to again and again if that that time is logged and then that credit is assigned you know to the appropriate musician and so and so now i mean the sad part of it is no one cares about album art anymore but the good part of it is that people are just laser focused on the music and the message. Yeah, yeah. I think it definitely really revolutionized the entire music industry. It also increased the pie. And that's, I think, it's a good lesson, by the way, because I think whenever there's
Starting point is 00:17:04 a new technology that comes by, there's always the kind of the pushback, right? Especially when it's a fundamental technology that changes how people, you know, interact with content, for example, or interact with technology. There's always a pushback because, you know, very naturally people are concerned. And we'll get to AI, I guess, later. But I think we all can see similar patterns. But in reality, first of all, these technology changes are something, you know, usually you cannot block.
Starting point is 00:17:36 You can, you know, slow them down a little bit if you really try hard, but you can't block. But second of all, they're usually opening up really new opportunities, business opportunities, consumption opportunities, education opportunities, and whatnot. They typically tend to be for the best or at least have a path that is for the best. And in the music industry, yeah, there was a lot of pushback from the big labels, the companies that control the rights to most of the content, at least on the Western world. But eventually they kind of, you know, went along with it.
Starting point is 00:18:13 And now if you look at the revenue of the music industry from streaming, obviously it's much bigger than, you know, CD sales. But it's also in just if you look at the entire streaming revenue versus what cd sales were its peak uh streaming is now a bigger business so uh and it's not surprising right we all have now phones in our pockets which is also audio streaming devices and just the reach of content is much more much more broad today yeah that's right. And actually, kind of a little foreshadowing here, but one of the most popular trending folks on Spotify was AI Drake, which is an AI version of Drake.
Starting point is 00:18:56 And I was, I listened to some of the tracks and I was blown away. I think they eventually got banned from Spotify because they were using drake's face as their face and so that you can't do that um but they were i want to say in the top 10 of trending for spotify which got my attention it and i listened to it it sounded amazing um i actually was really shocked even with everything we've seen so far with gender i was really shocked at the quality of it come to find out that actually a person wrote the lyrics so i i kind of you know i thought that it was all the way ai where someone just pressed a button but uh but no a person did write
Starting point is 00:19:36 the lyrics but the the text is speech you know and getting the music and getting it all to match the rhythm and everything it was just flawless i mean it, if you haven't yet, I don't think it's on Spotify anymore. I'm not sure what happened there. But you can definitely go on YouTube and look up AI Drake and listen to these songs. It's pretty wild. Amazing. Yeah. Anyway, so going back to kind of my story.
Starting point is 00:20:00 So I spent some time in Amazon Music. It was a lot of fun, but it also very new to me. I mean, it wasn't about computer vision, obviously. It was also about machine learning, which I was anyway, you know, didn't do a lot of things on, you know, outside of my graduate studies and, you know, what applies to computer vision. I was focusing over there more on algorithms for audio streaming, web applications, you know, scaling this from, you know, millions of customers in the US to tens of millions and later even more globally. And also, I think for me, I would just, you know,
Starting point is 00:20:37 relocated from Israel to the Bay Area. The culture was very different. The way technology is developed, like the culture within the companies were different. The way technology is developed, like the culture within the companies were different. And I was, you know, to a large degree, really adapting to that. Why don't you double click on that? Like what is, you know, because Patrick and I have basically been in the US our whole lives. I lived in Italy for two months, other than that I've lived in the US or Canada my whole life. And so what really struck you about like,
Starting point is 00:21:06 maybe, you know, culture and then corporate culture over here? Yeah, so, wow, I don't even know where to start, because the changes are, you know, the differences are pretty significant. So, you know, I started by, you know, in Israel, Israel's culture is, you know, very casual and also very direct. For better or worse, you know, people, you know, would often, you know, not beat around the bushes when they have something to say. You know, they'll tell it to you in your face, even if it may be a bit offending. And in Israel, it's not considered offending. It's just, you know, people tell it for what it is.
Starting point is 00:21:48 I think in the US, you know, people tend to be, I don't have to use the word respectful. It just kind of be more, you know, have more tact around saying things. So when they have something, you know, difficult to say or have some significant feedback, you know, they would share it in a way that is very processed. So for me initially, I had to, you know, really adjust, right, my noise cancellation. So, you know, to really learn that, you know, if people say something, even if they say it in a really,
Starting point is 00:22:24 you know, nice way, I have to read a little bit more into it just because I'm used to, hey, if someone has something important to say and if it's critical of something that is going on or something that I have done, it would come from my experience in Israel and the way I grew up, it would come very directly. In the US, I had to learn to understand the nuances a little bit better. That definitely resonates with me. You know, like Patrick and I grew up on the East Coast, or I guess maybe you'd call it the South, I mean, Southeast. But, you know, in moving to California, you know, I think the way I kind of expressed it, I didn't really tell people this because it also
Starting point is 00:23:00 lacks tact. But just to like explain it, I kind of felt like the people around me were passive aggressive and I was actively aggressive. But yeah, I felt like similar to what you were saying. I would just say stuff and then other people that would meet, especially in the corporate world, I would realize, you're saying a day later, oh, this person actually, they were actually, you know, really happy with this or really upset with that. It's like there's intentionally, you know, a bit of noise in the signal to try to, yeah, I don't know what it is. Maybe it's like there's always plausible deniability of everything.
Starting point is 00:23:40 You know, it's just, it's like a politeness thing. But yeah, even though I've grown up here the whole time, I had to go through the same experience. everything you know it's just it's like a politeness thing but uh but yeah i i even though i've grown up here the whole time i i had to go through the same experiences yeah and i think to your point it's you know the u.s is a very big place so i guess my experience has been based mostly on the you know the culture in california and other areas in the u.s right like you said are probably you, you know, somewhat different. But, yeah, you know, this is one of the differences. I think the other thing, which I actually think that is kind of aligned, actually,
Starting point is 00:24:20 just done a bit differently, you know, Israel versus California is, you know, taking initiative and thinking out of the box you know israel is a you know small country grew up you know israel kind of was developed in an area with a lot of security you know problems so israelis tend to be very you know creative out of the box thinkers and also you know don't have too many too much respect for the way things are done, right? It's like you always think about ways you can do something better. And, you know, not surprisingly, per capita, Israel is the number one country in terms of startup, right? Founding startups. A lot of it comes from that Israeli mindset and culture.
Starting point is 00:25:01 I do think this is, you know, California is actually kind of very similar, maybe for different reasons. But in California, also kind of independent thinking, thinking out of the box, taking initiative, not conforming with status quo is, I feel is kind of encouraged and even maybe, you know, something that is highlighted's highlighted yeah that makes sense yeah it's interesting i think i do feel like there's there's like a real independent spirit you know if you visit places like i've never been to israel but if you visit india for example it felt like like a libertarian paradise because there's so many small companies. If a policeman arrests you, you just give him money one-to-one. You know, you don't have to go to court.
Starting point is 00:25:49 And so it just kind of felt like, yeah, like if the libertarian folks, if you kind of like take it to the limit, that's what you would get. I do feel like in the US, there is just, and healthcare particularly, there's just so much structure. And there's pros and cons to that, but it is different for sure.
Starting point is 00:26:08 You were at Amazon and then at some point you got into... So that kind of was sort of like an intro to AI, your sort of introduction to recommender systems and some of these kind of large-scale AI and kind of augmenting what you did with computer vision. And then what's the path from that to being kind of like all in on AI? What happened next? Yeah, so I spent a few years working in Amazon Music and then kind of decided, hey, I need a change. And then I moved within Amazon to AWS, Amazon Web Services. And back then, AWS was already kind of a rapidly growing business that was already fairly large. So today, AWS is a business in Amazon that generates about $85 billion in revenue every year, which is just massive, right?
Starting point is 00:27:10 Like if it would be its own company, it would have been one of the five biggest software companies out there in terms of software revenue. But back then, it was not that big, but still fairly big. But their machine learning offering was very limited back then. And then they doubled down on it. And I thought that was a really interesting area to be part of. And, you know, I was fortunate enough that they accepted to take me in. In my master's degree in Tel Aviv University, I studied a little bit about, you know, machine learning, among other things. Just like many other folks, you folks who did the CS back then, machine learning was not what it is
Starting point is 00:27:48 today in terms of the dominance. Back then, we were building AWS SageMaker. That's today a very successful machine learning platform offered by AWS. It's a big business. From what I hear, it's the fastest service in AWS's history in terms of growth. So I joined that team and then contributed to SageMaker, worked on deep learning frameworks.
Starting point is 00:28:17 Back then, AWS tries to double down on a framework called MXNet, which is kind of similar to TensorFlow or PyTorch, only it wasn't successful as both of these frameworks. It's tough. I mean, I'm amazed PyTorch was able to take the lead. Yeah. And I had, yeah, I think that was really interesting. I definitely took a lot of lessons from that because, you know, I was on the team that
Starting point is 00:28:40 lost. I was on the MXNet team. And I think, you know, you learn a lot from things that don't work according to plan. Typically, you actually learn more from the things that don't work according to plan or fail than from your success. Because success, you tend to attribute it to yourself and, you know, yourself and your team and that's it. But failure is you're kind of forced to think harder, right, about why things didn't work out.
Starting point is 00:29:06 Yeah, I would love your take on this, because I don't know how that all played out. It's a little bit, I'm definitely a user of TensorFlow and PyTorch. But sort of how did PyTorch kind of take the lead and leapfrog over everybody? You know, and I guess like, what were maybe some of the mistakes MXNet did or some of the gaps that PyTorch was able to fill that allowed them to do that yeah I think some of the things I observed and again I think there's definitely many more angles to it but the first thing I think is usability is the number one thing, right? And I think, especially for us as engineers, we tend to sometimes underrate usability thinking,
Starting point is 00:29:53 oh no, usability, you know, it's similar, right? Like people can achieve the same goal in different way. One way may be more complex than the other, but it's fine. Performance matters more. That's like a very common pitfall. And I think definitely, I think on MXNet side, we definitely fell into that pitfall where we optimize for performance rather than optimizing for usability. So I think that is one key learning. And I think every tool developer, platform developer, framework developer out there, I recommend always put
Starting point is 00:30:25 usability as the most important thing. Performance, you can catch up later on. And actually for, you know, for people to get started, they actually don't look, usually they don't look too much into the performance. They would look more about the usability, how easy it is to onboard, how easy it is to learn, how easy is it to extend? How easy is it to apply it to kind of core problems? Because, you know, at the end of the day, usability is what allows, first of all, people to move quickly solving a problem. And your tool exists to solve a problem.
Starting point is 00:30:55 It doesn't exist for the soul of, you know, existing as a tool. And moving quickly actually saves tons of time and money. So I definitely say usability over performance. That's one key learning to keep in mind. Yeah, I think I actually, you know, if we follow that trail, follow that breadcrumb trail, and one of the things that Facebook did really well was having a lot of different roles inside the company.
Starting point is 00:31:22 You know, it wasn't just everyone was software engineer. And I think that that, although it seems esoteric, if you think about it, that really plays into this where if everyone's a software engineer and a software engineer is building things for other software engineers, then of course, why can't you use this really convoluted API? I did, and I'm a software engineer, right? But if you have, you know, research scientists, machine learning engineer, and then embedded engineer
Starting point is 00:31:50 and software engineer, then, you know, it's more clear that like the machine learning engineer, the research scientist is the customer of something like PyTorch. And that you can't really expect,
Starting point is 00:32:03 even though they have engineer in the title, you can't really expect them even though they have engineer in the title, you can't really expect them to figure out some weird C++ error. And so I think setting up that distinction early on kind of caused all these sort of downstream effects. Because I think if Amazon had treated the folks using MXNet as true customers instead of engineers,
Starting point is 00:32:26 Amazon is amazing at customer satisfaction, right? So it's almost like maybe that's where the issue kind of started, right? Yeah, yeah. That was definitely part of it. I think the other thing which you also alluded to is the importance of building a community. And actually building a community is definitely not trivial, requires deep thought. I'd say at the equivalent level to thinking through software design, for example, you want to think about how do you design your community? How do you design it so people, wherever they are,
Starting point is 00:33:03 they want to use the tool, they're well-supported, they have resources, they have people to follow. That also requires a lot of deep thought. When I look at the PyTorch, I think they definitely, I'm not sure if they did that from the beginning, but at some point they started investing a lot in that and I think they did it fantastically well. I think there is a real PyTorch community and I actually consider myself now part of the PyTorch community. You know, I spoke at their last, you know,
Starting point is 00:33:29 PyTorch developer conference. I met with a lot of people at MosaicML. You know, we are part of that community. And that community is what helps PyTorch be successful and more importantly, really be used by so many people in a really productive and constructive way. Yeah, that's a really good call out. Yeah, I agree 100%. I feel like I'm also part of that community. I think they did an amazing job with SEO.
Starting point is 00:33:57 When you search Google for PyTorch issues, it'll take you to the PyTorch forum. I don't even know if there's a PyTorch stack overflow. I mean, I'm sure there is, but I don't know if there's a significant one. But they've done an amazing job of being the place where you go for issues and solutions. on their ML platform. And now you're at Mosaic, which is a startup kind of full circle. It's a startup here in the US, but a startup nonetheless. And can you kind of describe that? I mean, it feels a little scary. We talked about Amazon, the enormity of the business. AWS is one of the biggest businesses in the world.
Starting point is 00:34:46 And so how do you kind of take that leap to Mosaic knowing kind of what you're up against? Yeah, so I think we missed another kind of station along the way, which is after AWS ML, I joined Facebook. Back then it was Facebook, today it's Meta. And I joined the Meta's AI team. And then I did a bunch of things over there that were a lot of fun, starting with the recommendation platform called Deeper back then, and then expanding also into foundation model services there for language understanding, image understanding, video understanding. I still remember, you know, Haggai and I worked together, and I still remember when Hag know, Hagai and I worked together and I still remember when Hagai first joined.
Starting point is 00:35:27 And I remember thinking, wow, this person has long hair. The guy had this really long hair. But total genius, a ton of respect. You've helped me a ton along the way. So I really do appreciate it. It's been a pleasure. It was a pleasure working with you. See, it was awesome.
Starting point is 00:35:47 We did a bunch of cool AI stuff together. And I think you left after I did, I think, or maybe it was before. It was around the same time, though. A little bit after you did. Yeah. Okay. Yeah. I really kind of really wanted to, after so many years in big tech companies and there's definitely
Starting point is 00:36:06 nothing wrong with big tech companies i think you learn a lot you do a lot you have really kind of your impact just you know propagates through you know these immense customer bases right that these companies have you know in a smaller company have more there, you feel like your impact is more direct and you do have more bandwidth and time to kind of do a zero to one thing, right? Like build something from scratch, solve a core problem with very kind of where it's more easy for you to see kind of full ownership or work with others, right? So I was really tingling for that. And then I, you know, so the opportunity with Mosaic ML, really loved the team, the folks there,
Starting point is 00:36:51 really kind of felt good about the business problem. And I can tell you about that. And then just kind of, you know, decided to make the leap and join Mosaic ML. Got it. Cool. And that, is that the first time you started really diving into generative AI or did you do some of that at Facebook and Amazon? No, it was more at MosaicML.
Starting point is 00:37:14 I think even when I joined MosaicML, I don't even know, like the term generative AI was probably used back then, but not as often as it is used today. Right. Yep. So yeah, at MosaicML, the mission of MosaicML was really to, back then when I joined, it was, let's make machine learning more efficient. The reason for making it more efficient is that, you know, anybody can see the pace at which the complexity of training deep learning models is increasing.
Starting point is 00:37:44 And I think, by by the way that trend is actually toning down now we can get to it in a minute but you know if you look at even the last four years you know going from birth i think it was 2018 to gpt3 175 million parameters a couple of years later there's actually an there has been a growth in the number of model parameters of an order of magnitude every year, which is, it's just insane, right? And obviously, it requires much more compute and, you know, transformer architecture, because of the transformer blocks, it's quadratic in terms of the number of parameters. So, you know, that growth just kind of limited the number of companies, organizations out there that can actually leverage these advanced models just because it became much more expensive to train these models and, of course, also to deploy them.
Starting point is 00:38:37 So Mosaic tried to initially to just make this more efficient so it's more accessible. As we built our product, which is the mosaic ml platform it's a platform for training and deploying these models i kind of realized that the problem space is more than just efficiency i would even say efficiency right is a feature but then there's a lot of other things that make these models less accessible you know it's the complexity of uh you know setting up the infrastructure it's the complexity of getting started with some baseline model. Again, going back to ease of use, right? How can this be made as easy as possible
Starting point is 00:39:12 so as many companies as possible can leverage this technology? And this is our focus now at Mosaic. It's just making state-of-the-art AI with a focus on generative AI accessible to any company out there, you know, not just kind of the usual suspects of the big technology companies or big labs like OpenAI or Google Research. Yeah, that makes sense. So, you know, I think generative AI might be at that point where, you know, an average person has heard the word but has no idea what it is, like, it's not defined
Starting point is 00:39:46 for them. And so it's, it just occupies this sort of space, this soup of different things that they have seen and read about. And so this is a great time for us to really define it. Like, what is what is generative AI? And, you know, what's kind of the brief history there? Yeah, so I would say generative AI refers to, you know, AI technology and more specifically deep learning models that do a really incredible job generating media such as text or images or videos or audio through very simple prompts. And I think typically what we see today in something like, you know, a model like ChatGPT is, you know, you put in text, phrasing a request or a question,
Starting point is 00:40:35 and the model does a really incredible job following through on your request. And then, of course, there's also the kind of another poster child in stable diffusion, where it's a text-to-image or text-plus-image-to-image model that just takes a simple natural language text prompt with a request to generate some visual and does an incredible job of generating that visual. So those are, you know, that's what generative AI is at its core. And I think we're just seeing the beginning of it, meaning these models will be much better at following through on your requests. Plus, they'll be able to generate very impressive additional, you know, mediums, right? And I think video is one such example where we're still early on in video generation, and we'll see much more impressive things come along. But you can even expand this further, right? I recently did a
Starting point is 00:41:32 keynote at a conference in Boston. I spent a lot of time creating the slides. Like I had the idea of what I want to talk about, but then a lot of time was spent creating the slides. I can definitely see generative AI sooner than later actually generating the slides for me, doing a pretty good job at it. Yeah, that totally makes sense. One of the things that always really inspired me, but I didn't know where it was going, was unsupervised and self-supervised learning. I thought, and this this goes way way back
Starting point is 00:42:06 i had this idea where and feel free to in the audience steal this idea i'd love for someone to actually build this but it was an idea where you would have sort of like a zombie game so you'd be you know it's very typical you fight the zombies there's an infestation you need to get the medical supplies whatever but but you would be you'd start on your in your own house so the idea is you know i would plug into google maps or one of these map services and so i would somehow render the game across the entire planet and so whenever someone played the game it would get their location from their phone. Actually, I guess Pokemon Go is kind of like this, isn't it? But you would play in your own house.
Starting point is 00:42:50 The thing that I ran into was, you know, how could I figure out which buildings should have which supplies? And so I thought, well, I could, you know, I could scrape Wikipedia and scrape the Internet and I could try to figure out what I wanted ultimately was for not me for like just the computer to do the work to figure out, oh, if I sneak into this hospital, which is like a real hospital on Google Maps, that I would find medical supplies there. And if I sneak into a car dealership, I wouldn't find medical supplies. I would find gasoline or something. Right. dealership, I wouldn't find medical supplies, I would find gasoline or something, right. And so,
Starting point is 00:43:25 you know, rather than having some content, you know, human in the loop there, I wanted it all to just get rendered, right. And then that kind of led to learning about these embeddings where, where somebody has, you know, scanned all of the internet through Common Crawl or Wikipedia or these things. And they figured out the similarity between words. So you could actually see what's the similarity between hospital and medical kit. And that would be more similar than gas station and medical kit. And so you would use that to sort of generate your game here. And I found that to be, I mean, I never finished the project, but I found it to be just really inspiring how I created like the entire planet
Starting point is 00:44:11 worth of supplies, like in these buildings. And it just was one of these kind of really satisfying moments. And so ever since then, I've always been into that. And there's just something magical about that. Maybe you could talk a little bit about, you know, how does that actually work?
Starting point is 00:44:28 So, you know, when someone's on Mosaic or on, you know, SageMaker, which you also worked on, how do they build these giant models? Yeah. So Mosaic ML offers a platform, right, which is a platform for training and then deploying these models. I can start by maybe quickly describing, you know, when you are, you know, when someone wants to train such a model, let's say a large language model, LLM, what they need to do and then, you know, how a platform can actually help them achieve that. So typically, the first thing is it always starts with defining the task, right? You're trying to solve. And, you know, there's definitely a lot of general purpose LLMs out there, right? That just are, you know, their task is to basically be able to, on the business level, kind of be able to follow through on instructions, requests, questions, and do a good job responding to what a human is asking or prompting.
Starting point is 00:45:29 And then when you look at the machine learning task, it's basically completion of the next word. So when you get an input sequence of words, which is a human sentence, complete the next word, and then complete the next word after that, and the next word after that. And when you do this a bunch of times you get a coherent response the first thing is of course you know you need to figure out your data set training data set and i kind of breeze through some things of course they're pretty complex but the data set then there's the model architecture which covers things like uh you know both the architecture of the neural network, and
Starting point is 00:46:06 we always use neural networks for these things today, as well as the scale of the model. Because with the same architecture, you can scale it, meaning the number of parameters across the different layers can be larger or smaller. It has implications on the compute you'll need and the amount of data you'll need, and we'll get to that in a minute. Then you want to set up your training regime, meaning the hyperparameters for training, as well as your evaluation. How are you going to evaluate your model versus the original task that you had? The next step after that is going to be deploying the model once you have a good model,
Starting point is 00:46:42 and that's almost like a related problem, but but almost separate now what's important to note about all of these things that i've said is the the scale of these models tend to be uh you know very large and uh what we've what what i think the community the industry has found is that when you scale these transformer-based architecture up, you get what's called emergent behavior, meaning the model is suddenly, it's like a step function chain where the model is suddenly able to handle new problems that, I mean, it wasn't explicitly optimized for, and they just emerge with a bigger model size and more training data. One example for that is the ability to solve math problems. I think both OpenAI and their GPT work and paper, Google with their Lambda paper called out some of these emergent behaviors, including solving math problems, but other things as
Starting point is 00:47:43 well. Yeah, it's important to mention to double click on the scale yeah i you know when when i was getting interested in the large language models i thought well you have a pretty decent gpu i mean it's i don't know three or four years old it has i don't know one gigabyte of gpu memory or something i don't know actually how much it has on the order of one or two gigabytes and uh i thought oh i could just download the the data set and train a train a model myself and uh the answer is i'll save everyone in the audience some time you can't do this so the data set is enormous the models are enormous even if you want to fine tune the model
Starting point is 00:48:21 you still have to load it into memory and i think they said you needed like 60 basically you need a gpu that costs two thousand dollars if you want to do this yourself which is uh out of my budget so um um so yeah you kind of have to use a service i think this might be the uh you know maybe some of the computer vision models were like this, but for me, at least this is the first time where you just, you can't try this at home. You can do it yourself, but you can't do it in your own house. Yeah, exactly. So, I mean, just to give a few examples, I mean, you know, Meta published results of training a model called OPT175, which is a 175 billion parameter model.
Starting point is 00:49:11 They trained, they haven't published the weights, but they did publish a log book and other details. It was trained for over a month on thousands of GPUs and budget, that kind of operation can be in the millions of dollars. And I'm not even talking about preparation before, deploying after, just the training. Right. Is a parameter and a weight the same thing? When people say there's 7 billion parameters, is that 7 billion weights?
Starting point is 00:49:38 Yeah, usually that's how people refer to it. So it's definitely immense. Although I think what we're learning in an industry is that, you know, there has been a few things. So I think first of all, a model like OPT or even GPT-3, it was actually under-trained for its size, meaning you can take an actually a smaller model with less parameters, train it on more data, and it will actually perform just as well or even better than a bigger model trained on less data. Now, how do they know that? How do they measure that?
Starting point is 00:50:14 Yeah, so there's a paper published by Open Mind. I think people tend to refer to it as a chinchilla paper. I just don't remember the exact name of the paper. I think our community is having a really good sense of humor when choosing model names and paper names. Right, it's all animals, right? There's llama, alpaca, koala. Yeah, but yeah, so the chinchilla paper basically talks about the scaling laws, meaning for a given, it's all about, you know, all referring to very similar architecture.
Starting point is 00:50:47 So transformer-based architecture and then different model sizes, what's the amount of data? And typically that's counted as number of tokens that are required to train it to its full capacity. Now the way they, there is no fancy math kind of analysis. Unfortunately, I think machine learning is somewhat still feels more like alchemy than science. So what they do is just train a bunch of experiments
Starting point is 00:51:17 and just took the same architecture, train it in different data sizes, data set sizes, and then measured the various evaluation metrics and then kind of came up with their analysis. And they were able to train, I think it was a 60 billion parameter model on, I don't remember how many tokens, and it outperformed the valuation metrics of GPT-3 with 175 billion parameter. So although the number of parameters was a third of OpenAI's big model,
Starting point is 00:51:56 the smaller model actually outperformed the bigger one. Oh, interesting. Yeah. So I think there's new kind of things being discovered by the day. I can tell you that I give example of two of our customers at Mosaic ML. One is Replit. So Replit is kind of very popular online IDE. I'm sure the listeners, some of them at least are familiar.
Starting point is 00:52:21 And for those that don't, definitely check it out. Replit is a fantastic tool for software developers. They built their code assistant, right, called Ghostwriter. And, you know, it does things that are pretty cool, making developers much more productive, like, you know, code completion, you know, it can create functions from comments it can explain your code for you etc so it's really a nice tool the model behind it was trained on mosaic ml platform it's a three billion parameter model uh you know only you quote quote unquote only three billion it's funny today three billion considered a small model just a couple of years ago, it was considered huge. But 3 billion parameter model trained on, I think, about 500 billion tokens of just code, open source code. Is there a common place where I could get a scrape of all of GitHub or something?
Starting point is 00:53:17 How do you get that many tokens of code? Yeah, there are a lot of datasets. I think it's called the stack. That's an open source dataset that you can access. Replit itself, obviously, because people have, you know, using them for, you know, they store a lot of code. They have access to some of that. Of course, when the writers of that code allowed Replit to use it. So, yeah. So there's definitely a lot of kind of specific data sets. Plus, you also tend to mix, right? So usually, and that's where the alchemy part comes in, right? Usually you want to mix your training data set
Starting point is 00:53:58 so it's a bit balanced. So, you know, you want to mix a little bit of kind of natural natural language from you know wikipedia or other websites because you know even code comments for example they're you know they're written in plain english they're not written in c plus plus or whatever other programming language so people tend to do mixing by the way replete published a fantastic blog post talking about how they built that model. And there's a lot of details there, including both the modeling side,
Starting point is 00:54:32 data set management side, as well as the infrastructure. But just going back to kind of the TLDR, so it was a relatively small model, specialized for being a code assistant. And it actually outperformed the, I think, two or three times larger OpenAI Codex. And that's the model, OpenAI Codex is the model behind GitHub Copilot. So I think what's interesting there is, first of all, you know, a relatively small company, I mean, Replit is a startup.
Starting point is 00:55:01 It's a big startup. It's still a startup. It's a big startup. It's still a startup. Was able to train a model smaller than another model, as well as outperforming it on quite a few evaluation metrics. And we're able to do it with actually quite a small team. Yeah, that's amazing. I think you touched on so many different things there. So one thing is, you know, folks should definitely get familiar with doing things on the cloud. And we've talked about this for many shows, we've had folks, we literally just had a show on Kubernetes a few episodes ago. And so you kind of, you'll definitely have a lot of tools at your disposal,
Starting point is 00:55:42 which will abstract away, you know away layer upon layer of this. But it's good to kind of get familiar with running things on the cloud because storing a 500 billion token data set on your desktop, probably out of the question. And definitely the models, it would just the capital cost would get out of control and a lot of students you know if you're in college or high school you know often there's a whole bunch of different amazon credits that you can get and all sorts of services there yeah totally yeah and then it sounds like the the process for you know if you say you know i'm a musician and i want to uh train a model on you know all of these uh actually music we talked about here's an even different one. I'm really into theater and I want all of the English plays around Shakespeare's era all ingested into some large language model. Step one is to find that data set.
Starting point is 00:56:43 And so it sounds like what I usually do, and Hagar, I'd love to get your advice on this too, but I usually just type what I want and then add the word data set at the end into Google and try to see if someone's already done this. Do you have any tips for getting access to data? I think a great place to start would be Hugging Face Hub. They have a data set, repositories there, and a lot of them, actually.
Starting point is 00:57:09 Actually, the problem is choosing the right one out of so many available. But Hugging Face Hub is a great place to start. Similarly, by the way, for starting with the model architecture. So the nice thing is is at this point, you don't have to do anything from scratch. There are data sets available, there are models available, and then there's a lot of training recipes available.
Starting point is 00:57:33 And the best way to get started is just to start with something that is working and then hacking it, right, to fit your specific needs, etc. You know, tweaking the dataset mix, tweaking the task you're training your model for. That actually kind of brings me to a point where, you know, even taking a step back, you know, how can people leverage generative AI or even being more specific, you know, large language models, for example.
Starting point is 00:58:04 We've been talking about kind of training your own model a little bit now, and I think it's definitely, it's gotten much easier today, but it's still, you know, even when I look at, you know, Replit, which we just discussed, right, training that model, Replit's model took about 500 GPUs running for about 10 days. Oh, wow. Yeah. that model replitz model took about 500 gpus running for about 10 days oh wow yeah so you know for those of us that are familiar with efforts at google and facebook it sounds you know like something relatively small and fast and definitely it is comparing to the bigger things that have been happening but then if you approach it from the perspective of, you know, maybe a much smaller company or even just someone who just wants to do a cool project, a student or just someone
Starting point is 00:58:53 doing a cool project on the side, that's definitely still big and requires a lot of, right, monetary investment. Yeah. But there are other ways to actually get started with LLMs that are much faster and cheaper. And we can maybe talk about those as well yeah i think uh we'll definitely dive into that going back to something you said earlier i do feel like it's very alchemic at the moment and and i think the reason for that if you think about what actually i want to say standardized or what took us to the next level in in chemical alchemy was just the reproducibility
Starting point is 00:59:26 and the affordability of experiments. So people could run just thousands of experiments, do them in parallel in very sanitized environments. If we go to that factory in China, we'll have to wear those suits where we can't get any dust anywhere. And so everything has been extremely sanitized and as a result, just very reproducible. And so that's ultimately what turned alchemy from alchemy into chemistry. And so here, you're totally right. It's not only that it took those 500 machines for four days, but it's that it's probably their 20th or 30th model. So it's their 20th time dropping, you know, $2,000 to train this model. And they're constantly altering the data and mixing with it and hyperparameter tuning and all of that. So totally agree. I think, you know, very, very hard.
Starting point is 01:00:21 You know, it's a big investment to train one of these from scratch. And so that's, I guess, where fine tuning and other things come in. And what have you seen kind of on that front? Yeah. So, you know, I think two other alternatives that people, approaches people are taking that are slower, you know, have a lower barrier of entry or easier to get started. One is just using a model behind an API. And OpenAI is, I think,
Starting point is 01:00:56 one service that is very broadly used already today where basically it's very easy to get started. You just sign up with the service, get an API key, and then you have access to a really powerful general purpose model, you know, and what's nice is to have access to that capacity. All you need to do is kind of just write an API call and, you know, whatever is your favorite programming language, but it's fairly simple and you don't need to know anything about machine learning.
Starting point is 01:01:29 But still, you have that power, that capacity. That's one good way to get started, especially to create prototypes, right? Or to play around with the technology and understand what it's capable of. Yeah, to that point, there's a lot you can do with engineering the prompt. I have a project that I'm working on with OpenAI and, you know, it was giving me answers that were not unreasonable, but didn't fit the product that I was trying to build.
Starting point is 01:01:52 And I kind of found that by, you know, playing around with the prompt. One of the tricks I found is you can kind of if you know the beginning of the answer, so if you know, for example, it should start with answer colon or a person's name colon or the answer is, if you actually write that, it makes a huge, huge difference because it massively narrows the scope. So for example, I would ask a question, and this is actually, I wasn't using OpenAI at this point, I was using Lama,
Starting point is 01:02:26 which is the Facebook's open source LLM. So I asked a question, and then it generated another question, and then another question, and another question. I was like, no, I want you to answer the question. And so I found it as simple as putting your answer colon at the end of my question sentence, told it that it's expecting an answer. And so to your point, even before you try anything with gradients and loss functions and all of that, just playing around with something like OpenAI's model or any model as a service can teach you about the problem you're trying to solve. Exactly. Yeah. And I think the prompt engineering is definitely another kind of field of alchemy,
Starting point is 01:03:12 if you like it, but it today does have a really massive impact on the quality of responses you get from text completion. And I think an important thing to remember, I do think that folks who, you know, understand how the sausage is made are the best prompt engineers out there. Although there's definitely, you know, I think if you just Google prompt engineering today, you already see a lot of interesting kind of examples in Google to get started with.
Starting point is 01:03:40 But one important thing to remember when you are creating your prompt is, remember these models are typically trained with next word completion, right? So it's the auto-aggressive transformer models. They just try to predict the next word. So if you are giving them the beginning of the answer, for example, you already really made the problem much simpler for them, right? Because they don't have to guess your intent and get it right.
Starting point is 01:04:03 You are indicating your intent to them by giving them the first few words of their answer. So that's a great way to squeeze better results out of them. I would say that personally, I expect this thing to matter less and less because models will be just much better at understanding your intent, maybe even better than we are at some point. Yep. It's definitely getting there. And the other interesting trend
Starting point is 01:04:29 that is also pushing things in that way is what's called instruction fine-tuning. So my guess would be that maybe with the Lama model you played with, it was the base Lama and not an instruction fine-tuned version of Lama. Right, it was just the naive stock vanilla. Yeah. And with instruction fine-tuned version of Lama. Right, it was just the naive stock vanilla. Yeah, and with instruction fine-tuning, what people do is they take a base model. You know, Lama is pretty good. They have multiple sizes, but it's a pretty good model
Starting point is 01:04:59 overall. Yeah, I think I could only fit the 7 billion on my computer. Yeah, that would have been my guess. But then they fine-tuned that model to follow instructions. And then this means this model, you know, yeah, just has seen a lot of example of an instruction and a response to that instruction. And it's now that it can do a better job following instructions. And then, you know, assuming... How does that work?
Starting point is 01:05:24 Like, how does the system know that the question is finished? Like, how do they actually do that fine-tuning? Yeah, so typically, again, there's the art of how you format your data set. So typically, if you look at the instruction fine-tune, most instruction fine-tune data sets, they'll have sort of a structure of instruction, colon, some instruction text, and then response, colon, and some text around that. Sometimes people use also like hashes to kind of written, the model kind of has an easier time, right, to follow your instruction. Does this make sense?
Starting point is 01:06:14 Yeah, this makes sense. Actually, you know, this, this, I don't want to take this on too much of a tangent, but how do you deal with if most of the data is just crawled off the internet? How do people deal with all the HTML and the markup? I mean, if you're reading the New York Times and they italicize something, how does that get into the model? Yeah, so there's different approaches. I mean, some models, you actually want them to be able to generate HTML, right? I'll put that aside for a minute. Let's assume for a minute you want your model only to be able to write text. So when you curate your training data set, you filter out things like HTML tags, markdown formats, and stuff like that.
Starting point is 01:06:55 So your model only gets the text data and doesn't see anything else. That makes sense. Yeah. For some models, you do want them to create HTML. In that case, you do want to preserve that. Right. But again, your dataset should not only understand HTML, but also understand kind of the context of, you know, now you're asked to generate HTML and, or now you want to generate Python
Starting point is 01:07:20 code. And then instruction fine tuning is really helpful at explaining to the model that, hey, for a given response, it's expected to generate the distributions that are more, you know, text distributions or Python code or whatnot. Got it. And so I've seen this thing called LoRa. Is that, that seems pretty pivotal, like the low rank stuff seems pretty pivotal to the fine-tuning. What's the sort of connection there? Is instruction fine-tuning?
Starting point is 01:07:49 How does that actually work? Yeah, so I'm definitely not an expert in LoRa, and I think it's also still pretty early days. But with LoRa, the idea is that you can do fine-tuning much more efficiently by decomposing matrices, and then your fine- is more efficient but then you can also apply you can take a base model and apply fine tuning by just uh you know applying the factorized matrices you got from from laura but is that is that the common thing so if someone
Starting point is 01:08:18 let's say someone out there wants to fine tune a model let's continue with the screenwriting example. So someone takes Llama off the internet and they want to adapt it to screenwriting. And let's say they've found the screenwriting data set and somehow they've converted it to Markdown or they've stripped out all the HTML. So they have the screenwriting data, they have the Facebook model. How would they, you know, either using Mosaic or using something else, like how would they actually fine tune that? Is there a module that everyone uses or something? Yeah. So what you would typically do, first of all, you know, curate that data set with, you know, basically we just text. So in this case, let's say it's screenwriting. So what
Starting point is 01:09:03 people would typically do, they'll curate a data set that includes a lot of examples of screenplays, text, and then they would take a base model that was pre-trained on general purpose language. So that model should be pretty good at English, grammar, syntax, and understanding various concepts and all of that. But then that pre-trained model, they will just continue a training regime with that data set that they have. So they would fine tune it on that data.
Starting point is 01:09:34 Now we're not even getting into LoRa. LoRa is more like a way to do this in a more optimal manner, both for the fine tuning and applying that fine tune. I'll put that aside for a minute. There's a much more simple thing to do is just to take that data set you created and then just continue training the pre-trained model with that data. And what it will, it will kind of force the model's parameter to be better tuned for that kind of text, that kind of language.
Starting point is 01:10:10 At Mosaic, we recently... The other thing I would say is that it's also much cheaper and faster than the pre-training, because for the pre-training, you need to train it right on billions or even trillions of tokens. At Mosaic ML, we recently open sourced a model called MPT7B, so 7 billion parameters. It was trained on 1 trillion tokens of text, of language, which is huge. And this, you know, it cost us about $200,000 to train this model, this size on this number of tokens. But then to fine tune it on, we did an instruction fine tuned version, a chat fine tuned version, as well as a model that is able to actually write books or write stories, write fiction. that was much, much cheaper and much faster.
Starting point is 01:11:07 Like just to give you some data around that, it's all published in our blog post, but the base model took us about 10 days on 440 A100 GPUs. It's almost the best gpus out there except for the h100s just coming up so it costs us about two hundred thousand dollars so so so those 400 gpus for four days for 10 days oh 10 days okay so 400 so that's a 4000 GPU days cost $200,000. Yeah, it's not cheap. Yeah, yeah, it's definitely not cheap. But luckily, we've open sourced it with the weight. So anyone can build on top of it. Now, how does that work? Do they need your PyTorch code? They would, right? Yeah. So the PyTorch code is defining the model architecture. So that code has been open sourced,
Starting point is 01:12:03 obviously. But there's, you know, and it is, there are a bunch of optimizations we'd put in there, but, you know, it's PyTorch code, a little bit of C++ for some of the optimized operators, but that's it. And then there's the weights itself, which is typically stored in a separate file, but then, you know, it's just PyTorch weight.
Starting point is 01:12:24 So we have example code, but basically once you instantiate the class for your model, you just use the standard PyTorch interface to load the parameters into the model. Yeah, getting kind of going full circle, you know, computer vision, we've been doing this for a while where, you know, you have a trunk model and then you have a bunch of heads for that model one head detects um you know traffic lights another head detects pedestrians stuff like that and so um it's well-traveled ground there i wonder how data efficient it is
Starting point is 01:12:59 i guess there's no way to really know right you you try to amp up the learning rate but but there's not really a scientific way to say okay this is how many playwright scripts you need to have a model that's reasonable it's like one of these things it's like really hard to calculate it's really hard yes it's still more of empirical trial and error but what's interesting is, you know, so the version of MPT-7b, that model we open source, that version that is instruction fine-tuned, we took the MPT-7b, we took, you know, a data set, or I think we combined a couple of data sets that are just out there for, you know, instruction. I think it was the Dolly dataset from Databricks. So about 10 million tokens of
Starting point is 01:13:46 data for instruction fine tuning. And basically within like a couple of hours on one node with eight GPUs, we fine tuned that model. So just to put things in perspective, the base model took us 10 days of hundreds of GPUs costing $200,000 to train. But to take that model and then instruction find unit for following instructions like we discussed earlier, that took us two hours with just eight GPUs costing us about 40 bucks. That's it.
Starting point is 01:14:20 So this is definitely within reach for anyone out there. It's like the difference between buying a tractor or buying the seeds to plant, right? Yeah, it's a huge difference. And that actually kind of is a segue to, you know, we spoke about the first way to leverage LLMs, just call an API. The second way is take an open source model and either use it as is or fine tune it for your needs. Either way, you know, it's fairly accessible today, fairly cheap and available. And so what about serving? I want to use the time we have left to talk about that. Let's say you try Lama on Hugging Face. And, you know, a lot of these Hugging Face sites have a web UI where you can ask questions, you know, it's not as sophisticated as OpenAI's site, but it's good enough, you can type in your question, it'll generate an answer.
Starting point is 01:15:15 And you say, yep, this is good enough, you're you fine tune a model. And now you want to build a website or some service for people you had a because the model can't even fit on on your gpu like how do you even how do you serve the model do people use cpus to serve the model is that a thing what's the story there so if you take um you know for example mpt7b so or the lama7b so it's a 7 billion parameter model right every parameter let's, you know, two to four bytes, depending if you're using FP16 or FP32. Typically serving today is done with FP16 or more specifically BF16, you know, so cellular parameters times two bytes, that's, you know, 14 gigabytes. That actually does fit on kind of good GPUs today, like the NVIDIA A140 gigs,
Starting point is 01:16:08 or even the A10s, 24 gigs or 32 gigs of memory. So these production grade GPUs can, you know, one GPU can hold such a model. But then there's other complexity there. I mean, you know, first of all, when you're talking about text generation, you want it to be fast and efficient. Now, remember, the way these models work is they generate one word after the other, or one token, actually, after the other. So actually, the latency of inference matters a lot,
Starting point is 01:16:37 especially for interactive applications, because, you know, a typical response to a model is definitely not just one word. It typically has, I'd say, tens, some questions even hundreds of tokens. So we want the inference to be as optimal as possible. And that's definitely one, I think, area of development. Even if you play today with models like ChatGPT, it's streaming the output word by word, but you can still see that it takes a while.
Starting point is 01:17:11 So setting up optimized inference is one area where there's definitely more and more tooling. And I think there's more room for the machine learning community to invest in. Now, one thing about that, you know, with regular deep learning models, like predicting the probability of an event, you would want to serve on the CPU because you don't have a batch versus in training, you know, you have a batch of data. Is that true here or is even just generating one word better on a GPU? Yeah, so GPUs definitely can. I think where they become really cost efficient there is, like you said, handling a batch. Now, the tricky thing is when you want to
Starting point is 01:17:55 generate output for an input sequence, and let's say you want to generate 50 tokens, you have to first calculate a token number, you know, response token number zero, and then you feed it in to generate token response number one, right? And then, and so on and so forth. So there is a sequential angle to this. Where you can do batch even for inference is when you have, you know, you have a service and you have multiple requests, different requests at the same time,
Starting point is 01:18:26 then you can batch. Oh, right. Yeah. But then you need scale to be able to handle something like that, or it's an offline process, like, you know, batch inference, which is typically offline, and then you can do those things.
Starting point is 01:18:39 Going back to the question of a CPU, so I think the main advantage of a CPU is just cost, right? Because GPUs are very expensive. I know Intel has been doing a lot of work to get their new CPU generation to be pretty good at handling transformer architecture so people can use it. You know, I have yet to see kind of inference of these kind of models work well on CPU. But I know it is an area actually Intel is working on. I even saw a demo. They did something which looked pretty promising. But then when you look at the details, it was their newest generation of CPUs.
Starting point is 01:19:21 And actually the cost of that CPU, at least on the AWS, was actually the same as the cost of a low end GPU. So performance was good, but then cost, there was no difference. So yeah, I think, you know, if we look at the trend of, you know, computing and processors is that the cost of running complex workloads, you know, always goes down, right? And I expect this to happen. So there's been a lot of interesting work by the community of folks kind of allowing you to run, you know, these models on commodity hardware. There's something called Lama CPP, I think, that someone hacked together where it's, you know, super efficient, you know, low-level implementation of inference for LAMA on a commodity CPU.
Starting point is 01:20:08 So I think it will definitely get there, although we're not there now. It actually brings me to, there's another important angle of inference. And again, that's like a differentiating factor, I think, between using an API, a model behind an API versus using either your own model or an open source model. And that issue is a huge issue, actually, of data privacy. You know, when you are leveraging a model behind an API, you have to send your data outside of a premise into another service.
Starting point is 01:20:41 Yeah, this was in the news. I think I hesitate to get the company wrong here but i'm pretty sure it was samsung the employees were using open ai and then yeah somehow i don't know what actually happened there but somehow uh opening i got their data or their schematics or something yeah so yeah there was that that was a big story in the news where um i think engineers at Samsung, that was the report, they were using chat GPT to kind of write down some of their plans. And then that data somehow leaked.
Starting point is 01:21:13 It's not clear if it was leaked because OpenAI is using data people send their service for inference, they're using it to retrain the model. And then the model memorizes some of what it's seeing. And then it leaked in a response somewhere else. Yeah, I think somebody else searched for like the model number. You know, a competitor was like, tell me more about the Samsung S4000.
Starting point is 01:21:36 And OpenAI was like, sure, here's what I know. Yeah, so I don't know if OpenAI, I don't know if they changed it yet, but if you're using the free version of ChatGPT, the default is opt-in, meaning many people are not aware of it, but you're by default opt-in to share your data with OpenAI and then use it for training their model and whatnot. And that's really something to pay attention to.
Starting point is 01:22:01 And I think the industry needs to mature a little bit. And I also, my personal take is that I think there should be legislation that governs how these models are used and the privacy of data and all that stuff. But that's just something
Starting point is 01:22:15 for everyone to remember. There are a lot of advantages of using a model behind an API. And we went through that, those advantages. But one drawback is definitely if they if you care about data privacy, if you're like in finance, or healthcare, or similar industries, you probably don't want to send your data over the wire somewhere. Or,
Starting point is 01:22:36 you know, you want to get very strong, right, guarantees from your service provider about how this data is going to be used or is not going to be used yeah i mean i think the the old adage you get what you pay for applies here you know if you're using a free api and open ai is spending you know we talked about hundreds of thousands of dollars you know on on keeping these gpu machines up even to do inference um you know you're giving something back right and so you know, you know, using HuggingFace or Mosaic or one of these services where you're paying for the service, you know, you could probably get much better privacy guarantees. Yeah. The other thing, by the way, is also cost. So what people are finding out is when they use models behind APIs, then I think at small scale and prototyping,
Starting point is 01:23:26 it's very cheap, cost-effective. But if and when this becomes a core use case for your application, it's becoming very expensive, especially if you have a large-scale operation. So that's also something, you know, I think people are realizing sometimes a bit too late, and that's also something to you know, I think people are realizing sometimes a bit too late. And that's also something to factor in because you can get much better cost efficiency if
Starting point is 01:23:50 you are serving your own model or either an open source model or a model you trained. If you serve it on your own infrastructure, of course, you need to set this up. And there are some services that help you do that, including Mosaic ML, but it's much more cost efficient than actually using an external service that has a margin and whatnot. So cost is becoming a thing at scale. Yeah, I think if everyone uses OpenAI, then it's driving towards a monopoly, which just putting my economist hat on results in infinite profitability for OpenAI. You know, conversely, if everyone's using their own model, and it's just a matter of who can host your model, that's driving to zero profitability, or like infinite competition, which is good for you as a person who wants to use the model. So it sounds like, you know, it doesn't take a lot of
Starting point is 01:24:45 money or time. It really takes you kind of out there building those skills to, you know, grab the right data set, grab the right model, try a bunch of different fine tuning and learn how that system works. And in the end, end up with a model that can create some unique value for you or for a addressable market that you have. So one thing about, I want to dive into Mosaic, the company here. So there's a ton of folks out there, listeners who are really interested in this technology, just like there are people across the world in all disciplines interested, and they would love to get their foot in the door, work more with AI and machine learning and generative AI.
Starting point is 01:25:26 And so talk a little bit about what's it like at Mosaic and what kind of folks you're trying to hire for and just general kind of job seeking advice. Yeah. So let's start with Mosaic. So we're still pretty early stage startup. We're now about 60 employees. Most of us are in SF, but we also have a couple of other offices, including in New York and even Copenhagen. And we're really, you know, kind of relatively small team, just trying to do a good thing by making state-of-the-art AI with the focus of generative AI just more accessible.
Starting point is 01:26:09 So any organization out there, that's what we're out to do. Any organization out there should be able to, you know, leverage these models in whatever way works for them, you know, a model behind an API, which we offer, or open-source models that we open source, make available to the community, or pre-training and fine-tuning your own model. And we think there is, you know,
Starting point is 01:26:32 great business opportunity with that. And it also, it's going to kind of really help kind of the next generation of startups, as well as big enterprises to use AI. Yeah, so, and then we're hiring actually. So, you know, the business has been going well, you know, we're seeing good traction with customers. We're seeing good traction with the community and we are growing the team and we're hiring across, you know,
Starting point is 01:26:55 software engineers, both for our cloud platform, as well as for machine learning runtimes. We hire researchers for our fantastic research team that is using our platform to, you know, build these amazing models like MPT that we've open sourced. So there's researchers, we are hiring interns across these both teams, both the research team and the engineering team. And then we hire for other functions. You know, we hire across product, technical program management, recruiting. So it's really kind of, you know, I feel that the team is kind of hitting on all cylinders. And then as part of that,
Starting point is 01:27:35 we're also continuing our growth. Yeah, it's really exciting. And, you know, I guess I'm biased, but I'm really excited about both the mission of what we're trying to do as well as kind of the the culture and team at mosaic cool it makes sense and so you know as we talked about for you know a relatively small sum you can take the mpt model and you can fine-tune it to do playwright If someone does playwrights of Shakespeare, let me know. I would love to just add me on Twitter.
Starting point is 01:28:07 I would love to see that. But I think the best way to get noticed at a company like Mosaic is to use the product, right? And to build something and have a portfolio of accomplishments that you could do relatively low cost kind of adjacent to that so if someone's a student you know a college student even high school student you know is mosaic a tool for them is it something that they should know about for when they go into industry is there is there sort of a free tier like what is the story. Yeah, great question. So at Mosaic, we do have a few open source components
Starting point is 01:28:48 that anyone can use. So there is the models I mentioned earlier, the MPT series of models. But there's also a training library called Composer. It's a PyTorch training library, which just helps train PyTorch models faster and better. And there's also a streaming dataset library, which is really useful for training models when you need to stream all the training data from cloud buckets.
Starting point is 01:29:14 However, the product itself, so far it's been really geared towards enterprises, meaning there's no free tier or community tier where people can just easily get started with the platform. And the reason it was designed this way is just kind of how the company evolved, right? You know, at the end of the day, it is a business. And we were going after enterprises initially to establish the business. And that has gone really well.
Starting point is 01:29:41 And the next thing on our plate is offering some sort of a community tier where, you know, a broader set of practitioners out there can get started using what we have to offer. And this will come soon. And I think at that point, definitely, it's going to be very easy for anyone to just get started, try us out, either use our models as APIs or fine tune either our models or any model out there that is available on the Hugging Face Hub or GitHub or anywhere else. As well as, of course, kind of pre-training your own model. Although this tends to usually cater more to the enterprises that have enough data and have the budgets to pre-train these models. So stay tuned. It will come.
Starting point is 01:30:29 And at that point, it's going to be amazing. I'm really looking forward to that moment where we kind of open the floodgates and allow the community to really engage fully with us. Cool. That makes sense. I mean, in the meantime, folks can get the MPT model. They can get all of the the weights the pie torch code so that they can continue training on their own data set and
Starting point is 01:30:51 there's there's a whole myriad of different services out there so if this sounds cool to you you should you should uh you know put in the sort of sweat equity here to uh to build something neat uh definitely email us uh you know tag us on social media with anything you build. We've actually, inside Baseball here, we've been really good at placing people. I've gotten emails lately from people who have been on the show representing a variety of different companies saying, oh, we have our first intern who found out about us from the show.
Starting point is 01:31:21 So I think that's a real testament to the audience out there. You folks are super motivated, highly technical, which is really great to see that we're able to sort of, Patrick and I can kind of connect to interested parties here, which is awesome. So we'll put the links to Mosaic ML and their careers page, all of that on the site,
Starting point is 01:31:43 if that's something that interests you folks out there. Hag thank you so much for coming on the show you know i think we did an awesome job kind of covering you know in the audio format um you know how how this whole system is evolving how it works technically uh there'll be tons of resources of the show notes for people to follow up and i want to just really appreciate you uh you know spending time with us today thank you jason thanks for having me and uh i really enjoyed this this chat cool thanks everyone out there have a good one music by eric barn dollar programming throwdown is distributed under a creative commons attribution share alike 2.0 license you're free to share copy distribute transmit the work to remix adapt the
Starting point is 01:32:40 work but you must provide an attribution to Patrick and I and share alike in kind.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.