Computer Architecture Podcast - Ep 1: Systems for ML with Dr. Kim Hazelwood, Facebook

Starting point is 00:00:00 Hi, and welcome to the Computer Architecture Podcast, a show that brings you closer to cutting-edge work in computer architecture and the remarkable people behind it. We are your hosts. I'm Suvainai Subramanian. And I'm Lisa Hsu. Today we have with us Dr. Kim Hazelwood, who's the West Coast Head of Engineering at Facebook AI Research, also known as FAIR. And prior to Facebook, Kim has donned several hats from being a tenured

Starting point is 00:00:26 associate professor at the University of Virginia, to being a software engineer at Google, and director of systems research at Yahoo Labs. But today, she's here to talk to us about systems for ML in particular. Before we begin, a quick disclaimer that all views shared on the show are the opinions of individuals and do not reflect the views of the organizations they work for. Kim, thank you so much for joining us today. We're excited to have you here. Thanks for having me. This has been a lot of fun.

Starting point is 00:01:02 So to kick it off, let's just start in broad strokes. Tell us what you do, especially in your new role, and what gets you up in the morning. So I just pivoted into a new role within Facebook. I've been at Facebook for four and a half years. And more recently, I shifted into the research organization. Prior to this, I was in infrastructure, which is more of the product data center side. And my recent role is, you know, two parts. It is a leadership role in FAIR engineering, as well as the fact that we are spinning up a SysML research

Starting point is 00:01:35 effort within FAIR. And that's being seeded by a team that we had in Boston, combined with some other researchers that are kind of spread throughout the country in New York and Menlo Park. So what gets me up in the morning? I have four daughters, but there's one in particular who likes to wake us all up around seven. There's a rule in the house that I'm not allowed to be woken up before that point. That's a great rule. I warn them that Monster Mommy exists before 7 a.m. They don't ask any questions about Monster Mommy. Don't want to find out what Monster Mommy's like. Yeah, I love that. So in this new role, it sounds like it's been a little bit of a shift from

Starting point is 00:02:20 deployment of real systems to more research. Before we get into sort of what went on as the new role, maybe you could talk a little about some of the challenges in deploying real systems at scale for ML. Sure. So I actually, this was something that was near and dear to my heart, so much so that in 2018, we had a paper in the industry session at HPCA, where we kind of dove into, here are the considerations that come up when you're deploying at scale that I don't think that a lot of the researchers were thinking too much about. I mean, there were some things that I didn't even realize were a practical challenge. So for instance, at Facebook, we have to think about things like disaster recovery. So what happens if a hurricane were to hit one of our data centers?

Starting point is 00:03:06 You know, would we be able to easily, like, handle that, the loss of many thousands of machines and without Facebook going down? So those are, you know, things like that. In general, just the scale that we deal with is a much bigger scale than when I was an academic, I would talk about scaling things out. And I was talking on the order of a few racks. And things change very, very dramatically when you're potentially, you know, dealing with a global infrastructure setup where you have machines around the world, and you're potentially shipping data around the globe. Yeah, so you're talking a lot about like deploying things at scale. So, but let's talk about some unique attributes of designing systems for ML. And you know,

Starting point is 00:03:54 how they're maybe similar to the classic computer architecture research paradigm where, you know, we are used to benchmarks like spec and so on. And how are they similar in that respect? And maybe how are they different as well? Yeah, I mean, so there were a bunch of us who started with a background in systems or architecture, and came into the SysML space, which is now its own official research area, but came with more of a systems background. So we had to get caught up on many, many years worth of ML research and work. But in many ways, the machine learning workloads are workloads. And workloads we're used to. We're used to abstracting away what's actually happening and figuring out, well, how does that actually just hit the underlying layers? How does it hit the hardware? What are the computational challenges?

Starting point is 00:04:44 What are the networking challenges? What are the networking challenges? What are the storage challenges that surface when you're dealing with this new class of workloads? At first, we essentially just had to get some understanding of what are those workloads like? And this was before the days. So today we have MLPerf, where we've started to agree upon what are the workloads that we're going to focus on. And what's nice about that, it's a very diverse set of workloads. Before this, it was the Wild West. And I think for the most part, people over-optimized and over-pivoted on specific subsets of machine learning. So for instance, computer vision. So for the longest time,

Starting point is 00:05:24 pretty much all of the GPUs were being designed for computer vision workloads. All of the assumptions that were being made throughout the stack were for one small subset of the workload. So that was another thing that we tried to educate people about in the HPCA 2018 paper was just the broad diversity of workloads that are actually in play when you're talking about machine learning and how they differ, particularly when it comes to the lower layers in the stack. You're talking about the diversity in these workloads. So let's drill a little bit down into that. So diversity can mean a lot of different things. It can mean diversity in the compute requirements. It can mean diversity in what the end application cares about, like latency versus throughput, which is energy efficiency or something else. So can you talk a little more

Starting point is 00:06:12 about that? I think your papers generally delve into a lot of details, but give us a little more of a glimpse into that diversity and what are the different dimensions to that diversity and how they eventually affect the systems that you design. Sure. So early on, people used to approach me and say, what kind of workloads does Facebook care about? Do they care about CNNs or RNNs? And so the hidden implication there is that those are the only two options and that's what diversity looks like, is that it's either a CNN or an RNN. You know, I went around the company just to understand what is the diversity of the workloads?

Starting point is 00:06:51 What does it look like? And then that's where I realized there's a huge emphasis on deep learning, but there's actually a whole lot of practical applications of ML that don't need deep learning. In fact, it's complete overkill. There are a bunch of workloads that are moving very, very quickly, meaning that pretty much every day you're tweaking the ads algorithm or the ranking algorithms for content. But things like face recognition is a pretty stable workload where it's kind of solved. You know, we kind of know how to recognize a given person's face. So there's no need to really innovate very, very swiftly there. Right. So this is both like model research as well as how frequently you train, I'm guessing.

Starting point is 00:07:38 Exactly. So frequency at which new models come out. And also like for a given model, how frequently you need to train it. And this is important to keep in mind because if we're thinking about things like, okay, machine learning training, right? How often does that need to happen? And what are the computational requirements? And what are the main considerations that come into play? You know, one of the big things that required a little bit of a mental shift for me was when I realized just how long running some of these workloads were when it came to training like a new language model, for instance. You know, this can be weeks. And when it's weeks, so performance is important, but what

Starting point is 00:08:18 suddenly starts becoming even more important is things like reliability. And I don't mean reliability from the typical computer architecture lens. I mean, does this thing run for seven days and then fail? And then you have to start over from scratch and run for another 14 days. So basically, like what are the chances of actually successfully training a particular model in a given amount of time? That's what I mean by reliability. And then that's the first instance where we started to run into terminology collision with ML researchers.

Starting point is 00:08:55 And, you know, when we started, when I heard people talking about reliability at first, I was thinking the computer architecture definition of reliability, where we're really thinking about circuits. No one in the ML community really is thinking too much about hardware reliability. Stuck at fault. So I wanted to go back to a little bit something you were saying before about like over-indexing on computer vision is one way to think about, you know, the over-indexing on particular pieces of the ML space, would you say for people who are more classical computer architects, it's a little bit the way we revolve around spec in the general compute space? Yeah, I mean, so spec was an interesting example because we, you see the goal was to come up

Starting point is 00:09:40 with a diversity of workloads. There's 26 benchmarks, right? In our mind, this was a very diverse set of workloads. There's 26 benchmarks, right? In our mind, this was a very diverse set of workloads. But then immediately we started to kind of over pivot and over optimize for those spec benchmarks, the processes were designed, optimizing for spec. And where this was really, really interesting was in the case, so you have like one particular benchmark, like GCC. This one is much bigger than all the others and behaves differently. And because of that, in many cases, you would dismiss it. You'd say, oh, it's an anomaly. This worked well on everything except GCC.

Starting point is 00:10:17 And it wasn't until later when I started to analyze like production workloads, and I found there nothing like spec. And in fact, the closest thing we had was something like GCC that, you know, was had a much bigger instruction footprint, but that that had been ignored or dismissed as an anomaly prior to that. Even still GCC was like orders of magnitude smaller in instruction footprint from a real production workload. I mean, we were dealing with 100 megabyte binaries, hundreds of megabytes for like doing search or anything like an actual practical problem that is needed to be solved at scale. These are massive, massive binaries. So touching upon the issue of workloads, what can architects do to sort of keep their pulse on like

Starting point is 00:11:04 what are the workloads, what is important and pulse on? Like, what are the workloads? What is important? And so on. Yeah. So because there's always a natural tendency towards kind of dogpiling, right? Where everybody piles up on, you know, whatever the flavor of the week was, computer vision, you know. Carol Jean Wu has this fantastic pie chart of here's all of the research investments into particular domains.

Starting point is 00:11:27 So like, here's all the papers that are focusing on computer vision. Here's all the papers that are focusing on language translation, right? Kind of running through all of the different ML spaces. And she has this beautiful pie chart. And then, you know, she did this exercise internally within Facebook, where we looked at all of the workloads that were running out in production. And we built a similar pie chart with all of the same categories. And the pie charts were so dramatically different, right? The big discrepancy being recommendation systems, for whatever reason, have gotten very, very little love in the research community. And it's genuinely what's powering all of Google. It's what's powering Facebook, all of the, you know,

Starting point is 00:12:12 newsfeed algorithm, the ads algorithms, all of the search results within Google. This is all recommendation. Is this why I can never find anything to watch on Netflix? Perhaps. Or someone else is messing up your recommendation model by watching all the things that you don't like. Yeah. Okay. So when you talk about the pie chart internally at Facebook, is that more based in terms of, like, number of models being run or computational hours being run? Because there's kind of a difference where, you know, like you were saying before, some of these new language models might take weeks and weeks and weeks, but that's one model.

Starting point is 00:12:51 So how are you counting it? Yeah, so we essentially want to count it in all of the different ways, right? Right. And in fact, also keep in mind that these are always moving. So those pie charts are changing essentially daily. So a lot of times what we'll do is just build dashboards and track how things are progressing day over day so that we can notice whenever there's a shift. You know, we look at lots of trends that we've seen internally. So I have some great stats on, you know, what is the growth of machine learning look like within Facebook, for instance. So we started to track like, so we have a shared set of infrastructure for if you're going to train a machine learning model at Facebook, you use a system called FB learner. And so this is like a single entry point where we can see how many people are training models,

Starting point is 00:13:42 how many models are they training, and how complex are those models? Therefore, what kind of computational resources are needed to power all of that? And I took a look at the numbers, and it turns out we're doubling the number of engineers who are doing machine learning training every 18 months. We are tripling the amount of models that they're training. So each person is doing more work. And then the computational complexity is increasing as well to where it requires 8x more computational resources every 18 months as well to be able to power all of that training. So that's a pretty dramatic growth. So people, you know, we've got more people doing it. I even gathered one fun fact about the average age of people training models at Facebook is dropping.

Starting point is 00:14:38 This means it's actually extremely important that this is easy to use because you can't assume that you're going to have expert ninja programmers who have been doing this for years and years. You literally have to optimize for a fresh college grad who is 22 years old coming out of Berkeley and needs to be able to come in and use their systems. This needs to be very easy to use.

Starting point is 00:14:59 Right, so touching upon that point, this talks about the importance of tools, frameworks, and the overall entire ecosystem for both getting your models trained and served and deployed and so on. What do you think about the state of the ecosystem? You know, where are we doing well? What is missing? And for example, how could systems researchers and architects build those tools for the community or leverage insights from the tools that we build? Yep. So this is something that is near and dear to my heart because I worry sometimes that as

Starting point is 00:15:27 computer architects, we're really, really excited about what happens under the covers, but we have to recognize that we're one of the only people who care about what happens under the covers. And the worst thing we can do is expose all of that complexity to people who just don't have the background to be able to do a good job at making key decisions if we expose to them too fine-grained detail under the covers. I think that moving up the stack and thinking very, very deeply about what is the experience that you want to have a data scientist to have and having the proper expectations for how much expertise they can have. This is super important for us to keep in mind because for all of the different accelerators that we're proposing, for all of the different optimizations or changes that we might

Starting point is 00:16:18 be proposing, we need to keep in mind that we need to make this invisible and easy to use and as automated as is humanly possible. Because otherwise, like if you put a human in the loop where they actually have to make an informed decision, this is not a good strategy. Right. This is where people like putting in the right guard bands so that people can't shoot themselves in the foot is so, so, so critical. Designing the right abstractions that help people use this easily, but providing the sufficient levers so that if they wanted to optimize for performance, they have the ability to do that. Yeah. And there, I think what's super helpful is just the kind of shadow, the customer, right? So just kind of watch somebody, you know, try to use certain tools. We had this performance optimization tool for GPUs, where we were giving feedback to the users. And we would say, hey, your SM occupancy is low. And they were like, oh, okay.

Starting point is 00:17:20 What does that mean? What is SM? You know, is that a system memory? Like a good guess, but no. And then they also just were like, I don't know what to do about that. Or I don't know if that's normal. Right. They have no expectation. They have no idea. Like, what is SM occupancy supposed to be?

Starting point is 00:17:38 And, oh, this is low. Well, what do I do? I don't know. You know, so making, giving actionable feedback is so critical and not just pointing out, you know, something's wrong. First of all, they're not even going to have the perspective to know if that's bad or good. Right. So giving, giving that perspective and then also making things much more actionable to them. You know, you might consider increasing your batch size. That they can get be, okay, I know how to do that. But just telling them something

Starting point is 00:18:05 unactionable isn't helpful. This is a lesson that we had to learn the hard way. So it sounds like then in terms of building systems, you're having to not only consider the actual system itself, the construction of the architecture, but also the user interface and the software stack in between the machine and the end users. Exactly, exactly. Because we want to, as I said, hide all of the computer architecture that is going into this to the extent that we can. And so actually, I wanted to also touch on something you were saying before about all your data dashboards and all that sort of thing, because it sounds like you not only have to pay attention to how your end users are using your equipment and tools and software

Starting point is 00:18:44 stacks and all that, But at the same time, you have this other layer of data that seems like it's changing very rapidly. So of course, you want to make data-driven decisions. But if data is changing so rapidly, how do you sort of accommodate for the rate of change? And in addition, I think that rate of changes also can be overwhelming to people who are wanting to maybe make the transition into machine learning. So, you know, given that a lot of our field has transitioned from classical stuff to ML stuff in a relatively rapid rate, maybe you can speak a little bit about the rate of change and how to grok it. Yeah, I mean, I think for us, like we're used to a setup where everything's extremely stable, right? You have the spec benchmarks.

Starting point is 00:19:23 They were changing at a 10, 15, 20 year cadence. We didn't really have to worry about all of that change. The ISA was stable. The ISA was stable. I mean, there were occasional people making a run for, hey, let's do a new ISA, but it wasn't that frequent. It wasn't anything that we needed to optimize for. And so, you know, in this new world, you basically, when you're trying to make a decision, right, you not only have to make a decision based on like, I'm going to do my best with the information that I have, but I also need to recognize that all of this is going to change tomorrow. And I need to know, like, have a mental model for how do I evaluate when it has changed

Starting point is 00:20:01 enough that I need to go revisit and modify some earlier decisions? Because you can. You can change your mind. You can tweak certain things. It just gets harder and harder if you're trying to make those changes in the hardware, right? And so that's a tricky lesson because I think that we all need to recognize we are not yet into stability when it comes to the ML field. I think we can rise to the challenge, but we can't use the tricks that we used to use where we assumed,

Starting point is 00:20:32 you know, you can, I'm going to design for language translation and forget the fact that a new breakthrough model is happening at the cadence of about every six months, that it just is completely different than the old language model. Significantly better, but it's hitting the resources completely differently. So being able to be flexible and pivot starts to become very, very important. Right. And for hardware architects in particular, I think this is a useful lesson, which you also mentioned in your recent blog post that programmability is critical for the accelerators that you're designing. Yes, accelerators need to be performant. Because of just the rapid change in the field and new models and new characteristics in

Starting point is 00:21:17 the models, programmability is probably really important. Yeah. In fact, in many cases, it's more important than raw performance. And you'll see a bunch of examples out in the community where people have spoken with their feet. So within Facebook, we used to have two ML frameworks. We had Cafe2 and we had PyTorch. PyTorch was much easier to use. Cafe2 was much more performant.

Starting point is 00:21:43 You know which one won? Not the performant one right it at some point it's like i it's having this fast iteration cycle being able to easily use something ends up trumping raw performance to you know a certain extent and you know we were definitely within the guard bands of that okay if i have to wait an hour, if I have to wait two hours, is that really that big a deal? If I have to program it for 10 hours, right? Versus one. Sure, if it's one week versus two weeks to be able to train a model that I can launch the training when I get to work and have it be finished by the time I'm about to leave for the day.

Starting point is 00:22:30 Like that's one threshold. Like I don't care if it finishes at 2 a.m. versus 5 a.m. I'm asleep, right? So this weird notion of performance where it's like it matters but doesn't matter kind of depends on whether you're falling off someone's cliff in terms of what is their working style like. Right.

Starting point is 00:22:48 What do they need? Right. And if it takes you three days to get it working versus one, then actually that should factor into the equation as well. Yeah. Because all of the innovation comes from just, it's trial and error. If you actually watch how the data scientists and ML engineers are working, like how are you tweaking the Facebook state-of-the-art ads model,

Starting point is 00:23:11 it's trial and error. It's let me try, let me twist this knob, twist this knob, try that. Is that better or worse? Better or worse, right? And so this iteration cycle, it's a feedback loop. And so the faster you can go through one iteration, the better. But, but it's not as clean as like, you know, 10% better, I'm going to get 10% more models run through because sometimes there's, this is happening and finishing in the middle of the night, right? So it doesn't matter. So then, so then given that the advice you're giving is that for you want to be able to hide the abstractions from the end users, and you at Facebook are paying a lot of attention to that user interface. But for pure hardware architects underneath who are trying to provide a system that's robust and agile for all these potential future situations that we don't even know are going to happen yet. What advice would you give to architects? So architects need to like team up with people working in the software stack with

Starting point is 00:24:09 be thinking about tools from the very beginning and really do collaborative projects together of like, hey, here's what we want to be able to do down under the covers. What implications is that going to have in the higher layers of the stack? And is that okay? And, you know, forming like teams and coalitions where we can work together. If you work on an accelerator in isolation, it's just really not a good strategy because there can be, you will have blinders to what implications that might have above the covers. And you don't want that because that can be a deal breaker at the end of the day. Agreed, agreed. I think that's one of the things that us, we in the hardware industry always should bear in mind is what is above you and what is below you and all the layers of the stack. Given that you are sort of within our community a little bit the face of ML at Facebook,

Starting point is 00:25:02 I want to ask too about, there's a little bit of an equivalence now between ML and Facebook, but that's clearly not the case. There's more going on there. There's larger, greater technology trends at play, you know, the end of Moore's law, Denard scaling, all that sort of stuff. So what kind of challenges do you guys have at Facebook with respect to technology that sort of abstracts away ML and is not necessarily ML oriented? So I wouldn't say that it abstracts it away or that it's completely separate, but it is it is something that isn't traditional ML that is an extremely important problem. And that's the problem of just dealing with data. So one of the things that, you know, all of ML is driven by oodles

Starting point is 00:25:47 and oodles of data, but in that whole space is a ton of engineering challenges and interesting like research explorations we can do. So a lot of the time, it really just comes in shuttling data around the globe, going through and taking unstructured data and figuring out how much of it is relevant, how much of it can I safely ignore. A lot of times you're dealing with just tons and tons of zeros that you're shipping around the globe. You know, so for instance, like with like a typical ranking and recommendation challenge, you might have, you know, perhaps 50,000 inputs going into your model, right? And so this is like 50,000 different bits of information that drive a single decision. And these decisions are being made trillions of times per day. So there's a really, really interesting space there that I think not enough people are paying

Starting point is 00:26:52 attention to. We have to spend a lot of time and effort thinking about that internally. That's why we have so many data and storage divisions thinking long and hard about this. You know, I think they're just shuttling, you know, it's part networking problem. It's a part storage problem. You know, as computer architects, we're going to immediately go to like disks and, you know, oh, let's use Flash, right? But it's, you know, there's also like, all of that has to be shuttled over the network and around the world and between servers and a rack and between racks. And there's just a ton of data movement. And there, I think it's just, it's an interesting engineering challenge that ends up being a space that we have to spend a lot of time and resources in.

Starting point is 00:27:46 It's not traditional ML though. Right. So are you saying computer architects should move en masse and dogpile in the networking world? Somebody needs to. I'm not really seeing... So the networking folks that I chat with, they seem to be continuing to do their normal mode of execution on how they're thinking

Starting point is 00:28:07 about networking and not thinking deeply about like, what are the ML implications on that? So we've run into a bunch of interesting situations where, you know, we designed our entire server infrastructure with certain assumptions in mind. Like we might assume that you've just got a bunch of servers that are serving web traffic. These are independent work streams. They're not communicating with each other. So you don't have like a ton of intra rack communication normally. And we optimized for that. And what happens with the ML spaces is that got put on its head because now suddenly you were doing distributed training and you had lots of communication happening between the servers, between the racks.

Starting point is 00:28:52 And this was just a networking pattern that people weren't used to. People were used to the whole system had been optimized for servers talk to the top of rack, top of rack. You know, it goes up from there. And data is sort of moving in one direction. So this this sort of sideways traffic is just something that I think has has caught the networking engineers by surprise, to where in some cases, I'm not even sure they've, it's fully landed on them, that something is different. Yeah. And that now they have bottlenecks that they didn't realize were there. So do you think some of those problems

Starting point is 00:29:29 could be solved by sort of adopting the way HPC does networking? Because they definitely have a lot of inter-node communication. Yeah. So that's another space where you need to get all the right people talking the same language and really understanding that you can't go about ML the way you went about things in the past. There are fundamental differences. And so that needs to fully, fully land on the HPC community because there are differences. It needs to land on, you know, it's landing on us in some form. It needs to land on the networking community. It needs to land on the database community. We need to be willing to rethink some of the assumptions and expectations that we had for years and years and years of experience because this is a different world. It's a

Starting point is 00:30:23 different set of challenges. And we need to be open to that because that opens up the doors for many, many opportunities. So speaking of getting a lot of different people from different backgrounds, different expertise, like computer architects, networking folks, storage folks, and so on,

Starting point is 00:30:38 because all of these have to get assembled to build a performance system that works for the end users. And as someone who leads a team who manages a lot of different people, tell us a little bit about how you think about assembling people with diverse backgrounds, different set of expertise. How do you sort of get them communicating in the same language? And importantly, I guess, how do you have sustained interactions where these people

Starting point is 00:30:59 see the value of collaborating with each other and, you know, coming up with ideas and talking consistently so that we improve the systems as a whole? Yeah. So you essentially answer the question in some way, which is that we need to get everybody speaking the same language, which also sort of means taking them out of their norm and their expected way of doing things and saying, okay, we're going to be solving a different challenge now. We all need to kind of get together, have some common language. But I

Starting point is 00:31:31 think the, you know, one trick that we tend to use is that we focus on let's solve one core problem together, one concrete named problem together. This is a good forcing function for us to very, very swiftly get up to speed on what is the common terminology that we need to think about? Oh, what are the curveballs that the other types of engineers are going to throw into the equation? So that's kind of how we tend to go about it as we solve a concrete engineering challenge together. Okay. That's driven by, you know, there's a concrete need. And that's much, much easier than just trying, dancing around with no real deadline or, you know, no forcing function to like we have to get all on the same page or else.

Starting point is 00:32:25 That's sort of the tactic that we tend to get all on the same page or else. That's sort of the, that's the tactic that we tend to take. Well, that's a great way to think about it. Have like a concrete problem, like a concrete deliverable that gets everyone on the same page. And I'm sure there'll be like multiple such instances or the long run, you have a system that, you know. Yeah, and then, but, you know, we all usually have some sort of, some concrete deliverable with a deadline.

Starting point is 00:32:45 But also just like taking notes along the way of like, if I had, you know, wearing my research hat, if I had had, you know, 10 more hours, I would have explored this different avenue and figured out was this the optimal choice. But sometimes when you're kind of under a deadline, you don't have time for that to truly explore the entire space. And so I just try to keep note of those. Like, OK, later, let's go back and like a whole bunch of different research questions that we can answer. We can do that later after we've solved the engineering challenge. Because we actually go if we find, hey, we didn't have enough time to truly explore the space we picked we did our best and we picked one um we can change that later right so that's what's what's really really cool about that you know research to production kind of feedback loop right is that you never have to call it done right like you can change it

Starting point is 00:33:39 tomorrow right and in general academics like to sort of step back and take a look at you know what are the things that we missed and how can we sort of think about this more systematically? I'm guessing like this process of sort of taking down notes is also helpful in sort of seeding collaborations with academics or if they're visiting and so on. Exactly, exactly. In fact, you know, now that we've spun up this SysML research group, like we find various engineering teams who like they're staffed with PhDs who, you know, they're like, we had to solve this problem. There was an interesting research experiment, but we don't have time to do it. You guys like want to take a look. And so we'll get like this influx of really awesome research ideas that had a very, very practical foundation. And, you know, it's like free ideas coming in from around the company. That's awesome. So one of the things that you

Starting point is 00:34:35 were saying before is that you have this kind of, okay, let's solve the engineering problem, and then we'll go back and re-explore some of these things later. That seems like something that's very viable for things that can have rapid turnaround time like software stacks or a user interface and that sort of thing. But in the computer architecture field where we all sort of were born and raised, a lot of these types of things, you know, particularly if you're thinking about ASICs, very long lead times where, you know, we'll do it next time where we'll do it next time around is basically we'll do it in years which may be too late so how do you kind of balance i i don't know to what extent facebook is doing custom silicon if you look at some of the job ads that they have out maybe it's some um but to what extent do you sort of balance uh we'll do it later versus lead times in terms of being able to turn around results?

Starting point is 00:35:28 Yeah, I mean, I think the concrete examples that I had in mind were a bit more granular than, you know, an entire chip tape out, for instance, right? To where the next iteration isn't necessarily years down the line, but it's, you know, more on the scale of, you know, a month or so down the line. You know, there's a bunch of fundamental decisions that you're making daily and weekly on any long-term project, right? It's not really just a singular outcome. So there's a bunch of different design decisions where like, it could be as simple as like, what kind, what's the right ratio of, you know, A to B and what's a quick and dirty way to answer that as like what's the right ratio of A to B? And what's a quick and dirty way to answer that question? What's the thoughtful, very, very detailed way that we don't have time to do right now?

Starting point is 00:36:17 But that we can potentially pivot, right? Because even in, say, a chip tape out, right, you can make changes. It just gets more and more painful the further you are along in the process so you don't actually ever really have to wait until the next rev you can choose to and you know that's a good exercise to do as well it's like how many how much of this is forgiving how much of this would literally have to wait for a complete redesign before we can actually start to think more deeply about. But I really, it's not so much about concrete decisions, is I look for opportunities to really improve the intuition and understanding of a space, right? So like a simple example is, there was a point in time where we were trying

Starting point is 00:37:07 out, you know, would this particular type of model run better on a CPU or a GPU? And the way we would go about that is we would implement it on CPU and we would implement it on GPU and we'd compare, right? It would be way easier if we just had some insights on, like, based on, you know, what I understand about this workload and what I don I understand about this workload and what I don't understand about this workload, here's why this might be better than the other mechanism. So, there is a concrete decision that we needed to make there is where do we want to run it?

Starting point is 00:37:38 But also I saw that as a research opportunity to help anybody going forward with a brand new workload that shows up on their plate to be able to eyeball it and say, this is probably, you know, because of the sparsity and because of the computational requirements, is much better suited for something without having to actually implement it. Right. Right.

Starting point is 00:38:01 Right, so then for architects, there's a layer maybe at the system level where you can have this kind of like higher level, more rapid iteration time to be able to decide things. But, of course, those who are like thinking about how to build a better register file with read-write port ratios or branch predictors or something like they're they're just going to have to potentially wait unless they can simulate i suppose i i guess so perhaps i mean i i i i imagine their process is is a bit more forgiving as well but i don't tend to operate down at that level

Starting point is 00:38:38 so yeah i can't really speak intelligently there yeah i only say that just because like it you know if this is this is uh for computer architects, we want to make some aspect of it accessible and not necessarily about UX. Right. I mean, this is actually a very great point, like in terms of the tool sets and the way we analyze these kind of systems, normally we're used to building a simulator, having like workloads, doing like very detailed cycle level or cycle accurate simulations and things

Starting point is 00:39:04 like that. And it looks like the tool sets that we need for designing systems for ML are, they're at different levels of granularity. Yes, you could build cycle-accurate simulators if you're looking at very specific kernels, like solutions and things like that. But in terms of designing the end system,

Starting point is 00:39:20 it looks like you need a different set of tools. It need not be a simulator, but this is the kind of tools that you need are very different. CYCLE R. Cycle accurate, when you're talking about something that runs for weeks, it makes no sense. Right? I remember there was an eye-opening moment for me

Starting point is 00:39:38 when I was working at Google, and I ran into Luis Barroso. And I said, hey, do we have simulators here at Google and I ran into Luis Barroso and I said you know hey do like do we have simulators here at Google that we're using and he sort of chuckled and he was like why why would we do that yeah he said I don't care about cycles I care about like seconds in like network latencies or, you know, where our bottlenecks are, are at so much of a grander scale than a cycle-accurate simulator. It's complete overkill, right?

Starting point is 00:40:14 And to where we would only be able to understand such a tiny snippet of the end-to-end view that it's of limited utility. Right. So this is making me think actually, what I was trying to poorly crystallize before, is it seems like the natural human cadence for wanting to be able to iterate on answers

Starting point is 00:40:37 is something like on the order of a day or a week, maybe. So no matter where you're looking, whether you're talking about a branch predictor, a register file, or whether to run things on a CPU or a GPU, or how to run a network, the key is that you've got to build a framework for doing evaluation where you can get a useful answer in approximately the order of days. Right. Regardless of where you're at. Okay, that's my law. Right? That's a great my law right there's yeah there's basically there's a ton of utility

Starting point is 00:41:08 in being able to say hey this is gonna take um nine days to run uh versus seven days to run being able to predict that there's utility there right and you don't need cycle accuracy to get there no you need something that can give you an answer in less than the time that you're talking about, 7 versus 90. Yeah. Yeah, because if it takes nine days for me to get an answer on does it take nine days to run, that was the most inefficient way to get there. You've doubled it. In general, like, what are your opinions on sparsity? Because a lot of academics also

Starting point is 00:41:46 focus on sparsity. We see a lot of papers looking at sparse workloads and things like that. The interesting thing about sparsity is it's a very, very overloaded term. The faction of sparsity that has been most on my mind is really just the optimizations that we can do due to the fact that a lot of the data we're dealing with is zeros. Right. And so multiplying zeros by zeros, this is very very easy. Right. So being able to leverage that, being able to leverage the fact that certain things aren't going to be that computationally intensive at the end of the day.

Starting point is 00:42:26 Because in theory they would be because you're doing a lot of multiplications, but in practice it's zeros and zeros and more zeros. And so that, number one, I think there is room for providing clarity in the research community, in the academic community, on exactly what people mean when they say sparsity. Yann LeCun is going to mean something different than Kim Hazelwood is going to mean, than you know someone else who is using that term because it is very very overloaded. It's almost like we need a noun following the sparse what, right? Sparse data,

Starting point is 00:43:07 what exactly is sparse? So there, I think the higher bit is we need some clarity on terminology, but from a computer architecture lens, I'd say we have a pretty big opportunity to leverage the fact that a lot of the data that we're dealing with is zeros. So one of the stats that you were saying earlier that's really striking is this notion of this 8x increase every 18 months of computational complexity that's required to do all the training that's being done by the more engineers and the more models that they're running. So at a certain point, this has sort of got to level off. So in your mind, where's the end game? What's the end point here? You know, how should we be accommodating this kind of computation? What should we be doing to try and maybe force it to level off so that we don't

Starting point is 00:43:55 require the energy of the sun to do some of this stuff? Exactly. So yeah, I'm not naive enough to think that these trends will continue. And they were also not, like if I pick a different 18-month window, the numbers may be pretty volatile. So I think that we're still in that growth phase. I'm not naive enough to think that we'll continue to grow at that rate. But what I hope we'll start to do is, number one, is less egregious use of resources. Just because people are using 8x the resources doesn't necessarily mean that we're getting the ROI on all of that. So I think one thing we can do as a community is rein that in or come up with the right mental model for how to think about is this actually a useful use of limited resources and computation.

Starting point is 00:44:52 So I think that there will be some natural like tapering of the growth that will happen. I think people will start to get more responsible about how they're using things and then at some point, you just, there's only so many data centers we can build. And then there you just start enforcing, like, priority mechanisms that will cap, you know, will artificially cap the amount of computation that you can do. But I think that, you know, right now it's still the wild west where everybody just like everything's free and why don't we just like for fun and you know entertainment will just train models, for learning we'll train models, like there's people training models without really thinking too much about the the back-end implications of that. I'm starting to see some

Starting point is 00:45:41 really fascinating research spin up in that space. There was a tweet where one of the researchers at UMass Amherst mentioned that training a language model, one of the state-of-the-art language models, was using the amount of, it produced the same amount of CO2 emissions as three car lifetimes. I read that tweet and I stayed up all weekend. I was like, this was so impactful on my state of the world and the state of mind where I thought, oh no, like what are we doing? We are like a cigarette company in the 60s. Like we just, we're being so naive.

Starting point is 00:46:29 Like when or how do we think about being environmentally and mentally responsible here? Am I contributing to like global harm, right? And so what's made me happy is starting to see some activities, both in the research community and within companies, to actually think more deeply about that. and quantify here's how much this experiment cost you know the earth cost you know we already tie things back to like money and we started tying it back to power because people stopped caring about money then you're tired to power but then people at some point stop caring about power. So like, oh, we'll just use that money to buy more data centers. Problem solved.

Starting point is 00:47:28 But if you start to put it in terms of like, oh, and we, this many human lives, this many, this is your future generation. Then suddenly this will be a bit of a wake up call where we can start to think more deeply about responsible AI. And what's made me happy is that now that has spun up as an official research area, as an official charter for various teams at lots of different companies, is this notion of responsible AI. So this is a little bit of a trite connection, but as you were talking, it was reminded of how when you were at Google, if I recall, you led a performance team that basically was making sure that the people who were using the clusters at Google were not being too lackadaisical about it, saying like, oh, you took over 50 machines and you

Starting point is 00:48:22 asked for this many gigabytes of memory, but you only used a fraction of it. So you need to go back and think about how you're using these resources and here's how you can improve them. Do you think that that's that kind of approach would work in the ML world, too, where people are like, oh, well, you know, I remember as a grad student, I'd be like, I think I might have a bug, but I'm not sure. I'm going to run these thousand jobs and I'll find out and then I'll fix it and then I'll run the thousand jobs again do you think there's a space for monitoring that kind of lackadaisicalness yeah so we that's that's actually something we've already gotten in the business of solving so there's this there's some you know efforts within Facebook about demand control right so figuring out like how much of this demand is for compute there's some efforts within Facebook about demand control.

Starting point is 00:49:05 Right? So figuring out how much of this demand for compute resources is useful and how do we reign in egregious use of resources. And there are a bunch of different avenues, but one of the most effective ones is really just visibility. So when people try to go use resources and there's none available and it's like, okay, your wait time is going to be like seven days. They want to know who's in front of me in line right now, right? And then they start to self-police. They're like, what are you doing? Why are you using all of the resources, right? And this all kind of came to us by accident. There

Starting point is 00:49:42 was one time where just people were complaining like, hey, my job's not getting scheduled. It's not getting scheduled. What's going on? We looked into it. And there was one user at Facebook who was using 75% of all Facebook experimental resources. That one person was an intern. So I reached out. I pinged the intern,

Starting point is 00:50:05 I'm like, hey, what are you doing? It appears you're using 75% of the resources. And they were like, what? I am? So then I was like, that's totally on us. If any particular user can abuse the system and not even realize it, means we didn't give the right feedback mechanism of like, are you sure? You know, like, you know, this, this is the equivalent of, you know, this many millions of

Starting point is 00:50:32 dollars and this many resources. So, so that like, you know, just raising awareness of like, how much are you consuming actually solves a lot of the problems. A lot of people just don't even realize it. Shine a light. So there's that. And then there's the people. Then you get to the point where there's the people who realize it but don't care. And what's helpful there is you basically tell, you know, inform the person behind them in line that the person in front of you is being wasteful.

Starting point is 00:51:04 They will self-police. And so they'll kind of tap, tap, tap. Like, why are you at the register for so long? What's going on here? Right? And then from there, then you can provide feedback mechanisms of like once they've run a... There's actually a few things we can catch.

Starting point is 00:51:21 We can catch, are you sure you need to train this? This other user trained the same model on the same data two weeks ago. Here's, do you just want the answer? Because we already, we have the answer. So keeping track of what has already been done company-wide, a lot of times you'll catch duplication. There were cases where people were launching two identical jobs with identical data because they were worried about the reliability. Oh dear. Where they were like, well, one of these will finish. So let me just launch two copies because then I'll be sure to get my answer. You know, this is like super,

Starting point is 00:51:57 super wasteful and not fair. So these are the kinds of things that like once you shine a light on it, it solves a lot of the problems. And then it starts to get into the harder problem of like, what does it mean to be fair? You start giving people quotas. You start saying, I'm not gonna submit your job because you've used too many resources. And so that's something like you very, very quickly

Starting point is 00:52:19 come to the situation where you have to start thinking about that and solving that problem. So every company is already thinking about how do you solve the problem of shared compute resources amongst you know potentially thousands of researchers. Right. So I think maybe this is a good time to kind of shift gears and talk a little bit about your career. So the one kind of tenuous thread is that you know you've been talking about agility and reacting to changing data and your career itself has been quite agile. You've donned a lot of hats. You've played a lot of roles. So maybe you can talk a little bit about your career past and how you

Starting point is 00:52:59 think about when to make a change, because sometimes that can be very daunting to decide. Yeah. So I've always been the kind of person who, I'm not really a planner. I'm not the person who, like when I was five years old, I was like, I'm going to get a PhD in computer science and become a professor. Like if you'd asked me up until about like 10th grade, I would have said, I'm going to be a dentist or something, right? So I tended to kind of do a late binding on a lot of my decisions that I think was beneficial because it allowed me to pivot and respond to opportunities as they arose, even if it would have been in conflict with my well-laid plans. Throughout my life, at each point when I look back, I'd be like, if I had told myself five years ago

Starting point is 00:53:50 or 10 years ago that I'd be here, old self would be like, what, really? Like, how did that even happen? It's like, oh, a whole series of events happened. So, you know, I think because of that, I've sort of pivoted from academia into industry. I've always had a bit of a hybrid role. So like even when I was a faculty member, I was one day a week with Intel.

Starting point is 00:54:14 You know, before that, I was a postdoc, so full time at Intel. So I've always kind of pivoted back and forth between there's a bit of a slider of like how much academic research do you want to be doing? How much practical impact do you want to have? At what speed do you want that to happen? And I'm always sort of playing around and moving those dials and finding opportunities to be able to do that. And so because of that, I've kind of bounced a bit between like pure production roles, which was one of my first roles at Facebook, to my current role purely in research. You know, so there, I think the other big lesson that I

Starting point is 00:54:55 learned was, you know, sometimes there can be a lot of emphasis on wanting what other people want. So everybody has, you know, like this all started when I was in grad school. It was like everybody was like, hey, I really, really want to be a professor. And I remember thinking like, I want to feel that strongly about it. I don't know if I do, but everybody else seems to want it. So let me try and want it too. And, you know, so I went through the motions. I was like, you know, I asked my advisor, I said, hey, should I do academia or industry? And he's like, you're asking your professor. What do you think I'm going to say? And he said, well, I'll tell you what, just give it two years, give it a shot. If you hate it, you can leave people do that

Starting point is 00:55:45 all the time and I thought oh that was actually very freeing of a concept that I'm not actually making a decision for the rest of my life right of like is this where I want to be when I'm 80 years old because I'll never be able to make a decision if I think that way but if I think of everything is a set of decisions that are so forgivable and so changeable that I don't even really worry about it. Even my transition within Facebook into research was like, if I go into research and I hate it, I'll just go back. Like, it's not that big a deal. People do this all the time. So I think that, you know, my biggest advice is, number one, be true to yourself on what you want. Everybody wants something different and for different reasons. And everybody has a different setup. So, you know, asking somebody else what they want and thinking

Starting point is 00:56:30 that that's in any way going to apply to you and your own unique situations is just like a recipe for being unhappy, right? So figure out what you want to do and own that. And, you know, even in the face of people were like, what, you left a tenured position? Who does that? I was like, me, that's who does that. Because I could and I also, like in the back of my mind, thought if I decided it was a huge disaster,

Starting point is 00:56:59 I think I could go back, right? So I'm just not too worried about it. So I just think being true to yourself, being able to pivot, I'm just not too worried about it. So I just think, you know, being true to yourself, being able to pivot, I think has worked out well for me. And just try not to overthink it. Yeah, yeah. That's great advice. And actually, I think that kind of relates to something you talked about at your ISCA keynote a couple years ago, which for those of us who don't know it is Hazelwood's Law. Do you want to explain that? So Hazelwood's Law is really about figuring out where the opportunities exist.

Starting point is 00:57:32 So what ends up happening in research communities or in fields is that there's this dogpile effect like there'll be some idea that makes sense. And then everybody kind of dog piles on. And you only want to dog pile in it as much as that problem is worth in the grand scheme of things. And you want to make sure that you're not leaving giant gaps in between. And so they'll, you know, there was, you know, at some point maybe there were a ton of people who were like, oh, let's all focus on quantization, right? Like, yes, quantization is important, but not everybody needs to be working on quantization. We need some people working on

Starting point is 00:58:15 that and we need some people working on some of the other gaps, you know? So there were a bunch of opportunities where I realized like nobody's really looking at like the network implications of ML, right? Maybe I'm the wrong person to do it I don't have a networking background but I can definitely identify that that's a gap and so I feel like recognizing those opportunities coming up with your own path on like I don't have to work on quantization because everybody else is. I should find what I'm passionate about and in particular where there's not enough people. Where are the spaces that need me and need my skills in particular? Right.

Starting point is 00:58:56 So don't become a professor because everybody else wants to become a professor. And don't work on quantization because everybody else wants to work on quantization. Yeah. I mean, not to say anything bad about quantization. People should be working on quantization. Absolutely. But just not everybody. Right. Right. And only do it if you genuinely have an affinity for it, want to do it, like it, and feel like you can make a genuine contribution that's unique. Exactly. Exactly. Right. So on that note, so what excites you about, you know, the path forward, the future? Like, do you have a vision for, you know, how systems for ML will evolve?

Starting point is 00:59:28 And more broadly, like, you know, how architecture and, you know, our work will evolve as well? Yeah, I mean, I think that throughout history, within the computer architecture community, we have these various trends that come along. And, you know, you have a choice. You're like, you know, okay, here's the energy efficient computing. Here's that trend. Do you want to jump on this? Do you want to not jump on it? You want to let that one go by? And, you know, I've sort of jumped on some of them. I've been selective. And the ones I jump on, I go all in. And I'm like, okay, I'm going to, you know,

Starting point is 01:00:03 I'm picking dynamic optimization until nobody was doing that okay, I'm going to, you know, I'm picking dynamic optimization until nobody was doing that anymore. And then I picked, you know, like I kind of went through this, these things interest me, these things somebody else can work on. Right now we're in a little bit of a peak in terms of like the hype curve. I think this will stabilize a little bit and it will open up a ton of new opportunities where we realize, OK, once once the dust has settled and once we've stabilized in this particular field, you know, where do we want to go from here? But in that in those peaks, that's where all the fun happens. So this is where why I love where we are right now and kind of where we're heading in the short term. I make no claims about long term, right? Long term, we'll get bored after some amount of time, and then we'll go move

Starting point is 01:00:51 on to something else. But for now, it is a lot of fun because we get to throw a bunch of things up in the air and rethink them. And those opportunities don't come along that often. And so that's why I really, really like what's happening now. And I also have always been one who likes to straddle that hardware software divide. And this is like one opportunity where I realized that space is so critical right now for this particular type of workload. So those are the kinds of things that I'm super excited about is like, what all great things, because people always surprise me with the ideas that they come up with during these like peak hype periods. And we're in one.

Starting point is 01:01:37 And so I just think like, you know, don't be jaded about it. be excited about this. This is a fantastic opportunity to try out all sorts of crazy new ideas. Awesome. Well, there you have it, folks. We've been super, super excited to have you with us today, Kim. Thank you for joining us. It's been an absolute delight. It's been a really fun conversation.

Starting point is 01:01:59 Absolutely. And to our listeners, thank you for being with us on the Computer Architecture Podcast. Till next time, it's goodbye from us.

Computer Architecture Podcast - Ep 1: Systems for ML with Dr. Kim Hazelwood, Facebook

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.