Computer Architecture Podcast - Ep 1: Systems for ML with Dr. Kim Hazelwood, Facebook
Episode Date: May 28, 2020Dr. Kim Hazelwood is the west coast head of engineering at Facebook AI Research (FAIR). Prior to Facebook, Kim has donned several hats from being a tenured professor at the University of Virginia, dir...ectory of systems research at Yahoo Labs, and a software engineer at Google. Today, she joins us to discuss systems for Machine Learning (ML), and share her insights on having an agile career.
Transcript
Discussion (0)
Hi, and welcome to the Computer Architecture Podcast, a show that brings you closer to
cutting-edge work in computer architecture and the remarkable people behind it.
We are your hosts.
I'm Suvainai Subramanian.
And I'm Lisa Hsu.
Today we have with us Dr. Kim Hazelwood, who's the West Coast Head of Engineering at Facebook
AI Research, also known as FAIR.
And prior to Facebook, Kim has donned several hats from being a tenured
associate professor at the University of Virginia, to being a software engineer at Google,
and director of systems research at Yahoo Labs. But today, she's here to talk to us about systems
for ML in particular. Before we begin, a quick disclaimer that all views shared on the show
are the opinions of individuals and do not reflect the views of the organizations they work for.
Kim, thank you so much for joining us today.
We're excited to have you here.
Thanks for having me.
This has been a lot of fun.
So to kick it off, let's just start in broad strokes.
Tell us what you do, especially in your new role, and what gets you up in the morning.
So I just pivoted into a new role within Facebook.
I've been at Facebook for four and a half years.
And more recently, I shifted into the research organization.
Prior to this, I was in infrastructure, which is more of the product data center side.
And my recent role is, you know, two parts. It is
a leadership role in FAIR engineering, as well as the fact that we are spinning up a SysML research
effort within FAIR. And that's being seeded by a team that we had in Boston, combined with some
other researchers that are kind of spread throughout the country
in New York and Menlo Park. So what gets me up in the morning? I have four daughters,
but there's one in particular who likes to wake us all up around seven. There's a rule in the
house that I'm not allowed to be woken up before that point. That's a great rule.
I warn them that Monster Mommy exists before 7
a.m. They don't ask any questions about Monster Mommy. Don't want to find out what Monster Mommy's
like. Yeah, I love that. So in this new role, it sounds like it's been a little bit of a shift from
deployment of real systems to more research. Before we get into sort of what went
on as the new role, maybe you could talk a little about some of the challenges in deploying
real systems at scale for ML. Sure. So I actually, this was something that was near and dear to my
heart, so much so that in 2018, we had a paper in the industry session at HPCA, where we kind of dove into, here are the considerations that come
up when you're deploying at scale that I don't think that a lot of the researchers were thinking
too much about. I mean, there were some things that I didn't even realize were a practical
challenge. So for instance, at Facebook, we have to think about things like disaster recovery.
So what happens if a hurricane were to hit one of our data centers?
You know, would we be able to easily, like, handle that, the loss of many thousands of machines and
without Facebook going down? So those are, you know, things like that. In general, just the scale
that we deal with is a much bigger scale than when I was an academic, I would talk about scaling
things out. And I was talking on the order of a few racks. And things change very, very dramatically
when you're potentially, you know, dealing with a global infrastructure setup where you have
machines around the world, and you're potentially shipping data around the globe.
Yeah, so you're talking a lot about like deploying things at scale. So,
but let's talk about some unique attributes of designing systems for ML. And you know,
how they're maybe similar to the classic computer architecture research paradigm where,
you know, we are used to benchmarks like spec and so on. And how are they similar in that respect?
And maybe how are they different as well? Yeah, I mean, so there were a bunch of us who started with a background in systems or
architecture, and came into the SysML space, which is now its own official research area,
but came with more of a systems background. So we had to get caught up on many, many years worth of ML research and work. But in many ways,
the machine learning workloads are workloads. And workloads we're used to. We're used to
abstracting away what's actually happening and figuring out, well, how does that actually just
hit the underlying layers? How does it hit the hardware? What are the computational challenges?
What are the networking challenges? What are the networking challenges?
What are the storage challenges that surface when you're dealing with this new class of workloads?
At first, we essentially just had to get some understanding of what are those workloads like?
And this was before the days. So today we have MLPerf, where we've started to agree upon what
are the workloads that we're going to focus on.
And what's nice about that, it's a very diverse set of workloads. Before this, it was the Wild
West. And I think for the most part, people over-optimized and over-pivoted on specific
subsets of machine learning. So for instance, computer vision. So for the longest time,
pretty much all of the GPUs were being designed for computer vision workloads. All of the assumptions that
were being made throughout the stack were for one small subset of the workload. So that was another
thing that we tried to educate people about in the HPCA 2018 paper was just the broad diversity of workloads that are actually in play
when you're talking about machine learning and how they differ, particularly when it comes to
the lower layers in the stack. You're talking about the diversity in these workloads. So let's
drill a little bit down into that. So diversity can mean a lot of different things. It can mean
diversity in the compute requirements. It can mean diversity in what the end application cares about, like latency
versus throughput, which is energy efficiency or something else. So can you talk a little more
about that? I think your papers generally delve into a lot of details, but give us a little more
of a glimpse into that diversity and what are the different dimensions to that diversity
and how they eventually affect the systems that you design. Sure. So early on, people used to approach me and say,
what kind of workloads does Facebook care about?
Do they care about CNNs or RNNs?
And so the hidden implication there is that those are the only two options
and that's what diversity looks like, is that it's either a CNN or an RNN.
You know, I went around the company just to understand what is the diversity of the workloads?
What does it look like? And then that's where I realized there's a huge emphasis on deep learning,
but there's actually a whole lot of practical applications of ML that don't need deep learning. In fact, it's complete overkill. There are a bunch
of workloads that are moving very, very quickly, meaning that pretty much every day you're tweaking
the ads algorithm or the ranking algorithms for content. But things like face recognition
is a pretty stable workload where it's kind of solved.
You know, we kind of know how to recognize a given person's face.
So there's no need to really innovate very, very swiftly there.
Right. So this is both like model research as well as how frequently you train, I'm guessing.
Exactly.
So frequency at which new models come out.
And also like for a given model, how frequently you need to train it. And this is important to keep in mind because if we're thinking about things like, okay,
machine learning training, right? How often does that need to happen? And what are the
computational requirements? And what are the main considerations that come into play? You know,
one of the big things that required a little bit of a mental shift for me was when I realized just how long running
some of these workloads were when it came to training like a new language model, for instance.
You know, this can be weeks. And when it's weeks, so performance is important, but what
suddenly starts becoming even more important is things like reliability. And I don't mean reliability from
the typical computer architecture lens. I mean, does this thing run for seven days and then fail?
And then you have to start over from scratch and run for another 14 days. So basically,
like what are the chances of actually successfully training a particular
model in a given amount of time?
That's what I mean by reliability.
And then that's the first instance where we started to run into terminology collision
with ML researchers.
And, you know, when we started, when I heard people talking about reliability at first,
I was thinking the computer architecture definition of reliability, where we're really thinking about circuits. No one in the ML community really is thinking too
much about hardware reliability. Stuck at fault. So I wanted to go back to a little bit something
you were saying before about like over-indexing on computer vision is one way to think about,
you know, the over-indexing on particular pieces of the ML space, would
you say for people who are more classical computer architects, it's a little bit the
way we revolve around spec in the general compute space?
Yeah, I mean, so spec was an interesting example because we, you see the goal was to come up
with a diversity of workloads.
There's 26 benchmarks, right?
In our mind, this was a very diverse set of workloads. There's 26 benchmarks, right? In our mind, this was a very
diverse set of workloads. But then immediately we started to kind of over pivot and over optimize
for those spec benchmarks, the processes were designed, optimizing for spec. And where this
was really, really interesting was in the case, so you have like one particular benchmark, like GCC. This one is much bigger
than all the others and behaves differently. And because of that, in many cases, you would
dismiss it. You'd say, oh, it's an anomaly. This worked well on everything except GCC.
And it wasn't until later when I started to analyze like production workloads,
and I found there nothing like spec. And in fact, the closest
thing we had was something like GCC that, you know, was had a much bigger instruction footprint,
but that that had been ignored or dismissed as an anomaly prior to that. Even still GCC was like
orders of magnitude smaller in instruction footprint from a real production workload. I mean, we were dealing
with 100 megabyte binaries, hundreds of megabytes for like doing search or anything like an actual
practical problem that is needed to be solved at scale. These are massive, massive binaries.
So touching upon the issue of workloads, what can architects do to sort of keep their pulse on like
what are the workloads, what is important and pulse on? Like, what are the workloads?
What is important?
And so on.
Yeah.
So because there's always a natural tendency towards kind of dogpiling, right?
Where everybody piles up on, you know, whatever the flavor of the week was, computer vision,
you know.
Carol Jean Wu has this fantastic pie chart of here's all of the research investments into particular domains.
So like, here's all the papers that are focusing on computer vision. Here's all the papers that
are focusing on language translation, right? Kind of running through all of the different ML spaces.
And she has this beautiful pie chart. And then, you know, she did this exercise internally within
Facebook, where we looked at all of the workloads that were running out in production.
And we built a similar pie chart with all of the same categories.
And the pie charts were so dramatically different, right?
The big discrepancy being recommendation systems, for whatever reason, have gotten very, very little love in the research community. And it's
genuinely what's powering all of Google. It's what's powering Facebook, all of the, you know,
newsfeed algorithm, the ads algorithms, all of the search results within Google.
This is all recommendation. Is this why I can never find anything to watch on Netflix? Perhaps.
Or someone else is messing up your recommendation model by watching all the things that you don't like.
Yeah.
Okay.
So when you talk about the pie chart internally at Facebook, is that more based in terms of, like, number of models being run or computational hours being run?
Because there's kind of a difference where, you know, like you were saying before,
some of these new language models might take weeks and weeks and weeks, but that's one model.
So how are you counting it? Yeah, so we essentially want to count it in all of the different ways,
right? Right. And in fact, also keep in mind that these are always moving. So those pie charts are
changing essentially daily. So a lot of times what we'll do is just build dashboards and track how things are progressing day over day so that
we can notice whenever there's a shift. You know, we look at lots of trends that we've seen
internally. So I have some great stats on, you know, what is the growth of machine learning look like within Facebook,
for instance. So we started to track like, so we have a shared set of infrastructure for if you're
going to train a machine learning model at Facebook, you use a system called FB learner.
And so this is like a single entry point where we can see how many people are training models,
how many models are they training, and how complex are those models? Therefore, what kind of computational resources are needed to power all of that?
And I took a look at the numbers, and it turns out we're doubling the number of engineers who
are doing machine learning training every 18 months. We are tripling the amount of models that they're training.
So each person is doing more work.
And then the computational complexity is increasing as well to where it requires 8x more computational resources every 18 months as well to be able to power all of that training.
So that's a pretty dramatic growth.
So people, you know, we've got more people doing it. I even gathered one fun fact about
the average age of people training models at Facebook is dropping.
This means it's actually extremely important that this is easy to use because you can't assume that
you're going to have
expert ninja programmers who have been doing this
for years and years.
You literally have to optimize for a fresh college grad
who is 22 years old coming out of Berkeley
and needs to be able to come in and use their systems.
This needs to be very easy to use.
Right, so touching upon that point,
this talks about the importance of tools, frameworks,
and the overall entire ecosystem for both getting your models trained and served and deployed and so on.
What do you think about the state of the ecosystem?
You know, where are we doing well? What is missing?
And for example, how could systems researchers and architects build those tools for the community
or leverage insights from the tools that we build?
Yep. So this is something that is near and dear to my heart because I worry sometimes that as
computer architects, we're really, really excited about what happens under the covers, but we have
to recognize that we're one of the only people who care about what happens under the covers.
And the worst thing we can do is expose all of that complexity to people who just don't have the background to be able to do a good job at making key decisions if
we expose to them too fine-grained detail under the covers. I think that moving up the stack and
thinking very, very deeply about what is the experience that you want to have a data scientist
to have and having the proper expectations for how much expertise
they can have. This is super important for us to keep in mind because for all of the different
accelerators that we're proposing, for all of the different optimizations or changes that we might
be proposing, we need to keep in mind that we need to make this invisible and easy to use and as automated as is humanly possible.
Because otherwise, like if you put a human in the loop where they actually have to make an informed decision, this is not a good strategy.
Right. This is where people like putting in the right guard bands so that people can't shoot themselves in the foot is so, so, so critical. Designing the right abstractions that help people use this easily, but providing
the sufficient levers so that if they wanted to optimize for performance, they have the ability
to do that. Yeah. And there, I think what's super helpful is just the kind of shadow,
the customer, right? So just kind of watch somebody, you know, try to use certain
tools. We had this performance optimization tool for GPUs, where we were giving feedback to
the users. And we would say, hey, your SM occupancy is low. And they were like, oh, okay.
What does that mean?
What is SM? You know, is that a system memory? Like a good guess, but no.
And then they also just were like, I don't know what to do about that.
Or I don't know if that's normal.
Right.
They have no expectation.
They have no idea.
Like, what is SM occupancy supposed to be?
And, oh, this is low.
Well, what do I do?
I don't know.
You know, so making, giving actionable feedback is so critical
and not just pointing out, you know, something's wrong. First of all, they're not even going to
have the perspective to know if that's bad or good. Right. So giving, giving that perspective
and then also making things much more actionable to them. You know, you might consider increasing
your batch size. That they can get be, okay, I know how to do that. But just telling them something
unactionable isn't helpful. This is a lesson that we had to learn the hard way. So it sounds like
then in terms of building systems, you're having to not only consider the actual system itself,
the construction of the architecture, but also the user interface and the software stack in between
the machine and the end users. Exactly, exactly. Because we want to, as I said, hide all of the computer architecture that is going into
this to the extent that we can.
And so actually, I wanted to also touch on something you were saying before about all
your data dashboards and all that sort of thing, because it sounds like you not only
have to pay attention to how your end users are using your equipment and tools and software
stacks and all that, But at the same time,
you have this other layer of data that seems like it's changing very rapidly. So of course,
you want to make data-driven decisions. But if data is changing so rapidly, how do you sort of
accommodate for the rate of change? And in addition, I think that rate of changes also can
be overwhelming to people who are wanting to maybe make the transition into machine learning.
So, you know, given that a lot of our field has transitioned from classical stuff to ML stuff in a relatively rapid rate, maybe you can speak a little bit about the rate of change and how to grok it.
Yeah, I mean, I think for us, like we're used to a setup where everything's extremely stable, right?
You have the spec benchmarks.
They were changing at a 10,
15, 20 year cadence. We didn't really have to worry about all of that change.
The ISA was stable.
The ISA was stable. I mean, there were occasional people making a run for, hey,
let's do a new ISA, but it wasn't that frequent. It wasn't anything that we needed to optimize for.
And so, you know, in this new world, you basically, when you're trying to make a decision, right, you not only have to make a decision based on like, I'm going to do my best
with the information that I have, but I also need to recognize that all of this is going to change
tomorrow. And I need to know, like, have a mental model for how do I evaluate when it has changed
enough that I need to go revisit and modify some earlier decisions?
Because you can.
You can change your mind.
You can tweak certain things.
It just gets harder and harder if you're trying to make those changes in the hardware, right?
And so that's a tricky lesson because I think that we all need to recognize we are not yet
into stability when it comes to the ML field. I think we can
rise to the challenge, but we can't use the tricks that we used to use where we assumed,
you know, you can, I'm going to design for language translation and forget the fact that
a new breakthrough model is happening at the cadence of about every six months, that it just is completely different than the old language model. Significantly better,
but it's hitting the resources completely differently. So being able to be flexible
and pivot starts to become very, very important. Right. And for hardware architects in particular,
I think this is a useful lesson, which you also mentioned in your recent blog post that programmability is critical for the accelerators
that you're designing.
Yes, accelerators need to be performant.
Because of just the rapid change in the field and new models and new characteristics in
the models, programmability is probably really important.
Yeah.
In fact, in many cases, it's more important than raw performance.
And you'll see a bunch of examples out in the community where people have spoken with their feet.
So within Facebook, we used to have two ML frameworks.
We had Cafe2 and we had PyTorch.
PyTorch was much easier to use.
Cafe2 was much more performant.
You know which one won?
Not the performant one right it at some point it's like i it's having this fast iteration cycle being able to easily use something ends up
trumping raw performance to you know a certain extent and you know we were definitely within
the guard bands of that okay if i have to wait an hour, if I have to wait two hours, is that really that big a deal?
If I have to program it for 10 hours, right?
Versus one.
Sure, if it's one week versus two weeks to be able to train a model that I can launch the training when I get to work
and have it be finished by the time I'm about to leave for the day.
Like that's one threshold.
Like I don't care if it finishes at 2 a.m. versus 5 a.m.
I'm asleep, right?
So this weird notion of performance where it's like it matters
but doesn't matter kind of depends on whether you're falling off
someone's cliff in terms of what is their
working style like.
Right.
What do they need?
Right.
And if it takes you three days to get it working versus one, then actually that should factor
into the equation as well.
Yeah.
Because all of the innovation comes from just, it's trial and error.
If you actually watch how the data scientists and ML
engineers are working, like how are you tweaking the Facebook state-of-the-art ads model,
it's trial and error. It's let me try, let me twist this knob, twist this knob, try that. Is
that better or worse? Better or worse, right? And so this iteration cycle, it's a feedback loop.
And so the faster you can go through one iteration,
the better. But, but it's not as clean as like, you know, 10% better, I'm going to get 10% more
models run through because sometimes there's, this is happening and finishing in the middle
of the night, right? So it doesn't matter. So then, so then given that the advice you're giving
is that for you want to be able to hide the abstractions from the end users, and you at Facebook are paying a lot of attention to that user interface.
But for pure hardware architects underneath who are trying to provide a system that's robust and agile for all these potential future situations that we don't even know are going to happen yet. What advice would you give to architects? So architects need to like team up with people working in the software stack with
be thinking about tools from the very beginning and really do collaborative projects together
of like, hey, here's what we want to be able to do down under the covers. What implications is
that going to have in the higher layers of the stack? And is that okay? And, you know, forming like teams and coalitions where we can work together. If you work on an accelerator in
isolation, it's just really not a good strategy because there can be, you will have blinders to
what implications that might have above the covers. And you don't want that because that can be a
deal breaker at the end of the day.
Agreed, agreed. I think that's one of the things that us, we in the hardware industry always should bear in mind is what is above you and what is below you and all the layers of the
stack. Given that you are sort of within our community a little bit the face of ML at Facebook,
I want to ask too about, there's a little bit of an
equivalence now between ML and Facebook, but that's clearly not the case. There's more going
on there. There's larger, greater technology trends at play, you know, the end of Moore's
law, Denard scaling, all that sort of stuff. So what kind of challenges do you guys have
at Facebook with respect to technology that sort of abstracts away ML and is not necessarily ML
oriented? So I wouldn't say that it abstracts it away or that it's completely separate, but it is
it is something that isn't traditional ML that is an extremely important problem. And that's the
problem of just dealing with data. So one of the things that, you know, all of ML is driven by oodles
and oodles of data, but in that whole space is a ton of engineering challenges and interesting
like research explorations we can do. So a lot of the time, it really just comes in shuttling
data around the globe, going through and taking unstructured data and figuring out how much of it is relevant, how much of it can I safely ignore.
A lot of times you're dealing with just tons and tons of zeros that you're shipping around the globe. You know, so for instance, like with like a typical ranking and recommendation
challenge, you might have, you know, perhaps 50,000 inputs going into your model, right? And
so this is like 50,000 different bits of information that drive a single decision.
And these decisions are being made trillions of times per day.
So there's a really, really interesting space there that I think not enough people are paying
attention to. We have to spend a lot of time and effort thinking about that internally.
That's why we have so many data and storage divisions thinking long and hard about this. You know, I think they're just
shuttling, you know, it's part networking problem. It's a part storage problem. You know,
as computer architects, we're going to immediately go to like disks and, you know, oh, let's use
Flash, right? But it's, you know, there's also like, all of that has to be shuttled
over the network and around the world and between servers and a rack and between racks. And there's
just a ton of data movement. And there, I think it's just, it's an interesting engineering
challenge that ends up being a space that we have to spend a lot of time and resources in.
It's not traditional ML though.
Right.
So are you saying computer architects
should move en masse and dogpile in the networking world?
Somebody needs to.
I'm not really seeing...
So the networking folks that I chat with,
they seem to be continuing to do their normal mode of execution on how they're thinking
about networking and not thinking deeply about like, what are the ML implications on that?
So we've run into a bunch of interesting situations where, you know, we designed our
entire server infrastructure with certain assumptions in mind. Like we might assume
that you've just
got a bunch of servers that are serving web traffic. These are independent work streams.
They're not communicating with each other. So you don't have like a ton of intra rack
communication normally. And we optimized for that. And what happens with the ML spaces is that got
put on its head because now suddenly you were doing distributed training and you had lots of communication happening between the servers, between the racks.
And this was just a networking pattern that people weren't used to.
People were used to the whole system had been optimized for servers talk to the top of rack, top of rack.
You know, it goes up from there.
And data is sort of moving in one direction. So this this sort of sideways traffic is just
something that I think has has caught the networking engineers by surprise, to where in
some cases, I'm not even sure they've, it's fully landed on them, that something is different. Yeah.
And that now
they have bottlenecks that they didn't realize were there. So do you think some of those problems
could be solved by sort of adopting the way HPC does networking? Because they definitely have a
lot of inter-node communication. Yeah. So that's another space where you need to get all the right
people talking the same language and really understanding that you can't go about ML the way you went about
things in the past. There are fundamental differences. And so that needs to fully,
fully land on the HPC community because there are differences. It needs to land on, you know,
it's landing on us in some form. It needs to land on the networking community. It needs to land on the database
community. We need to be willing to rethink some of the assumptions and expectations that we had
for years and years and years of experience because this is a different world. It's a
different set of challenges.
And we need to be open to that
because that opens up the doors
for many, many opportunities.
So speaking of getting a lot of different people
from different backgrounds, different expertise,
like computer architects, networking folks,
storage folks, and so on,
because all of these have to get assembled
to build a performance system
that works for the end users.
And as someone who leads a team who manages a lot of different people, tell us a little
bit about how you think about assembling people with diverse backgrounds, different set of
expertise.
How do you sort of get them communicating in the same language?
And importantly, I guess, how do you have sustained interactions where these people
see the value of collaborating with each other and, you know, coming up with ideas and talking
consistently so
that we improve the systems as a whole?
Yeah.
So you essentially answer the question in some way, which is that we need to get everybody
speaking the same language, which also sort of means taking them out of their norm and
their expected way of doing things and saying, okay, we're going to be solving
a different challenge now. We all need to kind of get together, have some common language. But I
think the, you know, one trick that we tend to use is that we focus on let's solve one core problem
together, one concrete named problem together. This is a good forcing function for us to very,
very swiftly get up to speed on what is the common terminology that we need to think about?
Oh, what are the curveballs that the other types of engineers are going to throw into the equation?
So that's kind of how we tend to go about it as we solve a concrete engineering challenge together.
Okay.
That's driven by, you know, there's a concrete need.
And that's much, much easier than just trying, dancing around with no real deadline or, you know, no forcing function to like we have to get all on the same page or else.
That's sort of the tactic that we tend to get all on the same page or else. That's sort of the, that's the tactic that we tend to take.
Well, that's a great way to think about it.
Have like a concrete problem, like a concrete deliverable
that gets everyone on the same page.
And I'm sure there'll be like multiple such instances
or the long run, you have a system that, you know.
Yeah, and then, but, you know, we all usually have some sort of,
some concrete deliverable with a deadline.
But also just like taking notes along the way of like, if I had, you know, wearing my research hat, if I had had, you know, 10 more hours, I would have explored this different avenue and figured out was this the optimal choice.
But sometimes when you're kind of under a deadline, you don't have time for that to truly explore the entire space.
And so I just try to keep note of those.
Like, OK, later, let's go back and like a whole bunch of different research questions that we can answer.
We can do that later after we've solved the engineering challenge.
Because we actually go if we find, hey, we didn't have enough time to truly explore the space we picked we did our best and we picked one um we can change that
later right so that's what's what's really really cool about that you know research to production
kind of feedback loop right is that you never have to call it done right like you can change it
tomorrow right and in general academics like to sort of step back and take a look at you know
what are the things that we missed and how can we sort of think about this more systematically?
I'm guessing like this process of sort of taking down notes is also helpful in sort of seeding collaborations with academics or if they're visiting and so on.
Exactly, exactly. In fact, you know, now that we've spun up this SysML research group, like we find various engineering teams who like they're staffed
with PhDs who, you know, they're like, we had to solve this problem. There was an interesting
research experiment, but we don't have time to do it. You guys like want to take a look.
And so we'll get like this influx of really awesome research ideas that had a very, very practical foundation. And, you know, it's like
free ideas coming in from around the company. That's awesome. So one of the things that you
were saying before is that you have this kind of, okay, let's solve the engineering problem,
and then we'll go back and re-explore some of these things later. That seems like something that's very viable for things that can have rapid turnaround time like software stacks or a user
interface and that sort of thing. But in the computer architecture field where we all sort of
were born and raised, a lot of these types of things, you know, particularly if you're thinking
about ASICs, very long lead times where, you know, we'll do it next time where we'll do it next time around
is basically we'll do it in years which may be too late so how do you kind of balance i i don't know
to what extent facebook is doing custom silicon if you look at some of the job ads that they have
out maybe it's some um but to what extent do you sort of balance uh we'll do it later versus lead times in terms of being able to turn around results?
Yeah, I mean, I think the concrete examples that I had in mind were a bit more granular than, you know, an entire chip tape out, for instance, right?
To where the next iteration isn't necessarily years down the line, but it's, you know, more on the scale of, you know, a month or so down the
line. You know, there's a bunch of fundamental decisions that you're making daily and weekly on
any long-term project, right? It's not really just a singular outcome. So there's a bunch of
different design decisions where like, it could be as simple as like, what kind, what's the right
ratio of, you know, A to B and what's a quick and dirty way to answer that as like what's the right ratio of A to B?
And what's a quick and dirty way to answer that question?
What's the thoughtful, very, very detailed way that we don't have time to do right now?
But that we can potentially pivot, right? Because even in, say, a chip tape out, right, you can make changes.
It just gets more and more painful the further you are along in the process so you don't actually ever really have to wait until the
next rev you can choose to and you know that's a good exercise to do as well it's like how many
how much of this is forgiving how much of this would literally have to wait for a complete
redesign before we can actually start to think more deeply about.
But I really, it's not so much about concrete decisions, is I look for opportunities to really
improve the intuition and understanding of a space, right? So like a simple example is,
there was a point in time where we were trying
out, you know, would this particular type of model run better on a CPU or a GPU? And the way we would
go about that is we would implement it on CPU and we would implement it on GPU and we'd compare,
right? It would be way easier if we just had some insights on, like, based on, you know,
what I understand about this workload and what I don I understand about this workload
and what I don't understand about this workload,
here's why this might be better than the other mechanism.
So, there is a concrete decision that we needed to make
there is where do we want to run it?
But also I saw that as a research opportunity
to help anybody going forward with a brand new workload
that shows up on their plate to be able to eyeball it
and say, this is probably, you know, because of the sparsity
and because of the computational requirements,
is much better suited for something without having to actually implement it.
Right.
Right.
Right, so then for architects, there's a layer maybe at the system level
where you can have this kind of like higher level,
more rapid iteration time to be able to decide things.
But, of course, those who are like thinking about how to build a better register file
with read-write port ratios or branch predictors or something like they're they're just going to have to potentially wait
unless they can simulate i suppose i i guess so
perhaps i mean i i i i imagine their process
is is a bit more forgiving as well but i don't tend to operate down at that level
so yeah i can't really speak intelligently
there yeah i only say that just because like it you know if this is
this is uh for computer architects,
we want to make some aspect of it accessible and not necessarily about UX.
Right.
I mean, this is actually a very great point, like in terms of the tool sets and the way
we analyze these kind of systems, normally we're used to building a simulator, having
like workloads, doing like very detailed cycle level or cycle accurate simulations and things
like that.
And it looks like the tool sets that we need
for designing systems for ML are,
they're at different levels of granularity.
Yes, you could build cycle-accurate simulators
if you're looking at very specific kernels,
like solutions and things like that.
But in terms of designing the end system,
it looks like you need a different set of tools.
It need not be a simulator, but this is the kind of tools
that you need are very different.
CYCLE R. Cycle accurate, when you're
talking about something that runs for weeks,
it makes no sense.
Right?
I remember there was an eye-opening moment for me
when I was working at Google, and I ran into Luis Barroso.
And I said, hey, do we have simulators here at Google and I ran into Luis Barroso and I said you know hey do like do we
have simulators here at Google that we're using and he sort of chuckled and
he was like why why would we do that yeah he said I don't care about cycles I
care about like seconds in like network latencies or, you know, where our bottlenecks are,
are at so much of a grander scale
than a cycle-accurate simulator.
It's complete overkill, right?
And to where we would only be able to understand
such a tiny snippet of the end-to-end view
that it's of limited utility.
Right.
So this is making me think actually,
what I was trying to poorly crystallize before,
is it seems like the natural human cadence
for wanting to be able to iterate on answers
is something like on the order of a day or a week, maybe.
So no matter where you're looking,
whether you're talking about a branch predictor,
a register file, or whether to run things on a CPU or a GPU, or how to run a network,
the key is that you've got to build a framework for doing evaluation where you can get a useful
answer in approximately the order of days. Right.
Regardless of where you're at. Okay, that's my law.
Right? That's a great my law right there's yeah there's basically there's a ton of utility
in being able to say hey this is gonna take um nine days to run uh versus seven days to run
being able to predict that there's utility there right and you don't need cycle accuracy to get
there no you need something that can give you an answer in less than the time that you're talking about, 7 versus 90.
Yeah.
Yeah, because if it takes nine days for me to get an answer on does it take nine days to run, that was the most inefficient way to get there.
You've doubled it.
In general, like, what are your opinions on sparsity?
Because a lot of academics also
focus on sparsity. We see a lot of papers looking at sparse workloads and things like that.
The interesting thing about sparsity is it's a very, very overloaded term.
The faction of sparsity that has been most on my mind is really just the optimizations that we can do
due to the fact that a lot of the data we're dealing with is zeros.
Right. And so
multiplying zeros by zeros, this is very very easy.
Right. So being able to leverage that, being able to leverage the fact that
certain things aren't going to be that computationally intensive at the end of the day.
Because in theory they would be because you're doing a lot of multiplications,
but in practice it's zeros and zeros and more zeros.
And so that, number one, I think there is room for providing clarity in the research community,
in the academic community, on
exactly what people mean when they say sparsity. Yann LeCun is going to mean
something different than Kim Hazelwood is going to mean, than you know someone
else who is using that term because it is very very overloaded. It's almost like
we need a noun following the sparse what, right? Sparse data,
what exactly is sparse? So there, I think the higher bit is we need some clarity on terminology,
but from a computer architecture lens, I'd say we have a pretty big opportunity to leverage the fact
that a lot of the data that we're dealing with is zeros.
So one of the stats that you were saying earlier that's really striking is this notion of this 8x
increase every 18 months of computational complexity that's required to do all the training that's being done by the
more engineers and the more models that they're running. So at a certain point, this has sort of
got to level off. So in your mind, where's the end game? What's the end point here? You know, how should we be accommodating this kind of
computation? What should we be doing to try and maybe force it to level off so that we don't
require the energy of the sun to do some of this stuff? Exactly. So yeah, I'm not naive enough to
think that these trends will continue. And they were also not, like if I pick a different 18-month window,
the numbers may be pretty volatile.
So I think that we're still in that growth phase.
I'm not naive enough to think that we'll continue to grow at that rate.
But what I hope we'll start to do is, number one, is less egregious use of resources.
Just because people are using 8x the resources doesn't necessarily mean that we're getting the ROI on all of that.
So I think one thing we can do as a community is rein that in or come up with the right mental model for how to think about is this actually a useful use of limited resources and computation.
So I think that there will be some natural like tapering of the growth that
will happen. I think people will start to get more responsible about how they're
using things and then at some point,
you just, there's only so many data centers we can build. And then there you just start enforcing,
like, priority mechanisms that will cap, you know, will artificially cap the amount of computation
that you can do. But I think that, you know, right now it's still the wild west where everybody just like everything's free and why don't we just like for fun and you know entertainment will just
train models, for learning we'll train models, like there's people training models without
really thinking too much about the the back-end implications of that. I'm starting to see some
really fascinating research spin up in that space. There was a tweet where one of the researchers at UMass Amherst mentioned that training a language
model, one of the state-of-the-art language models, was using the amount of,
it produced the same amount of CO2 emissions as three car lifetimes.
I read that tweet and I stayed up all weekend.
I was like, this was so impactful on my state of the world
and the state of mind where I thought, oh no, like what are we doing?
We are like a cigarette company in the 60s.
Like we just, we're being so naive.
Like when or how do we think about being environmentally
and mentally responsible here?
Am I contributing to like global harm, right?
And so what's made me happy is starting to see some activities, both in the research community and within companies, to actually think more deeply about that. and quantify here's how much this experiment cost you know the earth cost
you know we already tie things back to like money and we started tying it back
to power because people stopped caring about money then you're tired to power
but then people at some point stop caring about power. So like, oh, we'll just use that money to buy more data centers.
Problem solved.
But if you start to put it in terms of like, oh, and we, this many human lives, this many, this is your future generation.
Then suddenly this will be a bit of a wake up call where we can start to think more deeply about responsible AI.
And what's made me happy is that now that has spun up as an official research area, as an official
charter for various teams at lots of different companies, is this notion of responsible AI.
So this is a little bit of a trite connection, but as you were talking, it was
reminded of how when you were at Google, if I recall, you led a performance team that
basically was making sure that the people who were using the clusters at Google were
not being too lackadaisical about it, saying like, oh, you took over 50 machines and you
asked for this many gigabytes of memory, but you only used a fraction of it.
So you need to go back and think about how you're using these resources and here's how you can improve them.
Do you think that that's that kind of approach would work in the ML world, too, where people are like, oh, well, you know, I remember as a grad student, I'd be like, I think I might have a bug, but I'm not sure.
I'm going to run these thousand jobs and I'll find out and then I'll fix it and then I'll run the thousand jobs again do you think
there's a space for monitoring that kind of lackadaisicalness yeah so we that's
that's actually something we've already gotten in the business of solving so
there's this there's some you know efforts within Facebook about demand
control right so figuring out like how much of this demand is for compute there's some efforts within Facebook about demand control.
Right? So figuring out how much of this demand for
compute resources is useful and how do we reign in egregious use of resources.
And there are a bunch of different avenues, but
one of the most effective ones is really just visibility.
So when people try to go use resources and there's none available and it's
like, okay, your wait time is going to be like seven days. They want to know who's in front of
me in line right now, right? And then they start to self-police. They're like, what are you doing?
Why are you using all of the resources, right? And this all kind of came to us by accident. There
was one time where just people were complaining like, hey, my job's not getting scheduled.
It's not getting scheduled.
What's going on?
We looked into it.
And there was one user at Facebook who was using 75% of all Facebook experimental resources.
That one person was an intern.
So I reached out.
I pinged the intern,
I'm like, hey, what are you doing?
It appears you're using 75% of the resources.
And they were like, what?
I am?
So then I was like, that's totally on us.
If any particular user can abuse the system and not even realize it,
means we didn't give the right feedback mechanism of like, are you sure?
You know, like, you know, this, this is the equivalent of, you know, this many millions of
dollars and this many resources. So, so that like, you know, just raising awareness of like,
how much are you consuming actually solves a lot of the problems.
A lot of people just don't even realize it.
Shine a light.
So there's that.
And then there's the people.
Then you get to the point where there's the people who realize it but don't care.
And what's helpful there is you basically tell, you know, inform the person behind them in line that the person in front of you is being wasteful.
They will self-police.
And so they'll kind of tap, tap, tap.
Like, why are you at the register for so long?
What's going on here?
Right?
And then from there,
then you can provide feedback mechanisms of like once they've run a...
There's actually a few things we can catch.
We can catch, are you sure you need to train this?
This other user
trained the same model on the same data two weeks ago. Here's, do you just want the answer? Because
we already, we have the answer. So keeping track of what has already been done company-wide,
a lot of times you'll catch duplication. There were cases where people were launching two
identical jobs with identical data because they were worried about
the reliability. Oh dear. Where they were like, well, one of these will finish. So let me just
launch two copies because then I'll be sure to get my answer. You know, this is like super,
super wasteful and not fair. So these are the kinds of things that like once you shine a light
on it, it solves a lot of the problems.
And then it starts to get into the harder problem of like,
what does it mean to be fair?
You start giving people quotas.
You start saying, I'm not gonna submit your job
because you've used too many resources.
And so that's something like you very, very quickly
come to the situation where you have to start thinking
about that and solving that problem.
So every company is already thinking about how do you solve the problem of shared compute resources amongst
you know potentially thousands of researchers. Right. So I think maybe this is a good time to
kind of shift gears and talk a little bit about your career. So the one kind of tenuous thread
is that you know you've been talking about agility and reacting
to changing data and your career itself has been quite agile. You've donned a lot of hats. You've
played a lot of roles. So maybe you can talk a little bit about your career past and how you
think about when to make a change, because sometimes that can be very daunting to decide. Yeah. So I've always been the kind of person who, I'm not really a planner. I'm not the person who,
like when I was five years old, I was like, I'm going to get a PhD in computer science
and become a professor. Like if you'd asked me up until about like 10th grade, I would have said,
I'm going to be a dentist or something, right? So I tended to
kind of do a late binding on a lot of my decisions that I think was beneficial because it allowed me
to pivot and respond to opportunities as they arose, even if it would have been in conflict
with my well-laid plans. Throughout my life, at each point when I look back,
I'd be like, if I had told myself five years ago
or 10 years ago that I'd be here,
old self would be like, what, really?
Like, how did that even happen?
It's like, oh, a whole series of events happened.
So, you know, I think because of that,
I've sort of pivoted from academia into industry.
I've always had a bit of a hybrid role.
So like even when I was a faculty member, I was one day a week with Intel.
You know, before that, I was a postdoc, so full time at Intel.
So I've always kind of pivoted back and forth between there's a bit of a slider of like how much academic research do you want to be doing?
How much practical impact do you want to have?
At what speed do you want that to happen?
And I'm always sort of playing around and moving those dials and finding opportunities
to be able to do that.
And so because of that, I've kind of bounced a bit between like pure production roles, which was one of my first roles at Facebook,
to my current role purely in research. You know, so there, I think the other big lesson that I
learned was, you know, sometimes there can be a lot of emphasis on wanting what other people want. So everybody has, you know, like this all started
when I was in grad school. It was like everybody was like, hey, I really, really want to be a
professor. And I remember thinking like, I want to feel that strongly about it. I don't know if I do,
but everybody else seems to want it. So let me try and want it too. And, you know, so I went
through the motions. I was like, you know, I asked my advisor, I said, hey, should I do academia or
industry? And he's like, you're asking your professor. What do you think I'm going to say?
And he said, well, I'll tell you what, just give it two years, give it a shot.
If you hate it, you can leave people do that
all the time and I thought oh that was actually very freeing of a concept that I'm not actually
making a decision for the rest of my life right of like is this where I want to be when I'm 80
years old because I'll never be able to make a decision if I think that way but if I think of
everything is a set of decisions that are so forgivable and so changeable that I don't even really worry about it. Even my transition
within Facebook into research was like, if I go into research and I hate it, I'll just go back.
Like, it's not that big a deal. People do this all the time. So I think that, you know, my biggest
advice is, number one, be true to yourself on what you want. Everybody wants something different and
for different reasons. And everybody has a different setup. So, you know, asking somebody else what they want and thinking
that that's in any way going to apply to you and your own unique situations is just like a
recipe for being unhappy, right? So figure out what you want to do and own that. And, you know,
even in the face of people were like, what, you left a tenured position?
Who does that?
I was like, me, that's who does that.
Because I could and I also,
like in the back of my mind,
thought if I decided it was a huge disaster,
I think I could go back, right?
So I'm just not too worried about it.
So I just think being true to yourself, being able to pivot, I'm just not too worried about it. So I just think, you know, being true to
yourself, being able to pivot, I think has worked out well for me. And just try not to overthink it.
Yeah, yeah. That's great advice. And actually, I think that kind of relates to
something you talked about at your ISCA keynote a couple years ago,
which for those of us who don't know it is Hazelwood's Law. Do you want to explain that?
So Hazelwood's Law is really about figuring out where the opportunities exist.
So what ends up happening in research communities or in fields
is that there's this dogpile effect like there'll be some idea that makes sense.
And then everybody kind of dog piles on.
And you only want to dog pile in it as much as that problem is worth in the grand scheme of things.
And you want to make sure that you're not leaving giant gaps in between.
And so they'll, you know, there was, you know, at some point maybe there were a ton
of people who were like, oh, let's all focus on quantization, right? Like, yes, quantization is
important, but not everybody needs to be working on quantization. We need some people working on
that and we need some people working on some of the other gaps, you know? So there were a bunch
of opportunities where I realized like nobody's really looking at like the network implications
of ML, right? Maybe I'm the wrong person to do it I don't have a networking
background but I can definitely identify that that's a gap and so I feel like
recognizing those opportunities coming up with your own path on like I don't
have to work on quantization because everybody else is. I should find what I'm passionate about and in particular where there's not enough people.
Where are the spaces that need me and need my skills in particular?
Right.
So don't become a professor because everybody else wants to become a professor.
And don't work on quantization because everybody else wants to work on quantization.
Yeah.
I mean, not to say anything bad about quantization. People should be working on quantization.
Absolutely. But just not everybody. Right. Right. And only do it if you genuinely have an affinity
for it, want to do it, like it, and feel like you can make a genuine contribution that's unique.
Exactly. Exactly. Right. So on that note, so what excites you about, you know, the path forward,
the future? Like, do you have a vision for, you know, how systems for ML will evolve?
And more broadly, like, you know, how architecture and, you know, our work will evolve as well?
Yeah, I mean, I think that throughout history, within the computer architecture community,
we have these various trends that come along.
And, you know, you have a choice.
You're like, you know, okay, here's the
energy efficient computing. Here's that trend. Do you want to jump on this? Do you want to not jump
on it? You want to let that one go by? And, you know, I've sort of jumped on some of them. I've
been selective. And the ones I jump on, I go all in. And I'm like, okay, I'm going to, you know,
I'm picking dynamic optimization until nobody was doing that okay, I'm going to, you know, I'm picking dynamic optimization until
nobody was doing that anymore. And then I picked, you know, like I kind of went through
this, these things interest me, these things somebody else can work on. Right now we're in
a little bit of a peak in terms of like the hype curve. I think this will stabilize a little bit
and it will open up a ton of new opportunities where we realize, OK, once once the dust has settled and once we've stabilized in this particular field, you know, where do we want to go from here?
But in that in those peaks, that's where all the fun happens.
So this is where why I love where we are right now and kind of where we're heading in the short term.
I make no claims about long term, right? Long term, we'll get bored after some amount of time, and then we'll go move
on to something else. But for now, it is a lot of fun because we get to throw a bunch of things
up in the air and rethink them. And those opportunities don't come along that often.
And so that's why I really, really like what's happening now.
And I also have always been one who likes to straddle that hardware software divide.
And this is like one opportunity where I realized that space is so critical right now for this particular type of workload.
So those are the kinds of
things that I'm super excited about is like, what all great things, because people always surprise
me with the ideas that they come up with during these like peak hype periods. And we're in one.
And so I just think like, you know, don't be jaded about it. be excited about this. This is a fantastic opportunity to try out all sorts of crazy new ideas.
Awesome.
Well, there you have it, folks.
We've been super, super excited
to have you with us today, Kim.
Thank you for joining us.
It's been an absolute delight.
It's been a really fun conversation.
Absolutely.
And to our listeners,
thank you for being with us
on the Computer Architecture Podcast.
Till next time,
it's goodbye from us.