Computer Architecture Podcast - Ep 22: Measuring Datacenter Efficiency and Visioning the Future of Computer Architecture with Dr. Babak Falsafi, EPFL

Starting point is 00:00:00 Hi, and welcome to the Computer Architecture podcast, a show that brings you closer to cutting-edge work in computer architecture and the remarkable people behind it. We are your hosts. I'm Suvinai Subramanian. And I'm Lisa Shue. On this episode, we welcomed as our guest, Dr. Babak Falsofi, who is a professor at EPFL and the founding president of the Swiss Data Center Efficiency Association, which promotes best practices in sustainable data center operation with quantifiable energy efficiency and emissions, with online calculators and a label. His contributions to computer architecture includes spatial memory streaming

Starting point is 00:00:36 or SMS prefetchers, which have been documented to be in ArmCores since Cortex A53. He has shown that consistency models are neither necessary nor sufficient to eliminate memory ordering related stalls. These results laid the foundation for fence speculation in modern cores.

Starting point is 00:00:54 For the past two decades, he has been working on cloud-native server design. He is a recipient of an Alfred P. Sloan Research Fellowship and a fellow of the ACM and the I-Trippoli. Babak joined us to discuss the art of understanding electricity efficiency in the data center by measuring where the energy flows and how this measurement is hampered by a lack of standardization. We also discussed broader topics like what the vision should be for our field in the coming years after such dramatic developments in computing have happened since the last visioning workshop. We want to particularly note that Babak was a chair of the Sigarch E.C. when we started this podcast and was an early supporter of this project. And we want to thank him for his continued support. And with that, let's get to the interview.

Starting point is 00:01:39 A quick disclaimer that all views shared on this show are the opinions of individuals and do not reflect the views of the organizations they work for. So Bobbock, welcome to the podcast. We're really happy to have you here. And people know our first question is always, what is getting you up in the mornings these days? Thank you for having me. I get up in the morning, I have my espresso with artisanal coffee,

Starting point is 00:02:10 and usually play duolingo. I'm learning Italian, and it's the fastest language so far I've learned. Wow. that's cool so first i guess you're in europe so having your espresso is like a necessary thing okay and then the italian piece so you said is the fastest language you ever learned what other languages if you tried and what's with the italian like what's striking that so i was one anyone so i speak farcee i can't say my farcee is as good as it used to be my native language

Starting point is 00:02:40 is probably english now i speak greek my kids speak greek my wife speaks greek I speak French. That's the language of teaching here. It also language in La Zanamed language. German, I spent a few years in Germany before coming to the U.S. And Italian, because we're right next to the border, and it's a great place to hang out. It turns out that if you speak Italian, it completely changes the world when you go to a restaurant and order of food and wine. I see, I see.

Starting point is 00:03:12 Okay, so you've gone full euro. I mean, I remember going on a business trip. when I was at Qualcomm, we had to go to Europe. And so our counterparts there mostly were European from various countries. And we had a few Europeans on our side, the Qualcomm side, plus like a few regular old Americans. And we just had, it was the dinner, you know, post-meeting day dinner. And we were discussing how many languages do you speak? How many languages do you speak? Every European spoke at least three languages. And most of the Americans, with exception of myself, took one, maybe one and a half. And it was just absurd. Why can't we speak

Starting point is 00:03:50 more languages here in the States? How many languages do you, Steve, Sufni? Probably five or six. Oh my goodness. Are they mostly dialects of language? Are there? Yeah, various Indian languages because I grew up in multiple states in India. So I picked up four languages there and then the English. Amazing. I speak English, Mandarin, and Mandarin and Spanish, all reasonably passable, I would say, and I'm trying to learn Portuguese, which is actually hard because Portuguese is so similar to Spanish that there are some verbs that are the same and some verbs that are completely different. You just have to remember which ones are which. But yeah, well, cool. So that's getting you up in the mornings. And then what is happening after

Starting point is 00:04:33 that after you get to EPFL? Are you working in the office? Are you working from home? I think after COVID, we're spending a little bit of time at home, but I do go in. I go in usually three times a week. I travel a lot, too. So when I'm here, I go in a few times a week, and otherwise I travel. I spend a lot of my time in industrial conferences and data center sector. Go to academic conferences, just like the way I used to. I went to micro, but I also spend a lot of time in industrial conferences these days.

Starting point is 00:05:08 Gotcha, gotcha. And so a lot of your focus these days is on these sort of data. Center industrial conferences. So tell us what's going on there. What are you focusing on? Well, we're trying to bring a little bit of clarity about this electricity that comes in. Where does it actually go? And we'd like to find ways in which we can help operators measure them as precisely as possible. And then find our bottlenecks in the energy flows. D.C. Energy flows. This is the part where you have your cooling, your UPS system, everything that houses and supports the servers. That's more of a well-established area in terms of metrics and

Starting point is 00:05:53 methodologies. And then IT, when you bring the 20 kilowatts into the rack, where does that electricity go? That's up in the air right now. We're trying to find ways to measure and report that. Got it. Yeah. I mean, I think energy and data centers are one of the center pieces today. Just in the US, I think we're expected to double the data center energy footprint by 2030 from to like 70 gigawatts or something in that order. I'm sure are similar growth trajectories in Europe and other parts of the world too. So one of the statements that you've said is when you talk about data center energy and you're talking about measuring the energy flows, we should not be guessing what is actually happening, but we should measure them very carefully

Starting point is 00:06:36 and understand what the metrics are and get clarity behind where are we spending our energy, our dollars, and also carbon dioxide emissions and so on. Can you double click on that? Tell us a little bit about what are we guessing right now, where is there a lack of clarity, and what can we do to make the state of the ecosystem better? Well, first of all, I'd like to also follow up a little bit on the background that you gave because there is a huge shift towards energy consumption and data systems. They have started growing partially because there's just sort of normal growth, exponential growth in space, but also because of AI.

Starting point is 00:07:16 This is having an impact on energy market. So it's super important for everyone to find a proper way of measuring and provisioning for energy in the data center market. Now, what are the challenges there? There are metrics that have been around for several decades now for DC energy flow that's our usage effectiveness, VUE, it basically accounts for how much of electricity is going into the infrastructure that houses the IT equipment. And there are, of course, changes that are coming because we also recycle heat from data centers. There are on-premise renewables. We need to account for these as well, and that impacts the metrics use in DC energy flow. And then,

Starting point is 00:08:06 Well, the good news is that VC energy flows are getting better. So now we're getting PUEs of just some of our clients at STEA, Swiss State Data Center, Efficiency Association. They're getting less than 1.22. That means, most of the electricity is going to IT. And there, it's, we don't have proper counters and the servers to figure out when you bring it down to electricity, is it actually used properly? I see.

Starting point is 00:08:35 So your focus is. So when I was at Microsoft, we thought about data centers quite a bit and as a whole. So we did talk about things like the heating and cooling and that sort of infrastructure as a sort of first class component to the data center architecture. And I think what I'm hearing you saying is that specifically within this, there is the compute part. And of course, you want as much of the energy as possible going to the thing that you built the data center for the compute. And then within that, if we were to double click on that presumably dominant source of the consumption, we don't have good details on what's inside. And that's

Starting point is 00:09:17 perfect for computer architects to really dive into. I see. And so then to go back to Suvenay's question, like, what specifically are we really guessing on in there and what will we like to get much more clarity about? Well, first of all, there are no standards today in industry or R&T. So just creating standards is already a good thing. A lot of standards are based on some of the emerging standards based on utilization. Utilization is super important. The reason is that there's idle power and if you can utilize your servers or you use the electricity that's coming in a lot more efficiently. And there are ways to measure that today. Based on the load and server and consumption are correlated.

Starting point is 00:10:04 So you can measure the load and based on that, then figure out to the first order how well you're using electricity. So that's one thing. There are other sources of electricity loss in the rack itself. The server itself, electricity goes into the power supply units, power supply units. They have industrial ratings already. So if you're a titanium plus, you're about within the efficiency of 3 to 5%. So that's a little loss.

Starting point is 00:10:32 You can measure that again based on the mode. Then you have the fan power. Fan power is a problem because, well, first of all, a lot of operators want to operate at higher temperatures that these fan power is higher. So it's not to be ignored. That could also account for 10, 15% in some extreme cases, 20% electricity loss.

Starting point is 00:10:55 And then with liquid cooling, liquid cooling is slowly coming in. today with our PUE metrics, fan power is accounted for as IT power. With liquid cooling, there's no fan power. So that makes PUE not an apples to Athens metric when you're hosting both liquid cooled and airpool servers. So that's the first. And then beyond that, we can double click into let's look at when you're actually using the server.

Starting point is 00:11:29 program is running. How efficiently are we computing? I see. Okay. So then to get more precise, the picture, and just to be clear for all of our listeners, too, there's one way to look at it. If you look at an entire data center, all the electricity going into the data center is for the data center. So you could say in some, like at a very coarse lens, 100% you're using it. Then you double click in and say, like, well, actually, let's look inside the data center. Some percentage of it is going to cooling, some percentage is going into whatever. You need to have lights for the bathrooms, for the technicians or whatever. So now you have some amount of cost. So that's not directly going into compute. So now you're going into the racks. And you're saying in the racks,

Starting point is 00:12:10 you could say all electricity going into a rack is being used for compute, except now you actually have to double click and say, like, no, we're using some for fans. We're using some for, you're doing some loss and just the transformers. And so now we're like actually kind of figure out how much electricity is going to compute. And then once you go on beyond these level of peripherals, what I'm really curious about is how you might think about electricity with respect to like compute inside of what a computer architecture student might think of with respect to like Patterson and Hennessy book or something like that. Like is it that you needed to be going through an ALU to count or like what if it's just being used in a driver to push data along a long

Starting point is 00:12:53 wire or read something from memory or DRAM refresh. Like, how are you thinking about how much electricity counts, I suppose, as like actually being used for what you wanted to do? So this is a very good question. And to the first order again, as you double click, you go in, it's important to look at how much electricity is going to compute, how much electricity is going into your storage and memory and how much electricity is going into your network. And this with AI actually gets a lot more interesting because interconnects and

Starting point is 00:13:31 data centers, they weren't accounted for a big chunk of electricity consumption. But now with AI interconnects are a lot more interesting, a lot of electricity is going into the data movement. So that would be the next level of accounting, which is how much are you computing, how much are you spending in your memory and storage and how much of it is going into movement? We could, for cloud racks, not for AI training racks, there are already back of the envelope sort of breakdowns of how much of your electricity is going to CPU homeschool memory and breakdown homeschooled network. We can't measure that today, but we can already start with a set of

Starting point is 00:14:16 basically breakdowns that are aggregate average over what operators report. That's the state of the first today, basically. That's where we are. So you look at utilization and your compute and your CPU, your GPUs. Utilization used to be in the first couple of generation data set of the first 10 years, I think, since 2000, the famous Google paper in 2000, utilization used to be low because of SLO. But then in the next decade, we started consolidating using software technologies because building was a lot more expensive than trying to consolidate.

Starting point is 00:14:53 And so we develop software consolidation technologies to allow utilization to go up from 15 to 20% to 60, maybe in some cases 70%. There are no real numbers available for this. So we have these numbers from the famous data center book by Rosso and Holtzlip. And so that's where we are with hyperscalers today. There is another market for data centers, which is outside of the U.S. and China, and that's a co-locator market. The co-locator market is real estate companies building data center campuses, and they rent space to IT customers.

Starting point is 00:15:35 The IT customers are co-locators have really low utilization. Their utilization are still 10%, or even below 10%. So I also spend time, I'm going to IT conferences and talking to IT operators about utilization. Gotcha. Interesting. And so what would you say to someone who, you basically just sort of classified into co-locators for people who are renting and hyperscalers,

Starting point is 00:16:04 like two totally different ways of using a lot. lag racks. And so if you were, say, a hyperscaler, like, what is the one sort of way of designing a data center that would be, like, very bad for this sort of electricity utilization that might be sort of hidden right now because we don't really measure anything? Like, is there something that jumps out as like, this is actually a no-go? I think for hypers, it's sort of a somewhat of a solved problem because whatever

Starting point is 00:16:36 number of servers they have to buy and the electricity they're spending that's already factored into the economic model so they try to operate

Starting point is 00:16:47 so that they can for a given service level objective they can deliver basically maximum throughput they want to reduce the number of service they want to reduce the maximum

Starting point is 00:16:57 investment in infrastructure and they also want to reduce the amount of electricity that's good. Electricity today in the cloud world is still a small fraction for the CPUs and your regular typical servers, a small fraction overall budget. But in the case of AI, it's a bigger thing. But they try to, because of fact, they're due to the economic model, they try to reduce

Starting point is 00:17:22 that. But if you look at your typical IT customers, there are high, there are qualicators in the U.S. as well. They're multinational collicators like Digital Realty Acconic, Stack Infra. Their IT customers, they're just basically, if somebody writes a check and say, look, go buy servers and bring this thing up, they're not looking at, can I do this with a third of the servers? And I would save money on the servers.

Starting point is 00:17:49 I would save money on the electricity goes into them. I would also save embodied emissions on the servers themselves. Got it. Yeah, I want to expand on this distinction between hyperscalers and co-located clouds or neoclouds and so on. And you specifically touched upon for standard data centers, like CPUs and classic data center workloads. We have about a few decades of understanding

Starting point is 00:18:12 and we can sort of annotate these details. We have knock-in-math and we can work towards a system where you get more clarity on the metrics. If you switch gears to AI, we are relatively in the early innings of build-out of infrastructure for these services. And, of course, hypers have been, at the forefront of this in the last few years,

Starting point is 00:18:31 but increasingly over the last year or so, we've been seeing a lot of neoclouds emerge as well, right? So we have CoreWeave, you have a bunch of others. So given your experience in standard data center, server racks, and so on, what would be our best-case guidance as we go out building out the AI infrastructure? Like what kind of metrics should people look at if you could sort of advise like the neoclouds or the co-located clouds that are building out the AI infrastructure what sort of visibility would you like to bring in?

Starting point is 00:19:02 And you touched upon utilization, right? Like, yes, in the CPU world, we started off at very low utilization, and then gradually as your software packing schemes got better, you could co-locate workloads, variety of policies, the utilization gradually went up. But LLMs, we certainly know that, especially for serving inference, depending on the breakdown of pre-fill decode,

Starting point is 00:19:21 we can have very, very low utilization that could make your PUE numbers look good, but maybe the utilization is still pretty bad. So how would you go about, given your experience and learnings in the CPU side of the world, as we move into the AI infrastructure side where you not really have CPUs, you have interconnects, you have accelerators and more, what kind of visibility, clarity metrics would you like to see, especially given that we're spending huge amounts of money, huge amounts of power into this new space? That's a very good question. First of all, in the cloud, not the GPUs, but the CPUs, even though hypers,

Starting point is 00:19:54 they run their first-party workloads at high-eal. whatever containers they went, if they went to customers, those customers are not using all the resources are the right thing. Of course, the cloud operator would like to increase that utilization as much as possible because it would factor into an economic model and if they can do more with less resources. But today we don't have the technologies to allow, for example, though if you rent out a 4-gibite container, if the customer is not using all the 4 gigabytes, how do you give them the illusion that they're using the 4 gigabytes but not actually allocate 4 gigabytes? This is a great opportunity for computer architecture students to look into

Starting point is 00:20:42 technologies that allow you with hardware software code design, emerging fabrics to make that available. When you look at AI, GPUs have the same problem. So there's a paper from Microsoft Asia, you can ICSI last year, 2024, where they look at deployed AI models, and they say that the GPUs have less than 50%

Starting point is 00:21:07 utilization. And there are a lot of reasons why this is the KXM. You have the Python stack. There are many places where their bottlenecks, for example, CPU, GPU coordination of past, movement of data. They do a great job of breaking this down.

Starting point is 00:21:22 So, yes, we see this also with GPUs. With GPUs, of course, GPUs are more expensive. They also use a lot more power, so utilization is more of an issue. But finding ways to properly quantify utilization, and then finally,

Starting point is 00:21:41 solutions that would increase utilization is a really good way to move forward. Yeah. Plus on that, I think there was a recent paper from Alibaba as well at SOSP this year, where they talked about, you know, their pooling system to improve GPU utilizations, and they had some pretty eye-popping numbers in their paper. But, yeah, this is certainly an avenue that's under-explored in the

Starting point is 00:22:03 AI space. So I can also, sort of this is going towards research now, but I we have a project here at EPFO. We're starting on redefining the future of a cloud rack and there are emerging fabric. So this is a great opportunity to get for

Starting point is 00:22:19 computer architecture students, perhaps working with other people, like OS people, maybe PL people, to redefine the future of Iraq. And a future rack with these emerging fabrics, NVL, from Nvidia,

Starting point is 00:22:30 NeuronLink, CXL to some extent, we have the raw hardware capabilities to bring, you have their 42 units inside Iraq. Right now, these 42 units are almost exact companies of each other.

Starting point is 00:22:47 The silicon is completely fragmented because you have workloads, the workloads have a completely diverse requirements. So instead of creating this copy of the same server multiple times in the rack, you could disaggregate the rack and connect it with a high-end fabric, one of these emerging fabrics. And this is, I called Rack Duo, this is harder software pro-design and a great research opportunity

Starting point is 00:23:11 for computer architecture students. The disaggregation increases utilization. So it's an obvious way of addressing the problem. Yes, this is making me think of something that you had said earlier, and they're kind of like tangentially related. So earlier you were saying something about bottlenecks on electricity utilization. And then for computer architectures, students and people, we tend to think of bottlenecks in terms of, okay, we have this data that we're trying to, or this computation that we're trying to get through, and there's something that's holding up the whole process.

Starting point is 00:23:45 And in this case, if you had really, really effective fabric and you could disaggregate all resources, then you could compose and decompose and all that stuff. And you could stop this computational bottleneck, which conceivably could help with electrical utilization. Is this what you meant before by electrical electricity bottlenecks? Or is there something else? Because when you first said it back then, I was like, in my mind, I pictured like electricity going down a wire and somehow getting jammed up. And I was like, I don't understand what that means. So maybe you could talk about that. So in the DC energy flow, and these are people, these are building operators. These are like electrical and mechanical people, right? They're the ones who basically measure, they run and

Starting point is 00:24:30 measure their entire DC energy flow. DC energy flow is from the time electricity comes into the building until it ends up in the racks. These guys are in charge of that. Once it goes into the rack, it's not theirs, right? They worry about temperature, and that's not it, right? But it's all of this electricity is coming in. And so their job is basically to, I can give you an example of a hyperscale. Hyperscalers apply to minimize

Starting point is 00:25:00 the electricity that's used to come in to basically get a PUE value which is as close as one can get to one. Right? So what Peewees used to be at around two, then around a couple of tickets

Starting point is 00:25:17 ago, hyperscare started announcing. RPUs are 1.05, where really we lose very little and the electricity gets delivered to the rack itself. That's what I mean by the energy bottle. Like, taking that electricity coming in from the outside of the campus

Starting point is 00:25:33 until it's delivered to the rack, you want to maximize the electricity that you deliver to the rack from what's coming in and minimize the loss. Those are energy model like for the B.C. operators. Once it goes into the IT, that's IT people, right? The IT people, they have to basically figure out.

Starting point is 00:25:54 Now, the hyperscalers, first-party workloads, and even third-party workloads, they control the temperature. They control everything, right? With co-locators, it's up to the contract. The IT customer says, look, we want to keep the temperature between 20-2 and 26 from the bottom to the top of the rack. Others have already shown you can operate at higher temperatures,

Starting point is 00:26:15 but some IT customers, they have these sort of industry standard mentality that we need to keep the server's cool. We don't want them to fail. Many are running at much higher temperatures. So that's the energy loss that I'm talking about, is that to figure out how do you minimize and find a bottleneck in your energy flow, so you maximize the electricity you're delivering to the racks. I see. Okay. Yeah.

Starting point is 00:26:41 So mentally, the word bottleneck thing, they're like not exactly. exactly the same from the, you know, like a computer architecture, computer architecture benchmarking to this. It's almost like redirecting it and shunting it off to be used for something else. Yeah, I pictured it as a slowdown. And I was like, I don't know how you slow down electricity, but now it makes more sense. It's a redirect to do it for something else to keep the programs running, but it ultimately is being diverted for some other purpose besides compute. It's a loss. It's not a bottleneck. Gotcha.

Starting point is 00:27:13 It's how much electricity you drop. on the way to the wreck. Yeah, makes sense. Makes sense. Okay. Okay. That helps because, yeah, initially I was like,

Starting point is 00:27:21 I had a bottleneck. What's an energy bottleneck? Well, we had customers who were running at PUEO 1.6, and we have an online calculator for DC energy flows. They can go and enter their number, whatever they measure. And the calculator, it becomes obvious where they're losing electricity. And they use the calculator.

Starting point is 00:27:45 the online calculator actually eliminate some of those losses, and they can achieve much higher PUEs. So much better, much better PUEs. Got it. And is this calculator part of your Swiss data energy efficiencies or association tool, basically? So, yeah, so Swiss Exenter Efficiency Associations, and on-profit, it's partially subsidized by the Federal Office of Energy in Switzerland.

Starting point is 00:28:09 We have three online calculators. One of them is for DC energy flows. are already six pre-configured energy flows and customers can also have their own custom energy flows and they can submit and we can then help them calculate their efficiency. And there is an IT calculator, the IT calculator is mostly a set of KPIs that we've come up with, which we believe are the first step to quantify efficiency,

Starting point is 00:28:37 the IT stack. These are not things you can, you're not measuring electricity here, you're measuring utilization, you're reporting your energy loss into PSUs through industry standards, you're also reporting how well you've consolidated your workload. So we're mostly looking at a set of KPIs we've come up with, but it generates an index on the same scale as PUE.

Starting point is 00:29:01 So the IT customer can basically report their own energy efficiency for IT, or they can combine their energy efficiency with the DC operator and create one geometric mean index. on the same scale. And then we have an emissions calculator, which basically is a source of emissions, the source of electricity, how much of it is based on coal,

Starting point is 00:29:24 how much of it based on hydro. And based on that, again, we report an index similar to COE, the carbon usage effectiveness. And depending on how much per kilowatt-hour CO2 emissions you have, you can sort of identify your pollution, problem. Yeah, this sounds like a great resource. I think it'll add a lot of just delineating that these are the factors that go into various efficiency metrics and losses. We're super

Starting point is 00:29:55 helpful to people who are not familiar with these things. And as you said, coming up with a set of KPIs that they can track, these are the higher order bits that you need to sort of pay attention to. I think would be incredibly valuable. Yeah. These KPIs are also KPIs that you could already instrument and software. So we have a partner company in Germany that instruments servers, and they create a dashboard for our APIs. So these are not things that they're difficult to measure. You can measure them with open source tools or you can use products that are available

Starting point is 00:30:29 in the market. Wonderful. And so you get like a layered approach. Like you get the PUE at the data center level. You get the IT energy efficiency metrics as well. And then you also have the carbon dioxide emissions. So you get all three of them visible to you, and these are all available by instrumenting at the software there.

Starting point is 00:30:46 I wonder, this is a little bit out of left field, but when I was at Microsoft, we thought about some of these sorts of things. And one thing, at every layer, you have all sorts of margins baked in to the design, because every layer is responsible for its own layer, and you want to be careful. And so then when you actually add up all the margins across, you have like a whole lot of extra margin for everything with respect to the design that could potentially, if you did some sort of cross-layer optimization, might be lower. And one thing that

Starting point is 00:31:18 jumps out to me is redundancy, right? So with respect to these sorts of data centers, nobody ever has only a single data center, all their stuff in a single data center. It's just a different scale than before. I wonder if there's any thought to sort of maybe almost like ratifying data center. where back in the day, you had one drive and you were worried about losing it, so then you had more. But then that didn't seem efficient, so you do something like rate. Now, I think a lot of times we have three data centers, maybe more, with a lot of redundancy. And that is a lot of margin, like right off the top.

Starting point is 00:32:00 Are there thoughts on improving from at that level, or are we diving straight into the computer architecture, mesh fabrics, like, composability type things. It depends on who you're talking to, but you're absolutely right. The IT operators today of cold locators, the customers are coal-ocatings, they do care about the reliability of a service,

Starting point is 00:32:22 the way RAID did, but they don't think about the cost. Raid was a combination of reliability and costs. And IT customers, absolutely, if they thought about how do I provide the same service with same level of reliability with a lot less resources, then you can get away with fewer servers, you can get away with also less electricity.

Starting point is 00:32:48 So that's a good point. I think we can all help at various levels of the stack. A lot of the waste is at the higher levels of the stack, right now. And I think, like I said, hyperscalers are really good at squeezing most of that. out. Once you look at what the hyperscalers are doing, I think you're getting to the point where basically you're squeezing everything out of the server today. And the biggest problem beyond that is that that server is just not designed correctly to run your services. You basically

Starting point is 00:33:26 took the personal computer of the 90s, which is exactly what I had at my desk was a master's in Wisconsin with the operating system that runs on it. And now you're saying, okay, I'm going to take this thing, better make it a building block. This is super cheap, right? So in 2024, the worldwide investment in data centers was $300 billion. The cloud revenue was $700 billion.

Starting point is 00:33:49 So there's a lot of money in that. Now, AI is different because there's so much money behind AI, but you could directly go through much more expensive technology. I think this is going to come for the cloud. We're at a point where we're getting, really, we're

Starting point is 00:34:04 squeezed everything out of the server. The, Hardware is running in a nanosecond timescale, and it's a personal computer, and storage is basically the operating system are in the millisecond time scale, and everything in the data center is in the microsecond timescale. So there's a complete mismatch with the building block of what we're building, not just the hardware, but also the software, the operating system, and what we're doing in the data system. So I think for computer architecture students, for our community, there's a great opportunity to look at, okay, now if we're going to do this slowly, from scratch, how would we build this for the micro-second time scale? Gotcha. And do you still spend a lot of time advising students these days? Do you have students working on these sorts of problems? Or are you mostly advising large entities and corporations and associations, I suppose? So I spent a reasonable amount of my time talking to industry

Starting point is 00:35:02 mostly for best practices. And then I do have. I have a team back at UPFL. We still do research. We're doing fundamental research. We're trying to look at some of these questions about how do you build cloud native servers. How do you do it at the CPU level? How do you do it at the memory level?

Starting point is 00:35:24 How do you look at beyond a single node? What are the things we need to do to bring the resources together, to use it more efficiently, so that's the segregated racks. And the team is quite interesting because we used to do computer architecture, so everybody was sort of a hardware person. But now we're sort of bringing circuits, architecture, operating systems, databases, networking together, right? So we have people that are working together on these topics to go forward. Because if you just, if we do what we used to do in computer science, which is everybody was

Starting point is 00:36:02 working at their layer of the stack, we're not going to be able to solve this efficiency problem. I think you've made a few pertinent points, which is that the prior design paradigm has sort of run its course, especially if you look at the hypers, we have sort of squeezed every last inch of efficiency, but we are moving towards a new paradigm. It looks like we need a clean slate approach if we have to reap additional benefits. So there's a need for a new way of designing our computer systems, and it needs to bring together multiple layers to the stack. It can't be like you focus on just the architecture layer or just the operating system layer. You need to once again have people talking to each other. Could you sketch, like, what's your vision? What are the design

Starting point is 00:36:40 principles that you would like to see embodied in this next iteration of computer systems designed for the new data center? What are the design principles or pillars that you have, that your research group is exploring towards this particular frontier? So I think we touched upon looking at the new rack form factor. So that's at the sort of larger scale of resources. A lot of that is related to, like I said, co-design. Co-design is related to proper contracts

Starting point is 00:37:11 between software and hardware. And some of these contracts are completely new. For example, at the rack level, if you want to have a memory access that goes from one over to fast fabric to another node, that would be latencies that are orders of magnitude larger than what CPUs can handle today.

Starting point is 00:37:35 So you need to find a way to be able to have an agreement with the operations and say, look, if this thing is taking longer than a certain distance, we need to be able to take this context and put it away and eventually bring it back because the hardware is not going to be waiting there.

Starting point is 00:37:50 It's just the hardware basically blocked completely. some of these were already revisited many, many years ago. I remember, for example, the MITAO-wife back to CCMA days. They were doing cash-cornernernerce software. So depending on a common case, some of it was done in hardware, but if something took longer, they had to, for example, search a shared list to figure out where the block is. It would trap to software.

Starting point is 00:38:19 So there were already hardware software contracts that were defined to do this. So at the rack level, I think we need to look at what is proper hardware software code design, what these contracts are that are new. There are not contracts we dealt with before. And using then those contracts, right, to properly support those in hardware. Today, I think Google has already reported. You spend 10 to 20% a few cycles are just contract switching. A lot of OSCE that's called, they were referred to it as data center packs.

Starting point is 00:38:50 There's a lot of this sort of in the search for maximizing utilization of a socket, we're running tens of 10,000 threats. We're consolidating on a socket. Now, what happens there is that we're basically spending a lot of time talking to the operating system. There are a lot of opportunities for identifying operating system services that should be millisecond time scale services, but there are a lot of things we can do. do in hardware and software at the user level that do not need to be in the millisecond timescale. We have an example of this case here where we show that you could draw on a single address

Starting point is 00:39:32 phase function as a service. We actually perform memory isolation, proper memory utilization, hardware and software with a user level library without talking to the operating system, right? As you go down to the node, when you look at efficiency, we need proper metrics to a establish how efficient our silicon is designed and it's operating. And we're looking at CPUs today. CPUs are historically, again, they've been desktop course. And these desktop cores were basically put together at a piece of silicon.

Starting point is 00:40:09 And with that we added, we used a lot of S-RAM just to keep it cool. We want to keep the power density of airport. And these desktop cores are not proper course for the workloads that are running. in the data center, but that was the cheapest way to build them. The core to SRAM ratio has gone from what used to be 2 to 1 to now 1 to 3 in the latest chips. Chips are mostly S-RAM because the cords are basically sucking all the power. You increase your silicon area so that you cool it down with air-cool it.

Starting point is 00:40:47 And we need to revisit this. say, are we building the right course? Are we running them at the right frequencies? Maybe we can properly size these scores and figure out what frequency we do want them at so that we can flip this core to cash ratio back to mostly logic. Mostly electricity is going into running the program.

Starting point is 00:41:11 So there is a lot of opportunity here to sort of properly figure out how to design and operate even CPUs. And you can do the other components in the server as well. That's very interesting. I never heard about this core to cash ratio, particularly in the context of just having area for having the cores be the hotspots and the caches sort of enable some area for cooling.

Starting point is 00:41:41 I wonder, so what you had said earlier is that there's potential for, there have been some data to show that they're spending 10 to 20, 20% of time and context switching. That almost makes you think that you could potentially, like, there's got to be some sort of algebraic line that you could just draw to say, well, if I slowed things down so that I don't have to wait so much that I don't have to contact switch, it actually ends up being faster to accomplish my full task by not having this overhead.

Starting point is 00:42:11 And so is this the kind of question that you're looking at right now with your students? you need, because you have multi-tenancy, you need isolation. So you need protection, but you don't want to get the operating system involved, right? Because the operating system is the operating system on the 90s.

Starting point is 00:42:28 And in some sense, it's like having a abstraction, which we can bring from the networking community, right? You have, in the data center, you need a control plane and a data plan. The control plane are your CPU and your OS basically establishing who can talk to who, making sure that it's isolated.

Starting point is 00:42:51 And the data plane is basically where the data moves and the work gets done. And so, yeah, there are examples of basically what are the services you could do without having to get the operands involved without even doing the switch. But I think we need to look at this basically from a more fundamental way, not just look at specific use cases and bring in a little bit of hardware software go design. We need to really look at what is it that the services are doing with hardware, and how do we properly support that in our hardware design?

Starting point is 00:43:31 Yeah. I guess what I was asking was, like, one of the reasons why you might context, which is maybe this mismatch, it's almost like an impedance mismatch that you were talking about where the cores are operating very fast, and then memory and storage are slower and slower and then the operations, all this stuff is slow. So once you get to a point where you run up against the roadblock, as of now, or the current paradigm is, the best thing to do is just switch. And that context switch, as you're saying, is coming with a potentially 10 to 20 percent cost. So then you can imagine that if you did reduce the impedance mismatch, maybe by slowing

Starting point is 00:44:08 down the cores, as you suggested. I think I heard you say that, like maybe. we slow down the course. So you reduce the impedance mismatch that allows you to like maybe you could if you maybe if you reduce that then you don't have to pay this 10 to 20 percent tax and at some point it actually ends up being faster. So I think that's what I thought I heard you say and then I was just confirming but like it it didn't is that what you the kind of thing that you're looking at? So it's important to think about where it is so this school's back to 15 years ago or we did whatever ended up being caveat on

Starting point is 00:44:44 the RANDAX. We actually looked at what the workloads were doing, and we weren't looking at SLO back then. There was no SLO. It was just the workloads, right? And we're saying, look, we have this area budget, we have this power budget. If we can properly design the course, we can reclaim a lot

Starting point is 00:45:01 of this debt silicate. And we did. We looked at building the first 64-bit arm or out of order. We took a vortex 815, which was 32-biz, together with arm, turned into a 64-bit arm core and showed that with the area and power budget of that arm core, we can convert most of the silicon to course. And the reason is that the chip only needs the instruction work is to support the course. The data is on anyways. And the reason is that you have these back then

Starting point is 00:45:29 tens of gigabytes of data off chip. Now you have hundreds of gigabytes of data. The disparity between on-chip memory, off-chip memory is several orders of magnitude. So you might as well properly provision for your course, power and area, reclaim all the S-Rand, and make sure you just keep your instruction work itself on chip. And by doing this, you will basically deliver with the same amount of silicon electricity and order of magnitude increase in throughput. That's basically what Kavium did, whether Thunder X. Back then, armed servers didn't happen, didn't take up, because the entire Linux ecosystem for ARM was missing. I had HP, Qualcomm, Samsung, AMV,

Starting point is 00:46:15 long list of companies that dived into ARR and then they quickly left it. Cavium eventually pivoted to HPC because they couldn't sell servers to the cloud with the Linux ecosystem back then. And then Amazon and Huawei eventually built an entire Linux ecosystem around Arm, and now this year, Arm servers that are shipping

Starting point is 00:46:36 or over 25% of the market. So this is reclaiming that silicon, doing something better with that silicon. And I think there's a lot that can be done. And so now with SLO, we don't really have a proper way of establishing how well we're doing. So there have been a lot of back and forward

Starting point is 00:47:00 about Winfew versus Brody course. I know Earth Sports also load a blog about this. If you read the block, it basically is an endorsement for better single-treat performance. It's not really a quantification of the fact that single-threat performance is needed. And the reason he's basically making that case back 15 years ago is saying, look, we're building these data centers. A lot of our expenditure is in supporting software developers to debug. Performance debug our software stacks. And that's expensive.

Starting point is 00:47:32 We would much rather go to hardware, which is cheaper for us and go with a single-threat performance. That means that if you have these high single-threat performance scores, you're going to need more servers, and they're willing to spend that. But if you actually sit down and quantify SLO and figuring out how much you need, you can operate the chips with much lower core complexity and lower frequency as well. And that's something that we need to do. And that's on-gold work in my group right now. Right. Yeah, I think this has been a fascinating debate over the last 15 years or so, at least, on Wimpy v. Brony Coors, like, how do you size these various components in the system and so on?

Starting point is 00:48:13 I think, pardon the pun, it's a fascinating threat to chase down. But in the interest of time, I want to maybe switch gears a little bit. You've been an invaluable steward for the computer architecture community, you've served of the chair of Cigarch and so on. maybe I'll just pick up on one, since we're talking about canonical debates in the computer architecture community. Looking forward, you've seen the community evolve, tackle like various challenges, go through various phase transitions as well. Given the current era that we are in, what do you think will be the canonical debate for this era of computer architecture?

Starting point is 00:48:46 Well, in my specific area, data centers are out of energy. So we need to figure out how we're going address this because it's impacting it's already impacting residential electricity costs it's impacting the grids there are the the load on the grids are so high that they're getting these third wave harmonics on people's light bolts are flickering at home right so we this is we need to really look at tackle this energy problem in the data center space I think mobile space, it's been the same. Mobile has been always behind desktops. Desktop was a volume sort of product for decades, thanks to Moore's Law.

Starting point is 00:49:39 And mobile was just sort of bringing technology from the desktop and reuse of yet. Similar sort of workload, hardware mismatch we're seeing in the server space today. We have in the mobile space. I just went to a mobile technology conference and the, The designers were telling me that we have apps that have three megabyte instruction working sets, right? This is on a mobile SOC. And so we need to, unfortunately, look at the mobile use case separately from the data center use case, because these are both important and they don't actually necessarily have the same problem.

Starting point is 00:50:18 Some of the problems like the instruction work is that could be similar. But others, for example, the future is going to be in sensory emergency. with 6G later generations of Wi-Fi. And we need to look at the killer apps and figure out how we can support that. Most of the mobile problem would be the form factor because we can't put batteries on our head. The batteries have to be on our body somewhere so we can carry it. But whatever is coming from around our head has to be able to filter, compress, sense,

Starting point is 00:50:54 and eventually communicate information to somebody, some other note that's doing all the computation and eventually connect to the back end. I think we are going to, we're entering a bit of a divided world in computer architecture. And these are both important sort of use cases of computing and we need to basically have proper debates about and support them. Got it.

Starting point is 00:51:21 In your past instance as chair of Cigar, And as part of other forums, we've had visioning workshops and the architecture 2030 white paper. We've shepherded like multiple groups and worked with other leaders, professors in the field and so on. As we sort of head into the 2030s, do you think this is an opportune moment to sort of take stock of since the last white paper, since the last decade, which has been a flurry of a Cambrian explosion of different kind of ideas and new problems that have come into the fort as well. Do you think there's a good time for the community to sort of reflect and think about where do we want to be in the 2030s? What are the most pressing problems that we should try and tackle? And how do we set a North Star for the community as a whole or the important problems, challenges facing our times?

Starting point is 00:52:10 Yeah, this is absolutely an important question. So we get this in 2016. We had an architecture 2030 workshop, and I think it's now about time we look at this again. actually our vision workshops as to regard, we ask organizers to write a report about what came out of it, which is important. So we can also reflect

Starting point is 00:52:32 on that report and see how much of that actually happened, how much of it hasn't happened yet. We still have a few more years before 2030, but we definitely need to redo this. Two challenges here are for me, first of all, maybe the first challenge is an opportunity. We need to reach out and work with other communities, the layers of the stack in computing. We cannot just have a computer architecture workshop and the future of computer architecture

Starting point is 00:53:01 because it's a future of computing. Moro's law has sort of slowing down and the future is in code design. So that's one challenge, which I think this would be an opportunity. The second one is that we are quite broad in technologies that we're recovering. There's a lot happening. And the challenge there is to find a way to harness all this. The world is not the way it was when I was a PhD student computer architecture. So that's a challenge that we need to basically address in any visioning for the future of computing. Yeah, so maybe we can also take a little time now to talk about how you became that PhD student

Starting point is 00:53:48 all those years ago. We also like to ask our guests about their computational origin story, I suppose. So maybe you can share with us your origin story as we, as souvenir always says, wind the clocks back.

Starting point is 00:54:04 Well, I come from a background in the 70s where I grew up. Everybody was either a doctor or an engineer. So electrical engineering was super hot. So I wanted to be in electrical engineering. And And when I came to the U.S., my brother-in-law, who wasn't like community engineering, was also

Starting point is 00:54:22 getting a degree in computer science. Computer science was hot back then. It was this new field. It wasn't as rigorous as computer engineering, but it has had its own sort of closer community feeling because it was just growing. It was becoming a field. And so I decided to finish a degree in a community engineer and eventually also got a degree in computer science. And beyond that, it was basically, I wanted to do computer engineering. And so

Starting point is 00:54:50 that's how I ended up going to Madison. I had offers from a few schools. At that time, Madison had a really interesting energetic group of computer architects. Some of them had joined recently at that time. And it looked like a good group of people could go work with. Yeah. I mean, you came up in an era that was really exciting. for computer architecture, it felt like there were some, I was telling Subunay, it wasn't until I got maybe midway

Starting point is 00:55:22 through my career and looking back is when you're a student, when I was a student, you're just trying to learn things as fast as you can and read everything and just like, I was still in that kind of like read and absorb phase rather than read

Starting point is 00:55:36 and really analyze and think and whatever. So it got to me a few years later when I was looking at it. I was like, wow, there was some serious really interesting fundamental debates that were going on in that era. And now we've gone through a lot of churn, maybe like slightly winding back to what we were just talking about. We've gone through a lot of churn. There's a lot of new stuff that has happened that has exploded on the scene in the last decade. And there's been a lot of shifts in our community.

Starting point is 00:56:05 What do you, what do you imagine now as being one of like the fundamental debates that is going on with us. I mean, there were definitely camps and team this and team that. These days, it's, I guess when I was wearing up, there was team Edward and team Jacob. And now maybe there's team, what's that show that the kids are watching? I feel so old saying this. The summer I turn pretty, I think. I hear all these kids talking about team this or that. I forget the guy's names. So for us now, like team risk, team sis, what are the teams now in your mind? I think a lot of these things are circular and are coming back, right? So I was part of this group of people, Stanford, Wisconsin, and MIT were building

Starting point is 00:56:52 multiprocessors because at that time, Stanford basically had announced that Webstar 4,000, which is an order pipeline, was the fastest processor you would ever need. You would never build anything. The wimpy, broadie argument goes back to the mid-80s. They started building shared our multi-prosters. We jumped on that wagon as well. And in 2010, Horowitz wrote a paper with a bunch of collaborators with Stanford. He said, look, if you run the spec benchmarks in whatever technology node was at a time,

Starting point is 00:57:33 a two-way and order core is by far optimal. There's a potential optimal frontier on core design, the frequency and complexity, also sizing the components. The way in order core is always superior to out-of-order cores. And back in the 80s, Mark was saying that nobody would ever build an out-of-order core. And out-of-order happened because of Moore's law, and we solved so much. But if you go back and revisit that, and if you, today, if you care about efficiency, And it's about performing for what performs for skilled silicon millimeter square of silicon.

Starting point is 00:58:11 Again, in order may be okay, they may actually win. And so some of these questions, some of these debates never go away. And we're looking at that right now at EPFO. Many of these ideas, sys versus risk, they come back. We, sysp could do whatever risk could do internally and it could make advantage of those risk ideas. Now we're going to post more. We have to build accelerators. We're back to some of those CISC ideas. Yeah, and around and around they go, I guess. That's true. I mean, I remember in high school, I said something about history repeating itself to a history

Starting point is 00:58:54 teacher. And he was like, tell me more about what you mean. And things do circle around. There's a big, it's a big pendulum. So, so, yeah. Actually, I would like to also say that recently I went back and looked at the first couple of Asplas proceedings, and those were super interesting because there was this huge risk versus cyst debate at that time. And you could see and read what people, the cis people were thinking and what the risk people wanted to do. And a lot of those ideas are relevant to that.

Starting point is 00:59:29 I completely agree with that. And I think systems community is ripe with examples of these kind of things. risk versus set complexity, this aggregation versus aggregation at various layers of the stack, right? So these are constant debates. And sometimes it's not like what's the right answer for eternity. It's like at this point in time, this set of tradeoffs makes a lot of sense. And then something changes, either the technology changes, the application changes, or something else happens. And so you need to revisit the assumptions or the conclusions of a particular era. And in a different era, you might have a different conclusion, right, or a different set of ideas that sort of play well together.

Starting point is 01:00:02 Yeah. And I think going back, to working together. A lot of these harder software contracts, we need to work with software people. I think it was a little bit embarrassing to find out that one of the best paper awards of Cirodisco came from the OS community. There's a group of amazing researchers, Cambridge,

Starting point is 01:00:23 basically came out and said, look, the contract between very consistent models when interacting with interrupts, they're not defined at all. And they actually showcase a bunch of products, both within the same ISA and across ISAs, that show that they're not supported properly, and they get different behavior.

Starting point is 01:00:41 And this was the best paper. And I think that was a great best paper. But if you look at access dirty vets in virtual memory, they're not well-defined. X-86 does a much better job that defining them than orange does. So I think now moving forward, because this code design is more important,

Starting point is 01:00:58 the contracts are super important between solid and hardware. They have to be spelled out properly. Yeah, completely agree with that. I think the opportunities are there. It's certainly a very interesting time. And I think it's up to us and the broader community to reach out and form those connections across multiple layers of the stack,

Starting point is 01:01:18 whether it's architecture and compilers, architecture and technology, or it's architecture operating systems and all the way across the stack. I think there are plenty of opportunities, very wise words indeed. So on that note, any other words of wisdom to our listeners

Starting point is 01:01:33 based on your long range of experience, either technical or otherwise? Well, this is the best time to be computer architecture. There are absolutely no silver bullets. We talked about when I was a master's student and was multiprocessors versus out-of-water core. Those were the only two things you could do. Nowadays, sky is the limit in innovation,

Starting point is 01:01:56 and this is a great opportunity to be computer architecture. I would have loved to be a grad student. It could be architecture right now. Yeah, there's certainly a lot of problems out there to be, to be solved. And I think that is great because that means that the students can, everything's right for the picking. They can decide what they're interested in and go for it. Instead of picking, dog piling on the hot topic,

Starting point is 01:02:21 Kim Hazelwood talked about that on our very first episode. Don't dog pile. There's a lot. There's a lot. Go for what you like. So. And I think we do have a relatively young audience. So I hope they all take this to heart.

Starting point is 01:02:37 Cool. Well, this was really wonderful, Bobak. I think we had a really good conversation with you. We talked about quite a lot of different things from maybe even like low-level silicon-type ideas to very high-level data center, energy flow type ideas. So, Ramit, appreciate your time. I think this is a really fun conversation. Thank you for having.

Starting point is 01:02:59 I had a great time talking to you. Yeah, it was a fascinating conversation. Thank you so much, Babak. And for our listeners, thank you for being with us on the Computer Architecture podcast. Till next time, it's goodbye from us.

Computer Architecture Podcast - Ep 22: Measuring Datacenter Efficiency and Visioning the Future of Computer Architecture with Dr. Babak Falsafi, EPFL

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.