Catalyst with Shayle Kann - Will inference move to the edge?

Starting point is 00:00:00 A very brief word before we start the show. We've got a survey for listeners of Catalyst and Open Circuit, and we would be so grateful if you could take a few moments to fill it out. As our audience continues to expand, it's an opportunity to understand how and why you listen to our shows, and it helps us continue bringing relevant content on the tech and markets you care about in clean energy. If you fill it out, you'll get a chance to win a $100 gift card from Amazon, and you can find it at latitudemedia.com slash survey, or just click the survey link in the show notes. Thank you so much. Latitude Media, covering the new frontiers of the energy transition.

Starting point is 00:00:39 I'm Shail Khan, and this is Catalyst. We could be getting 80% of our compute done locally and leaving 20% of the heavy lifting for the data center cloud. Of the 80%, I would say most of that will be on the edge. I think maybe on the order of 1% ends up being put on your consumer electronics. Coming up to the age of edge inference blunt to big data center boom. When utilities need flexible capacity they can count on, they turn to Energy Hub. Energy Hub works with more than 170 utilities, coordinating over 2.5 million devices to manage 3.4 gigawatts of flexibility built for the moments when utilities can't afford uncertainty. Energy Hub builds and

Starting point is 00:01:32 operates virtual power plants that utilities actually stake their grid planning on, coordinating EVs, batteries, thermostats, and more through a single platform built for utility scale. Predictive, verifiable, and designed to perform when it counts. Learn more at energy hub.com. Trillions of dollars are flowing into clean and critical infrastructure, but those investments aren't driven by technology alone. They're shaped by markets, by policy, by capital, and by the institutions that connect them. I'm Alfred Johnson, CEO of Crux, and host of a brand new podcast, Critical Capital.

Starting point is 00:02:05 Each episode, I talk with people deploying capital, shaping policy and building the clean economy. Tune in as we unpack how progress is actually made. Listen to critical capital on Spotify, Apple, or wherever you get your podcasts. Catalyst is supported by Fish Tank PR, an award-winning PR firm focused on climate and energy tech, renewables, and sustainability. Fish Tank is known for generating prominent and effective media coverage for the brands they work with. If you want a PR partner that's thoughtful, shoots straight, and gets results, you'll like Fish Tank PR. To learn more about Fish Tank's approach, visit fish tankpr.com.

Starting point is 00:02:42 That's F-I-S-C-H-Fish-Tankpr.com. I'm Shale Khan. I lead the early stage venture strategy and energy impact partners. Welcome. Okay, so here's an energy question disguised as an AI infrastructure question. What proportion of the world's AI compute in 20? 2035 will be cloud, i.e. in large centralized data centers, versus edge versus edge, i.e. on device. It's an energy question because the answer today is effectively 100% in that first category, cloud.

Starting point is 00:03:20 And that's why we have this crazy dynamic in the electricity sector and actually in the natural gas sector too, where hypers and neoclodels and developers and real estate speculators and crypto miners, turned AI companies, and more are hunting for sites that can accommodate hundreds of megawatts or gigawatts of power. And the whole thing, as we know, is crashing through the electricity sector, affecting generation and transmission distribution, prices, now politics, and so on. But there's a narrative that I've heard a number of times that, if borne out, would potentially present a very different future from the present. This is one where AI workloads, first of all, shift significantly from training to inference, and then where those inference workloads become highly latency-sensitive

Starting point is 00:04:06 and are also able to be executed in a more distributed fashion. And as a result, much of that compute, and thus the power demand, shifts from these big centralized data centers to the edge. That could mean it shifts to 10 megawatt data centers clustered around an urban core or an autonomous vehicle corridor, or at the limit, it could mean inference compute happens on device. and centralized data centers fall back into a pure training position. Any version of this that takes significant share of the market would have profound implications for the energy question and for the grid. So it's worth exploring, which is what I'm doing today with my guest, Dr. Ben Lee.

Starting point is 00:04:46 Ben is a professor of electrical engineering and computer science at the University of Pennsylvania. He's also a visiting researcher at Google. By the way, this edge AI infrastructure world and the energy implications thereof is super interesting to me, as you will hear. So if you are building something in the space, please come get in touch. In the meantime, here's Ben. Ben, welcome.

Starting point is 00:05:08 Great to be here. Thanks so much. I'm very excited for this conversation because this is the topic that I, in my energy circles that I travel in, I've heard scuttlebutt about a bunch of times, but I've never actually spent the time to really try to understand. The topic basically being

Starting point is 00:05:24 how much of inference compute might move from central cloud infrastructure to the edge, and then how far to the edge, of course, being another question. I think we should start by actually defining those categories a little bit. How do you think about the categorization of where compute can occur? Then we'll talk about each of those categories individually. Right. So even before we talk about generative AI, for classical computer, cloud computing in general,

Starting point is 00:05:54 all the services we love and change the way we live and work today. There are three levels generally I think about for compute. The first is massive hyperscale data centers, the ones run by Microsoft and Google and Amazon, hundreds of thousands of machines, massive facilities. That's what most people think about when they think about cloud computing. At the other end of the extreme would be personal devices, consumer electronics. So you think about your phone, you think about your tablet, your

Starting point is 00:06:27 laptop, plenty of compute can happen there as well. There is a perhaps less understood middle layer or intermediate layer called edge computing. And edge computing really means that there are times where you don't want to go all the way to this remote, massive facility and wait for the data to go out to that data center and then come back. You might want to access some compute that's a little bit closer to you. Maybe the same city, maybe the same geographic region, that's edge computing. So they're still going to supply really capable high-performance machines, these servers, but you don't suffer those longer communication times or latencies that you might if you

Starting point is 00:07:08 were to go to that remote massive data center. And my recollection is that there was, I think, okay, so the advent of cloud computing meant the build out of lots of big centralized data centers. There was a fair amount of conversation some number of years ago in the kind of first wave of excitement around autonomous vehicles, in particular, that you might see a fair amount of edge infrastructure get built because of the latency tolerance requirement for AVs. I mean, I'm on the outside, so tell me if I've got that kind of narrative wrong here.

Starting point is 00:07:41 But then it seems to me that because AVs were generally delayed or maybe the need wasn't as high, like what we've got today, If you just look at the infrastructure today, it seems like the vast, vast majority of classical compute even, except for stuff that's sitting in like mainframes that companies is in the cloud and the big centralized data centers. Do I have that right? That's right.

Starting point is 00:08:06 And this is a decades-long trend. I mean, we've seen this progression, this adoption of cloud computing over the last 15 to 20 years. And there are a couple of reasons we are seeing that shift, or we have seen that shift. The first is that computing in a massive data center run by the hyperscalar companies, these big tech companies, is much more energy efficient. They know how to deploy these facilities. They know how to cool them and build HVAC systems efficiently. So they're incurring very small overheads per watt of compute. There's this industry standard

Starting point is 00:08:43 metric called power usage effectiveness or PUE, and that's the ratio of the power you're using divided compared to the power that's going to compute. So Google's PUE is close to 1.1, which is to say for every watt going to compute, there's an additional 0.1 watts going to the over as a power delivery or cooling or whatever. So that's really incredibly efficient. And most mom and pop data center operators, most enterprise data center operators don't get the scale. an efficiency that these hyperscalers do. The scale also gives a second key advantage, which is the ability to share hardware.

Starting point is 00:09:22 So you buy the hardware once, and you have lots of users sharing the same physical hardware. That allows us, again, to drive the cost down, allows the hyperscaler operators to drive the costs down, and that essentially gets a massive increase in efficiency. So most compute now is being done in these large data centers in the cloud. Okay, so let's talk about the world of AI now, which is where all this growth in compute is happening. You know, AI workloads, of course, divided into two major categories, one being training of models and the other being inference.

Starting point is 00:09:58 I think we'll spend most of our time today talking about inference probably, but let's spend one minute on training. Is there any movement or argument that training should take place anywhere other than large centralized data design? It seems very clear to me that the trend right now is just build the large, possible data center to train the largest possible model. So is there anyone who thinks that it might, that might turn in the other direction? Some, but that really hasn't gotten much traction. I think the reason why we see most training

Starting point is 00:10:26 going happening in massive data centers is because of the scale. You need to communicate large data sets. You need lots of GPUs all closely coordinated, learning the model parameters. The only scenario that some people have explored for training away from the data center is if you've got private data and somehow you want to refine your model or somehow fine-tune your model with that private data.

Starting point is 00:10:46 You don't want to share it with the hyperscalers. That has been primarily a research question rather than a production system that people have deployed. Okay, so let's assume then that the vast majority of training compute is still going to happen in centralized data centers. As it stands today, I don't know if you know the numbers, but just high-level, of all AI,

Starting point is 00:11:11 workloads, how much is training versus inference? Because I think the other big point people have made is like over time, the proportion of workloads going toward inference is going to increase, the proportion of workloads going toward training may decrease as we sort of asymptote the next model or something like that. But like today, it's mostly training still. I would agree with that. I think to first order, the training costs are historically what people have cared about the most because the data sets are massive, and then you're talking about these massive 1,000-meagwatt data centers for the training workloads.

Starting point is 00:11:48 There was a study we did when I was a visiting research scientist at META, where we found that energy costs for AI were roughly broken into three categories. There's a data pre-processing aspect as well, and that's about a third. The training is another third, and then the inference or the use of the model is the last third. But clearly those fractions are evolved. rapidly. And I would agree with you when you're saying that the training costs are probably flatlining. They were reaching a plateau and how quickly they are growing, perhaps. And if the optimism of our AI is to be justified, you're going to have to see inference costs go way up,

Starting point is 00:12:29 because that will be an indicator that adoption has gone up in a fairly significant way, both among individual users, but also among companies and enterprise users. So I think it's, It's true to say that inference costs are large and potentially will grow very rapidly. Okay, so then we're getting to the sort of crux of our question today, which is inference workloads, inference costs increase over time. Usage of the models increases over time. That's the presumption of everything going on in AI world. And then the question is, will that inference compute predominantly still take place in these big centralized cloud data centers? or will some or much of it potentially shift

Starting point is 00:13:09 either to one of the other two categories you described, sort of edge localized or fully localized on device? So let's talk about the edge version first, which is essentially smaller data centers, still data centers, but smaller and more local. What's the argument for why that might happen and what are the limitations? So the argument in favor of edge computing is mainly

Starting point is 00:13:35 the proximity to the end user. So we have been conditioned in an era before generative AI that when we access internet-based services like a search engine, we expect the answer to come back on the order of 100 milliseconds. That is the order of magnitude that we're talking about. And as a result, to get those 100-millimeter latencies, oftentimes you require computation closer to the user. So you don't have to travel across the internet.

Starting point is 00:14:04 You don't have to travel from the West Coast. out to the East Coast and back again, the data, I mean, and get that answer back in a timely way. What is interesting with generative AI is that we are being reconditioned to tolerate much longer delays. So if you use something like GBT or you use something like Claude or your favorite chatbot, oftentimes it's just sitting there thinking for seconds and seconds, maybe tens of seconds

Starting point is 00:14:31 before it gets you the first token. So the question there is to what extent we carry about that latency and need that really fast, responsive access to the answer. Yeah, and I think we've been especially trained even further in that direction with the introduction of things like deep research, where even in the name, you sort of think, well, of course that has to take time. It is deep research that they are doing. So it's an interesting point that maybe we are becoming re-conditioned to allowing more

Starting point is 00:14:56 latency. The argument that I've heard for why latency is really in a matter, apart from just wanting search queries or chat queries to come back quicker, is the next wave of that. is the next wave of applications for AI, right? And so maybe we go back to the autonomous vehicle world and things like that, where, like, latency, making decisions in near real time does become really important. Robotics being another category that could be a major user of AI compute, but needs really, really low latency.

Starting point is 00:15:26 Is that part of the argument for shifting some compute to the edge? Yes, absolutely. So the class of compute you mentioned autonomous vehicles, robotics fit into what we call cyber physical AI. So cyber physical systems are those that have a cyber component, a computational component, but also interact with the physical world. And once those interactions with the physical world arise, then we care about responsiveness because with that underpins safety guarantees and the ability to make sure that your robotic arm is able to respond quickly enough to hazards. your autonomous vehicles are able to do so. So I agree that there will be cases where we will need those really low latencies,

Starting point is 00:16:09 and that is going to require edge computing much closer to the user, so we have much shorter internet delays, network delays. I'm curious to understand the trade-offs here, right? I know with model training, there are technical reasons why you want your compute as clustered together as closely as possibly You want every GPU as close to every other GPU

Starting point is 00:16:33 is you can make them minimizing the copper between them or the optics or whatever it is that's communicating between them. And that for some reason that you can explain to me makes model training more effective. Is there a similar dynamic in inference? Is there a technical reason why that you're paying a penalty

Starting point is 00:16:51 if you shift to smaller data centers at the edge? Or is there no technical reason why it's suboptimal. Right, yeah. Let's talk about the training piece first. The reason why we need a thousand megawatt data centers where we have hundreds of thousands of GPUs connected so closely together is because the data sets are massive and the models are massive.

Starting point is 00:17:16 We're trying to learn on the order of a trillion parameters for these machine learning models, this AI models, and we're trying to do it on the wealth of data we find in the Internet. There's no way that any single G. GPU can handle that much data. So what we end up doing is partitioning the data into smaller pieces and then handing each GPU a slice or partition of this data. And each GPU will churn on its own share, on its own partition of the data and learn the

Starting point is 00:17:46 models that work best for its piece of the data. And all the other GPUs in the data center are doing the same thing on their partitions of the data. Periodically, what they will do is they will compare notes. They will share the weights that they've learned. And this sharing is really, really expensive. And some of the people in the energy space may know that there are massive energy fluctuations or power fluctuations we will see in data center usage when the GPUs go from this computational

Starting point is 00:18:17 intensive phase where you're learning the model weights to this communication intensive phase where they're comparing notes and sharing their intermediate results with each other. So as a result, that's why we're talking about these massive data centers for training. They all need to communicate frequently to share what they've learned from their own data sets. For inference, we don't see that effect. Just to add, the craziest thing to me about how model training data centers operate right now, the absolute craziest thing is, as you said, there are these, there are surprisingly large spikes in power demand as a result of how the models are trained.

Starting point is 00:18:56 they do in large part because those spikes are actually problematic, not just to the grid, but to the equipment inside the data center as well. So what they do at least sometimes to manage that is they create dummy workloads. So they keep the power profile basically flat, but you are literally just wasting energy on absolutely nothing. Nothing is happening during those times. They're dummy workloads. At that scale, the fact that that is happening is wild to me.

Starting point is 00:19:22 Absolutely. And I think we've seen this in other contexts as well, but not powerful. at this scale, this notion of an electrical engineering, we'll call it the DIDT problem, the change in current divided by a change in time. If large current swings over very short periods of time, you could imagine building batteries to sort of damp things out or decoupled, and certainly a lot of people are thinking about that. But the easiest thing to do might be to just modulate the software, as you say, because we have very precise control over what the software does. So that is an active and ongoing area of research that needs to further develop.

Starting point is 00:19:56 virtual power plants are becoming a reliable way for utilities to manage capacity, but enrolling devices is just the start. What really matters is confidence, knowing those resources will perform when dispatched and being able to prove it from the control room to the living room. Energy Hub's platform handles the full picture, from near real-time forecasting, locational dispatch, and the kind of rigorous verification that holds up when regulators, grid operators, or leadership ask, did it deliver? easy enrollment creates momentum, proven performance builds trust.

Starting point is 00:20:31 That's why more than 170 utilities rely on Energy Hub to manage over 2.5 million devices delivering 3.4 gigawatts of flexible capacity. See what that looks like at energy hub.com. We're living through a profound economic shift, and energy sits at the center of all of it. Trillions of dollars are flowing into power plants, transmission lines, battery factories, data centers, but the future of energy, isn't shaped by technology alone. It's shaped by markets, by policy, by capital,

Starting point is 00:21:01 and by the institutions that connect them. I'm Alfred Johnson, CEO of Crux, the capital platform for the clean economy. Join me for my brand new show, Critical Capital. As I talk with people deploying capital, shaping policy and building projects. Together, we unpack how risk is priced, how incentives are structured,

Starting point is 00:21:19 and how progress is actually made. Listen to Critical Capital on Spotify, Apple, or wherever you get your podcasts. Are you tired of overpaying for big-name PR firms, but not really knowing what they're delivering? Is your comms team wasting time reviewing lengthy messaging briefs and decks, instead of engaging journalists or producing content? Are you wondering why your competitors are getting press and you aren't? Fish Tank PR is an award-winning climate and energy tech, renewables, and sustainability-focused PR firm dedicated to elevating the work of both early stage and established companies. Whether you need to position yourself as a thought leader in between project announcements, or translate complex,

Starting point is 00:21:56 ideas and technologies into tangible, compelling stories that resonate with the media, Fishtank can help. Check out fish tankpr.com. That's F-I-S-C-H-Fish-Tankpr.com. Okay, so then on to inference. So you're saying inference does not contain that same challenge. So is there any, what is the downside to shifting inference workloads to the edge? To my knowledge, there isn't much of a downside because the reason why inference is amenable to edge computing is because when you send a prompt to for processing by a large language model, that prompt is probably handled by one GPU or maybe eight GPUs inside a single machine. So and the reason that is is because the model sits in that machine, the data sits in that machine and all of your prior conversations with that bot are sitting in that machine. and it's a very localized piece of compute that needs to be done.

Starting point is 00:23:00 And you don't need tens or hundreds of GPUs to be coordinating to give you an answer back. You've got that one GPU or a tightly coupled GPUs giving you that answer back. And that is amenable. That is great for edge computing, and we can certainly supply that. So a thought experiment that I've given people recently in thinking about this is, let's just say that you, need a gigawatt of inference compute in five years from now or seven years from now, something like that. You think you need a gigawatt, wherein the demand for that gigawatt is geographically centralized somewhere. Let's just say you need a gigawatt of inference. You think you're going to need a gigawatt

Starting point is 00:23:44 of inference compute to serve the Dallas metropolitan area, whatever it might be. At that point, a few years from now, this is back to the power perspective, is it going to be? Is it going to to be easier for you to find and cite a one gigawatt site or 110 megawatt sites within that geographic region. Today, I think it is still probably easier to find the gigawatt site, or at least the past couple of years it has been, but there aren't that many gigawatt sites out there from a power availability perspective. So at some point, is that going to flip and is it going to be easier to build 110 megawatts sites, which sounds really hard to do,

Starting point is 00:24:25 and indeed is, but these are all hard problems. So if that happens, do you think that we are going to see a significant portion of that inference workload move to that type of scale? Is that the right scale? Should we be looking at 10 megawatt sites, 100 megawatt sites, one megawatt sites? How far to the edge do we want to go?

Starting point is 00:24:44 Yeah, absolutely. And I agree with the premise of that question 100%. I think that there are two reasons to go to smaller, many smaller data centers. The first is the one you mentioned, power provisioning and connections to the grid. The second is the fact that you don't need

Starting point is 00:25:01 a massive GPU coordination for an inference workload. I guess the catch might be that if you are thinking about your existing edge data centers, maybe you've got data centers in downtown Los Angeles or something like that already serving workloads.

Starting point is 00:25:17 Those workloads may not be configured to handle GPU and AI compute. They may have power delivery infrastructure that was optimized for CPUs. They might have HVAC systems optimized for the much lower power density of CPUs. So it's not simply a matter of pulling out your CPUs and replacing them with GPUs. You may have to retrofit the facility itself to support that. But I agree.

Starting point is 00:25:46 I think finding capacity there may eventually become easier than finding. the next 1,000 megawatts. Is there any limitation? I can imagine, I'm trying to think of why you wouldn't do that. You need to sort of house all of the, you need to have a fair amount of memory, and you to house all the model weights and so on in every individual data center,

Starting point is 00:26:09 if you're going to do that at the edge, right? So is there, there's got to be some minimum viable scale, I assume. Right, and maybe to give you a sense of the type of data centers we were talking about in the past, again, in a study that we had done with Meta, we looked at 15 of their data centers before generative AI, and the scale of those facilities were somewhere between 15 to 50 megawatts, right? So less than 100 megawatts. And certainly, that was fairly very fairly conventional, uncontroversial to build those sites of data centers in the past. So that's the starting point, I think, in terms of the scale.

Starting point is 00:26:49 Now, as you scale down towards, for example, one megawad, not clear at what point things start making less sense. I guess the other point here, like the way that the data center build out has gone historically, just like the cloud data center build out, it's been fairly clustered in these regions, right? And there's a reason why Northern Virginia is the data center hub of the world. And there are others as well, Chicago, Dallas, et cetera, Phoenix. And that, as I understand it, is largely because the cloud providers needed to offer a certain level of reliability to their customers. And so they could have redundancy within a given region, and that was helpful to them in terms of what they were offering. Do you think that this future world, wherein a bunch of inference compute moves to the edge, let's call it 15. the 50 megawatt data centers then instead of hundreds or thousands of megawatt data centers.

Starting point is 00:27:50 Does it look similar? Is that you have a bunch of, a small number of regions that have like a really high concentration of those 15 to 50 megawatt data centers? Or could it be much more dispersed because the whole point of this is really low latency and local and you don't need them to be as clustered? I think there are lots of different aspects that play in terms of data center's citing. I think the redundancy is definitely one of them. And I have trouble disentangling the role that some of these other factors play as well. Some people talk about tax breaks and incentives from local companies and local states. Some people talk about proximity to internet exchange points. So not only are talking about congestion-free power movement, but you're also

Starting point is 00:28:38 talking about congestion-free data movement into and out of the data center. As northern Virginia has that. And then, of course, the availability of the power itself. I guess I would say that when you start talking about many of these smaller data centers, from a redundancy perspective, it might be okay that they're not all geographically clustered, as long as you have a strategy for rolling over the compute or rolling over the workload to spare capacity somewhere within that region that has a similar performance profile or some sort of similar latency or delay characteristic. And so that's really the concern, whether you have robust and geographical redundancy and resilience there.

Starting point is 00:29:25 Is this happening? Like, it's interesting. I was thinking, okay, so it sounds like you're saying there's not a big downside. We already have significant inference workloads, so it's not like we're waiting on workloads to show up that could accommodate this. And yet, if you look at everybody, most everybody, building data centers, certainly the hyperscalers, and I think the colos and folks as well, you know, the focus continues to be on we got to find big sites for big data centers. Why don't we see more development of this smaller scale edge AI inference world? I think it really depends on the workload and the application, and we don't know.

Starting point is 00:30:07 I view AI as a more fundamental basic technology, and we don't necessarily know what application or capability will be layered on top of it. I'd say that we've been talking about edge data centers a lot. There are other words for this type of data center. A content distribution network is one of those examples, a CDN, or a point of presence, a BOP that these facilities are sometimes called, and they exist in fairly significant numbers.

Starting point is 00:30:39 Content distribution networks ensure that when you want to access, for example, New York Times.com or WSJ.com, your webpage is not being served from the other end of the country. Those web pages are sitting close to you because the content distribution network took those updated webpages and moved them to facilities near you, data centers near you. Likewise, companies like META, when they have Instagram or when they have these social media applications,

Starting point is 00:31:11 they also have these points of presence that supply data from local points of presence rather than retrieving content for your feed from across the country. So we already see that, but these are application level performance requirements, whether they be for social media or for other sort of news content. Once it becomes clear what applications of AI really drive further inference deployments, then we'll know what sort of performance requirements I need to, what sort of what we call caching techniques or strategies might be useful, so that we can keep fresher data or more recent, more frequently used models closer to these

Starting point is 00:31:50 users and then serve them more quickly. I think we'll become clearer as we see which models really get traction, which applications really get traction. Right. So maybe the state of affairs today is, look, anybody who's developing data centers, we know we need the big centralized data centers because there is currently, essentially endless demand to train models, at least relative to the availability of compute today. And so we know we need to build the big centralized ones. We might as well use those big centralized ones that we know we need right now for inference workloads, such as they are today.

Starting point is 00:32:26 But we don't have enough certainty yet about what the inference workloads are going to be long term to invest that kind of capital and time expenditure that it would take to build out the network of 110 megawatt data centers in a particular geographic region, something like that. That's right. And I would say maybe that my crystal ball is as clear as anyone else's crystal ball, but I feel like there's a huge amount of GPU capacity being discussed in the pipeline in these large data centers. And if it turns out that maybe there are diminishing returns from training larger and larger models,

Starting point is 00:33:03 or maybe we run out of data because we've exhausted all the data that's available on the Internet, when those things happen, it may be that demand for these GPUs in these largest data centers will flatten out, and we're going to have spare capacity, at which point, as you say, they will be used or repurposed to serve and inference. And then it will be hard to make the case for building yet, more data centers, smaller ones with GPUs closer to the users. I think the catch there will be if one of these model providers or one of these application developers makes performance a distinguishing feature of their offering, right?

Starting point is 00:33:43 If they start competing on performance rather than on capability, then we're going to see, well, I may have a thousand GPUs in the middle of Nebraska that are already deployed, but if I really want to break into the San Francisco market, I've got to build my GPUs right there and have them available. All right, so speaking of performance, let's transition to the full extreme version of this, which is also, I think, theoretically the most disruptive from an energy perspective,

Starting point is 00:34:09 which is shifting any significant portion of these inference workloads all the way onto the device. Either skip the middle ground of edge, five megawatt data centers or 15 or 50, or include them, but shift workloads that would have gone to a big data center,

Starting point is 00:34:26 that requires a lot of power, straight onto your iPhone or your iPad or whatever it is. And we've heard some glimmers of this as well. Give me the similar sort of like pros and cons of shifting that workload straight onto the device. Right. Pros primarily two things. One is performance, right? You don't have to go across the Internet. The model is right there and the computer is right there.

Starting point is 00:34:50 Assuming that you get really capable of hardware on your device as well, you get really quick responsive answers from your AI. The second is also something we've mentioned earlier, which is the notion of privacy. You don't necessarily need to send your data out into this hyperscale data center where it gets blended with lots of other user data and you have made fewer guarantees about what happens to it. Localized compute is certainly more private than compute on shared systems. So those are the two key advantages. And then I guess third would be that it gets more tightly integrated with the capabilities

Starting point is 00:35:30 on a particular platform. So, for example, Apple's ecosystem. Right. And Apple seems like the obvious candidate to do this clearly. You mentioned privacy. Apple is particularly focused on privacy. They have the hardware, the device, right? Like Apple is notoriously, or at least reputationally behind in the AI race.

Starting point is 00:35:49 And so, like, this, it's not hard to picture that, like, if somebody, is going to move a lot of this inference on device. It's going to be Apple. Okay, but there is a real trade-off here, I assume. Yes, and the trade-off is primarily, it's primarily with respect to the capabilities of the device. So if we have a very large model, we're going to have to deploy that model

Starting point is 00:36:13 on a much more capable hardware platform than we've got today. This means having some number of gigabytes of memory to hold the model weights, and then also some additional gigabytes of memory to hold the context as you develop this conversation with the model. In addition to the memory, you're also going to need the compute. You're not going to have this high-performance GPU sitting inside your phone. So you're going to have to have specialized chips.

Starting point is 00:36:40 Those specialized chips on your hand are going to be less powerful or less capable than the ones in the data center. So all of this speaks to not getting exactly the same model that you would get into the data center. you would get a shrunk-down model. Maybe in the data center, you would have a trillion parameters, this massive GP-5 model, for example. But on a personal consumer electronics device,

Starting point is 00:37:05 you might only have 7 billion parameters, so orders of magnitude smaller. And that smaller model will be less capable. It will give you less capable answers. It will be capable of doing fewer tasks. But maybe that's okay, because you've identified only a handful of tasks that you really care about on your personal device.

Starting point is 00:37:25 So that is really the trade-off. As you go towards the device, you're going to have to shrink the size of the model down. You're also going to get less and less capability out of your AI. The final thing, of course, is the power and energy profile. At data center scale, we care primarily about power because power influences infrastructure and power delivery and influences thermal and so on.

Starting point is 00:37:49 thermal management. For device-level compute, there are two considerations. We care about energy rather than power because that affects battery life. So even if you could deliver a really capable GPU chip onto your phone,

Starting point is 00:38:05 the question is, how long would your phone last if you were using that chip on a fairly consistent basis? So the energy aspect will continue to be challenging, and then the thermal aspect will also be challenging if you have a really powerful device

Starting point is 00:38:18 that's going to be a hot brick inside your pocket. And that's going to be a deal breaker as well. So when you say deal breaker, is there progress toward on-device inference? I mean, to your point on performance, that strikes me as like, okay, this is now, we're now again in the context of like specific workloads. Certain types of workloads,

Starting point is 00:38:42 like a $7 billion parameter model might be fine. And others, it wouldn't be. And so maybe there will be some on-device an on-device chip and some inference that you could do on-device, but you know, you pull up your chat GPT app or whatever, and of course it's going to send you back out to the cloud, or maybe to the edge. But, you know, these other challenges of thermal management,

Starting point is 00:39:05 things like that are hardware challenges. Where are we in the progression of on-device inference? Is it coming? Is it not coming? Do we not know? I think the assumption with on-device inference is that you'll be able to shrink the model without loss in performance for the tasks you care about. That is the primary strategy, the computer scientists have been taking. On the hardware side, we have made strides in developing custom chips, custom silicon, for the specific types of tensor algebra that are required for machine learning models.

Starting point is 00:39:45 So we know how to build those chips, and that gives us energy-efficient compute, higher performance. We know how to build really capable memory systems or solid-state disks. So when your phone now has hundreds of gigabytes of memory on it or hundreds of gigabytes of storage on it, so there's a question of, well, maybe you'll end up using less of it for your photos and more of it for your AI model, something like that. So I think there are fairly significant resource constraints,

Starting point is 00:40:16 but I don't think that they are insurmountable in the sense that more intelligent hardware design and more intelligent hardware management could go some ways in terms of making these AI models feasible on the device. Okay, so I'm going to put you on the spot, and we promise not to hold you to these numbers, but just to give a sense of where we think things are heading. If we're fast-forwarding 10 years, right, let's just say we're in 2035, and imagine there's a total volume of inference compute in the world or whatever that's, let's just say is 100 megawatts total,

Starting point is 00:40:51 what would be your guess of the ranges of how much of that compute is going to take place in large centralized data centers or versus at the edge? Let's, we'll draw a line, let's say, you know,

Starting point is 00:41:03 100 megawatts and above is large centralized, sub hundred megawatts, but not on device, is edge, and then the third category, of course, being on device. Like, how much of it

Starting point is 00:41:14 can go anywhere, but the centralized data centers? So I would go straight to this idea of having a 2080 rule, because we see this all the time in computer systems, where you have 20% of your tasks being extremely popular. Maybe there are 20 things that you always want to do, and you spend 80% of your AI compute doing those things. That could be email processing.

Starting point is 00:41:37 That could be photo analysis. So we can identify what those really compelling applications and tasks are, and we're going to be spending most of our time doing that. And then for the remainder of the long, long, heavy tail of other tasks that people might want to do, there will always be backup capabilities residing in the cloud data center. So I would say that we could be getting 80% of our compute done locally and leaving 20% of the heavy lifting or the more esoteric, the more corner case compute for the data center cloud.

Starting point is 00:42:13 That is, of course, excluding the training. The training will continue to all reside in the massive facilities. But in terms of the inference, I think there's huge potential. Right. But, yeah, that's actually a very significant shift. If 80% of the inference workload, appreciate that that doesn't include training, but still, if 80% of the inference workload could end up local, that's a significant shift.

Starting point is 00:42:34 And has pretty profound implications for the energy picture as well. Are you saying that 80% just to pin you down even a little bit more, is that local in the sense of being at the edge or is that local in the sense of being on device? Or like, what do you think the split ends up being there? Yeah, so I think of the 80%, I would say most of that will be on the edge. Like I suspect it is today, I think that if you look at what we talked about earlier, content delivery networks, points of presence, they've probably identified 20% of the content that 80% of the people will be looking at most of the time

Starting point is 00:43:15 and they're putting it at the edge. I think maybe on the order of 1% ends up being put on your consumer electronics. Actually, even for today's compute, when we set aside AI, there is a trend towards consumer electronics hiding that flow of data back and forth between the device and the edge for you.

Starting point is 00:43:40 So sometimes, sometimes they'll, like if you use a cloud storage service, like Dropbox, or if you're using a photo storage service, they will let you pretend that you have access to all of your videos or all of your photos and all of your documents, and they will transparently, behind the scenes, move things back and forth between the data center and your local device. So you may think you have all of it, but maybe you've only got a tiny sliver, less than 1% on your local device. Right. Certain things open up. In my box instance, certain things. open up much faster than others when I try to open them.

Starting point is 00:44:15 And it's occurred to me that that is why. If I step back then, okay, so it sounds like what you're saying, in this scenario, you're painting of the future. Roughly 80% of the inference workloads are edge, very little of it actually on device, and then the other 20% or so sitting in cloud, big cloud data centers. So when I think about the energy implications of that, there's, I think, a couple ways to think about it that are pretty interesting.

Starting point is 00:44:45 One is this, okay, so maybe a fair amount of the energy consumption of at least inference compute is going to shift to these 5 megawatt, 15 megawatt, 50 megawatt type local sites. That has big implications for the grid. In ways that are, I don't know, both good and bad, probably, harder to manage, in some ways, easier to manage in other ways. But the overall energy consumption of inference compute, I would expect, and you can tell me if I'm wrong, would actually be higher in this scenario than it would be if it was all centralized because I assume the PUE that you get for these edge data centers isn't quite as good as it is for the large centralized data center. So like on balance, this probably means more overall AI energy consumption. Do you think that's right? Yes. Yes. I think I think you get economies of scale.

Starting point is 00:45:37 when you go to a gigawatt or two gigawatts, you have a single facility, you're managing it in a highly optimized coordinated way, and you've got hundreds of thousands of these machines all managed very precisely. I think as you shrink the system down, you will lose an efficiency. You will be trying to build these 20 megawatt data centers

Starting point is 00:46:00 and maybe footprints or facilities that weren't designed initially for those workloads. So, yes, I think total energy costs. may go up as a result. We're talking about inference workloads to some extent as a monolith. I'm sure they are not. So are there big distinctions in your mind in terms of the different types of inference workloads and how that influences where they should be housed?

Starting point is 00:46:25 Right. Yes. So that's a really great question, actually. I would say that there are fundamental limits that a number of inference queries a human user can actually produce because ultimately, limited by the speed of our typing, the number of tokens we can actually produce to query the models. So there is some of that where humans will continue to send requests to agents.

Starting point is 00:46:50 But I think increasingly, most of the inference workload will come from other software agents. This could be a search engine retrieving web pages and then asking the large language model to summarize it for into a coherent discussion. for you. This could be your photo app learning something about your images, or this could be your mail app, doing something with the mails and helping you compose messages. So all of that is done behind the scenes, and those inference workloads are potentially much larger because, of course, software can generate those requests at much, much higher rates. From the perspective

Starting point is 00:47:36 of where that computation happens, to the extent that the data center already has servers running your mail workloads, or to the extent that your search engines are already running in the same data center, the communication that the model will be a bottleneck, right? So if you have a data center in Nebraska running your search engine for you or doing some of these other big heavy lifting, heavy software jobs, then potentially they could query and execute inference in these largest hyperscale data centers. All right, Ben, this was super interesting. Really appreciate your time. It was my pleasure. I really enjoyed the conversation. Thanks so much. Dr. Ben Lee is a professor of electrical engineering and computer science at the University of

Starting point is 00:48:28 Pennsylvania. He's also a visiting researcher at Google. This show is a production of latitude media. You can head over Latitude Media.com for links to today's topics. Latitude is supported by Prelude Ventures. This episode was produced by Daniel Waldorf, mixing and theme song by Sean Marquand. Stephen Lacey is our executive editor. I'm Shale Khan, and this is Catalyst.

Catalyst with Shayle Kann - Will inference move to the edge?

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.