Catalyst with Shayle Kann - Will inference move to the edge?
Episode Date: December 18, 2025Today virtually all AI compute takes place in centralized data centers, driving the demand for massive power infrastructure. But as workloads shift from training to inference, and AI applications bec...ome more latency-sensitive (autonomous vehicles, anyone?), there‘s another pathway: migrating a portion of inference from centralized computing to the edge. Instead of a gigawatt-scale data center in a remote location, we might see a fleet of smaller data centers clustered around an urban core. Some inference might even shift to our devices. So how likely is a shift like this, and what would need to happen for it to substantially reshape AI power? In this episode, Shayle talks to Dr. Ben Lee, a professor of electrical engineering and computer science at the University of Pennsylvania, as well as a visiting researcher at Google. Shayle and Ben cover topics like: The three main categories of compute: hyperscale, edge, and on-device Why training is unlikely to move from hyperscale The low latency demands of new applications like autonomous vehicles How generative AI is training us to tolerate longer latencies Why distributed inference doesn‘t face the same technical challenges as distributed training Why consumer devices may limit model capability Resources: ACM SIGMETRICS Performance Evaluation Review: A Case Study of Environmental Footprints for Generative AI Inference: Cloud versus Edge Internet of Things and Cyber-Physical Systems: Edge AI: A survey Credits: Hosted by Shayle Kann. Produced and edited by Daniel Woldorff. Original music and engineering by Sean Marquand. Stephen Lacey is our executive editor. Catalyst is brought to you by EnergyHub. EnergyHub helps utilities build next-generation virtual power plants that unlock reliable flexibility at every level of the grid. See how EnergyHub helps unlock the power of flexibility at scale, and deliver more value through cross-DER dispatch with their leading Edge DERMS platform, by visiting energyhub.com. Catalyst is brought to you by Bloom Energy. AI data centers can’t wait years for grid power—and with Bloom Energy’s fuel cells, they don’t have to. Bloom Energy delivers affordable, always-on, ultra-reliable onsite power, built for chipmakers, hyperscalers, and data center leaders looking to power their operations at AI speed. Learn more by visiting BloomEnergy.com. Catalyst is supported by Third Way. Third Way’s new PACE study surveyed over 200 clean energy professionals to pinpoint the non-cost barriers delaying clean energy deployment today and offers practical solutions to help get projects over the finish line. Read Third Way's full report, and learn more about their PACE initiative, at www.thirdway.org/pace.
Transcript
Discussion (0)
A very brief word before we start the show. We've got a survey for listeners of Catalyst and
Open Circuit, and we would be so grateful if you could take a few moments to fill it out.
As our audience continues to expand, it's an opportunity to understand how and why you listen
to our shows, and it helps us continue bringing relevant content on the tech and markets you
care about in clean energy. If you fill it out, you'll get a chance to win a $100 gift card
from Amazon, and you can find it at latitudemedia.com slash survey, or just click the survey link in
the show notes. Thank you so much.
Latitude Media, covering the new frontiers of the energy transition.
I'm Shail Khan, and this is Catalyst.
We could be getting 80% of our compute done locally and leaving 20% of the heavy lifting
for the data center cloud. Of the 80%, I would say most of that will be on the edge.
I think maybe on the order of 1% ends up being put on
your consumer electronics. Coming up to the age of edge inference blunt to big data center boom.
When utilities need flexible capacity they can count on, they turn to Energy Hub. Energy Hub works
with more than 170 utilities, coordinating over 2.5 million devices to manage 3.4 gigawatts of
flexibility built for the moments when utilities can't afford uncertainty. Energy Hub builds and
operates virtual power plants that utilities actually stake their grid planning on, coordinating EVs,
batteries, thermostats, and more through a single platform built for utility scale.
Predictive, verifiable, and designed to perform when it counts.
Learn more at energy hub.com.
Trillions of dollars are flowing into clean and critical infrastructure, but those investments
aren't driven by technology alone. They're shaped by markets, by policy, by capital,
and by the institutions that connect them.
I'm Alfred Johnson, CEO of Crux, and host of a brand new podcast, Critical Capital.
Each episode, I talk with people deploying capital, shaping policy and building the clean economy.
Tune in as we unpack how progress is actually made.
Listen to critical capital on Spotify, Apple, or wherever you get your podcasts.
Catalyst is supported by Fish Tank PR, an award-winning PR firm focused on climate and energy tech, renewables, and sustainability.
Fish Tank is known for generating prominent and effective media coverage for the brands they work with.
If you want a PR partner that's thoughtful, shoots straight, and gets results, you'll like
Fish Tank PR.
To learn more about Fish Tank's approach, visit fish tankpr.com.
That's F-I-S-C-H-Fish-Tankpr.com.
I'm Shale Khan.
I lead the early stage venture strategy and energy impact partners.
Welcome.
Okay, so here's an energy question disguised as an AI infrastructure question.
What proportion of the world's AI compute in 20?
2035 will be cloud, i.e. in large centralized data centers, versus edge versus edge, i.e. on device.
It's an energy question because the answer today is effectively 100% in that first category, cloud.
And that's why we have this crazy dynamic in the electricity sector and actually in the natural gas sector too,
where hypers and neoclodels and developers and real estate speculators and crypto miners,
turned AI companies, and more are hunting for sites that can accommodate hundreds of megawatts or
gigawatts of power. And the whole thing, as we know, is crashing through the electricity sector,
affecting generation and transmission distribution, prices, now politics, and so on. But there's a narrative
that I've heard a number of times that, if borne out, would potentially present a very different
future from the present. This is one where AI workloads, first of all, shift significantly from
training to inference, and then where those inference workloads become highly latency-sensitive
and are also able to be executed in a more distributed fashion. And as a result, much of that compute,
and thus the power demand, shifts from these big centralized data centers to the edge.
That could mean it shifts to 10 megawatt data centers clustered around an urban core or an
autonomous vehicle corridor, or at the limit, it could mean inference compute happens on device.
and centralized data centers fall back into a pure training position.
Any version of this that takes significant share of the market
would have profound implications for the energy question and for the grid.
So it's worth exploring, which is what I'm doing today with my guest, Dr. Ben Lee.
Ben is a professor of electrical engineering and computer science at the University of Pennsylvania.
He's also a visiting researcher at Google.
By the way, this edge AI infrastructure world and the energy implications thereof
is super interesting to me, as you will hear.
So if you are building something in the space,
please come get in touch.
In the meantime, here's Ben.
Ben, welcome.
Great to be here. Thanks so much.
I'm very excited for this conversation
because this is the topic that I,
in my energy circles that I travel in,
I've heard scuttlebutt about a bunch of times,
but I've never actually spent the time
to really try to understand.
The topic basically being
how much of inference compute
might move from central cloud infrastructure to the edge,
and then how far to the edge, of course, being another question.
I think we should start by actually defining those categories a little bit.
How do you think about the categorization of where compute can occur?
Then we'll talk about each of those categories individually.
Right.
So even before we talk about generative AI, for classical computer, cloud computing in general,
all the services we love and change the way we live and work today.
There are three levels generally I think about for compute.
The first is massive hyperscale data centers,
the ones run by Microsoft and Google and Amazon,
hundreds of thousands of machines, massive facilities.
That's what most people think about when they think about cloud computing.
At the other end of the extreme would be personal devices,
consumer electronics. So you think about your phone, you think about your tablet, your
laptop, plenty of compute can happen there as well. There is a perhaps less understood
middle layer or intermediate layer called edge computing. And edge computing really means that
there are times where you don't want to go all the way to this remote, massive facility
and wait for the data to go out to that data center and then come back. You might want to access
some compute that's a little bit closer to you. Maybe
the same city, maybe the same geographic region, that's edge computing.
So they're still going to supply really capable high-performance machines, these servers,
but you don't suffer those longer communication times or latencies that you might if you
were to go to that remote massive data center.
And my recollection is that there was, I think, okay, so the advent of cloud computing
meant the build out of lots of big centralized data centers.
There was a fair amount of conversation some number of years ago
in the kind of first wave of excitement around autonomous vehicles,
in particular, that you might see a fair amount of edge infrastructure get built
because of the latency tolerance requirement for AVs.
I mean, I'm on the outside, so tell me if I've got that kind of narrative wrong here.
But then it seems to me that because AVs were generally delayed
or maybe the need wasn't as high, like what we've got today,
If you just look at the infrastructure today,
it seems like the vast, vast majority of classical compute even,
except for stuff that's sitting in like mainframes that companies
is in the cloud and the big centralized data centers.
Do I have that right?
That's right.
And this is a decades-long trend.
I mean, we've seen this progression,
this adoption of cloud computing over the last 15 to 20 years.
And there are a couple of reasons we are seeing
that shift, or we have seen that shift. The first is that computing in a massive data center
run by the hyperscalar companies, these big tech companies, is much more energy efficient. They know
how to deploy these facilities. They know how to cool them and build HVAC systems efficiently.
So they're incurring very small overheads per watt of compute. There's this industry standard
metric called power usage effectiveness or PUE, and that's the ratio of the power you're using
divided compared to the power that's going to compute. So Google's PUE is close to 1.1, which is to say
for every watt going to compute, there's an additional 0.1 watts going to the over as a power
delivery or cooling or whatever. So that's really incredibly efficient. And most mom and pop
data center operators, most enterprise data center operators don't get the scale.
an efficiency that these hyperscalers do.
The scale also gives a second key advantage,
which is the ability to share hardware.
So you buy the hardware once,
and you have lots of users sharing the same physical hardware.
That allows us, again, to drive the cost down,
allows the hyperscaler operators to drive the costs down,
and that essentially gets a massive increase in efficiency.
So most compute now is being done in these large data centers in the cloud.
Okay, so let's talk about the world of AI now, which is where all this growth in compute is happening.
You know, AI workloads, of course, divided into two major categories, one being training of models and the other being inference.
I think we'll spend most of our time today talking about inference probably, but let's spend one minute on training.
Is there any movement or argument that training should take place anywhere other than large centralized data design?
It seems very clear to me that the trend right now is just build the large,
possible data center to train the largest possible model.
So is there anyone who thinks that it might,
that might turn in the other direction?
Some, but that really hasn't gotten much traction.
I think the reason why we see most training
going happening in massive data centers is because of the scale.
You need to communicate large data sets.
You need lots of GPUs all closely coordinated,
learning the model parameters.
The only scenario that some people have explored
for training away from the data center is if you've got private data
and somehow you want to refine your model
or somehow fine-tune your model with that private data.
You don't want to share it with the hyperscalers.
That has been primarily a research question
rather than a production system that people have deployed.
Okay, so let's assume then that the vast majority
of training compute is still going to happen
in centralized data centers.
As it stands today, I don't know if you know the numbers,
but just high-level, of all AI,
workloads, how much is training versus inference? Because I think the other big point people have made
is like over time, the proportion of workloads going toward inference is going to increase,
the proportion of workloads going toward training may decrease as we sort of asymptote the next
model or something like that. But like today, it's mostly training still. I would agree with that.
I think to first order, the training costs are historically what people have cared about the most
because the data sets are massive,
and then you're talking about these massive 1,000-meagwatt data centers
for the training workloads.
There was a study we did when I was a visiting research scientist at META,
where we found that energy costs for AI were roughly broken into three categories.
There's a data pre-processing aspect as well, and that's about a third.
The training is another third, and then the inference or the use of the model is the last third.
But clearly those fractions are evolved.
rapidly. And I would agree with you when you're saying that the training costs are probably
flatlining. They were reaching a plateau and how quickly they are growing, perhaps. And if the
optimism of our AI is to be justified, you're going to have to see inference costs go way up,
because that will be an indicator that adoption has gone up in a fairly significant way, both among
individual users, but also among companies and enterprise users. So I think it's,
It's true to say that inference costs are large and potentially will grow very rapidly.
Okay, so then we're getting to the sort of crux of our question today, which is inference workloads, inference costs increase over time.
Usage of the models increases over time.
That's the presumption of everything going on in AI world.
And then the question is, will that inference compute predominantly still take place in these big centralized cloud data centers?
or will some or much of it potentially shift
either to one of the other two categories you described,
sort of edge localized or fully localized on device?
So let's talk about the edge version first,
which is essentially smaller data centers,
still data centers, but smaller and more local.
What's the argument for why that might happen
and what are the limitations?
So the argument in favor of edge computing is mainly
the proximity to the end user.
So we have been conditioned in an era before generative AI
that when we access internet-based services like a search engine,
we expect the answer to come back on the order of 100 milliseconds.
That is the order of magnitude that we're talking about.
And as a result, to get those 100-millimeter latencies,
oftentimes you require computation closer to the user.
So you don't have to travel across the internet.
You don't have to travel from the West Coast.
out to the East Coast and back again, the data, I mean,
and get that answer back in a timely way.
What is interesting with generative AI is that we are being
reconditioned to tolerate much longer delays.
So if you use something like GBT or you use something like
Claude or your favorite chatbot, oftentimes it's just sitting there
thinking for seconds and seconds, maybe tens of seconds
before it gets you the first token.
So the question there is to what extent we carry
about that latency and need that really fast, responsive access to the answer.
Yeah, and I think we've been especially trained even further in that direction with the
introduction of things like deep research, where even in the name, you sort of think, well,
of course that has to take time.
It is deep research that they are doing.
So it's an interesting point that maybe we are becoming re-conditioned to allowing more
latency.
The argument that I've heard for why latency is really in a matter, apart from just wanting
search queries or chat queries to come back quicker, is the next wave of that.
is the next wave of applications for AI, right?
And so maybe we go back to the autonomous vehicle world and things like that,
where, like, latency, making decisions in near real time does become really important.
Robotics being another category that could be a major user of AI compute,
but needs really, really low latency.
Is that part of the argument for shifting some compute to the edge?
Yes, absolutely.
So the class of compute you mentioned autonomous vehicles, robotics fit into what we call cyber physical AI.
So cyber physical systems are those that have a cyber component, a computational component, but also interact with the physical world.
And once those interactions with the physical world arise, then we care about responsiveness because with that underpins safety guarantees and the ability to make sure that your robotic arm is able to respond quickly enough to hazards.
your autonomous vehicles are able to do so.
So I agree that there will be cases
where we will need those really low latencies,
and that is going to require edge computing
much closer to the user,
so we have much shorter internet delays, network delays.
I'm curious to understand the trade-offs here, right?
I know with model training,
there are technical reasons why you want your compute
as clustered together as closely as possibly
You want every GPU as close to every other GPU
is you can make them minimizing the copper between them
or the optics or whatever it is
that's communicating between them.
And that for some reason that you can explain to me
makes model training more effective.
Is there a similar dynamic in inference?
Is there a technical reason why
that you're paying a penalty
if you shift to smaller data centers at the edge?
Or is there no technical reason
why it's suboptimal.
Right, yeah.
Let's talk about the training piece first.
The reason why we need a thousand megawatt data centers
where we have hundreds of thousands of GPUs connected so closely together
is because the data sets are massive and the models are massive.
We're trying to learn on the order of a trillion parameters for these machine learning
models, this AI models, and we're trying to do it on the wealth of data we find
in the Internet.
There's no way that any single G.
GPU can handle that much data.
So what we end up doing is partitioning the data into smaller pieces and then handing each
GPU a slice or partition of this data.
And each GPU will churn on its own share, on its own partition of the data and learn the
models that work best for its piece of the data.
And all the other GPUs in the data center are doing the same thing on their partitions
of the data.
Periodically, what they will do is they will compare notes.
They will share the weights that they've learned.
And this sharing is really, really expensive.
And some of the people in the energy space may know that there are massive energy fluctuations
or power fluctuations we will see in data center usage when the GPUs go from this computational
intensive phase where you're learning the model weights to this communication intensive
phase where they're comparing notes and sharing their intermediate results with each other.
So as a result, that's why we're talking about these massive data centers for training.
They all need to communicate frequently to share what they've learned from their own data sets.
For inference, we don't see that effect.
Just to add, the craziest thing to me about how model training data centers operate right now,
the absolute craziest thing is, as you said, there are these,
there are surprisingly large spikes in power demand as a result of how the models are trained.
they do in large part because those spikes are actually problematic, not just to the grid,
but to the equipment inside the data center as well.
So what they do at least sometimes to manage that is they create dummy workloads.
So they keep the power profile basically flat, but you are literally just wasting energy
on absolutely nothing.
Nothing is happening during those times.
They're dummy workloads.
At that scale, the fact that that is happening is wild to me.
Absolutely.
And I think we've seen this in other contexts as well, but not powerful.
at this scale, this notion of an electrical engineering, we'll call it the DIDT problem,
the change in current divided by a change in time. If large current swings over very short periods
of time, you could imagine building batteries to sort of damp things out or decoupled,
and certainly a lot of people are thinking about that. But the easiest thing to do might be
to just modulate the software, as you say, because we have very precise control over what the
software does. So that is an active and ongoing area of research that needs to further develop.
virtual power plants are becoming a reliable way for utilities to manage capacity,
but enrolling devices is just the start.
What really matters is confidence, knowing those resources will perform when dispatched
and being able to prove it from the control room to the living room.
Energy Hub's platform handles the full picture, from near real-time forecasting,
locational dispatch, and the kind of rigorous verification that holds up when regulators,
grid operators, or leadership ask, did it deliver?
easy enrollment creates momentum, proven performance builds trust.
That's why more than 170 utilities rely on Energy Hub to manage over 2.5 million devices
delivering 3.4 gigawatts of flexible capacity.
See what that looks like at energy hub.com.
We're living through a profound economic shift, and energy sits at the center of all of it.
Trillions of dollars are flowing into power plants, transmission lines, battery factories,
data centers, but the future of energy,
isn't shaped by technology alone.
It's shaped by markets, by policy, by capital,
and by the institutions that connect them.
I'm Alfred Johnson, CEO of Crux,
the capital platform for the clean economy.
Join me for my brand new show, Critical Capital.
As I talk with people deploying capital,
shaping policy and building projects.
Together, we unpack how risk is priced,
how incentives are structured,
and how progress is actually made.
Listen to Critical Capital on Spotify, Apple,
or wherever you get your podcasts.
Are you tired of overpaying for big-name PR firms, but not really knowing what they're delivering?
Is your comms team wasting time reviewing lengthy messaging briefs and decks, instead of engaging journalists or producing content?
Are you wondering why your competitors are getting press and you aren't?
Fish Tank PR is an award-winning climate and energy tech, renewables, and sustainability-focused PR firm dedicated to elevating the work of both early stage and established companies.
Whether you need to position yourself as a thought leader in between project announcements, or translate complex,
ideas and technologies into tangible, compelling stories that resonate with the media,
Fishtank can help. Check out fish tankpr.com. That's F-I-S-C-H-Fish-Tankpr.com.
Okay, so then on to inference. So you're saying inference does not contain that same
challenge. So is there any, what is the downside to shifting inference workloads to the
edge? To my knowledge, there isn't much of a downside because the reason why
inference is amenable to edge computing is because when you send a prompt to for processing by a large language model, that prompt is probably handled by one GPU or maybe eight GPUs inside a single machine.
So and the reason that is is because the model sits in that machine, the data sits in that machine and all of your prior conversations with that bot are sitting in that machine.
and it's a very localized piece of compute that needs to be done.
And you don't need tens or hundreds of GPUs to be coordinating to give you an answer back.
You've got that one GPU or a tightly coupled GPUs giving you that answer back.
And that is amenable.
That is great for edge computing, and we can certainly supply that.
So a thought experiment that I've given people recently in thinking about this is, let's just say that you,
need a gigawatt of inference compute in five years from now or seven years from now, something
like that. You think you need a gigawatt, wherein the demand for that gigawatt is geographically centralized
somewhere. Let's just say you need a gigawatt of inference. You think you're going to need a gigawatt
of inference compute to serve the Dallas metropolitan area, whatever it might be. At that point,
a few years from now, this is back to the power perspective, is it going to be? Is it going to
to be easier for you to find and cite a one gigawatt site or 110 megawatt sites within that
geographic region. Today, I think it is still probably easier to find the gigawatt site,
or at least the past couple of years it has been, but there aren't that many gigawatt sites
out there from a power availability perspective. So at some point, is that going to flip
and is it going to be easier to build 110 megawatts sites,
which sounds really hard to do,
and indeed is, but these are all hard problems.
So if that happens,
do you think that we are going to see a significant portion
of that inference workload move to that type of scale?
Is that the right scale?
Should we be looking at 10 megawatt sites, 100 megawatt sites,
one megawatt sites?
How far to the edge do we want to go?
Yeah, absolutely.
And I agree with the premise of that question 100%.
I think that there are two reasons to go to
smaller, many smaller
data centers. The first is the one you mentioned,
power provisioning and
connections to the grid. The second is
the fact that you don't need
a massive GPU coordination
for an inference workload.
I guess
the catch might be that if
you are thinking about
your existing edge data centers, maybe you've
got data centers in downtown Los Angeles
or something like that already serving workloads.
Those workloads may not be
configured to handle GPU
and AI compute.
They may have power delivery infrastructure that was optimized for CPUs.
They might have HVAC systems optimized for the much lower power density of CPUs.
So it's not simply a matter of pulling out your CPUs and replacing them with GPUs.
You may have to retrofit the facility itself to support that.
But I agree.
I think finding capacity there may eventually become easier than finding.
the next 1,000 megawatts.
Is there any limitation?
I can imagine, I'm trying to think of why you wouldn't do that.
You need to sort of house all of the,
you need to have a fair amount of memory,
and you to house all the model weights and so on
in every individual data center,
if you're going to do that at the edge, right?
So is there, there's got to be some minimum viable scale, I assume.
Right, and maybe to give you a sense of the type of data centers
we were talking about in the past, again, in a study that we had done with Meta, we looked at
15 of their data centers before generative AI, and the scale of those facilities were somewhere
between 15 to 50 megawatts, right? So less than 100 megawatts. And certainly, that was fairly
very fairly conventional, uncontroversial to build those sites of data centers in the past.
So that's the starting point, I think, in terms of the scale.
Now, as you scale down towards, for example, one megawad, not clear at what point things start making less sense.
I guess the other point here, like the way that the data center build out has gone historically, just like the cloud data center build out, it's been fairly clustered in these regions, right?
And there's a reason why Northern Virginia is the data center hub of the world.
And there are others as well, Chicago, Dallas, et cetera, Phoenix.
And that, as I understand it, is largely because the cloud providers needed to offer a certain level of reliability to their customers.
And so they could have redundancy within a given region, and that was helpful to them in terms of what they were offering.
Do you think that this future world, wherein a bunch of inference compute moves to the edge, let's call it 15.
the 50 megawatt data centers then instead of hundreds or thousands of megawatt data centers.
Does it look similar?
Is that you have a bunch of, a small number of regions that have like a really high concentration of those 15 to 50 megawatt data centers?
Or could it be much more dispersed because the whole point of this is really low latency and local and you don't need them to be as clustered?
I think there are lots of different aspects that play in terms of
data center's citing. I think the redundancy is definitely one of them. And I have trouble
disentangling the role that some of these other factors play as well. Some people talk about tax breaks and
incentives from local companies and local states. Some people talk about proximity to internet
exchange points. So not only are talking about congestion-free power movement, but you're also
talking about congestion-free data movement into and out of the data center. As northern Virginia has
that. And then, of course, the availability of the power itself. I guess I would say that
when you start talking about many of these smaller data centers, from a redundancy perspective,
it might be okay that they're not all geographically clustered, as long as you have a strategy for
rolling over the compute or rolling over the workload to spare capacity somewhere within that
region that has a similar performance profile or some sort of similar latency or delay characteristic.
And so that's really the concern, whether you have robust and geographical redundancy and resilience
there.
Is this happening?
Like, it's interesting.
I was thinking, okay, so it sounds like you're saying there's not a big downside.
We already have significant inference workloads, so it's not like we're waiting on workloads
to show up that could accommodate this.
And yet, if you look at everybody, most everybody, building data centers, certainly the hyperscalers, and I think the colos and folks as well, you know, the focus continues to be on we got to find big sites for big data centers.
Why don't we see more development of this smaller scale edge AI inference world?
I think it really depends on the workload and the application, and we don't know.
I view AI as a more fundamental basic technology,
and we don't necessarily know what application or capability
will be layered on top of it.
I'd say that we've been talking about edge data centers a lot.
There are other words for this type of data center.
A content distribution network is one of those examples,
a CDN, or a point of presence, a BOP that these facilities are sometimes called,
and they exist in fairly significant numbers.
Content distribution networks ensure that when you want to access,
for example, New York Times.com or WSJ.com,
your webpage is not being served from the other end of the country.
Those web pages are sitting close to you because the content distribution network
took those updated webpages and moved them to facilities near you,
data centers near you.
Likewise, companies like META,
when they have Instagram or when they have these social media applications,
they also have these points of presence that supply data from local points of presence
rather than retrieving content for your feed from across the country.
So we already see that, but these are application level performance requirements,
whether they be for social media or for other sort of news content.
Once it becomes clear what applications of AI
really drive further inference deployments, then we'll know what sort of performance requirements
I need to, what sort of what we call caching techniques or strategies might be useful,
so that we can keep fresher data or more recent, more frequently used models closer to these
users and then serve them more quickly. I think we'll become clearer as we see which models
really get traction, which applications really get traction. Right. So maybe the state of affairs
today is, look, anybody who's developing data centers, we know we need the big centralized
data centers because there is currently, essentially endless demand to train models, at least relative
to the availability of compute today.
And so we know we need to build the big centralized ones.
We might as well use those big centralized ones that we know we need right now for inference workloads,
such as they are today.
But we don't have enough certainty yet about what the inference workloads are going to be
long term to invest that kind of capital and time expenditure that it would take to build out the
network of 110 megawatt data centers in a particular geographic region, something like that.
That's right.
And I would say maybe that my crystal ball is as clear as anyone else's crystal ball,
but I feel like there's a huge amount of GPU capacity being discussed in the pipeline
in these large data centers.
And if it turns out that maybe there are diminishing returns from training larger and larger models,
or maybe we run out of data because we've exhausted all the data that's available on the Internet,
when those things happen, it may be that demand for these GPUs in these largest data centers will flatten out,
and we're going to have spare capacity, at which point, as you say,
they will be used or repurposed to serve and inference.
And then it will be hard to make the case for building yet,
more data centers, smaller ones with GPUs closer to the users.
I think the catch there will be if one of these model providers or one of these application
developers makes performance a distinguishing feature of their offering, right?
If they start competing on performance rather than on capability, then we're going to see,
well, I may have a thousand GPUs in the middle of Nebraska that are already deployed,
but if I really want to break into the San Francisco market, I've got to build my GPUs right there
and have them available.
All right, so speaking of performance,
let's transition to the full extreme version of this,
which is also, I think,
theoretically the most disruptive from an energy perspective,
which is shifting any significant portion
of these inference workloads
all the way onto the device.
Either skip the middle ground of edge,
five megawatt data centers or 15 or 50,
or include them,
but shift workloads that would have gone
to a big data center,
that requires a lot of power, straight onto your iPhone or your iPad or whatever it is.
And we've heard some glimmers of this as well.
Give me the similar sort of like pros and cons of shifting that workload straight onto the device.
Right.
Pros primarily two things.
One is performance, right?
You don't have to go across the Internet.
The model is right there and the computer is right there.
Assuming that you get really capable of hardware on your device as well,
you get really quick responsive answers from your AI.
The second is also something we've mentioned earlier, which is the notion of privacy.
You don't necessarily need to send your data out into this hyperscale data center
where it gets blended with lots of other user data and you have made fewer guarantees about what happens to it.
Localized compute is certainly more private than compute on shared systems.
So those are the two key advantages.
And then I guess third would be that it gets more tightly integrated with the capabilities
on a particular platform.
So, for example, Apple's ecosystem.
Right.
And Apple seems like the obvious candidate to do this clearly.
You mentioned privacy.
Apple is particularly focused on privacy.
They have the hardware, the device, right?
Like Apple is notoriously, or at least reputationally behind in the AI race.
And so, like, this, it's not hard to picture that, like, if somebody,
is going to move a lot of this inference on device.
It's going to be Apple.
Okay, but there is a real trade-off here, I assume.
Yes, and the trade-off is primarily,
it's primarily with respect to the capabilities of the device.
So if we have a very large model,
we're going to have to deploy that model
on a much more capable hardware platform than we've got today.
This means having some number of gigabytes of memory
to hold the model weights,
and then also some additional gigabytes of memory
to hold the context as you develop this conversation with the model.
In addition to the memory, you're also going to need the compute.
You're not going to have this high-performance GPU sitting inside your phone.
So you're going to have to have specialized chips.
Those specialized chips on your hand are going to be less powerful or less capable
than the ones in the data center.
So all of this speaks to not getting exactly the same model
that you would get into the data center.
you would get a shrunk-down model.
Maybe in the data center, you would have a trillion parameters,
this massive GP-5 model, for example.
But on a personal consumer electronics device,
you might only have 7 billion parameters,
so orders of magnitude smaller.
And that smaller model will be less capable.
It will give you less capable answers.
It will be capable of doing fewer tasks.
But maybe that's okay,
because you've identified only a handful of tasks
that you really care about on your personal device.
So that is really the trade-off.
As you go towards the device,
you're going to have to shrink the size of the model down.
You're also going to get less and less capability out of your AI.
The final thing, of course, is the power and energy profile.
At data center scale, we care primarily about power
because power influences infrastructure and power delivery
and influences thermal and so on.
thermal management.
For device-level compute,
there are two considerations.
We care about energy rather than power
because that affects battery life.
So even if you could deliver
a really capable GPU chip
onto your phone,
the question is, how long would your phone last
if you were using that chip
on a fairly consistent basis?
So the energy aspect
will continue to be challenging,
and then the thermal aspect
will also be challenging
if you have a really powerful device
that's going to be a hot brick inside your pocket.
And that's going to be a deal breaker as well.
So when you say deal breaker,
is there progress toward on-device inference?
I mean, to your point on performance,
that strikes me as like, okay, this is now,
we're now again in the context of like specific workloads.
Certain types of workloads,
like a $7 billion parameter model might be fine.
And others, it wouldn't be.
And so maybe there will be some on-device
an on-device chip and some inference that you could do on-device,
but you know, you pull up your chat GPT app or whatever,
and of course it's going to send you back out to the cloud,
or maybe to the edge.
But, you know, these other challenges of thermal management,
things like that are hardware challenges.
Where are we in the progression of on-device inference?
Is it coming? Is it not coming? Do we not know?
I think the assumption with on-device inference
is that you'll be able to shrink the model without loss in performance for the tasks you care about.
That is the primary strategy, the computer scientists have been taking.
On the hardware side, we have made strides in developing custom chips, custom silicon,
for the specific types of tensor algebra that are required for machine learning models.
So we know how to build those chips,
and that gives us energy-efficient compute, higher performance.
We know how to build really capable memory systems or solid-state disks.
So when your phone now has hundreds of gigabytes of memory on it
or hundreds of gigabytes of storage on it,
so there's a question of, well, maybe you'll end up using less of it for your photos
and more of it for your AI model, something like that.
So I think there are fairly significant resource constraints,
but I don't think that they are insurmountable in the sense that more intelligent hardware design
and more intelligent hardware management could go some ways in terms of making these AI models feasible on the device.
Okay, so I'm going to put you on the spot, and we promise not to hold you to these numbers,
but just to give a sense of where we think things are heading.
If we're fast-forwarding 10 years, right, let's just say we're in 2035, and imagine there's a total volume of inference compute
in the world or whatever
that's, let's just say
is 100 megawatts total,
what would be your guess
of the ranges
of how much of that compute
is going to take place
in large centralized data centers
or versus at the edge?
Let's, we'll draw a line,
let's say, you know,
100 megawatts and above
is large centralized,
sub hundred megawatts,
but not on device,
is edge,
and then the third category,
of course, being on device.
Like, how much of it
can go anywhere,
but the centralized data centers?
So I would go straight to this idea of having a 2080 rule,
because we see this all the time in computer systems,
where you have 20% of your tasks being extremely popular.
Maybe there are 20 things that you always want to do,
and you spend 80% of your AI compute doing those things.
That could be email processing.
That could be photo analysis.
So we can identify what those really compelling applications and tasks are,
and we're going to be spending most of our time doing that.
And then for the remainder of the long, long, heavy tail of other tasks that people might want to do,
there will always be backup capabilities residing in the cloud data center.
So I would say that we could be getting 80% of our compute done locally
and leaving 20% of the heavy lifting or the more esoteric,
the more corner case compute for the data center cloud.
That is, of course, excluding the training.
The training will continue to all reside in the massive facilities.
But in terms of the inference, I think there's huge potential.
Right.
But, yeah, that's actually a very significant shift.
If 80% of the inference workload, appreciate that that doesn't include training,
but still, if 80% of the inference workload could end up local,
that's a significant shift.
And has pretty profound implications for the energy picture as well.
Are you saying that 80% just to pin you down even a little bit
more, is that local in the sense of being at the edge or is that local in the sense of being
on device? Or like, what do you think the split ends up being there?
Yeah, so I think of the 80%, I would say most of that will be on the edge.
Like I suspect it is today, I think that if you look at what we talked about earlier,
content delivery networks, points of presence, they've probably identified 20% of the content
that 80% of the people will be looking at most of the time
and they're putting it at the edge.
I think maybe on the order of 1%
ends up being put on your consumer electronics.
Actually, even for today's compute,
when we set aside AI,
there is a trend towards consumer electronics
hiding that flow of data back and forth
between the device and the edge for you.
So sometimes,
sometimes they'll, like if you use a cloud storage service, like Dropbox, or if you're using a photo
storage service, they will let you pretend that you have access to all of your videos or all of
your photos and all of your documents, and they will transparently, behind the scenes,
move things back and forth between the data center and your local device. So you may think you
have all of it, but maybe you've only got a tiny sliver, less than 1% on your local
device. Right. Certain things open up. In my box instance, certain things.
open up much faster than others when I try to open them.
And it's occurred to me that that is why.
If I step back then, okay, so it sounds like what you're saying,
in this scenario, you're painting of the future.
Roughly 80% of the inference workloads are edge,
very little of it actually on device,
and then the other 20% or so sitting in cloud, big cloud data centers.
So when I think about the energy implications of that,
there's, I think, a couple ways to think about it that are pretty interesting.
One is this, okay, so maybe a fair amount of the energy consumption of at least inference compute
is going to shift to these 5 megawatt, 15 megawatt, 50 megawatt type local sites.
That has big implications for the grid.
In ways that are, I don't know, both good and bad, probably, harder to manage, in some ways, easier to manage in other ways.
But the overall energy consumption of inference compute, I would expect, and you can tell me if I'm wrong, would actually be higher in this scenario than it would be if it was all centralized because I assume the PUE that you get for these edge data centers isn't quite as good as it is for the large centralized data center.
So like on balance, this probably means more overall AI energy consumption.
Do you think that's right?
Yes. Yes. I think I think you get economies of scale.
when you go to a gigawatt or two gigawatts,
you have a single facility,
you're managing it in a highly optimized coordinated way,
and you've got hundreds of thousands of these machines
all managed very precisely.
I think as you shrink the system down,
you will lose an efficiency.
You will be trying to build these 20 megawatt data centers
and maybe footprints or facilities
that weren't designed initially for those workloads.
So, yes, I think total energy costs.
may go up as a result.
We're talking about inference workloads to some extent as a monolith.
I'm sure they are not.
So are there big distinctions in your mind in terms of the different types of inference workloads
and how that influences where they should be housed?
Right.
Yes.
So that's a really great question, actually.
I would say that there are fundamental limits that a number of inference queries a human user
can actually produce because ultimately,
limited by the speed of our typing, the number of tokens we can actually produce to query
the models.
So there is some of that where humans will continue to send requests to agents.
But I think increasingly, most of the inference workload will come from other software
agents.
This could be a search engine retrieving web pages and then asking the large language model
to summarize it for into a coherent discussion.
for you. This could be your photo app learning something about your images, or this could be your
mail app, doing something with the mails and helping you compose messages. So all of that is
done behind the scenes, and those inference workloads are potentially much larger because,
of course, software can generate those requests at much, much higher rates. From the perspective
of where that computation happens, to the extent that the data center already has servers
running your mail workloads, or to the extent that your search engines are already running
in the same data center, the communication that the model will be a bottleneck, right?
So if you have a data center in Nebraska running your search engine for you or doing some
of these other big heavy lifting, heavy software jobs, then potentially they could query and
execute inference in these largest hyperscale data centers. All right, Ben, this was super interesting.
Really appreciate your time. It was my pleasure. I really enjoyed the conversation. Thanks so
much. Dr. Ben Lee is a professor of electrical engineering and computer science at the University of
Pennsylvania. He's also a visiting researcher at Google. This show is a production of latitude media.
You can head over Latitude Media.com for links to today's topics.
Latitude is supported by Prelude Ventures.
This episode was produced by Daniel Waldorf, mixing and theme song by Sean Marquand.
Stephen Lacey is our executive editor.
I'm Shale Khan, and this is Catalyst.
