No Priors: Artificial Intelligence | Technology | Startups - The marketplace for AI compute with Jared Quincy Davis from Foundry
Episode Date: August 22, 2024In this episode of No Priors, hosts Sarah and Elad are joined by Jared Quincy Davis, former DeepMind researcher and the Founder and CEO of Foundry, a new AI cloud computing service provider. They disc...uss the research problems that led him to starting Foundry, the current state of GPU cloud utilization, and Foundry's approach to improving cloud economics for AI workloads. Jared also touches on his predictions for the GPU market and the thinking behind his recent paper on designing compound AI systems. Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @jaredq_ Show Notes: (00:00) Introduction (02:42) Foundry background (03:57) GPU utilization for large models (07:29) Systems to run a large model (09:54) Historical value proposition of the cloud (14:45) Sharing cloud compute to increase efficiency (19:17) Foundry’s new releases (23:54) The current state of GPU capacity (29:50) GPU market dynamics (36:28) Compound systems design (40:27) Improving open-ended tasks
Transcript
Discussion (0)
Hi, listeners. Welcome to No Pryors.
Today we're talking to Jared Quincy Davis, the founder and CEO of Foundry.
Jared worked at Deep Mind and was doing his Ph.D. with Matei Zahari at Stanford before he began his mission to orchestrate compute with Foundry.
We're excited to have him on to talk about GPUs and the future of the cloud. Welcome, Jared.
Thanks, Sarah, and great to see you. Thanks a lot as well.
Yeah, great seeing it.
The mission at Foundry is directly related to some.
problems that you had seen in research and at deep mind. Can you talk a little bit about the
genesis? A couple of the most inspiring events I've witnessed in my career so far were the release
of Alpha Fold 2 and also Chedipa T. I think that one of the things I was so remarkable to me about
AlphaFold 2 is initially it was a really small team, you know, three and then later 18 people or so.
And they solved what was kind of a 50 year grand challenge in biology, which is a pretty
remarkable fact that, you know, every university, every farmer company hadn't solved. And similarly
with ChichPT, a pretty small team, Open Air was 400 people at the time,
you know, released a system that really shook up the entire global business landscape.
You know, that's a pretty remarkable thing, you know, and I think it's kind of
intriguing to think about what would need to happen for those types of events to be a lot more
common in the world. And, you know, although those events are really amazing because
of the small numbers of people working on them, I think, you know, it's not quite the David
and Goliath story, neither are quite the David and Goliath story that they appear to be when you
when you double click. In open eyes case, you know, there were only 400 people but had
$13 billion worth of compute, you know, which is quite a bit of computational scale there.
And in deep mind's case, it was a small team, you know, but obviously they were standing on
their shoulders of giants in some sense with Google, right, and the leverage that they had
via Google. And so one thing I think that we thought about is, you know, what can we do to make
the type of computational leverage and tools that are currently exclusively the domain of opening
I and deep mine, kind of available to a much broader class of people.
And so that's a lot of what we worked on with Foundry, saying, can we build a public cloud?
You know, built specifically for AI workloads, where we reimagined a lot of the components
that constitute the cloud into end from first principles.
And in doing that, can we make things that currently cost a billion dollars, cost $100 million,
and then $10 million, and over time?
And that'd be a pretty massive contribution.
I think it would increase the frequency of events like AlphaFault 2 by 10x, 100x,000,
or maybe even more super linearly.
And we're already starting to see the early signs of that,
but quite a lot of room left to push this agenda.
So really exciting.
So that's kind of maybe an initial introduction preamble
to how we thought about it.
And I can trace that line of reasoning a bit more,
but that's kind of part of what we've done.
Jared, for anybody who hasn't heard of Foundry yet,
what is the product offering?
Yeah.
So Foundry, we're essentially a public cloud
built specifically for AI.
And what we've tried to do is really reimagine all of the systems undergirding what we call the cloud
into end from first principles for AI workloads.
And we've started to do this a bit of a new way.
I think the AI offerings from the existing major public clouds and kind of some new GPU clouds haven't really re-envisioned things.
And by thinking about a lot of these things a bit anew, we've been able to improve the economics by, in me cases 12 to 20X over.
lower tech GPU clouds and
the existing public clouds. And
you know, we'll partially based on some of these products
that we'll talk about today that we're releasing and a lot of new
things that we're working on. We think we can push that quite a bit further as well.
And so, you know, our
primary products are essentially infrastructure as a service, so our customers
come to us for elastic and really
economically viable access to stay-of-the-art systems.
And also a lot of tools to make leveraging those systems
really seamless and easy. And we've invested quite a bit in things like
reliability, security,
elasticity, and just the core
price performance. How underutilized
are most GPU clouds today? And I think
there's almost three versions of that. There's things on hyperscalers
like AWS or Azure. There's large clusters or clouds
that people who are doing large scale model training or
inference run for themselves. And then there's more just like
everything else. It could be a hobbyist. It could be a research lab.
It could be somebody with just, you know, some GPUs
that they're missing around with. I'm sort of curious for each one of those
types of or categories of users. Like what
What is the likely utilization rate and how much more do you think it could be optimized?
Is it 10%? Is it 50%? Like, I'm this very curious.
One of the most, I'd say, positive cases with the highest utilization, which is the case where
you're running kind of an in-to-in pre-training job, right? And so that's the case where
you've done a lot of work up front. You've designated a time that you're going to run this
pre-training workload for, and you're really trying to get the most utilization out of it.
And for a lot of companies, utilization, even during this phase, you know, is sub-80%.
So why? One reason is actually that these GPUs, particularly the newer ones, actually do fail a lot, as black practitioners would know.
And so one of the consequences of that is that it's very common now to hold aside 10 to 20% minimum of the GPUs that a team has as buffer, as healing buffer in case of a failure so you can slot something else in to keep the training workload running.
right and so even for a lot of the more sophisticated orgs running large pre-training
at scale the utilization sub 80% sometimes less than 50% actually depending on how bad of a batch
they have and the frequency of failure in the cluster and so even in that case now there are often
though also large gaps in intermissions between training workloads even if you have the
the GPUs are dedicated to a specific entity you know and so even in those most conservative
cases which all come back to the less conservative extreme cases utilization really can be
really quite a bit lower than people would imagine. So we can pull on that case a bit more because
I think it's actually quite counterintuitive and really interesting. I think there's a really
fundamental disconnect between people's mental image of what GPs are today and what they actually
are. I think that in most people's minds, you know, GPs are chips, right? And we talk about them as
just, but actually the H100 systems are truly systems. You know, there are 70 to 80 pounds, 35,000 plus
individual components. They're really kind of monstrosities in some sense. And the remarkable thing
that I think Jensen and Nvidia have done, one of one of many, is they've basically taken an entire
data center's worth of infrastructure and compressed it down into a single box. And so we look at
from that perspective, the fact that 70 pounds isn't quite as alarming. But it is, these are really
gnarly systems. And when you end up composing these individual systems, these DGXs or HGXs, into large
supercomputers, what you're often doing is you're interconnecting thousands, tens of thousands,
hundreds of thousands of them. And so the failure probability kind of multiplies. And so because
you have millions, perhaps, eventual components in this supercomputer, the probability that it will run
for weeks on end, and this is basically a verbatim quote from Jensen's keynote. It's basically
zero. That's a little bit of a challenge, and suddenly enough, I think the AI infrastructure world is
still somewhat inquired, and so it doesn't have perfect tooling, broadly speaking, to deal with
these types of things. And so one of the compensatory measures I think people take is
reserving this healing buffer, for example. I think that disconnect maybe helps explain
why these things fail. And it's actually, funny enough, the newer, more advanced systems
anecdotally fail a lot more than historical systems that were worse in some ways.
And do you think that's just like a quality and quality control issue for those systems,
or do you think it's just some form of complexity with some failure rate per component that's
What do you think is for the driver of that?
I think it's more of the complexity has grown, right?
And we're in a different regime now.
I think that it's fair to say, so maybe stepping back again to definitions,
we throw the term large around a lot in the ecosystem.
I guess one question is what does large mean?
And one useful definition of large that I think roughly corresponds to what people mean
when they invoke the term is that a large language model,
you enter the large regime when the, essentially the amount of compute,
necessary to contain
even just the model weights
starts to exceed
the capacity of even a state of the art
single GPU or a single note.
I think it's fair to say you're in the
large regime when you need multiple
state of the art servers
from Nvidia or
from someone else to even just
contain the model, you know, just run
the, basically run the e-training or
definitely even just to contain the model. That's
definitely the large regime. And so
the key characteristic
of the large regime
is that you have to somehow orchestrate
a cluster of GPUs
to perform a single synchronized calculation.
Right? And so it becomes a bit of a distributed systems problem.
I think that's one way of characterizing the large regime.
Now, a consequence of that is that you have many components
that are all kind of collaborating
to perform a single calculation.
And so any one of these components failing
can actually potentially, you know,
lead to some degradation or challenge downstreet.
and I mean, stop the entire workload, right?
You know, you've probably heard, people have talked a lot about Infineband,
and the fact that Nvidia, part of Nvidia's advantage
comes from the fact that they do build systems
that are also saved the art from networking perspective, right?
And their acquisition of Melanox was one of the better of all time, arguably,
from a market cap creation perspective.
And the reason that they did this is because they realized
that it would be really valuable to connect many, many machines
into a single, almost contiguous supercomputer
that almost acts as one unit.
Yeah, and the challenge of that, though, is that there now are many, many, many more components and many more points of failure.
And these things kind of, you know, the point of failure kind of multiply, so to speak.
You implied that, like, GPUs, well, you described GPUs as this unique asset that is more CAPEX than OPEX and that the hypers, like an Amazon, are making a certain assumption about what that depreciation cycle is.
Like, where do you think the assumptions for foundry versus those hyperscalers or versus, let's say, like, a core weave are different?
This opens up a pretty interesting conversation around, like, what is cloud?
I think we've kind of forgotten in this current moment what cloud was originally supposed to be, and it was value of our position was intended to be.
I think current AI cloud is not cloud in the originally intended sense by any means.
So we should pull on that thread.
But I'd say right now it's basically co-location.
Yeah, it's basically co-location, right?
It's not really cloud.
Yeah.
Maybe, yeah, it'll be, it's definitely worth pulling on that thread a little bit.
Yeah, do you want to break that down for sort of our listeners in terms of what you'd view as the differences?
First, I guess, there's a little bit of context for people.
The cloud, as we currently know it, is arguably one of the, you know, most important business categories in the world.
That's, I think, pretty clear.
The biggest companies in the world, they're either clouds, their Azure, AWS, GCP, you know,
core component of the biggest companies in the world, or Nvidia, who sells the clouds, obviously.
So it's clearly important category.
AWS is arguably a trillion dollars plus of market cap if you broke it out from Amazon.
At one point, relatively recently, it was all of Amazon's profit and more.
So it's an important category, to say the least.
Cloud, as we know today, it really started in 2003, which is when Amazon, you know, decayed 50 people initially,
and I believe to start working on AWS.
And they worked on this for three years with an Amazon, launched in 2006 in March with S3.
Later that year, in September, they launched EC2.
And that kind of was the beginning of the cloud.
It took quite a while for this model to catch on, and people were,
not quite clear on why this would be useful, even until quite recently.
And so in 2009, 2010, you know, Mattay, Sarah who you mentioned, wrote this paper called Above the Clouds with some collaborators at Berkeley, a Berkeley ground cloud computing.
And they talked about why cloud would be a big deal, and I think it was not, to say the least, not appreciated that the points they were making were valid at the time.
I think it was kind of, you know, not clear to people, to say the least.
You know, fast forward a few years, Bezos in 2015, said AWS was for all intents and purposes market size on constraint.
And he was kind of laughed by many.
That was seen as a really ludicrous statement to make.
You know, fast forward, no one's laughing.
In 2019, I remember people saying, like, it's not clear if cloud would be a big deal.
So if, like, hadn't yet scaled, you know, data bricks hadn't yet scale.
I wonder if it's worth making a distinction on these things because, you know,
I agree with parts of what you're saying, but, you know, I was building startups in that era.
And at the time, there was a pretty strong belief that at least for a startup,
these clouds were incredibly useful because it used to be you'd spend a lot of money and effort
on setting up your server racks and wiring everything up and all the rest and then you know
the emergence of things like AWS and nobody expected it to be Amazon right it just didn't
fit what people thought of it was basically an e-commerce marketplace company right they were suddenly
building infrastructure and providing it it felt very natural for somebody like Google to provide
that but I think for startups because I was doing a startup in 2007 you know we thought that it was
magic, right? Because suddenly, and not all the services were there yet and everything else,
but suddenly you didn't have to deal with all the infra. And there was a number of companies
that started before that, like Twitter and others, who kind of ended up having to continue
to build and maintain their own clouds. And it was really brutal. And then to your point,
I think the big transition was to agree to which enterprises, particularly in regulated
services or financial industries, kind of fought it initially. And then they started opting
it. So I agree with the lawyer saying, I wouldn't say that it was one of those things that
nobody believed in, though. I actually felt there was a lot of buy-in and a lot of belief
and, you know, adoption. Definitely not nobody. I think, you know, prissioned people and a lot of
builders really saw the value really early, particularly startups in the VC community. You know,
I think Berkeley and some of the researchers there really got it early, like 2009 and earlier, you know,
by 2015, I think it was already around $3 billion of run rate. So it was not small. It was very far
from what it is today, but it was not small. They didn't break it out yet, but it was, you know,
not small at all. It was a meaningful thing. I think though people didn't recognize it would
become anything like it is today. And it still wasn't clear.
to people what the value proposition was.
You may recall what there was, you know, dropbacks, for example, famously,
exited the cloud and went back on-prem.
And that story, they published why they did that and the economics of it,
and that caught a lot of attention.
And people were saying, I'm not sure if cloud actually makes sense anymore, et cetera.
And I think people kind of lost track of that.
Yeah, I think one of the insightful things that some of the early cloud systems companies
like Databricks and Stolek recognized was one of the value proposition to the cloud that
was really special.
I'll actually use one of the early.
of the quotes from, well, snowflake to illustrate this was the thing, they were trying
to think about what was unique about the cloud, what you could do in the cloud that you couldn't
do anywhere else. And the killer idea that they converged on, fundamentally, the cloud made fast
free, quote unquote, fast was free in the cloud, so they put it explicitly. And the idea was,
if you have a workload that was designed to run for 10 days on 10 machines, in the cloud,
you could theoretically run it on 100 machines for one day, you know, or 10,000 machines for 15
minutes. And that would cost the exact same. And so you could run it a thousand times faster for the
same cost theoretically. And it's like, well, that's kind of a big deal, right? You can run something
a thousand times faster for the same cost, you know, and then give the compute back. Now, what that
would require, though, a number of things to make that actually work, one being that you'd have to be
able to kind of reshape a workload that was designed to run on 10 machines, to run on 10,000
machines, not trivial. And another is that you actually have to have the 10,000 machines worth
capacity in the cloud and make sure that was utilized to make the economics kind of work, you know,
But yeah, if you could do those two things, it'd be a really, really big deal.
That has to be free in the cloud.
And so one of the key ideas was this elasticity, right?
And I think that's one of the things that's really absent in AI Cloud today.
In AI Cloud today, you're kind of forced to get really long-term reservations,
often three years for a fixed amount of capacity.
No one really wants 64 GPUs for three years or 1,000 GPUs for three years.
You want 10,000 for, you know, a couple of months, maybe nothing for a little while.
Maybe you don't know how much you need because you're launching a few years.
the new product and you're not sure how much demand there will be for it in how much inference
capacity you'll need, et cetera. It's very, very challenging if you have to reserve the total
amount that you may need up front for a long duration. I want to go back to this idea of like
it's actually hard for a lot of engineering teams, especially younger ones, to picture like a pre-cloud
world because their experience with it is like, you know, I have virtual machine.
on Amazon, they're limitless, like maybe it's serverless. And like, if you go all the way back
to what Alad was talking about, you gave us a great historical view. But if you look at like
functionally, like I had, you know, I had my servers in my closet on prem, right? And then I had
co-location, which is like, I have my servers. There's still my servers. I control them and I
manage them. But I, they physically live in somebody's data center where they're offering me like
real estate, like cooling and power. Right. And then you had hosting, which is like,
like, I'm buying a machine in a data center, like reservations for a long time, essentially,
or for the life cycle of the machine.
And then you had virtualization and containerization and all of these services that came out of this,
like, you know, separation into cloud services where you have higher level functions
with scheduling orchestration.
And, you know, server list is like, I'm just going to write, like, the logic you deal with it and place that workload.
And like that, you know, we haven't obviously gone to an endpoint.
in non-AI computing.
But I feel like the engineering world is so used to being over here,
whereas the AI hardware resource world,
there's like we're somewhere between Kolo and, you know,
Kolo and hosting still.
That's right.
Yeah, and that's challenging.
That means that you have to raise all the capital you may need up front,
you know, means that, you know,
and that's a challenging model.
And perpetuity means also that, you know,
you can't quite, you know, grow the product elastically
as demand and interest in it.
grows. You're kind of ball-necked by the supply chains in a way that I think developers haven't
experienced in quite a while. So it's a pretty challenging state, I think, for affairs. And it's also
a very challenging from a risk management perspective for these companies because they're making
these big commitments that are potentially, you know, if they don't work out, are pretty catastrophic
for them on this hardware, paying all up front, paying for these long duration contracts,
etc. It's a very challenging thing. And there's no analog yet. The markets aren't mature enough
that there's any analog. So what we have in other domains, like in commodities markets like
wheat, oil, et cetera, where you can buy options and futures and hedge and sell back and things
like that. It's kind of a pretty, still pretty inquiet in terms of where the market's at.
And yeah, I think that's leading to a challenge state of affairs that's going to, you know,
continue to bring a lot of pain for people. And so, you know, there are several things I think
we've employed to do this better. Some are, you know, mixes kind of business model innovations
and technical innovations at the same time. You know, but I think we're making,
a pretty substantial dint in this, but also in a way that's really viable economically for us.
And, you know, does it involve buying all GPUs and taking undue risks on them per se?
And so that's kind of a lot of what we've tried to do is, can we do something a lot more efficient?
You know, can we find some leverage, some points of leverage to address this problem?
You guys have some new releases as of, I think, a day or two ago.
Can you describe what's just come out from Foundry?
So I think right now, AI Cloud is, the AI cloud business is very much like a parking lot business.
And that sounds really funny because cloud is supposed to be high tech.
And you can hardly conceive of a less sophisticated business, at least on the surface, than parking lots.
And what do I mean by that?
Well, there's fundamentally there are two models in the parking lot business.
One is pay as you go.
For pay as you go, the race are luxurious, you know, you may or may not find a space.
I'm sure many of us have the experience of driving through SF and seeing a lot, a lot full.
sign, you know, for a lot after lot as we drive around trying to park. And if you do get a
spot, you might pay, you know, the $12 an hour rate or something like that. I'm choosing that
rate because it's the rate of AWS, you know, for on-demand. On the other hand, if you want
to kind of guarantee that you'll have a spot and also have a better rate, you can basically
buy a spot reserved. And so you can have your own reserve parking spot in your building.
Maybe you pay, you know, $4 an hour. So you're getting a massive discount, but it's $3K a month,
effectively, right, which is actually pretty substantial.
And if you're only using it 40 hours a week when you're in the office as a typical worker,
it actually might be effectively $16 an hour as opposed to $12, so it's actually worse.
And so I think one kind of funny analogy for one things that we want to do with a couple of these products
is kind of create the, enable the equivalent of allowing people to park, pay as you go in someone else's reserve spot.
And that sounds kind of funny, but you can imagine that, okay, that'd be actually an interesting thing.
And if you can do that, then depending on how much, how the percentage of the spots that are typically reserved, you might have 10x the effective capacity and a lot.
You know, and then also it can kind of be a win-win.
Instead of the pay-as-you-go person paying 12, you know, they can pay something much lower.
In this case, I'll say seven.
I'm going to actually be a lot lower.
You know, the person who owns the spot instead of paying four can make five equivalent.
And then the lot can also make a couple of bucks.
And so it's kind of a win-win for everyone.
And you're kind of double-picking the lot and it's, you know, really, really efficient.
Now, that sounds great, but it wouldn't quite work if you showed up to your reserve spot and there was someone parked there.
That might be a little bit aggravating.
It also might not work if you were forced to call two hours in advance and say, hey, I'm coming.
And then the person who was parked in your spot had to leave their dinner reservation to move their car.
That wouldn't be a fun model.
And so I think one thing that we had to do was kind of create the analog of a system to make this all really convenient.
And so, you know, maybe the V1 of the system was you came into the lot and a sensor was triggered.
saying, you're here to pick up,
to go to your reserve spot,
and then some valet ran to the car
that was parked there and moved it,
and they maybe moved it
from the second floor to the 10th floor.
And then the person who had parked there previously
now comes to the counter,
asked the valet where their car is,
and gets some tickets saying it's on the 10th floor,
but there are no stairs,
so they have to walk up the stairs and get it.
I'm stretching this analogy,
but you get the idea to be kind of inconvenient,
and so part of what we did was added more and more convenience features,
which we broadly call spot usability,
and these are a lot of things that we continue to add.
And so, you know,
The scenario that we're now at is basically, you know, you, letting someone else use your spot show up.
The sensor kind of knows, okay, you're here, and then the car in your spot is automatically moved via conveyor.
To another spot, we're managing the spaces to ensure we can move it somewhere.
And then when the person comes to pick up their car, it's kind of brought to them.
Now, yeah, so it's all really convenient.
It's kind of seamless for everyone involved.
It creates a ton more effective space, allows us to get much better economics out of the machines.
And it's really, really helpful for companies.
That's one really powerful, I think, thing that we've done.
and this is kind of offered in the context of a spot product.
I think people are somewhat familiar with spot usage
in the typical cloud context.
It's a lot more challenging to do with GPUs
for a few different reasons,
which is why there are not very many
in GPUs available on spot,
definitely not at scale and definitely not with interconnect.
So one of the things we want to do is enable this,
and it makes a lot of other things possible
that are pretty neat,
and so it's a mechanism we're employing in a few different ways.
So that's, I think, one thing,
and I'll give some more analogies to explain why that's powerful.
Well, we found that companies are using this type of mechanism quite a bit
for everything from training, which is to have classably seen as a workload that's difficult
for spot, but also especially for things like inference, batch inference especially, and
you know, that actually opens up another interesting conversation about the different
classes of workloads and what they, what each workload needs and cares about and how
this might evolve over time.
That, you know, actually that ties a little bit to the compound AI systems concept.
And also still, I'm a 3.1 release in a funny tangential way.
But yeah, that's kind of one analogy for the product that we launched on the Foundry Cloud Platform around Spot.
Yeah, I think that, you know, Spot usability increasingly deep and flexible and automated is like a really powerful permittives.
What a changing tax a little bit to just something I think the entire industry, like the tech industry is very interested in.
We did a little bit of work together a while back just understanding like where, you know, where is the GPU capacity in the world today, right?
all of the different types, how much is it, or like, how consolidated is it? And obviously,
this is near and dear to your business. Can you just describe a little bit, like, where you think
we are? And then, like, what caused, what caused the shortage sort of last year?
One kind of funny bit of trivia that I've posed to a few people that I think, you know,
reveals how off-based a lot of our priors are is kind of what percentage of the world's
GPU petafal capacity, or X-Flock capacity is kind of owned by the,
major public clouds. And I've asked many people, and I typically have gotten guesses, you know,
in the high tens of percents. And the only time I got a lower guess was from Sotia at Microsoft
who guessed basis points, which actually is correct. You know, it's a very small, small amount.
And maybe as one anecdote to maybe illustrate how this looked at least a couple of years ago,
this is an evolving thing, but how it's looked, you know, the example of GPT3 in its training
is to have an interesting one. It's a bit dated, but I'll use it just
because the numbers of the types of machines and numbers there are public,
as is not the case for some of these other systems.
And so GPD3 was trained on 10,000 V100 GPUs in an interconnected cluster in Azure
for about 14.6 days.
To put that in perspective, it was a stay-of-the-art system.
You know, it was kind of, by many estimates, eight figures for a single run at the time.
So it's a pretty substantial investment by open app.
And that tells you that 10,000 V100 is running continuously for 14.6 days is quite a bit of
compute. I think one interesting kind of
maybe true your question then
is how many equivalent GPUs
normalized in terms of the number
of flops, you know, they weren't fully interconnected
but just interesting, you know, processing
measure anyway, were there in the Ethereum
at the peak of Ethereum?
And so I kind of asked people this question and
it's fun to see people guess. Can I
solicit a guess actually a lot? Can you make a guess there?
You might know. So, um, because we've talked
about this conversation, so I'm not allowed to guess
this conversation. Yeah. So a lot, do you know by chance?
For Ethereum? Yeah, for
Ethereum. And don't think too hard. Just make a guess based on priors. When? So when it first
launched? The very peak, the tippy top of Ethereum, how many V100 equivalents were there,
given there were 10,000 for two weeks for GPT3? How many were there in Ethereum? Noting, by the way,
that these were running 24-7 in Ethereum, so you can modulate your guess based on that.
I would guess a few hundred thousand, a few million. That's an aggressive guess. Yeah, that's an aggressive guess,
and you're actually very correct. It was about 10 to 20 million. Yeah. Which is quite a
substantial scale. You can, and you can, by the way,
sysruck this really easily by looking at the,
you know, basically the hash power in Ethereum
at the peak, which was around 900
terra-foss per second, I believe, about a pet-hachshs per second, about a pet-hash per second,
so quite a bit of peak power. And a typical V100
will give you between, I believe, 45 to 120 mega-hashs per second
if you really know what you're doing. So that's kind of one guess.
There are tens of millions of GPU.
Yeah, because it's funny. I remember Bitcoin, even years ago,
all the CPU dedicated to it
at the time was
like larger than all of Google's data centers.
Yeah, Bitcoin, though, you used a lot of A6 in particular.
Ethereum actually had a higher relative ratio of GPUs.
And so with a larger GPU, sorry, Ethereum mining providers, like, like, Hive, for example,
had about, you know, had less than 1% of global hash power.
You know, and so you can actually start to extrapolate, you know, and, you know,
they had had quite a few GPUs, like tens of thousands of Nvidia GPs.
Yeah, it starts to give you a number of the scale of capacity.
And then I'll have other, you know, mining, my equipment for way less than 1% of the total hash power, about 0.1%.
So quite a bit of hash power in Ethereum.
And I think that's kind of one proxy, but to give you one more anecdotal on that line, actually an iPhone 15 pro now is actually stronger than of U100 as a funny example.
It has about 35 tarflops in every 16, I believe, where a V100 is around 30.
And so there's actually quite a bit of compute in the world, broadly speaking.
It's, I guess, a pun I'm making.
Now, it's not all useful.
It's not all interconnected.
It's not all accessible.
It's not all secure.
but this is one point to make that there's a lot of compute in the world,
and even for the high-end GPUs, there's a lot more than people think.
By many measures, utilization of these even H-100 systems
who are kind of state-of-the-art, the most viable, the most precious, etc.,
is in the making case 20%, 25% or lower,
according to some pretty high quality data I've seen from some great sources here.
Yeah, so quite low.
And as I mentioned, even during these pre-training rounds,
it's often 80% or lower because of the healing buffer partially.
That's actually ties to another product that we've launched,
which is this product that we built actually large,
for ourselves called Mars.
It's kind of a funny name, but it's monitoring,
alerting, resiliency, and security.
It's basically a suite of tools that we've invested a lot of IP in
to really boost and magnify the availability and uptime of GPUs
for our own platform.
It was actually something that we plan to make available
to other people just to use in their own clusters as well.
Actually, one of the reasons why we invested in Spot
is because we reserve very aggressively healing but for ourselves
so that there's a GPU failure we can automatically swap another GPU.
and a user won't perceive a disruption.
And so we actually, we maintain buffer for that reason.
And so actually being able to pack that buffer with preemptible nodes is actually a really
useful thing.
But now we're allowing other people to do this, including third-party partners who want
to, for example, make their healing buffer available to others through Foundry.
It's really offsetting their economics and the cost of the cluster for them.
So it's a really, really powerful thing.
And so between Mars and Spot, you can kind of see how these things are really interconnected
and in a nice way.
But, yeah, the number of GPUs available, there's quite a few, particularly if you look at it more broadly, in terms of total AI compute capacity, the percentage that's accessible, useful, and used is a pretty de minimis fraction of the total.
How did the GPU market dynamics and your prediction of them factor into foundry strategy going forward, right?
Because there are, especially for anybody doing large-scale training jobs, there is definitely a, you know, a significant effort to be at the, um,
leading edge, right? Access to B-100 and beyond is at a premium and then access in, you know,
the largest possible interconnected cluster with sufficient power is also a fight now. It sounds
like you, you know, see the opportunity differently or you feel like there are resources that
can be used that don't require just building new data centers. I think it's a little bit of all
the above, to be clear. I think two things would be true at once, and that's definitely the case here.
I think there'll be many workloads and use cases for which having stay-of-the-art, extremely large,
clusters is a really valuable thing.
You know, part of something, though, that we're noticing and also trying to promulgate
further is, are basically paradigms that don't require this, though, as well.
And so here's why I'd say, kind of two things from each at once.
And so I think there's a massive shortage of both power, space, and interconnect for the kind
of largest of clusters.
It's actually very hard to come by and to construct or to find a really large interconnected
cluster.
You know, it starts to vanish the larger the cluster gets.
Like, there are a lot more 1K clusters than 10K clusters and 20K clusters, and you can keep going.
Now, I think one thing that it'll get harder and harder to keep the scaling going from there.
You know, I think there's one question, which is, how will we continue to push the scaling laws?
You know, one fact about the scaling law occurs is that they're all plotted on logarithmic or house, and, you know, things get better predictably,
but it requires a continued 2xing or 10xing to get that next bump in performance.
but it's quite a bit harder to get the next to next thing.
And so it starts to come kind of intractable pretty quickly.
And so it's already prompted, I think, a lot of innovation.
So Google, for example, has been doing a lot with, you know, training across facilities, for example, across data centers,
interconnecting them for these models like Palm 2, something that previously would have seemed to be inconceivable,
or things like DiPaco or DeLocco, these models they release that have trained across facilities.
And that's one innovation, but I think actually an even slightly more radical thing is we're starting to see a shift,
towards a pretty different paradigm.
I think myself and a number of my collaborators and Matei and others
have kind of termed this compound AI systems.
I think you actually see it with these most recent models
like alpha geometry, alpha code, and Lama 3.
And so I think this actually points the way towards
what the AI of Structure Future might look like.
And I think it looks a lot less like everything requiring these big clusters.
And it was a little bit more interesting.
And so maybe I'll use Phi3 as an example.
with 5-3, they took a little bit of a different approach
where they trained a really high-quality small model
on high-quality data.
And this small model did not need
the kind of big interconnected cluster.
You can train it on a pretty small cluster.
However, it was still a non-trivial endeavor
because they had to curate and obtain this kind of high-quality data.
And so one of the things I think you're seeing
is for these models, like some of the Lama 3.8B
and 70B variants,
Those models are really small, but they're extremely smart.
They're smarter than much, much larger systems,
like the prior generation for open AI.
And the way that they trained Lama 8B and 70B looks a little bit different.
So what they did was they did,
they generate a ton of synthetic data.
It seems with Lama 3.1,405B.
And they distilled that larger model
into the 70B and 8B variant.
Right?
And so they got a very, very high quality, small variant.
Another example is with Alpha Code 2 was able to achieve extremely high code proficiency
in win competitions with really, really small language models.
But what they did was they called the model a million times for every question.
That's an embarrassly paralyzable workload.
You can scale it horizontally and infinitely.
They called it a million times per query and then had a nice, kind of pretty elegant regime
to filter down to the top 10 responses, which they didn't try one by one.
And so this is basically what they did.
they generated a million candidate responses
and basically filtered down to the best one
as a way to solve coding, so to speak.
So that's pretty interesting.
And I think you're seeing that type of regime
a bit more and more.
You know, same with alpha geometry.
Really, a powerful system was just announced
recently, you know, one, you know,
the silver medal level in the IMO and broadly,
not just geometry for a broader class of problems.
And, you know, these are kind of compound systems
with major kind of synthetic data generation pieces.
And so I think you're seeing people kind of move
computation around interpolate between training and inference, for example, to make the best use
of the info they have. And this is actually a kind of a funny, I think, reframing of the scaling
laws that we hold near and gear as an ecosystem. I think that, you know, the Chinchilla paper
scaling laws that Deep Mine uncovered have fuel a lot of the scaling effort, but actually one
kind of funny way of looking at those results is they show that if you want to make a monolithic
model as smart as possible, there is an ideal way to distribute parameters, compute, and basically
trade iterations. One, the funny thing I think some people have done, like Mistral, is actually
choose to maybe inefficiently train a small model to be smarter than it should be, wasting
money, but then that small model is actually really cheap to inference because it's small,
and it's way smarter than it should be for its size. And so I think people are getting more
sophisticated at thinking about cost in a more of a life cycle way. And that's actually leading
to the workload shifting from large pre-training more and more towards things like batch inference,
which is actually a really, really horizontally scalable workflow that you can paralyze.
You don't need interconnecting the same way. You don't even need to save their systems in the same way.
I think you're seeing that type of workload maybe grow in prominence. And so just to give one,
just unpack one more statement on that, one thing you can do that people are doing sometimes is
you unroll the current state-of-the-art model many, many times,
basically doing chain of thought on the current state-of-the-art model.
And then you take what required six steps
with the previous day of the art,
and that becomes your training data for the next day of the art.
Right, that type of approach to generating data,
that then filtering that down to high-quality examples
and then training models on it looks very different
than just throwing more poor-quality data
into a massive supercomputer to get the next generation.
Yeah, it seems like that bootstrap up is really sort of under-discussed relative to as you hit a certain threshold of model, the rate at which you can increase for the next model just kind of accelerates.
One other thing that would be great to cover, I know we only have a couple minutes left of your time, is the recent paper that you authored, which I thought was super interesting around compound AI system design and sort of related topics to that.
So would you mind telling us a little bit about that paper and sort of what you all showed?
So I think it's kind of in this regime that we're just talking about, where more and more often to go beyond the capabilities on frontier accessible to today's state of the art models and kind of get GPT5 or GPT6 early, practitioners are starting to do these things, oftentimes implicitly where they'll call the current state of the art model many, many times.
There are many scenarios where maybe you're willing to expend a bit of a higher budget, maybe it's code or something, and if I said that I can give you a 10% better model, you know, for code, many developers might be.
might pay 10x for that access to that.
Instead of $20 a month, they might be very willing to pay $200 a month, right, for obvious
reasons.
And so there's a question of what do you do in that setting?
And so people are, you know, if you're willing to call the model many times, you
can compose those many calls into almost a network of network calls, right?
And, you know, I guess one of the questions is, how then should you compose these
networks of networks, or principles should guide their architecture?
We kind of know how to construct neural networks, but we haven't yet elucidated the principles
for how to construct networks, so to speak.
these compound data systems, so to speak, where you have many, many calls, maybe external components.
And so one principle that we start to explore was maybe one thing you can prove to figure out how
to compose these calls or whether composing many calls will help you is, is you can look at how
verifiable the problem is. And if it's verifiable, you can actually bootstrap your way to really high
performance. So what does this mean? Well, verifiable means that it's kind of easier to check an answer
than it is to generate an answer. And there are a lot of cases where this is true.
most software engineering and computing tasks
kind of classically have this property
you know we looked at things like prime factorization
or a lot of math tasks
classively it can take someone years
of suffering to write a proof
and you can like read the proof in a couple of hours
right I think we've all had that experience
with some training so there are many examples like this
and so one thing you can do is you can have
models you can embarrassingly parallel
you can horizontally scale out
and generate many many candidate responses
and then relatively cheaply check those candidate responses
and kind of do of a best of K type of approach.
And it turns out the judge model or the verifier,
choosing the best candidate response
might actually have a lot higher accuracy
at selecting the best candidate response from the set.
His use to kind of repeat this as a procedure
to actually bootstrap your way to really
with a high performance in many cases.
And so, you know, we did kind of some preliminary investigations here
and we were able to, in one case, the prime factorization,
you know, kind of 10x, the performance,
go from 3.7,
percent to 36.6.6 percent. On prime factorization, which is pretty hard, kind of factoring
a number that's a composite of two primes, two three-digit primes and factorizing it, factoring it
into the constituent primes. It's kind of a classic problem that pops up a lot in cryptography.
And then also we looked at subjects in the MMLU and found that for kind of the subjects you
would expect, math, physics, electrical engineering, this type of approach is really helpful.
Now, we use language models. It doesn't have to be a language model. It could be a simulator.
these could be unit tests as your verifier, et cetera.
But we think this type of approach kind of points towards maybe a very different paradigm
for getting better performance than just kind of scaling the models
and doing a whole new pre-training from scratch.
MMLU performance bump was about 3%.
And to put that in perspective, the gap between some of the previous best models
is often less than 1%.
Between, for example, Jim and I, 1.5, and Lama 3.1 and things like that.
So, actually, to 2.8% or 3%, is actually a pretty major gap on MMLE.
So pretty intriguing, and I think a lot of practitioners are hopefully going to explore this setting a lot more.
It's a super cool paper. Do you have any creative ideas about how you could apply some of the ideas here to improve performance on more open-ended tasks?
In many ways, we're not originating these. I think some of these are baked largely into systems like alpha, alpha code, alpha geometry already.
I was pretty inspired to see the alpha geometry results recently as well.
Yeah, I think that what we'll see people doing is kind of composing, this sounds funny, but massive networks where maybe each stage in the network will basically be maybe some best of a, best of K component with many, many calls to different language models, you know, Claude, Gemini, Chbte4, each with their own, you know, spikes in terms of capabilities, and you'll kind of throw multiple of them at questions in many cases and then kind of choose the best response. You might, you know, also ensemble that with other components like classical,
heuristic-based systems and, you know, simulators, etc.
And kind of compose large networks that may make millions of calls to answer a question.
And I think that type of approach, it sounds kind of farcical right now, but I think it'll seem
common sense pretty soon.
You know, we think that's, we think it's a really interesting approach, and we've seen a lot
of interesting evidence that we'll speak more about pretty soon for things like code
generation and agentic tasks in the code regime, you know, for things like design, chip
design, you know, for things like actual neural network design, or network of network design
even, funny enough, in recursive ways. So it's actually a really good, it turns out a lot of
these problems that we care about, how that property is verifiable, and you can compose these
systems and bootstrap your way to, you know, much higher performance than people might have
imagined. So it seems pretty applicable downstream, but there's a lot of open questions, a lot
of work to do further. And I think, you know, part of our hope is that the community will
explore this more, and that these types of workloads that are a bit more paralyzable will
become more and more common. There'll be a lot more batch inference, a lot more synthetic data
generation. And you won't necessarily need the big interconnected cluster that maybe only
open AI can afford to do kind of cutting edge work in the future. Yeah, a really cool set of ideas.
And overall, a great conversation. Thanks so much for doing this, Jared. No, thank you, Sarah. And
thank you a lot. Great to see you. Yeah, great to say too. Find us on Twitter at No Pryor's Pod.
Subscribe to our YouTube channel. If you want to see our faces, follow the show on Apple Podcasts,
Spotify, or wherever you listen. That way you get a new episode every
week. And sign up for emails or find transcripts for every episode at no dash priors.com.