No Priors: Artificial Intelligence | Technology | Startups - The marketplace for AI compute with Jared Quincy Davis from Foundry

Starting point is 00:00:00 Hi, listeners. Welcome to No Pryors. Today we're talking to Jared Quincy Davis, the founder and CEO of Foundry. Jared worked at Deep Mind and was doing his Ph.D. with Matei Zahari at Stanford before he began his mission to orchestrate compute with Foundry. We're excited to have him on to talk about GPUs and the future of the cloud. Welcome, Jared. Thanks, Sarah, and great to see you. Thanks a lot as well. Yeah, great seeing it. The mission at Foundry is directly related to some. problems that you had seen in research and at deep mind. Can you talk a little bit about the

Starting point is 00:00:34 genesis? A couple of the most inspiring events I've witnessed in my career so far were the release of Alpha Fold 2 and also Chedipa T. I think that one of the things I was so remarkable to me about AlphaFold 2 is initially it was a really small team, you know, three and then later 18 people or so. And they solved what was kind of a 50 year grand challenge in biology, which is a pretty remarkable fact that, you know, every university, every farmer company hadn't solved. And similarly with ChichPT, a pretty small team, Open Air was 400 people at the time, you know, released a system that really shook up the entire global business landscape. You know, that's a pretty remarkable thing, you know, and I think it's kind of

Starting point is 00:01:11 intriguing to think about what would need to happen for those types of events to be a lot more common in the world. And, you know, although those events are really amazing because of the small numbers of people working on them, I think, you know, it's not quite the David and Goliath story, neither are quite the David and Goliath story that they appear to be when you when you double click. In open eyes case, you know, there were only 400 people but had $13 billion worth of compute, you know, which is quite a bit of computational scale there. And in deep mind's case, it was a small team, you know, but obviously they were standing on their shoulders of giants in some sense with Google, right, and the leverage that they had

Starting point is 00:01:44 via Google. And so one thing I think that we thought about is, you know, what can we do to make the type of computational leverage and tools that are currently exclusively the domain of opening I and deep mine, kind of available to a much broader class of people. And so that's a lot of what we worked on with Foundry, saying, can we build a public cloud? You know, built specifically for AI workloads, where we reimagined a lot of the components that constitute the cloud into end from first principles. And in doing that, can we make things that currently cost a billion dollars, cost $100 million, and then $10 million, and over time?

Starting point is 00:02:16 And that'd be a pretty massive contribution. I think it would increase the frequency of events like AlphaFault 2 by 10x, 100x,000, or maybe even more super linearly. And we're already starting to see the early signs of that, but quite a lot of room left to push this agenda. So really exciting. So that's kind of maybe an initial introduction preamble to how we thought about it.

Starting point is 00:02:37 And I can trace that line of reasoning a bit more, but that's kind of part of what we've done. Jared, for anybody who hasn't heard of Foundry yet, what is the product offering? Yeah. So Foundry, we're essentially a public cloud built specifically for AI. And what we've tried to do is really reimagine all of the systems undergirding what we call the cloud

Starting point is 00:02:58 into end from first principles for AI workloads. And we've started to do this a bit of a new way. I think the AI offerings from the existing major public clouds and kind of some new GPU clouds haven't really re-envisioned things. And by thinking about a lot of these things a bit anew, we've been able to improve the economics by, in me cases 12 to 20X over. lower tech GPU clouds and the existing public clouds. And you know, we'll partially based on some of these products that we'll talk about today that we're releasing and a lot of new

Starting point is 00:03:30 things that we're working on. We think we can push that quite a bit further as well. And so, you know, our primary products are essentially infrastructure as a service, so our customers come to us for elastic and really economically viable access to stay-of-the-art systems. And also a lot of tools to make leveraging those systems really seamless and easy. And we've invested quite a bit in things like reliability, security,

Starting point is 00:03:54 elasticity, and just the core price performance. How underutilized are most GPU clouds today? And I think there's almost three versions of that. There's things on hyperscalers like AWS or Azure. There's large clusters or clouds that people who are doing large scale model training or inference run for themselves. And then there's more just like everything else. It could be a hobbyist. It could be a research lab.

Starting point is 00:04:13 It could be somebody with just, you know, some GPUs that they're missing around with. I'm sort of curious for each one of those types of or categories of users. Like what What is the likely utilization rate and how much more do you think it could be optimized? Is it 10%? Is it 50%? Like, I'm this very curious. One of the most, I'd say, positive cases with the highest utilization, which is the case where you're running kind of an in-to-in pre-training job, right? And so that's the case where you've done a lot of work up front. You've designated a time that you're going to run this

Starting point is 00:04:41 pre-training workload for, and you're really trying to get the most utilization out of it. And for a lot of companies, utilization, even during this phase, you know, is sub-80%. So why? One reason is actually that these GPUs, particularly the newer ones, actually do fail a lot, as black practitioners would know. And so one of the consequences of that is that it's very common now to hold aside 10 to 20% minimum of the GPUs that a team has as buffer, as healing buffer in case of a failure so you can slot something else in to keep the training workload running. right and so even for a lot of the more sophisticated orgs running large pre-training at scale the utilization sub 80% sometimes less than 50% actually depending on how bad of a batch they have and the frequency of failure in the cluster and so even in that case now there are often though also large gaps in intermissions between training workloads even if you have the

Starting point is 00:05:34 the GPUs are dedicated to a specific entity you know and so even in those most conservative cases which all come back to the less conservative extreme cases utilization really can be really quite a bit lower than people would imagine. So we can pull on that case a bit more because I think it's actually quite counterintuitive and really interesting. I think there's a really fundamental disconnect between people's mental image of what GPs are today and what they actually are. I think that in most people's minds, you know, GPs are chips, right? And we talk about them as just, but actually the H100 systems are truly systems. You know, there are 70 to 80 pounds, 35,000 plus individual components. They're really kind of monstrosities in some sense. And the remarkable thing

Starting point is 00:06:17 that I think Jensen and Nvidia have done, one of one of many, is they've basically taken an entire data center's worth of infrastructure and compressed it down into a single box. And so we look at from that perspective, the fact that 70 pounds isn't quite as alarming. But it is, these are really gnarly systems. And when you end up composing these individual systems, these DGXs or HGXs, into large supercomputers, what you're often doing is you're interconnecting thousands, tens of thousands, hundreds of thousands of them. And so the failure probability kind of multiplies. And so because you have millions, perhaps, eventual components in this supercomputer, the probability that it will run for weeks on end, and this is basically a verbatim quote from Jensen's keynote. It's basically

Starting point is 00:07:00 zero. That's a little bit of a challenge, and suddenly enough, I think the AI infrastructure world is still somewhat inquired, and so it doesn't have perfect tooling, broadly speaking, to deal with these types of things. And so one of the compensatory measures I think people take is reserving this healing buffer, for example. I think that disconnect maybe helps explain why these things fail. And it's actually, funny enough, the newer, more advanced systems anecdotally fail a lot more than historical systems that were worse in some ways. And do you think that's just like a quality and quality control issue for those systems, or do you think it's just some form of complexity with some failure rate per component that's

Starting point is 00:07:39 What do you think is for the driver of that? I think it's more of the complexity has grown, right? And we're in a different regime now. I think that it's fair to say, so maybe stepping back again to definitions, we throw the term large around a lot in the ecosystem. I guess one question is what does large mean? And one useful definition of large that I think roughly corresponds to what people mean when they invoke the term is that a large language model,

Starting point is 00:08:04 you enter the large regime when the, essentially the amount of compute, necessary to contain even just the model weights starts to exceed the capacity of even a state of the art single GPU or a single note. I think it's fair to say you're in the large regime when you need multiple

Starting point is 00:08:23 state of the art servers from Nvidia or from someone else to even just contain the model, you know, just run the, basically run the e-training or definitely even just to contain the model. That's definitely the large regime. And so the key characteristic

Starting point is 00:08:39 of the large regime is that you have to somehow orchestrate a cluster of GPUs to perform a single synchronized calculation. Right? And so it becomes a bit of a distributed systems problem. I think that's one way of characterizing the large regime. Now, a consequence of that is that you have many components that are all kind of collaborating

Starting point is 00:08:58 to perform a single calculation. And so any one of these components failing can actually potentially, you know, lead to some degradation or challenge downstreet. and I mean, stop the entire workload, right? You know, you've probably heard, people have talked a lot about Infineband, and the fact that Nvidia, part of Nvidia's advantage comes from the fact that they do build systems

Starting point is 00:09:17 that are also saved the art from networking perspective, right? And their acquisition of Melanox was one of the better of all time, arguably, from a market cap creation perspective. And the reason that they did this is because they realized that it would be really valuable to connect many, many machines into a single, almost contiguous supercomputer that almost acts as one unit. Yeah, and the challenge of that, though, is that there now are many, many, many more components and many more points of failure.

Starting point is 00:09:43 And these things kind of, you know, the point of failure kind of multiply, so to speak. You implied that, like, GPUs, well, you described GPUs as this unique asset that is more CAPEX than OPEX and that the hypers, like an Amazon, are making a certain assumption about what that depreciation cycle is. Like, where do you think the assumptions for foundry versus those hyperscalers or versus, let's say, like, a core weave are different? This opens up a pretty interesting conversation around, like, what is cloud? I think we've kind of forgotten in this current moment what cloud was originally supposed to be, and it was value of our position was intended to be. I think current AI cloud is not cloud in the originally intended sense by any means. So we should pull on that thread. But I'd say right now it's basically co-location.

Starting point is 00:10:31 Yeah, it's basically co-location, right? It's not really cloud. Yeah. Maybe, yeah, it'll be, it's definitely worth pulling on that thread a little bit. Yeah, do you want to break that down for sort of our listeners in terms of what you'd view as the differences? First, I guess, there's a little bit of context for people. The cloud, as we currently know it, is arguably one of the, you know, most important business categories in the world. That's, I think, pretty clear.

Starting point is 00:10:50 The biggest companies in the world, they're either clouds, their Azure, AWS, GCP, you know, core component of the biggest companies in the world, or Nvidia, who sells the clouds, obviously. So it's clearly important category. AWS is arguably a trillion dollars plus of market cap if you broke it out from Amazon. At one point, relatively recently, it was all of Amazon's profit and more. So it's an important category, to say the least. Cloud, as we know today, it really started in 2003, which is when Amazon, you know, decayed 50 people initially, and I believe to start working on AWS.

Starting point is 00:11:19 And they worked on this for three years with an Amazon, launched in 2006 in March with S3. Later that year, in September, they launched EC2. And that kind of was the beginning of the cloud. It took quite a while for this model to catch on, and people were, not quite clear on why this would be useful, even until quite recently. And so in 2009, 2010, you know, Mattay, Sarah who you mentioned, wrote this paper called Above the Clouds with some collaborators at Berkeley, a Berkeley ground cloud computing. And they talked about why cloud would be a big deal, and I think it was not, to say the least, not appreciated that the points they were making were valid at the time. I think it was kind of, you know, not clear to people, to say the least.

Starting point is 00:11:56 You know, fast forward a few years, Bezos in 2015, said AWS was for all intents and purposes market size on constraint. And he was kind of laughed by many. That was seen as a really ludicrous statement to make. You know, fast forward, no one's laughing. In 2019, I remember people saying, like, it's not clear if cloud would be a big deal. So if, like, hadn't yet scaled, you know, data bricks hadn't yet scale. I wonder if it's worth making a distinction on these things because, you know, I agree with parts of what you're saying, but, you know, I was building startups in that era.

Starting point is 00:12:26 And at the time, there was a pretty strong belief that at least for a startup, these clouds were incredibly useful because it used to be you'd spend a lot of money and effort on setting up your server racks and wiring everything up and all the rest and then you know the emergence of things like AWS and nobody expected it to be Amazon right it just didn't fit what people thought of it was basically an e-commerce marketplace company right they were suddenly building infrastructure and providing it it felt very natural for somebody like Google to provide that but I think for startups because I was doing a startup in 2007 you know we thought that it was magic, right? Because suddenly, and not all the services were there yet and everything else,

Starting point is 00:13:03 but suddenly you didn't have to deal with all the infra. And there was a number of companies that started before that, like Twitter and others, who kind of ended up having to continue to build and maintain their own clouds. And it was really brutal. And then to your point, I think the big transition was to agree to which enterprises, particularly in regulated services or financial industries, kind of fought it initially. And then they started opting it. So I agree with the lawyer saying, I wouldn't say that it was one of those things that nobody believed in, though. I actually felt there was a lot of buy-in and a lot of belief and, you know, adoption. Definitely not nobody. I think, you know, prissioned people and a lot of

Starting point is 00:13:36 builders really saw the value really early, particularly startups in the VC community. You know, I think Berkeley and some of the researchers there really got it early, like 2009 and earlier, you know, by 2015, I think it was already around $3 billion of run rate. So it was not small. It was very far from what it is today, but it was not small. They didn't break it out yet, but it was, you know, not small at all. It was a meaningful thing. I think though people didn't recognize it would become anything like it is today. And it still wasn't clear. to people what the value proposition was. You may recall what there was, you know, dropbacks, for example, famously,

Starting point is 00:14:04 exited the cloud and went back on-prem. And that story, they published why they did that and the economics of it, and that caught a lot of attention. And people were saying, I'm not sure if cloud actually makes sense anymore, et cetera. And I think people kind of lost track of that. Yeah, I think one of the insightful things that some of the early cloud systems companies like Databricks and Stolek recognized was one of the value proposition to the cloud that was really special.

Starting point is 00:14:26 I'll actually use one of the early. of the quotes from, well, snowflake to illustrate this was the thing, they were trying to think about what was unique about the cloud, what you could do in the cloud that you couldn't do anywhere else. And the killer idea that they converged on, fundamentally, the cloud made fast free, quote unquote, fast was free in the cloud, so they put it explicitly. And the idea was, if you have a workload that was designed to run for 10 days on 10 machines, in the cloud, you could theoretically run it on 100 machines for one day, you know, or 10,000 machines for 15 minutes. And that would cost the exact same. And so you could run it a thousand times faster for the

Starting point is 00:15:01 same cost theoretically. And it's like, well, that's kind of a big deal, right? You can run something a thousand times faster for the same cost, you know, and then give the compute back. Now, what that would require, though, a number of things to make that actually work, one being that you'd have to be able to kind of reshape a workload that was designed to run on 10 machines, to run on 10,000 machines, not trivial. And another is that you actually have to have the 10,000 machines worth capacity in the cloud and make sure that was utilized to make the economics kind of work, you know, But yeah, if you could do those two things, it'd be a really, really big deal. That has to be free in the cloud.

Starting point is 00:15:30 And so one of the key ideas was this elasticity, right? And I think that's one of the things that's really absent in AI Cloud today. In AI Cloud today, you're kind of forced to get really long-term reservations, often three years for a fixed amount of capacity. No one really wants 64 GPUs for three years or 1,000 GPUs for three years. You want 10,000 for, you know, a couple of months, maybe nothing for a little while. Maybe you don't know how much you need because you're launching a few years. the new product and you're not sure how much demand there will be for it in how much inference

Starting point is 00:16:00 capacity you'll need, et cetera. It's very, very challenging if you have to reserve the total amount that you may need up front for a long duration. I want to go back to this idea of like it's actually hard for a lot of engineering teams, especially younger ones, to picture like a pre-cloud world because their experience with it is like, you know, I have virtual machine. on Amazon, they're limitless, like maybe it's serverless. And like, if you go all the way back to what Alad was talking about, you gave us a great historical view. But if you look at like functionally, like I had, you know, I had my servers in my closet on prem, right? And then I had co-location, which is like, I have my servers. There's still my servers. I control them and I

Starting point is 00:16:47 manage them. But I, they physically live in somebody's data center where they're offering me like real estate, like cooling and power. Right. And then you had hosting, which is like, like, I'm buying a machine in a data center, like reservations for a long time, essentially, or for the life cycle of the machine. And then you had virtualization and containerization and all of these services that came out of this, like, you know, separation into cloud services where you have higher level functions with scheduling orchestration. And, you know, server list is like, I'm just going to write, like, the logic you deal with it and place that workload.

Starting point is 00:17:24 And like that, you know, we haven't obviously gone to an endpoint. in non-AI computing. But I feel like the engineering world is so used to being over here, whereas the AI hardware resource world, there's like we're somewhere between Kolo and, you know, Kolo and hosting still. That's right. Yeah, and that's challenging.

Starting point is 00:17:42 That means that you have to raise all the capital you may need up front, you know, means that, you know, and that's a challenging model. And perpetuity means also that, you know, you can't quite, you know, grow the product elastically as demand and interest in it. grows. You're kind of ball-necked by the supply chains in a way that I think developers haven't experienced in quite a while. So it's a pretty challenging state, I think, for affairs. And it's also

Starting point is 00:18:06 a very challenging from a risk management perspective for these companies because they're making these big commitments that are potentially, you know, if they don't work out, are pretty catastrophic for them on this hardware, paying all up front, paying for these long duration contracts, etc. It's a very challenging thing. And there's no analog yet. The markets aren't mature enough that there's any analog. So what we have in other domains, like in commodities markets like wheat, oil, et cetera, where you can buy options and futures and hedge and sell back and things like that. It's kind of a pretty, still pretty inquiet in terms of where the market's at. And yeah, I think that's leading to a challenge state of affairs that's going to, you know,

Starting point is 00:18:41 continue to bring a lot of pain for people. And so, you know, there are several things I think we've employed to do this better. Some are, you know, mixes kind of business model innovations and technical innovations at the same time. You know, but I think we're making, a pretty substantial dint in this, but also in a way that's really viable economically for us. And, you know, does it involve buying all GPUs and taking undue risks on them per se? And so that's kind of a lot of what we've tried to do is, can we do something a lot more efficient? You know, can we find some leverage, some points of leverage to address this problem? You guys have some new releases as of, I think, a day or two ago.

Starting point is 00:19:19 Can you describe what's just come out from Foundry? So I think right now, AI Cloud is, the AI cloud business is very much like a parking lot business. And that sounds really funny because cloud is supposed to be high tech. And you can hardly conceive of a less sophisticated business, at least on the surface, than parking lots. And what do I mean by that? Well, there's fundamentally there are two models in the parking lot business. One is pay as you go. For pay as you go, the race are luxurious, you know, you may or may not find a space.

Starting point is 00:19:47 I'm sure many of us have the experience of driving through SF and seeing a lot, a lot full. sign, you know, for a lot after lot as we drive around trying to park. And if you do get a spot, you might pay, you know, the $12 an hour rate or something like that. I'm choosing that rate because it's the rate of AWS, you know, for on-demand. On the other hand, if you want to kind of guarantee that you'll have a spot and also have a better rate, you can basically buy a spot reserved. And so you can have your own reserve parking spot in your building. Maybe you pay, you know, $4 an hour. So you're getting a massive discount, but it's $3K a month, effectively, right, which is actually pretty substantial.

Starting point is 00:20:22 And if you're only using it 40 hours a week when you're in the office as a typical worker, it actually might be effectively $16 an hour as opposed to $12, so it's actually worse. And so I think one kind of funny analogy for one things that we want to do with a couple of these products is kind of create the, enable the equivalent of allowing people to park, pay as you go in someone else's reserve spot. And that sounds kind of funny, but you can imagine that, okay, that'd be actually an interesting thing. And if you can do that, then depending on how much, how the percentage of the spots that are typically reserved, you might have 10x the effective capacity and a lot. You know, and then also it can kind of be a win-win. Instead of the pay-as-you-go person paying 12, you know, they can pay something much lower.

Starting point is 00:21:04 In this case, I'll say seven. I'm going to actually be a lot lower. You know, the person who owns the spot instead of paying four can make five equivalent. And then the lot can also make a couple of bucks. And so it's kind of a win-win for everyone. And you're kind of double-picking the lot and it's, you know, really, really efficient. Now, that sounds great, but it wouldn't quite work if you showed up to your reserve spot and there was someone parked there. That might be a little bit aggravating.

Starting point is 00:21:25 It also might not work if you were forced to call two hours in advance and say, hey, I'm coming. And then the person who was parked in your spot had to leave their dinner reservation to move their car. That wouldn't be a fun model. And so I think one thing that we had to do was kind of create the analog of a system to make this all really convenient. And so, you know, maybe the V1 of the system was you came into the lot and a sensor was triggered. saying, you're here to pick up, to go to your reserve spot, and then some valet ran to the car

Starting point is 00:21:53 that was parked there and moved it, and they maybe moved it from the second floor to the 10th floor. And then the person who had parked there previously now comes to the counter, asked the valet where their car is, and gets some tickets saying it's on the 10th floor, but there are no stairs,

Starting point is 00:22:05 so they have to walk up the stairs and get it. I'm stretching this analogy, but you get the idea to be kind of inconvenient, and so part of what we did was added more and more convenience features, which we broadly call spot usability, and these are a lot of things that we continue to add. And so, you know, The scenario that we're now at is basically, you know, you, letting someone else use your spot show up.

Starting point is 00:22:22 The sensor kind of knows, okay, you're here, and then the car in your spot is automatically moved via conveyor. To another spot, we're managing the spaces to ensure we can move it somewhere. And then when the person comes to pick up their car, it's kind of brought to them. Now, yeah, so it's all really convenient. It's kind of seamless for everyone involved. It creates a ton more effective space, allows us to get much better economics out of the machines. And it's really, really helpful for companies. That's one really powerful, I think, thing that we've done.

Starting point is 00:22:44 and this is kind of offered in the context of a spot product. I think people are somewhat familiar with spot usage in the typical cloud context. It's a lot more challenging to do with GPUs for a few different reasons, which is why there are not very many in GPUs available on spot, definitely not at scale and definitely not with interconnect.

Starting point is 00:23:01 So one of the things we want to do is enable this, and it makes a lot of other things possible that are pretty neat, and so it's a mechanism we're employing in a few different ways. So that's, I think, one thing, and I'll give some more analogies to explain why that's powerful. Well, we found that companies are using this type of mechanism quite a bit for everything from training, which is to have classably seen as a workload that's difficult

Starting point is 00:23:19 for spot, but also especially for things like inference, batch inference especially, and you know, that actually opens up another interesting conversation about the different classes of workloads and what they, what each workload needs and cares about and how this might evolve over time. That, you know, actually that ties a little bit to the compound AI systems concept. And also still, I'm a 3.1 release in a funny tangential way. But yeah, that's kind of one analogy for the product that we launched on the Foundry Cloud Platform around Spot. Yeah, I think that, you know, Spot usability increasingly deep and flexible and automated is like a really powerful permittives.

Starting point is 00:23:55 What a changing tax a little bit to just something I think the entire industry, like the tech industry is very interested in. We did a little bit of work together a while back just understanding like where, you know, where is the GPU capacity in the world today, right? all of the different types, how much is it, or like, how consolidated is it? And obviously, this is near and dear to your business. Can you just describe a little bit, like, where you think we are? And then, like, what caused, what caused the shortage sort of last year? One kind of funny bit of trivia that I've posed to a few people that I think, you know, reveals how off-based a lot of our priors are is kind of what percentage of the world's GPU petafal capacity, or X-Flock capacity is kind of owned by the,

Starting point is 00:24:40 major public clouds. And I've asked many people, and I typically have gotten guesses, you know, in the high tens of percents. And the only time I got a lower guess was from Sotia at Microsoft who guessed basis points, which actually is correct. You know, it's a very small, small amount. And maybe as one anecdote to maybe illustrate how this looked at least a couple of years ago, this is an evolving thing, but how it's looked, you know, the example of GPT3 in its training is to have an interesting one. It's a bit dated, but I'll use it just because the numbers of the types of machines and numbers there are public, as is not the case for some of these other systems.

Starting point is 00:25:17 And so GPD3 was trained on 10,000 V100 GPUs in an interconnected cluster in Azure for about 14.6 days. To put that in perspective, it was a stay-of-the-art system. You know, it was kind of, by many estimates, eight figures for a single run at the time. So it's a pretty substantial investment by open app. And that tells you that 10,000 V100 is running continuously for 14.6 days is quite a bit of compute. I think one interesting kind of maybe true your question then

Starting point is 00:25:44 is how many equivalent GPUs normalized in terms of the number of flops, you know, they weren't fully interconnected but just interesting, you know, processing measure anyway, were there in the Ethereum at the peak of Ethereum? And so I kind of asked people this question and it's fun to see people guess. Can I

Starting point is 00:26:00 solicit a guess actually a lot? Can you make a guess there? You might know. So, um, because we've talked about this conversation, so I'm not allowed to guess this conversation. Yeah. So a lot, do you know by chance? For Ethereum? Yeah, for Ethereum. And don't think too hard. Just make a guess based on priors. When? So when it first launched? The very peak, the tippy top of Ethereum, how many V100 equivalents were there, given there were 10,000 for two weeks for GPT3? How many were there in Ethereum? Noting, by the way,

Starting point is 00:26:25 that these were running 24-7 in Ethereum, so you can modulate your guess based on that. I would guess a few hundred thousand, a few million. That's an aggressive guess. Yeah, that's an aggressive guess, and you're actually very correct. It was about 10 to 20 million. Yeah. Which is quite a substantial scale. You can, and you can, by the way, sysruck this really easily by looking at the, you know, basically the hash power in Ethereum at the peak, which was around 900 terra-foss per second, I believe, about a pet-hachshs per second, about a pet-hash per second,

Starting point is 00:26:51 so quite a bit of peak power. And a typical V100 will give you between, I believe, 45 to 120 mega-hashs per second if you really know what you're doing. So that's kind of one guess. There are tens of millions of GPU. Yeah, because it's funny. I remember Bitcoin, even years ago, all the CPU dedicated to it at the time was like larger than all of Google's data centers.

Starting point is 00:27:12 Yeah, Bitcoin, though, you used a lot of A6 in particular. Ethereum actually had a higher relative ratio of GPUs. And so with a larger GPU, sorry, Ethereum mining providers, like, like, Hive, for example, had about, you know, had less than 1% of global hash power. You know, and so you can actually start to extrapolate, you know, and, you know, they had had quite a few GPUs, like tens of thousands of Nvidia GPs. Yeah, it starts to give you a number of the scale of capacity. And then I'll have other, you know, mining, my equipment for way less than 1% of the total hash power, about 0.1%.

Starting point is 00:27:42 So quite a bit of hash power in Ethereum. And I think that's kind of one proxy, but to give you one more anecdotal on that line, actually an iPhone 15 pro now is actually stronger than of U100 as a funny example. It has about 35 tarflops in every 16, I believe, where a V100 is around 30. And so there's actually quite a bit of compute in the world, broadly speaking. It's, I guess, a pun I'm making. Now, it's not all useful. It's not all interconnected. It's not all accessible.

Starting point is 00:28:06 It's not all secure. but this is one point to make that there's a lot of compute in the world, and even for the high-end GPUs, there's a lot more than people think. By many measures, utilization of these even H-100 systems who are kind of state-of-the-art, the most viable, the most precious, etc., is in the making case 20%, 25% or lower, according to some pretty high quality data I've seen from some great sources here. Yeah, so quite low.

Starting point is 00:28:28 And as I mentioned, even during these pre-training rounds, it's often 80% or lower because of the healing buffer partially. That's actually ties to another product that we've launched, which is this product that we built actually large, for ourselves called Mars. It's kind of a funny name, but it's monitoring, alerting, resiliency, and security. It's basically a suite of tools that we've invested a lot of IP in

Starting point is 00:28:45 to really boost and magnify the availability and uptime of GPUs for our own platform. It was actually something that we plan to make available to other people just to use in their own clusters as well. Actually, one of the reasons why we invested in Spot is because we reserve very aggressively healing but for ourselves so that there's a GPU failure we can automatically swap another GPU. and a user won't perceive a disruption.

Starting point is 00:29:10 And so we actually, we maintain buffer for that reason. And so actually being able to pack that buffer with preemptible nodes is actually a really useful thing. But now we're allowing other people to do this, including third-party partners who want to, for example, make their healing buffer available to others through Foundry. It's really offsetting their economics and the cost of the cluster for them. So it's a really, really powerful thing. And so between Mars and Spot, you can kind of see how these things are really interconnected

Starting point is 00:29:35 and in a nice way. But, yeah, the number of GPUs available, there's quite a few, particularly if you look at it more broadly, in terms of total AI compute capacity, the percentage that's accessible, useful, and used is a pretty de minimis fraction of the total. How did the GPU market dynamics and your prediction of them factor into foundry strategy going forward, right? Because there are, especially for anybody doing large-scale training jobs, there is definitely a, you know, a significant effort to be at the, um, leading edge, right? Access to B-100 and beyond is at a premium and then access in, you know, the largest possible interconnected cluster with sufficient power is also a fight now. It sounds like you, you know, see the opportunity differently or you feel like there are resources that can be used that don't require just building new data centers. I think it's a little bit of all

Starting point is 00:30:28 the above, to be clear. I think two things would be true at once, and that's definitely the case here. I think there'll be many workloads and use cases for which having stay-of-the-art, extremely large, clusters is a really valuable thing. You know, part of something, though, that we're noticing and also trying to promulgate further is, are basically paradigms that don't require this, though, as well. And so here's why I'd say, kind of two things from each at once. And so I think there's a massive shortage of both power, space, and interconnect for the kind of largest of clusters.

Starting point is 00:30:58 It's actually very hard to come by and to construct or to find a really large interconnected cluster. You know, it starts to vanish the larger the cluster gets. Like, there are a lot more 1K clusters than 10K clusters and 20K clusters, and you can keep going. Now, I think one thing that it'll get harder and harder to keep the scaling going from there. You know, I think there's one question, which is, how will we continue to push the scaling laws? You know, one fact about the scaling law occurs is that they're all plotted on logarithmic or house, and, you know, things get better predictably, but it requires a continued 2xing or 10xing to get that next bump in performance.

Starting point is 00:31:33 but it's quite a bit harder to get the next to next thing. And so it starts to come kind of intractable pretty quickly. And so it's already prompted, I think, a lot of innovation. So Google, for example, has been doing a lot with, you know, training across facilities, for example, across data centers, interconnecting them for these models like Palm 2, something that previously would have seemed to be inconceivable, or things like DiPaco or DeLocco, these models they release that have trained across facilities. And that's one innovation, but I think actually an even slightly more radical thing is we're starting to see a shift, towards a pretty different paradigm.

Starting point is 00:32:07 I think myself and a number of my collaborators and Matei and others have kind of termed this compound AI systems. I think you actually see it with these most recent models like alpha geometry, alpha code, and Lama 3. And so I think this actually points the way towards what the AI of Structure Future might look like. And I think it looks a lot less like everything requiring these big clusters. And it was a little bit more interesting.

Starting point is 00:32:29 And so maybe I'll use Phi3 as an example. with 5-3, they took a little bit of a different approach where they trained a really high-quality small model on high-quality data. And this small model did not need the kind of big interconnected cluster. You can train it on a pretty small cluster. However, it was still a non-trivial endeavor

Starting point is 00:32:50 because they had to curate and obtain this kind of high-quality data. And so one of the things I think you're seeing is for these models, like some of the Lama 3.8B and 70B variants, Those models are really small, but they're extremely smart. They're smarter than much, much larger systems, like the prior generation for open AI. And the way that they trained Lama 8B and 70B looks a little bit different.

Starting point is 00:33:12 So what they did was they did, they generate a ton of synthetic data. It seems with Lama 3.1,405B. And they distilled that larger model into the 70B and 8B variant. Right? And so they got a very, very high quality, small variant. Another example is with Alpha Code 2 was able to achieve extremely high code proficiency

Starting point is 00:33:36 in win competitions with really, really small language models. But what they did was they called the model a million times for every question. That's an embarrassly paralyzable workload. You can scale it horizontally and infinitely. They called it a million times per query and then had a nice, kind of pretty elegant regime to filter down to the top 10 responses, which they didn't try one by one. And so this is basically what they did. they generated a million candidate responses

Starting point is 00:34:01 and basically filtered down to the best one as a way to solve coding, so to speak. So that's pretty interesting. And I think you're seeing that type of regime a bit more and more. You know, same with alpha geometry. Really, a powerful system was just announced recently, you know, one, you know,

Starting point is 00:34:16 the silver medal level in the IMO and broadly, not just geometry for a broader class of problems. And, you know, these are kind of compound systems with major kind of synthetic data generation pieces. And so I think you're seeing people kind of move computation around interpolate between training and inference, for example, to make the best use of the info they have. And this is actually a kind of a funny, I think, reframing of the scaling laws that we hold near and gear as an ecosystem. I think that, you know, the Chinchilla paper

Starting point is 00:34:41 scaling laws that Deep Mine uncovered have fuel a lot of the scaling effort, but actually one kind of funny way of looking at those results is they show that if you want to make a monolithic model as smart as possible, there is an ideal way to distribute parameters, compute, and basically trade iterations. One, the funny thing I think some people have done, like Mistral, is actually choose to maybe inefficiently train a small model to be smarter than it should be, wasting money, but then that small model is actually really cheap to inference because it's small, and it's way smarter than it should be for its size. And so I think people are getting more sophisticated at thinking about cost in a more of a life cycle way. And that's actually leading

Starting point is 00:35:25 to the workload shifting from large pre-training more and more towards things like batch inference, which is actually a really, really horizontally scalable workflow that you can paralyze. You don't need interconnecting the same way. You don't even need to save their systems in the same way. I think you're seeing that type of workload maybe grow in prominence. And so just to give one, just unpack one more statement on that, one thing you can do that people are doing sometimes is you unroll the current state-of-the-art model many, many times, basically doing chain of thought on the current state-of-the-art model. And then you take what required six steps

Starting point is 00:35:59 with the previous day of the art, and that becomes your training data for the next day of the art. Right, that type of approach to generating data, that then filtering that down to high-quality examples and then training models on it looks very different than just throwing more poor-quality data into a massive supercomputer to get the next generation. Yeah, it seems like that bootstrap up is really sort of under-discussed relative to as you hit a certain threshold of model, the rate at which you can increase for the next model just kind of accelerates.

Starting point is 00:36:28 One other thing that would be great to cover, I know we only have a couple minutes left of your time, is the recent paper that you authored, which I thought was super interesting around compound AI system design and sort of related topics to that. So would you mind telling us a little bit about that paper and sort of what you all showed? So I think it's kind of in this regime that we're just talking about, where more and more often to go beyond the capabilities on frontier accessible to today's state of the art models and kind of get GPT5 or GPT6 early, practitioners are starting to do these things, oftentimes implicitly where they'll call the current state of the art model many, many times. There are many scenarios where maybe you're willing to expend a bit of a higher budget, maybe it's code or something, and if I said that I can give you a 10% better model, you know, for code, many developers might be. might pay 10x for that access to that. Instead of $20 a month, they might be very willing to pay $200 a month, right, for obvious reasons. And so there's a question of what do you do in that setting?

Starting point is 00:37:27 And so people are, you know, if you're willing to call the model many times, you can compose those many calls into almost a network of network calls, right? And, you know, I guess one of the questions is, how then should you compose these networks of networks, or principles should guide their architecture? We kind of know how to construct neural networks, but we haven't yet elucidated the principles for how to construct networks, so to speak. these compound data systems, so to speak, where you have many, many calls, maybe external components. And so one principle that we start to explore was maybe one thing you can prove to figure out how

Starting point is 00:37:57 to compose these calls or whether composing many calls will help you is, is you can look at how verifiable the problem is. And if it's verifiable, you can actually bootstrap your way to really high performance. So what does this mean? Well, verifiable means that it's kind of easier to check an answer than it is to generate an answer. And there are a lot of cases where this is true. most software engineering and computing tasks kind of classically have this property you know we looked at things like prime factorization or a lot of math tasks

Starting point is 00:38:22 classively it can take someone years of suffering to write a proof and you can like read the proof in a couple of hours right I think we've all had that experience with some training so there are many examples like this and so one thing you can do is you can have models you can embarrassingly parallel you can horizontally scale out

Starting point is 00:38:39 and generate many many candidate responses and then relatively cheaply check those candidate responses and kind of do of a best of K type of approach. And it turns out the judge model or the verifier, choosing the best candidate response might actually have a lot higher accuracy at selecting the best candidate response from the set. His use to kind of repeat this as a procedure

Starting point is 00:39:01 to actually bootstrap your way to really with a high performance in many cases. And so, you know, we did kind of some preliminary investigations here and we were able to, in one case, the prime factorization, you know, kind of 10x, the performance, go from 3.7, percent to 36.6.6 percent. On prime factorization, which is pretty hard, kind of factoring a number that's a composite of two primes, two three-digit primes and factorizing it, factoring it

Starting point is 00:39:25 into the constituent primes. It's kind of a classic problem that pops up a lot in cryptography. And then also we looked at subjects in the MMLU and found that for kind of the subjects you would expect, math, physics, electrical engineering, this type of approach is really helpful. Now, we use language models. It doesn't have to be a language model. It could be a simulator. these could be unit tests as your verifier, et cetera. But we think this type of approach kind of points towards maybe a very different paradigm for getting better performance than just kind of scaling the models and doing a whole new pre-training from scratch.

Starting point is 00:39:54 MMLU performance bump was about 3%. And to put that in perspective, the gap between some of the previous best models is often less than 1%. Between, for example, Jim and I, 1.5, and Lama 3.1 and things like that. So, actually, to 2.8% or 3%, is actually a pretty major gap on MMLE. So pretty intriguing, and I think a lot of practitioners are hopefully going to explore this setting a lot more. It's a super cool paper. Do you have any creative ideas about how you could apply some of the ideas here to improve performance on more open-ended tasks? In many ways, we're not originating these. I think some of these are baked largely into systems like alpha, alpha code, alpha geometry already.

Starting point is 00:40:34 I was pretty inspired to see the alpha geometry results recently as well. Yeah, I think that what we'll see people doing is kind of composing, this sounds funny, but massive networks where maybe each stage in the network will basically be maybe some best of a, best of K component with many, many calls to different language models, you know, Claude, Gemini, Chbte4, each with their own, you know, spikes in terms of capabilities, and you'll kind of throw multiple of them at questions in many cases and then kind of choose the best response. You might, you know, also ensemble that with other components like classical, heuristic-based systems and, you know, simulators, etc. And kind of compose large networks that may make millions of calls to answer a question. And I think that type of approach, it sounds kind of farcical right now, but I think it'll seem common sense pretty soon. You know, we think that's, we think it's a really interesting approach, and we've seen a lot of interesting evidence that we'll speak more about pretty soon for things like code

Starting point is 00:41:30 generation and agentic tasks in the code regime, you know, for things like design, chip design, you know, for things like actual neural network design, or network of network design even, funny enough, in recursive ways. So it's actually a really good, it turns out a lot of these problems that we care about, how that property is verifiable, and you can compose these systems and bootstrap your way to, you know, much higher performance than people might have imagined. So it seems pretty applicable downstream, but there's a lot of open questions, a lot of work to do further. And I think, you know, part of our hope is that the community will explore this more, and that these types of workloads that are a bit more paralyzable will

Starting point is 00:42:05 become more and more common. There'll be a lot more batch inference, a lot more synthetic data generation. And you won't necessarily need the big interconnected cluster that maybe only open AI can afford to do kind of cutting edge work in the future. Yeah, a really cool set of ideas. And overall, a great conversation. Thanks so much for doing this, Jared. No, thank you, Sarah. And thank you a lot. Great to see you. Yeah, great to say too. Find us on Twitter at No Pryor's Pod. Subscribe to our YouTube channel. If you want to see our faces, follow the show on Apple Podcasts, Spotify, or wherever you listen. That way you get a new episode every week. And sign up for emails or find transcripts for every episode at no dash priors.com.

No Priors: Artificial Intelligence | Technology | Startups - The marketplace for AI compute with Jared Quincy Davis from Foundry

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.