No Priors: Artificial Intelligence | Technology | Startups - The World’s Largest AI Processor with Cerebras CEO Andrew Feldman
Episode Date: September 7, 2023The GPU supply crunch is causing desperation amongst AI teams large and small. Cerebras Systems has an answer, and it’s a chip the size of a dinner plate. Andrew Feldman, CEO and Co-founder of Cereb...ras and previously SeaMicro, joins Sarah Guo and Elad Gil this week on No Priors. They discuss why there might be an alternative to Nvidia, localized models and predictions for the accelerator market. Show Links: Andrew Feldman - Cerebras CEO & Co-founder | LinkedIn Cerebras Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @andrewdfeldman Show Notes: (0:00:00) - Cerebra Systems CEO Discusses AI Supercomputers (0:07:03) - AI Advancement in Architecture and Training (0:16:58) - Future of AI Accelerators and Chip Specialization (0:26:38) - Scaling Open Source Models and Fine-Tuning
Transcript
Discussion (0)
The shortage of compute for AI, also known as a GPU crunch, is increasingly impacting
AI companies big and small.
It is delaying training runs and launches for multiple players in the generative AI world.
One company coming to the rescue is Cerebra Systems, which is developing the largest
computer chip and one of the fastest purpose-built AI processors ever.
This week, Sarah and I are joined by Andrew Feldman, CEO of Cerebra Systems.
Andrew is one of the few entrepreneurial veterans in the semiconductor world.
He previously started CMicro, a pioneer of energy-efficient, high-bandwidth microservices.
C-Marco was acquired by AMD in 2012.
Andrew, thank you so much for joining us today.
Alan, Sarah, thank you so much for having me.
I think you all recently announced that Cerebrus has closed a $100 million deal with G-42 to develop one of the largest AI supercomputers in the world.
We did announce a strategic partnership with a group called G42,
and we announced that we were building nine supercomputers.
Each supercomputer would be four X-aflops of AI compute,
so in total 36 X-a-flops of AI compute.
That was extraordinarily exciting.
When you encounter a partner that shares your vision
and wants to build with you,
and you get to start building the biggest compute on Earth,
I mean, that doesn't get better than that.
And I think in general, you all were very forward-thinking and early to identifying AI as a really important market for custom semiconductors.
Could you tell us a little bit more about your thinking early on as you started Cerebrose and why you focused on AI many years ago?
In late 2015, five of us started meeting regularly.
I think we were meeting in Sarah's offices, actually, at that point.
All of us had worked together at our previous company, and we began working on ideas.
And one day our CTO, Gary, leaned back and he said,
why would a machine built for pushing pixels to a monitor be ideal for AI?
Then he said, wouldn't it be serendipitous if 20 years of optimizing a part for one job
left it really well suited for another?
That got us excited.
We began looking at GPUs, looking at AI work.
And by early 2016, we decided that we could build a better part for this work.
And our strategy was not to build a little bit better, but it was to try and do something fastly better.
And we went to a technology that had never worked before called Waferscale.
And we build a chip that's sort of the size of dinner plate, whereas most chips are the size of a postage stamp.
And we did that because we knew that this workload would go big.
and that the problems of memory bandwidth
and problems of breaking up work
and spreading it over lots of little machines
would be daunting.
This is my fifth startup
and this is the first time I was wrong
on the market size on the underside.
I had no idea it was going to be this big
and I think very few people saw
even those of us who were in it
how big this was going to be.
And so early 2016
we went out, we did eight pitches,
we got eight term sheets,
We raise money, and we're building multi-exaflop AI supercomputers for customers around the world.
It's really incredible thinking about the foresight.
I actually went back and looked at my notes from, like, our end of 2015 meeting.
And I'm sure you remember very well, but for our listeners, Andrew had a slide in there of how the top seven problems for deep learning were all long training times.
And like the whole industry was bottlenecked on computer.
this is even before the scale up of transformers that happened. It's kind of wild how
correct you were, at least on, you know, the depth of this problem. I think what you need to do
is save that snippet and send it to my wife. We got that right. I think also we got the fundamental
architecture right in a large sense. You know, we laid down the architecture before transformers
exist and we're the fastest at transformers by a lot. So when you break up a problem and it encounters
things it has never seen before, and it's still really good at them.
That's a sign in hardware architecture that you really got the architecture right.
What are some of the benchmarks that you use in order to assess the performance of your chip
versus others, and have you all performed relative to GPUs and other sort of more standard?
We are not big believers in canned benchmarks.
When I was at AMD, we had a team of 30 whose job it was to game benchmarks, when our CTO was
at Sun and a team of 70 whose job it was to game benchmarks.
How long it takes your customer to train a model is the answer.
And usually a collection of models.
If you're trying to do a 13 billion parameter model, medium size, right?
Almost always you begin at 100 million.
And you do 100 million.
You do a series of sweeps, right?
You're trying to do your hyperparameter tuning.
So you do 100 to 300, 600, maybe a billion.
And you're trying to find the right optimization points.
And I think how you do on all of those.
is the answer. How long it takes you to move from customer model to distributed compute
to finished run. And if you have to spend months doing distributed compute,
doing tensor model parallel distributed compute, I mean, if you look at the back of some of
these papers, they're crediting 20 or 30 people, sometimes more, who helped on the distributed
compute. And if you don't need to do that, like you don't with our equipment, because we run
strictly data parallel, that's all time that you get back as a customer, right? That's all
time that you get to train models and do interesting work rather than try and think about how
to break up a large matrix multiply and distributed over a collection of GPUs. Can you actually
talk a little bit about the architecture of your chip relative to others? All right. It's a data flow
architecture. It's comprised of about 850,000 identical tiles. Each tile is a processor and memory.
And so it's a fully distributed memory machine. And so you have huge amounts of memory bandwidth
because the memory is speaking to a processor that's one clock cycle away. So by data flow, it means that the
Cores wait until they get a token or a flit that arrives.
It tells them what to do, do work on this, and where to send it, and then it sends it
forward.
And this is a particularly good architecture for the type of work AI is.
AI is a flow, right?
It's a learning system.
It's interesting.
It was developed in the 80s at MIT by guy named Arvind, and they didn't have a good workload
for it, so it just sat there.
Nobody, and then we got a good workload for it.
It's phenomenal for this.
That's a really interesting part of computer architecture where a really cool architecture
can have no interesting workloads for it, and then you get a good workload, and now it's ready.
Now it's good at this particular problem.
We keep a huge amount of S-Ram on the wafer, all right, and so there are no memory bandwidth
problems ever.
That also allows us to harvest sparsity, which is something.
that others really struggle with.
The cluster we build keeps the parameters off-chip in a parameter store, and it streams
them in.
And the result of this architecture is that we run strictly data parallel, which means
even in a 64-node cluster, you run the exact same configuration on each machine.
Each machine sees the exact same way it's, but they see a different portion of the dataset.
And it scales linearly, which is unheard of in computer.
Can you, for a broader audience, describe, like, how this addresses some of the challenges of using traditional GPUs for large-scale machine learning?
Sure.
One of the real challenges of a traditional GPU is they fix the amount of memory on the interposer, right?
So you can buy a GPU with either 40 or 80 gig.
What if you want 120?
What if you want 20?
What if you want a bigger parameter store?
Right, you've got to buy more GPs, even though you don't want those.
And so what this strategy does is it allows you to support extraordinarily large models
by disaggregating the memory from the compute.
And in all GPUs, they're tightly coupled.
And that means if you want more compute without memory, you'll still have to pay for the memory.
If you want more memory without compute, you still have to pay for the compute.
And that was an idea that came from supercomputing that we knew really well.
we could organize this so you could run an arbitrarily large now, a trillion parameter network
on a single system.
Now, it would be slow.
That's a very big network.
But you could run it on one system.
You can identify layers.
You can debug.
You can do all this.
And you can never view that on the GPU, right?
Because there's nowhere to store the parameters.
And so by separating parameter storage from compute, we allowed you to minimize,
the various ratios at your will rather than at the GPU vendor's will.
So I guess in terms of the relationship that you have with G42,
there's a variety of different applications that you've looked at across healthcare,
across a variety of different protocols.
What do you view as some of the first use cases or best applications for your architecture?
So the first thing that they allow it is a model called BTLM,
which is the number one three billion parameter model on Hugging Face right now with more than a billion
downloads. We have Fortune 100 companies running on it right now. We have them doing their own
internal work on it. On Wednesday we're announcing that we're open sourcing with G42's
group called Inception and NBCUAI University, the largest Arabic LLM. So we're putting that into the
open source community. That's a collaborative effort with our strategic partners there. When you have a lot of
compute made the mind boggles and all the cool stuff you can do.
Why did you guys get into the training and open source model game?
We had so much compute and that it was a way we could prove to the world how easy it was
to train on us. We felt it was evidence that we could build and our systems could train
the biggest and fastest models, the most accurate models in the space.
In March, we put seven GPT models in the open source community.
Everybody else was putting one.
Why?
Because it's really hard to redistribute work across a GPU cluster.
For us, it's one keystroke.
So we put seven.
People are coming to us with extraordinary ideas.
We had a customer who came, and they said, look, we'd like you to design for us a model.
at $3 billion that we could prune and quantize such that it could be served off a cell phone.
Really cool.
What are the biggest milestones that you're thinking about going forward?
There's been a lot of back and forth.
Everybody thought we'd keep getting bigger, you know, north of $175 billion is a giant network.
It's expensive to train.
And maybe even more important, it's really expensive to serve, right?
17 and a half times more expensive to serve than the 10 billion parameter model.
And so I think a lot of the models in production are 13 billion and smaller
as an effort to keep the cost of production down.
We are training right now a whole collection in the 30 to 175 category.
I think there's going to be a spike in multilingual models.
Stability just put out one in Japanese.
There was one in Spanish that came out in,
inception, NBZUI and Cereber's put out one in Arabic.
I think these are underrepresented languages.
And I think we have a challenge there in that most of the big
datasets are giant internet scrapes and most of the
internet's in English. And so it's really hard to build big
datasets in languages that aren't English. We don't have tokens.
And so techniques to carefully think about how you can do
that, I think will be very very, very
much in demand. I think every nation wants to feel like its language and culture is captured
fairly in an LLM. That's interesting because a lot of the training also seems to capture
social norms versus data. And one can imagine as well as you think about different regional
models or different language-specific models, people want their own set of cultural norms to be
part of the model in some sense versus something that's imposed. That's exactly right. Language
captures experience. I mean, there's words that
are different and have very unique, I mean, it's just like the New York experience includes
Yiddish, right? It will be important that we find ways to capture these cultural experiences,
and we do that in language. And if you overwhelm the minority language with English,
all that gets sort of sifted out. I think the Yiddish LLM would just complain all the time.
That would be hilarious. That would be hilarious.
get a show out.
And it's like, what's your brothers are doctors?
Every response is, why aren't you a doctor yet?
Are you going to finish your PhD, Andrew?
Oh my God, no, don't worry.
The Chinese LLM says that too.
Oh, it will.
That's right.
That's the exact same mothers.
I think it's actually like kind of a deeper insight in the joke, which is there's
cultural experience baked into both the actual tokens and pre-training curation.
and what you're weighted toward, as well as in the AL-A chat player, right?
Because whatever human annotators are saying, like, this is what I want,
somebody wants to kvetch with their chatbot, too.
That's exactly right.
One of the things we learned was that it's really important taking into account the language
that has fewer tokens, especially, you know, Arabic, Hebrew, Hindi,
their characters sometimes required two bits rather than one bit.
And so the tokenization scheme has to be really thoughtfully done, or you further bias the model in favor of the English.
I think, you know, you put your finger on exactly why our strategic partners are so excited by this is they want their culture and their language properly represented.
They want it reflected.
They want it reflected.
in their culture reflected in an LLM and not sort of sanitized or washed with 99% English tokens.
Can we zoom out and just talk a little bit about what it's like to start a chip company?
It's a very uncommon experience, especially in as far deep as you guys are now with the money you've raised.
What's the hardest part of bringing a new chip to market?
I think the structure of the chip market is challenging.
I think one of the things that VCs love about SaaS
and about software companies is for not very much money,
you can get a prototype and you can start seeing if customers like it.
Chips are the opposite.
We're going to spend two, two and a half, sometimes three years,
and $50 or $60 million building our first part.
And then we will learn if people like it or hate it.
It has a huge gaping chasm between when you start
and when you begin learning from customers.
And that's one of the things that the software doesn't have.
Now, it turns out that if you're building a SaaS company,
you end up often using as much money.
You get insight into what your customers like
much, much earlier than in a chip company.
And I think that's perhaps the biggest structural difference.
The structure of engineering is different
when you've got a two or three-year project.
You can't do 150 one week's friends.
the way you manage QA versus development is different, often in chip companies for each
developer. You have three or four QA people, DV people. And that's because a bug can cost you
$20 or $30 million. It's not like the outfit said in 4.1.1. You can have to re-spin the chip
if you have a big bug. And so the commitment to tooling, to QA, to simulation is profoundly different
in chip in the chip world than when you're building a software now we still have to build a huge
amount of software and we're about 75% software when you build a chip the chip is the muscle and
the software is the brain you still need a huge amount of software you need low level software
software that runs right against the hardware you need a compile stack that allows the user to
write in pie torch and just run as if as if it were a GPU and you end up
with a lot of different types of engineers in a system company.
Andrew, you have such a unique expertise in depth
and a question that everybody is thinking about right now.
Can you give our listeners an overview of the AI accelerator market right now?
Like, who are the big buyers?
And then how might they decide to do anything but Nvidia?
Invidia's made hay when the sun shines.
And they've done an extraordinary job,
and you have to take your hat off to them, right?
I think they're now in a situation where they're extorted customers.
They're extremely expensive.
They're unable to ship.
And that has, among other things, open the door for many of us who have alternatives.
Nobody likes being dependent on their vendor.
You can ask the guys at Google and Facebook who are dependent on Intel for years, how much they hated that.
They dislike that intensely.
And so I think there's a battle between people sort of dislike being dependent and the need just to keep running forward.
I think large enterprises continue to be a good part of our business.
I think often where you can provide a little consulting to help them accelerate their model deployment
is something that provides an opportunity for smaller companies to get in the door.
And when you show them how much easier it is, we had a situation where they were trying to train
on a GPU cluster and they were at 60 days and it wasn't converging and we stood it up in three and a half days
later, their model converged. And they're like, holy cow, we have an internal cloud so customers
can just jump on your cloud. They can begin training right away. If they want to try before they
buy, they can begin with a little training run and go from there. We have customers in large
enterprises and generative AI startups. We have customers in the military and the intelligence
communities. We have customers in the supercomputer market and world. We sort of up and down
organizations of different sizes.
How much specialization do you think they'll be in the semiconductor layer going forward?
In other words, do you think they'll be increasingly different chipsets for inference versus training,
for transformer-based architectures versus diffusion models, for multimobal versus not?
I'm just sort of curious how you view all that.
The fundamental question in computer architecture is what do you make special and what you make
general.
And the second fundamental question is, what are you good at?
What do you not good at? What do you choose not to be good at, right? And those things are really important. You know, if you put in circuitry for transformers, that has a cost. There's a penalty every time you do a convolutional network. And so these tradeoffs aren't free. You put in circuitry for one thing. If you're not doing that thing, that circuitry just sits idle and you've wasted power in space. And so I think,
It is a TBD question on whether we will continue to see more specialization at the chip level in training.
We do not do it.
We crack the problem in a different way.
All of the problems are linear algebra and are predominantly sparse linear algebra.
So we focused on underneath the transformer.
Now, whether you will have different silicon for training and for inference, I think you will.
So your view on this, Andrew, is less a strong point of view that there will be something beyond transformers that is very important in the near term and more, you've already solved in a more general way.
That's right. The tricking architecture is to solve hard problems in a general way.
So you don't have to rely on product management to sort of guess what's the next cool layer type, right?
If we go to a type of model that doesn't require very much data,
if we go to single shot learning, right?
InVidia is totally out of luck, right?
Us to everybody.
These models are designed for extremely data-intensive transformer base.
But whether you're doing convolutional neural networks or transformers or diffusion models
or certainly new things will be invented,
I mean, we haven't run out of ideas.
We haven't even run through the ideas that the guys put together in the 80s.
right? We still have plenty of work to do to improve our model making. And as the industry matures,
and we're still in the early innings, you might see more customization for a particular model type,
but that's not my preferred approach. Are you making bets internally on separating train and inference?
We make all sorts of bets, Sarah. We lose frequently, and we win sometimes. And that's a
That's the real fun of building companies in a dynamic environment, right?
I think inference, especially generative inference, right, is a very different problem than all
other forms of inference, right? Basic classification is the exact same as the first pass
in a training run. But when you're doing inference on generative AI, it's not, and you have to
hold a ton more information at the inference point.
And so I think there will be, we will discover, we will invent new and different ways
to attack that problem.
It's extremely expensive right now.
I mean, people using 8H-100s to do inference on a big model.
I mean, that's half a million dollars.
You know, there is this ongoing crunch to supply for GPU and other sort of AI-centric
compute.
And that's causing different people to delay launching their companies or,
training their models, and those are often the same thing for different types of AI companies.
What do you view as the main cause of this supply crunch, and when do you think it loosens up?
Or when do you think things loosen in the market?
The chip market has a profoundly inflexible supply chain.
And so TSM, a fab is a pyramid in modern society.
It is the absolute peak of manufacturing capability.
It's a $20 billion building in which nobody works in it.
And ideas come in and outcome chips that cost $100.
I mean, these are unbelievably, unbelievably amazing things, but they can't turn on and die.
And so when somebody misses their forecast by a lot and they come back to the fab and say,
I need twice as much or three times as much, often that takes six or eight months to sift through.
And that's the problem that's coming in right now.
I mean, it's not just that Wall Street missed what Nvidia would sell.
Invidium missed it.
They missed the forecast.
And then you can go to TSMC and you can ask them for wafers, but many of their wafers are already allocated to AMD.
They're allocated to Qualcomm.
They're allocated to other giant companies, right?
And so you struggle forward.
And so in addition to taking a long time to get to first customer learning, there's a huge,
premium placed on the ability forecast accurately in the chip business.
Given that we're going through this big discontinuity from a technology perspective,
adoption perspective, etc., do you think we'll be accurate in forecasts going forward anytime
sooner? Do you think there's likely to be a crunch for the next couple years?
We're going to continue to get it wrong. We're going to continue to get it wrong for a while.
Do you think we're largely going to underestimate it?
Oh, yeah. So basically the market is accelerating rapidly. Everybody's underestimating it.
And, you know, fundamentally, we all think that this is going to be many times bigger than anybody thinks two, three years from now.
Yeah, and the dynamics of our supply chain are when we get it wrong and you order too much, you still have to take it, right?
You don't get sale? Guess what, guys? I really don't want those wafers anymore. And they're like, see this $10 million worth of wafers? Yours.
Yeah, that happened during crypto, right? So I think Nvidia had a really big miss during crypto.
It happened during crypto.
It happened to Broadcom.
I mean, Arista today is at 52-week lead times on switches.
Given that you've been involved with training some incredibly perform in open-source models,
and you mentioned that as you scale up a model,
inference costs starts to really kick in as you get to bigger and bigger models from a parameter perspective.
What do you view is the limits to scaling these models?
Like at what point do they get too big, or do you think we just keep scaling them
until we have these sort of hypermodels?
It's a really interesting optimization problem.
They get big, they get harder to work with.
It's harder to retrain, right?
Even GPD4 came out.
It didn't have any insight beyond 2021.
They had to race.
That was because it was so big, right?
And so I think they're tradeoffs.
They're tradeoffs between accuracy and size.
There are tradeoffs between size and cost to do inference.
I mean, maybe I want a larger model for radiology files.
or for my doctors, maybe I'll take a smaller model for my chat or my customer service bot, right?
We're going to have to think about these in terms of what they're delivering, what the cost of being wrong is, what the cost of serving is, and think about this as a business decision.
And at first, everyone's just running trying to say, I can make a bigger model, I can make a bigger model.
And then the guys who are trying to run businesses, you're like, well, I can't afford to give that away.
And so we're going to be down at $3 billion and at $6 billion and at $13 billion because that's, I can get pretty good, pretty darn good, and not break the bank with free inference.
And so I think that there are these tradeoffs that are really challenging and we're just beginning to grapple with now.
Now, I guess Open AI related to that recently announced a partnership with scale around fine tuning.
How important do you think fine tuning will be going forward?
I think fine tuning will be really important.
We're still just learning about the capacity of these models to hold information and to hold insight.
And fine tuning is an extremely exciting approach.
I think we've seen again and again that human feedback makes these models vastly better.
And that's not surprising, right?
I mean, that's how you learn to dance.
That's how you learn to do everything in soccer, gymnastics, ballet.
somebody saying, no, that's wrong. Nope. Do it again. That correction is a mechanism for
improvement somehow doesn't seem very surprising to me. I think we don't yet know how far
fine-tuning is going to take us and who will benefit from a model train from scratch
versus continuous training versus taking a model off the shelf and adding to it their unique data.
You know, there is this point of view from some of the large labs that very few people will do training and it will get quite concentrated.
But I've heard you speak about how you think that there are really interesting data sets in different large enterprises.
Tell us about that.
You know, at first, everybody scraped the Internet, that everybody had that.
And now the companies that have extraordinary data sets, Reuters, Bloomberg, Galaxos-McKline, Abb v.
These companies that have years of research and exceptional data are going to, I think, step into the floor.
The AI has made progress to the point where people are looking around for exceptional datasets, unique data sets.
The rise of data as an asset, I think it's the new gold.
And I think that's something that isn't talking about enough and will be very much top of mind going forward.
Find us on Twitter at No Pryor's Post.
podcast. Subscribe to our YouTube channel if you want to see our faces, follow the show on Apple Podcasts, Spotify, or wherever you listen. That way you get a new episode every week. And sign up for emails or find transcripts for every episode at no dash priors.com.