In The Arena by TechArena - Exploring Cerebras' AI Innovations: Inference, Llama, and Efficiency
Episode Date: September 13, 2024Sean Lie of Cerebras shares insights on cutting-edge AI hardware, including their game-changing wafer-scale chips, Llama model performance, and innovations in inference and efficiency....
Transcript
Discussion (0)
.
Welcome to the Tech Arena,
featuring authentic discussions between
tech's leading innovators and our host, Alison Klein.
Now, let's step into the arena.
Welcome in the arena. My name is Alison Klein. We're coming to you live from
AI Hardware Conference, and I'm so glad to be here with Sean Lee, CTO of Cerebrus. Sean,
welcome to the program. Thank you so much for having me. I've been following Cerebrus for a
number of years, and you guys have seemed like the grown-up kid in the AI hardware sandbox,
and you've been designing unique silicon solutions
for AI for a while. Can you provide some background on the company and how we got here?
Oh yeah, for sure. I guess it has been about eight years now since we started. We started in 2016,
really at the kind of beginning of the modern deep learning era. And what got us started was
really just the realization that there was this
tight dependency and connection between what ML can do and the hardware that is behind it.
And it almost feels like ancient history now, but this was like right after AlexNet came out and,
you know, the first models were trained on one or two GPUs. And we saw that there was this tight
connection. And then we believed that if we did something very different at the hardware level, we could enable something very different at the ML level.
And when you do deep tech innovation, what you've done, it takes three years to bring it to fruition.
And over time, everything that we've made as a company has very much been about going as big as possible.
I don't think even we could have imagined
how much Gen AI has exploded and the compute demand.
But everything about what we're doing
from the physical chip that we felt,
which is 56 times larger than traditional chips,
to our clusters, to all of the software
was all about making sure that we were trying to go as big
and push the envelope as much as possible.
And here we are now, and I guess our bets have paid off.
That's awesome.
You know, I've been thinking about you guys
and thinking about you from a standpoint
of producing those dinner plate-sized processors,
and they're quite eye-catching.
I've seen them at conferences before.
But what's even more impressive
is the recent news about your AI inference performance.
Can you tell me what you've delivered?
Yeah, absolutely.
I think to
give a little bit of context, what we realized when we looked at the Gen AI inference space
was that all of the solutions today were definitely slow. And it ultimately comes
down to the fact that Gen AI inference is very unique, that it's every bandwidth down. And
whether using an EMD GPU or an NVIDIA GPU or a TPU, they're all using
HBM and that ultimately limited the performance. And what we did was by using our waperscale
interplate size chip, we are able to offer 7,000 times more memory bandwidth. And as
a result, we're able to get order of magnitude speedups in inference just through the hardware alone.
That's crazy.
And we just launched it two weeks ago.
And the responses were just more than we could have imagined.
Now let's unpack that news a little bit.
You decided to focus on LAMA inference for this particular announcement.
Can you share why you're aiming at inference in particular?
Obviously, there's an entire AI pipeline and much of that pipeline could be
addressed by Sara Ross focused on inference.
Why'd you choose Lama too?
Focusing on inference, I guess the first part is that inference,
it's what touches the actual user.
And what we probably all see is that today, GenAI inference solutions
don't quite live up to, I think, the user experience that
we all would like to have.
It's not that responsive.
The user engagement, the interaction is very low.
And so we felt that there was an opportunity to fundamentally change that user interface.
We think we'd unlock the next level of value.
So that's why inference.
Now why LLAMA?
I think there's a couple of reasons here.
One is that in general at Cerebrus,
we are really big believers in open source and in open models.
We believe the future is all about the community coming together
and supporting each other and innovating off of each other's innovation.
I think Lama embodies that philosophy very well.
And so it's very much aligned with what we believe.
Most of our ML work that we've done,
we've all put into the open source. So that's the first thing. And I think the other, it's just
really pragmatically, Lama right now is the de facto standard open model that everybody wants
to use. The quality of the results, the size of a model, they're all state-of-the-art. And it's
just remarkable what you can do now with a model you can download from Hugging Face.
Right. Now, at AI hardware,
there's a lot of silicon players in this space. A lot of them have great PowerPoint. You have
silicon. You also demonstrated in your recent release that you have customers. And tell me
about the customers that talked about this latest substantiation of technology and how
market traction has been going. That's a really good point. We made a very conscious decision in our latest launch year with Inference
that we're not just launching a hardware platform
with some slides showing the performance,
but instead also launching an actual Inference service layer
that can be accessed through Thou for the entire developer community
as well as our customers.
And now we're starting to see that our hypothesis there was absolutely right.
Just in the last two weeks, I guess, since we launched, we've had over 30,000
developers come and use our platform.
We've had customers and users build all sorts of interesting applications
literally within the first few days.
As an example, actually, one of the most impressive ones that we weren't even surprised by
was a company called LiveKid,
which builds our voice pipeline.
They filled what was initially a demo
around our super fast inference,
where you can basically talk to an assistant
like interface through voice.
We didn't even know they were doing this.
They built it just for fun.
Although seeing what the faster inference could reveal,
they were getting way more latency. It's much better interaction with the user exactly, the user adherence
benefit that I mentioned earlier. They built that in, I think, a matter of hours, showed it to us,
we were so impressed. And now we've now installed it in our service and now people can also use it
when they come to our service and now they can chat with our chatbot through voice. And so we're
seeing so many customers and users like this
that have just popped up over the last couple of weeks
because we decided to launch it this way.
And we also have now engagements with a lot of really big customers,
which we will almost certainly be announcing publicly very soon.
That's awesome.
Now, why do you think that it's critical?
You talked about being dedicated to open source,
dedicated to community.
Why is it critical to have competition in this space?
Competition is really interesting.
On the one hand, obviously, with my Cerebras hat on,
I love it if it's a competition and we were just able to completely dominate the space.
But when I step back a little bit,
I think competition is ultimately a very healthy
and very important thing, as you mentioned.
The first is, I think, it's the ultimate validation of what you're doing.
Others are also trying to do it.
It's the ultimate validation that it actually is to make a difference and that people actually care about it.
And I think the other one is just part of the reason we believe in open source so much and the community so much
is that we believe that it's better for the community to have options,
have the ability to move between options.
This space, in general, training and inference, each one is so large with so many different
use cases.
In general, we believe that there can be multiple different solutions for all the different
use cases and the competition will foster that kind of very diverse ecosystem.
I think it's what the entire community will ultimately really benefit.
That's awesome.
Now, we've talked about performance.
You guys are obviously delivering on that vector.
But the other thing that I wanted to talk to you about is performance efficiency.
There's been so much focus on the energy consumption around AI.
How are you addressing that in your thought groups?
Yeah.
You know, the energy consumption,
environmental impacts, the straight up costs,
it's ultimately gonna be an inadvertent,
especially at scale.
And at Freebirth, we recognize that just having performance
at all costs might serve as a really interesting demo,
but it's actually gonna be a sustainable product
and have the kind of impact on the industry and community.
And so when we designed and architected the entire solution, obviously
performance was front and center, but we very much wanted to make sure that we
could also run it at high throughput, which ultimately reduces the cost and
reduces the average power per request.
And it actually really comes down to some of the fundamentals in our
architecting with the higher memory bandwidth that we're able to offer the
really high performance and the higher throughput bandwidth, that we're able to offer the really high performance.
At the end, the higher throughput or lower cost and lower power at the same time.
And we believe we have a unique architecture that can achieve that.
And now our users can also experience it because that lower cost and lower power ultimately
translates into a lower price for our solution.
And so right now, we've got the fastest inference service,
but it's also among the cheapest.
Right now, we're pricing our 7DB Lama model
at $0.06 per million tokens.
Wow.
It's among the lowest in the industry right now.
That's incredible.
That's an incredible price point.
That's wonderful.
Talk about Lama
and the fact that you're delivering
great performance on Lama,
but I know that one thing important to customers
is breadth of model support.
How are you addressing that?
That's absolutely true.
I think the ecosystem is so diverse and there's so many different models.
We recognize that during the run, even before inference, we invested really
heavily over the last several years on building a general compiler that will
connect to industry standard frameworks like PyTorch and allow
the programmer to be able to program using these industry standard frameworks and ultimately
automatically map both models into our hardware.
That's fantastic.
And that's what we've been running in the background on the training side for the
last several years.
And it's ultimately the underpinnings of our insurance product as well.
Even though we're offering it as an API to start,
we envision that there will be many customers
where an API-level access is good enough.
But then as we grow the number of models we support
or customers and users want to use custom models,
they'll be able to use our compiler to be able to map their custom models.
Awesome.
Now, what's next for Cerebrus on the innovation curve? You guys have been delivering
amazing results. What can we expect next from you?
It's very interesting. We're really proud of everything we've delivered so far and being
able to get to the fastest inference solution right now on the market. But we really believe
this is just the very beginning for us on the inference front. We have many additional
enhancements that we're working on,
some that improve our pandy, some that improve the cost and the power even further. Those will
be coming out in the coming weeks. And then in general, you know, at the bigger picture level,
it's in our DNA to be pushing the technology boundaries. We knew that tackling the problem
with wait-for-sale integration, for example, was not going to be the easy one. We knew that tackling the problem of wafer scale integration, for example, was not
going to be the easy one. We saw a path and we knew that the impact could be really huge if we
could do it. We are constantly searching for the next wafer scale step function improvement. And
we'll have many both medium-term and long-term projects that are in the works. And I'm going
to be very happy and excited to share those once it's fully made. That's so cool, Sean. Thank you
so much.
It was great meeting you and it was great learning more about serverless and what the team is delivering.
Where can folks find out more information about the solutions
we talked about today and engage your team?
The best way is to go to inference.swerybris.bi, where you can
actually even try out our inference service yourself through
our chatbot or an API.
And then through that, you can actually engage with our team directly. Or I'd encourage you to go on to
the social media sites, Discord and Reddit, and you can engage with us in the community there.
Like I said, we are very much a believer in doing this with the community.
Sean, thanks for taking the time out of your busy schedule at AI Hardware Select to chat.
It was a real pleasure.
It was a pleasure. Thank you so much for having me. Thanks for joining the Tech Arena. Subscribe and engage at our website,
thetecharena.net. All content is copyright by the Tech Arena.