In The Arena by TechArena - Exploring Cerebras' AI Innovations: Inference, Llama, and Efficiency

Starting point is 00:00:00 . Welcome to the Tech Arena, featuring authentic discussions between tech's leading innovators and our host, Alison Klein. Now, let's step into the arena. Welcome in the arena. My name is Alison Klein. We're coming to you live from AI Hardware Conference, and I'm so glad to be here with Sean Lee, CTO of Cerebrus. Sean, welcome to the program. Thank you so much for having me. I've been following Cerebrus for a

Starting point is 00:00:36 number of years, and you guys have seemed like the grown-up kid in the AI hardware sandbox, and you've been designing unique silicon solutions for AI for a while. Can you provide some background on the company and how we got here? Oh yeah, for sure. I guess it has been about eight years now since we started. We started in 2016, really at the kind of beginning of the modern deep learning era. And what got us started was really just the realization that there was this tight dependency and connection between what ML can do and the hardware that is behind it. And it almost feels like ancient history now, but this was like right after AlexNet came out and,

Starting point is 00:01:16 you know, the first models were trained on one or two GPUs. And we saw that there was this tight connection. And then we believed that if we did something very different at the hardware level, we could enable something very different at the ML level. And when you do deep tech innovation, what you've done, it takes three years to bring it to fruition. And over time, everything that we've made as a company has very much been about going as big as possible. I don't think even we could have imagined how much Gen AI has exploded and the compute demand. But everything about what we're doing from the physical chip that we felt,

Starting point is 00:01:52 which is 56 times larger than traditional chips, to our clusters, to all of the software was all about making sure that we were trying to go as big and push the envelope as much as possible. And here we are now, and I guess our bets have paid off. That's awesome. You know, I've been thinking about you guys and thinking about you from a standpoint

Starting point is 00:02:10 of producing those dinner plate-sized processors, and they're quite eye-catching. I've seen them at conferences before. But what's even more impressive is the recent news about your AI inference performance. Can you tell me what you've delivered? Yeah, absolutely. I think to

Starting point is 00:02:25 give a little bit of context, what we realized when we looked at the Gen AI inference space was that all of the solutions today were definitely slow. And it ultimately comes down to the fact that Gen AI inference is very unique, that it's every bandwidth down. And whether using an EMD GPU or an NVIDIA GPU or a TPU, they're all using HBM and that ultimately limited the performance. And what we did was by using our waperscale interplate size chip, we are able to offer 7,000 times more memory bandwidth. And as a result, we're able to get order of magnitude speedups in inference just through the hardware alone. That's crazy.

Starting point is 00:03:07 And we just launched it two weeks ago. And the responses were just more than we could have imagined. Now let's unpack that news a little bit. You decided to focus on LAMA inference for this particular announcement. Can you share why you're aiming at inference in particular? Obviously, there's an entire AI pipeline and much of that pipeline could be addressed by Sara Ross focused on inference. Why'd you choose Lama too?

Starting point is 00:03:31 Focusing on inference, I guess the first part is that inference, it's what touches the actual user. And what we probably all see is that today, GenAI inference solutions don't quite live up to, I think, the user experience that we all would like to have. It's not that responsive. The user engagement, the interaction is very low. And so we felt that there was an opportunity to fundamentally change that user interface.

Starting point is 00:03:57 We think we'd unlock the next level of value. So that's why inference. Now why LLAMA? I think there's a couple of reasons here. One is that in general at Cerebrus, we are really big believers in open source and in open models. We believe the future is all about the community coming together and supporting each other and innovating off of each other's innovation.

Starting point is 00:04:17 I think Lama embodies that philosophy very well. And so it's very much aligned with what we believe. Most of our ML work that we've done, we've all put into the open source. So that's the first thing. And I think the other, it's just really pragmatically, Lama right now is the de facto standard open model that everybody wants to use. The quality of the results, the size of a model, they're all state-of-the-art. And it's just remarkable what you can do now with a model you can download from Hugging Face. Right. Now, at AI hardware,

Starting point is 00:04:45 there's a lot of silicon players in this space. A lot of them have great PowerPoint. You have silicon. You also demonstrated in your recent release that you have customers. And tell me about the customers that talked about this latest substantiation of technology and how market traction has been going. That's a really good point. We made a very conscious decision in our latest launch year with Inference that we're not just launching a hardware platform with some slides showing the performance, but instead also launching an actual Inference service layer that can be accessed through Thou for the entire developer community

Starting point is 00:05:22 as well as our customers. And now we're starting to see that our hypothesis there was absolutely right. Just in the last two weeks, I guess, since we launched, we've had over 30,000 developers come and use our platform. We've had customers and users build all sorts of interesting applications literally within the first few days. As an example, actually, one of the most impressive ones that we weren't even surprised by was a company called LiveKid,

Starting point is 00:05:47 which builds our voice pipeline. They filled what was initially a demo around our super fast inference, where you can basically talk to an assistant like interface through voice. We didn't even know they were doing this. They built it just for fun. Although seeing what the faster inference could reveal,

Starting point is 00:06:03 they were getting way more latency. It's much better interaction with the user exactly, the user adherence benefit that I mentioned earlier. They built that in, I think, a matter of hours, showed it to us, we were so impressed. And now we've now installed it in our service and now people can also use it when they come to our service and now they can chat with our chatbot through voice. And so we're seeing so many customers and users like this that have just popped up over the last couple of weeks because we decided to launch it this way. And we also have now engagements with a lot of really big customers,

Starting point is 00:06:35 which we will almost certainly be announcing publicly very soon. That's awesome. Now, why do you think that it's critical? You talked about being dedicated to open source, dedicated to community. Why is it critical to have competition in this space? Competition is really interesting. On the one hand, obviously, with my Cerebras hat on,

Starting point is 00:06:55 I love it if it's a competition and we were just able to completely dominate the space. But when I step back a little bit, I think competition is ultimately a very healthy and very important thing, as you mentioned. The first is, I think, it's the ultimate validation of what you're doing. Others are also trying to do it. It's the ultimate validation that it actually is to make a difference and that people actually care about it. And I think the other one is just part of the reason we believe in open source so much and the community so much

Starting point is 00:07:20 is that we believe that it's better for the community to have options, have the ability to move between options. This space, in general, training and inference, each one is so large with so many different use cases. In general, we believe that there can be multiple different solutions for all the different use cases and the competition will foster that kind of very diverse ecosystem. I think it's what the entire community will ultimately really benefit. That's awesome.

Starting point is 00:07:49 Now, we've talked about performance. You guys are obviously delivering on that vector. But the other thing that I wanted to talk to you about is performance efficiency. There's been so much focus on the energy consumption around AI. How are you addressing that in your thought groups? Yeah. You know, the energy consumption, environmental impacts, the straight up costs,

Starting point is 00:08:08 it's ultimately gonna be an inadvertent, especially at scale. And at Freebirth, we recognize that just having performance at all costs might serve as a really interesting demo, but it's actually gonna be a sustainable product and have the kind of impact on the industry and community. And so when we designed and architected the entire solution, obviously performance was front and center, but we very much wanted to make sure that we

Starting point is 00:08:31 could also run it at high throughput, which ultimately reduces the cost and reduces the average power per request. And it actually really comes down to some of the fundamentals in our architecting with the higher memory bandwidth that we're able to offer the really high performance and the higher throughput bandwidth, that we're able to offer the really high performance. At the end, the higher throughput or lower cost and lower power at the same time. And we believe we have a unique architecture that can achieve that. And now our users can also experience it because that lower cost and lower power ultimately

Starting point is 00:08:58 translates into a lower price for our solution. And so right now, we've got the fastest inference service, but it's also among the cheapest. Right now, we're pricing our 7DB Lama model at $0.06 per million tokens. Wow. It's among the lowest in the industry right now. That's incredible.

Starting point is 00:09:16 That's an incredible price point. That's wonderful. Talk about Lama and the fact that you're delivering great performance on Lama, but I know that one thing important to customers is breadth of model support. How are you addressing that?

Starting point is 00:09:28 That's absolutely true. I think the ecosystem is so diverse and there's so many different models. We recognize that during the run, even before inference, we invested really heavily over the last several years on building a general compiler that will connect to industry standard frameworks like PyTorch and allow the programmer to be able to program using these industry standard frameworks and ultimately automatically map both models into our hardware. That's fantastic.

Starting point is 00:09:55 And that's what we've been running in the background on the training side for the last several years. And it's ultimately the underpinnings of our insurance product as well. Even though we're offering it as an API to start, we envision that there will be many customers where an API-level access is good enough. But then as we grow the number of models we support or customers and users want to use custom models,

Starting point is 00:10:17 they'll be able to use our compiler to be able to map their custom models. Awesome. Now, what's next for Cerebrus on the innovation curve? You guys have been delivering amazing results. What can we expect next from you? It's very interesting. We're really proud of everything we've delivered so far and being able to get to the fastest inference solution right now on the market. But we really believe this is just the very beginning for us on the inference front. We have many additional enhancements that we're working on,

Starting point is 00:10:46 some that improve our pandy, some that improve the cost and the power even further. Those will be coming out in the coming weeks. And then in general, you know, at the bigger picture level, it's in our DNA to be pushing the technology boundaries. We knew that tackling the problem with wait-for-sale integration, for example, was not going to be the easy one. We knew that tackling the problem of wafer scale integration, for example, was not going to be the easy one. We saw a path and we knew that the impact could be really huge if we could do it. We are constantly searching for the next wafer scale step function improvement. And we'll have many both medium-term and long-term projects that are in the works. And I'm going to be very happy and excited to share those once it's fully made. That's so cool, Sean. Thank you

Starting point is 00:11:24 so much. It was great meeting you and it was great learning more about serverless and what the team is delivering. Where can folks find out more information about the solutions we talked about today and engage your team? The best way is to go to inference.swerybris.bi, where you can actually even try out our inference service yourself through our chatbot or an API. And then through that, you can actually engage with our team directly. Or I'd encourage you to go on to

Starting point is 00:11:49 the social media sites, Discord and Reddit, and you can engage with us in the community there. Like I said, we are very much a believer in doing this with the community. Sean, thanks for taking the time out of your busy schedule at AI Hardware Select to chat. It was a real pleasure. It was a pleasure. Thank you so much for having me. Thanks for joining the Tech Arena. Subscribe and engage at our website, thetecharena.net. All content is copyright by the Tech Arena.

Pet Camera - EBO Air 2

In The Arena by TechArena - Exploring Cerebras' AI Innovations: Inference, Llama, and Efficiency

Sean Lie of Cerebras shares insights on cutting-edge AI hardware, including their game-changing wafer-scale chips, Llama model performance, and innovations in inference and efficiency....

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

Pet Camera - EBO Air 2

In The Arena by TechArena - Exploring Cerebras' AI Innovations: Inference, Llama, and Efficiency

Sean Lie of Cerebras shares insights on cutting-edge AI hardware, including their game-changing wafer-scale chips, Llama model performance, and innovations in inference and efficiency....

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.