Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 3X07: The Trillion-Parameter ML Model with Cerebras Systems

Starting point is 00:00:00 I'm Stephen Foskett. I'm Frederik van Heren. And this is the Utilizing AI podcast. Welcome to another episode of Utilizing AI, the podcast about enterprise applications for machine learning, deep learning, and other artificial intelligence topics. In the last few episodes, we've been talking about all sorts of different aspects of artificial

Starting point is 00:00:25 intelligence, from ethics to practical applications in biology and medicine. But throughout the three seasons of Utilizing AI, one of the persistent questions that we've brought to our guests, and indeed, one of our famous three questions at the end of each episode is talking about sort of how big AI is getting. We've already gotten to the point where billion parameter models don't make anybody blush anymore. How big is this thing going to get? And how big can the chips themselves get? That's the question that we're tackling this time. Frederick, this is a pretty technical question, and it comes down to a lot of,

Starting point is 00:01:05 well, architectural decisions, doesn't it? Right. I think there are a lot of challenges, and one of the challenges is that demand for AI compute is growing much faster than the hardware industry can provide it. So there is clearly a need for faster processing and at least more efficient processing. And I think there is a lot happening in the markets that is worthwhile talking about. And I think this is a great topic to kind of explain to people what are the options and how do you build a trillion model AI?

Starting point is 00:01:37 Yeah, because of course, there's all those computer science challenges here in terms of location of locality of data, bandwidth to memory, number of processing units, parallel processing, all that stuff. And so to answer that question, we have somebody who actually knows a thing or two about this. So I want to bring into the discussion now, Andy Hawk from Cerebrus, which is, well, basically, this is your thing, right? Yeah, absolutely. And it's great to be on. Thanks, Stephen and Frederick.

Starting point is 00:02:10 So start off maybe by just telling us, who is Cerebrus? What are you guys doing? What's the thing that you guys are working on? So Cerebrus Systems is a computer systems company. We're based in Silicon Valley out in California. We were founded in 2015 by an experienced and innovative group of founders across chip design, system engineering, deep learning, software, and frameworks. And what we saw when we looked out at the landscape, as you described it, was similarly that the demand for AI compute, which is arguably one of the most important computing workloads of the generation, those demands are growing far faster than traditional processors and systems could keep up. And so we saw that also as an opportunity. So Cerebra Systems is building a new type of processor and a new class of computer system that is really built to accelerate

Starting point is 00:03:17 this AI compute, not by a little bit, but by a lot, by orders of magnitude beyond what's possible with existing legacy general purpose processors and clusters. It's quite an interesting approach, right? Looking at 2015, you already had a few major players that were kind of dominating the AI market. So what gave you the idea in 2015 that you could do a lot better than the established market? Is that because you had a better understanding of the AI market? Or is it simply because architecturally you could come up with something that was completely different but more efficient? Yeah, it's a great question and I think in some respects it's both about architecture

Starting point is 00:04:09 as well as scale and efficiency. So as you noted earlier, what we're facing today is modern deep learning models are becoming larger and larger and more complex. And we've gotten to the point where many of those state-of-the-art models, be they for natural language processing or computer vision or other applications, often take days or weeks or sometimes even months to train even on large clusters of these legacy processors. And so it's not that those processing solutions don't work. They're suitable, but they're just not optimal for the specifics of the neural network compute workload. Neural network computing for deep learning is unique. It is not, say, graphics processing or database management. And so it has some unique requirements that we saw at Cerebrus, there are really two components, right? There's the hardware stack and then there is the software stack. So how do you approach the software stack? How do you kind of connect your customers in an efficient way so that they can take advantage of new innovation on the software side and take advantage of that on the hardware side. Yeah, you're absolutely right. And just to take a small sidestep to what we've built, the heart of our machine is a revolutionary

Starting point is 00:05:53 large processor for deep learning. We call it the wafer scale engine. It's about 56 times larger than the largest chip built previously. It's about eight and a half by eight and a half inches on a side. So think on the order of a size of a dinner plate rather than a traditional chip, which might be on the order of the size of a postage stamp. scale engine has 850,000 individual and programmable cores that are architected at the core level from the ground up to handle the sparse linear algebra operations that are common to deep learning at high efficiency and high performance. So I'll put a pin in that for now, but we've developed this very high performance, high efficiency engine. And the question you asked about programming is exactly right. I often joke that most good hardware

Starting point is 00:06:54 companies are actually majority software engineers, because we know as researchers and engineers ourselves, that a really powerful hardware platform can only be really fully realized and useful if it's easy to program with existing software and frameworks. And so at Cerebrus, we were founded in 2015. We introduced our first generation system in 19, and we've recently announced a second generation system and a collection of technology innovations that go along with it. stack that connects users directly to this high performance engine and computer system through standard ML frameworks like TensorFlow and PyTorch. And that's no small investment. It is, at Cerebrus, we are actually a majority software engineers across our entire team have built this compiler stack that is designed to meet data scientists and ML researchers where they are at ML frameworks like TensorFlow and PyTorch. I think the software is key in the sense that nowadays people choose the software and then

Starting point is 00:08:17 they pick the hardware. So I think the flow from software to hardware is an interesting one. So I think that this is really the thing, you know, every company has to have a hook and clearly the hook for Cerebrus is, and we've seen this a few times at a few different events, you know, recently at Hot Chips, holding that thing up and saying, like, look at the size of this thing, right? Nobody else is anywhere near that. One question I had, though, is, I guess, which is the cart and which is the horse? Like, was the idea to make a giant wafer scale chip and then you found ML as an appropriate application for that? Or was the idea to come up with a ML processor and the clever idea was, oh, man, what if we made the whole chip, you know, the whole wafer as a chip? Which direction did that go in?

Starting point is 00:09:12 Definitely the latter. It's a great question. We really started from first principles of the neural network compute workload. We spent a lot of time in our early architecture and design asking the question, what is deep learning training and inference really asking of the underlying compute platform? And let us find a way to build the right compute solution for that work. For example, we know that deep learning requires massive computational resources, lots of sparse linear algebra tensor operations. So we need

Starting point is 00:09:57 a lot of cores. This is why the traditional approach is to throw a cluster of tens or even hundreds or thousands of GPUs at the problem. So we know we need lots of compute. But in addition to the compute resources, neural networks also require a high degree of both fine and coarse-grained memory access and communication. And so we need a processor that also delivers alongside massive compute, high bandwidth, memory access, and high bandwidth, low latency communication. And we really built the wafer scale engine

Starting point is 00:10:40 around those first principles of the requirements of neural network computing. Candidly, one of our initial designs that we conceived of was actually an array type of solution that would have involved multiple small chips. But of course, that type of design, while appealing potentially from a manufacturing standpoint, has challenges in processor-to-processor communication bandwidth between, say, chiplets in an array design. And so we were really sort of drawn to wafer-scale integration, despite its challenges, because it offered us the best potential for performance and efficiency in its work. Yeah.

Starting point is 00:11:28 And I think that that's important because everybody, well, a lot of people, when they talk about the wafer scale engine, they focus on the number of processor cores. And yet you're using that real estate, not just for processors, you're using it also for SRAM and having RAM, you know, high bandwidth memory right next to the core so that every core can be kept fed, right? That's exactly right. And that's a really valuable attribute of our machine. It's not just the processors as you described, it's that those processors have access to 40 gigabytes of on-chip SRAM in a single clock cycle. And then as they're executing their work, each one of those individual cores can communicate

Starting point is 00:12:14 its results to adjacent cores over a two-dimensional network on chip. And the data flow traffic pattern between cores and across the entire wafer is fully configurable in software. And so it's not just the compute. It's also the high bandwidth memory access close to compute and the ability to communicate results, say between layers or across the entire wafer in arbitrary communication patterns to support different types of neural network architectures. Now, this approach, though, is something that it really is unique and unusual. Why is it, like, what were the big challenges in delivering a wafer scale engine, and why is it that it wasn't used for other applications that can benefit from massively parallel? Because, I mean, it seems like you're breaking down, you know, one of the fundamental challenges that face architecture today. I mean, if you look at what Intel and NVIDIA and AMD and Apple and all these other companies are doing when they're trying to design chips, it all comes down to sort of questions that

Starting point is 00:13:21 you have answered with the wafer scale engine. Why isn't everyone doing it this way? And is there something special about ML that makes that possible? Yeah, that's a great question. You know, we often think about the traditional approach to deep learning compute and traditional chip manufacturing. Everybody in chip manufacturing starts with a large primary circular wafer of silicon. And the traditional approach is then to take that large circular wafer and make multiple small chips and then cut them out, package them up onto, say, standard motherboards,

Starting point is 00:14:08 connect them over PCIe, put them into many servers. And then when your deep learning job gets big enough, we spend days, weeks, months of data center engineering time, reconnecting all of those small chips that we cut apart using physical cables across racks and aisles of the data center. And so that traditional approach is traditional clusters for CPUs and GPUs, and it offers vast computational resources, but it's really challenged by memory bandwidth and communication bandwidth. And so at a sort of fundamental level, we ask ourselves, well, if what we need is all of those compute resources together, but the traditional clustering approach with small chips is challenged by communication and memory bandwidth, why not just keep all of the wafer together in the first place? And that leads directly to your question, that the challenges with doing that are primarily yield and packaging. That is, once we land on this

Starting point is 00:15:21 large wafer scale integrated architecture as the right processing engine for deep learning, how do we power that? How do we cool that? How do we connect it? And how do we build it into a system that's easy to deploy and operate and maintain? And so we had to start with that sort of full systems view of the solution for deep learning as soon as we landed on

Starting point is 00:15:47 this wafer scale engine architecture. And we have innovations at Cerebrus throughout the stack on how to optimize for yield at the wafer scale and also how to package and power and cool that in a system. And that system, of course, is our Cerebrus CS2 system. That's the chassis for our processing engine that fits into a standard data center rack. Yeah, I think you said something very interesting earlier. I mean, what I see a lot is workflows that are relying on different types of accelerators, you know, GPUs, FPGAs. to handle different workflows and use the software to kind of reconfigure for an optimal calculation of all the parameters. So you bring up a couple of really good points there, Frederick, and they come back in some part to software also. I think one way to think of our CS2 system and the wafer scale engine

Starting point is 00:17:04 is really as a network attached accelerator. the onerous computationally intensive portions of a user's end-to-end training or inference workload. And it's compiling the training or inference compute graph to our wafer scale engine. But our software in the meantime is also taking the other components of the user's end-to-end training or inference workload, such as input data loading and pre-processing, and putting that onto host CPU resources that are perfectly capable of handling those standard, say, input data processing or pre-processing tasks. And then we use those external CPU host resources to pre-process and stream data down to the CS2, which performs the massive compute workload of training or inference.

Starting point is 00:18:17 So in a sense, our solution is already leveraging sort of best of breed compute resources where they're necessary for a user's workload. And the more customers we talk to, I think that the more and more we hear about a vision in the future of a very heterogeneous compute cluster to support different types of AI and non-AI workloads. And I think in that future world of heterogeneous large-scale compute clusters, you can think of the CS2 as the sort of high-performance, centralized, dense node

Starting point is 00:19:00 for large, sparse linear algebra graph computation? So that's a great question. And I do think that our system in a sense can be a part of those heterogeneous clusters and also used for both deep learning and non-deep learning work. It seems like others as well are kind of getting on board with this whole idea of a network attached, you know, co-processing environment. Certainly, we've seen that with some conventional, fairly conventional systems. So, I mean, it looks like it would be a server, but it's actually used more as sort of a network attached offload engine. And, of course, you know, and we recently had NVIDIA talking about like their grace modules,

Starting point is 00:19:46 which sounds like exactly the same idea, except a totally different execution of that idea. Like instead of going giant, they're going little, right? Since the wafer scale engine is the thing you're known for, can you see a situation where you take this same approach

Starting point is 00:20:04 and go at it with something that's not a wafer scale engine? That's a good question. So I think I often describe the wafer scale engine as 850,000 core sparse linear algebra processor. And one of the advantages, coming back to your question about the challenges of wafer scale integration, one of the advantages of this architecture for our processor is that it's really a uniform, homogenous 2D grid of cores with a local bank of SRAM and a north-south-east-west network on-chip that connects all of those cores. And so in some sense, that large wafer could logically be scaled down to, say, a smaller form factor device. And this uniform design is also one of the things

Starting point is 00:21:08 that helps us yield full wafer scale modules into our production systems, because we can tolerate small numbers of individual point wise defects that are natural to the manufacturing process. But to your question about other form factors and other packaging solutions, we have considered smaller form factor designs

Starting point is 00:21:30 for custom or lighter weight or edge deployments. But right now, what we're really focused on and where the market is really pulling us is towards actually even larger computing resources, clusters of wafer scale engines and clusters of CS2s to help them address this growing body of very, very large neural networks for AI. That was exactly my next question. Although it's a big chip, what if somebody outgrows the big chip, right? How do you cluster?

Starting point is 00:22:10 How do you build those things, right? Because doing everything within a single chip, it's easy to move things around. Once you go outside of the box, it's a whole different world. So how do you see that? Do you see customers buying multiple CS2s and then you have some kind of a mechanism, you know, let's call it a bus for the lack of a better word, where you kind of have the ability to exchange data and communication channels, or do you think that an even bigger chip is the way to go. Yeah, if we had the opportunity to build a bigger chip, if we were given larger primary silicon wafers, I think we would probably take that opportunity.

Starting point is 00:22:54 But that's actually not how we're approaching this emerging challenge of extraordinary scale models from the hundreds of billions to trillions or even tens or hundreds of trillions of parameters. To do that, we actually recently developed and announced a portfolio of technologies that we refer to as weight streaming. And this collection of hardware and software allows us to support training or inference of extraordinarily large models all on a single device. So with this technology, we can support up to 120 trillion parameter models on a single CS2 system. But we've also developed the clustering technology directly to your question

Starting point is 00:23:48 so that we can combine the resources of multiple CS2s, multiple wafer scale engines, together and achieve linear performance scaling across multiple CS2s, up to 192 CS2s, in fact, with near linear performance scaling to support training of those very large models in very short periods of time. So imagine, for example, training GPT-3 in just a day or training a trillion parameter model over, say, a long weekend or something much larger still. With that host of technologies around weight streaming, we're really addressing what is a real customer interest in being able to support training large models, say hundreds of billions of parameters to trillions of parameters in a reasonable amount of time. And so yeah, we are talking to multiple customers about deploying

Starting point is 00:24:53 clusters of CS2 systems or even next generation systems beyond that for exactly that purpose. So we're looking right now at 100 billion parameter models. And so the trillion parameter model, that's a thing, right? dozen years or more, is that models with more parameters deliver greater representational capacity and better accuracy for downstream tasks. So bigger model, better accuracy for increasingly complex set of tasks. But I will say that it's not just bigger models. We also need compute that supports more efficient compute, more flexible compute for what I'll call smarter models, right? Models that

Starting point is 00:25:59 can incorporate sparsity or dynamic, very fine-grained computation so that as models get bigger, they're not just increasing in capability by brute force scaling, but we're also developing models that are more parameter efficient. So I do think that trillion parameter models are almost inevitable, But I also see a large number of our customers and research partners very interested in exploring things like dynamic computation and both weight and activation sparsity to build models that are not just larger, but smarter and more efficient so that they can achieve the same accuracy on tasks, but with fewer parameters. It's really interesting for me to hear you in particular talking about that, because the thing is, sometimes companies will sort of

Starting point is 00:26:54 spin lead into gold in a way in terms of marketing. And so if they can't build big, they might say, oh, well, big is wrong. It's more efficient is more important. But if you can build big, and you're still saying more efficient is important, then that to me is a pretty valuable data point, if you know what I mean. So absolutely, I think that, you know, there is going to be a point of diminishing return in terms of model size, just like, as you said, like brute force model size. But at the same time, most of the researchers that we've spoken with have said that there probably are going to need to be models

Starting point is 00:27:31 that need to be that big. Like there's no way around it that we can't, we need to get to that trillion parameter plus level for certain models, certain tasks. And those things, well, they're going to need a new kind of hardware to support them. And that's, I think, what you guys are trying to bring to the table, right? Spot on. I couldn't have said it better myself. I think we certainly see a push towards larger models, but we also see a push towards smarter and more efficient models that will

Starting point is 00:28:01 require a compute platform that's able to perform very dynamic fine-grained computation, handle sparsity, and that's exactly the direction that we're going. And to your point about large versus small models, I think that's correct as well. We see customers that are pushing the envelope of parameter size, but we also have customers that are working with small models. And for those customers, we can actually compile multiple copies of a model on one of our wafer scale engines and accelerate their work as well. And so I think it's not that one size fits all

Starting point is 00:28:35 or that bigger is always better. I think what we really need and what we really built at Cerebrus is a flexible platform for not only large scale AI, but for the sort of broader landscape of AI workloads that users are bringing to life. Yeah, and I think what I hear a lot in the market is obviously you can get a better or more accurate model if you use larger models, a larger amount of parameters. But what I also hear at the same time is the time to market, right?

Starting point is 00:29:12 The ability to deliver a model within a certain amount of time, right? So it's basically saying, yes, we could do billions and trillions of parameters. However, you know, we don't want to wait four weeks for that, or we want to get this done in four days. So I see a lot of customers wanting to have that little knob where they can turn it and say, all right, so that amount of parameters, you know, how long is it going to take? Do you see that also customers kind of asking you that? I guess they ask you that in a different way. They might ask you for more efficient way of processing, which I guess is a research background myself in algorithms and basic science. And I think customers come to us for performance, not only to build, be able to research and develop and deploy an accurate production model sooner, but also to be able to iterate more quickly on their ideas

Starting point is 00:30:26 in the research phase. And once a model is in production, to then be able to retrain that model at a higher cadence so that the model that they've developed and is in production is constantly a better reflection or better suited to the changing landscape of data that it's facing. And so I think from my standpoint and from our partners and customers, what we hear is that performance in deep learning compute is valuable in that it allows them to iterate more quickly, not only to build the right model sooner, but to try out more ideas along the way. And I think that power of iterative research and development and being able to retrain production models once they've been deployed are really valuable and part of the end-to-end sort of

Starting point is 00:31:19 life cycle story of AI model development that we're powering. Yeah, it certainly does sound like, you know, Cerebrus is not just a one-trick pony. It's not just a gimmick. It's not just look at the size of this, you know, wafer scale engine. It really is about empowering, you know, the biggest and baddest, you know, ML models.

Starting point is 00:31:44 And I appreciate the effort that's got into not just the chip, but everything around the chip from the supporting system to the software, of course. And I'm glad that you brought that up as well, Frederick. So now's the time in the podcast where we shift gears a little bit. We started this tradition last year in season two, and now we're putting a little spin on it here in season three. Our guest has not been prepped on the questions we're going to ask them next, and so we're going to get some fun off-the-cuff answers right now.

Starting point is 00:32:16 This season, we're also going to be bringing in a question from a previous podcast guest to ask the third question, and I will, of course, ask Andy if he wants to contribute one to a future podcast guest as well. So Frederick, why don't you go first with the three question? Sure. My question is, how small can ML get? Will we have machine-learned powered household appliances, let's say toys or disposable devices? Oh, wow. That's a great question.

Starting point is 00:32:53 I think the short answer to the latter part of your question is, is certainly yes. I mean, we, we see edge AI applications today in, in watches and cell phones and other small lightweight edge devices and applications. I think there are certainly a set of household appliances and daily life tools that we use that could benefit from AI-powered personalization and customization to their uses. That's a good question. How small can they get? I'm not quite sure how small they can get,

Starting point is 00:33:28 but I definitely see the potential for proliferation to many more small daily use consumer devices in AI in the future. Well, considering that your entire point of existence scooped the other question, which is how big can ML models get? We had to ask you that one. So let's shift gears again and get a little bit more abstract. So a trillion parameter ML model. Well, that sounds a little bit like an artificial mind. Are we ever

Starting point is 00:33:59 going to get to the point where we have sort of a Hollywood style AI, a general purpose artificial person? Boy, Stephen, we think about the applications of large models a lot, and I will say I don't know the answer to that question. But what we have seen quite a bit of interest in recently as models have been getting larger is models that perform multiple tasks or models that are trained on multiple types of data so that they can become more flexible and in general purpose and so-called zero-shot learners. And so I definitely see a trend in the field towards models that are more capable of a broader set of tasks. And once again, it's not necessarily just about bigger models. It's about different data sets and conditional and dynamic computation for models that can understand what they're being presented with and understand which sort of attribute of the

Starting point is 00:35:05 model to leverage to make a decision or recommendation. But I do definitely see models becoming more and more capable for a broader set of tasks over the next several years. All right. Well, we're going to take it to another level now. A previous podcast guest was Leon Adato, host of the Technically Religious podcast. Leon, go ahead with your question. Hi, my name is Leon Adato. And as one of the hosts of the podcast, Technically Religious, I thought I would ask something that has something to do with that area. I'm curious, what responsibility do you think IT folks have to ensure that the things we

Starting point is 00:35:44 build are ethical? Leon, I have to say, I think that transparency and ethics are becoming more and more important in AI. And it's something that we're paying quite a bit of attention to at Cerebrus. I think as we work with partners and customers, it's been candidly inspiring to see what large enterprise organizations are doing as they're pushing the boundaries of AI. In particular, at Cerebrus, we've been working with customers that are using our systems for things like COVID-19 research,

Starting point is 00:36:27 accelerating drug and therapy development for human health problems, climate research, energy and environment modeling, as well, of course, as a host of industrial applications across web and social media, automotive and others. But what we see in particular with our customers is a really majority push towards applications of AI that are fairly clearly in the broader social and economic good and environmental good of the planet. And I think we do have a responsibility, and one of our charters at Cerebrus is to develop these systems and do what we can to ensure that they're used in a way that is fair and addresses and towards the social good.

Starting point is 00:37:28 I think it's something that as tool builders, we have to be mindful of as we stay focused on building the next generation compute platforms for AI. Wow, that was well said. Thank you. And Leon, thank you for your question. And Andy, I'm glad we gave you the opportunity to address that as well, since we did get pretty technical in here about bits and bytes and chips.

Starting point is 00:37:51 So thank you so much for joining us here for this podcast and being such an affable guest, even there with the three questions. I do look forward to your questions if you want to pose one to a future guest. And if you, our listeners, want to join in, you can as well. Just send an email to host at utilizing-ai.com, and we'll record your question for a future podcast guest. So Andy, thanks for joining us. Where can people connect with you and follow your thoughts on enterprise AI and other topics?

Starting point is 00:38:25 Stephen and Frederick, first of all, thanks so much for having me. This was a really enjoyable discussion. I can imagine we could have carried on even longer, and I really appreciate the listeners too. For all you listeners out there, you can find me on LinkedIn, Andy Hawk with Cerebrus, and feel free to reach out if you're interested. We're always, of course, hiring for our extraordinary team and also just happy to chat about the future of AI and computing. What's new with you, Frederik? Yeah, you can find me on LinkedIn and Twitter as Frederik van Heeren. And you can typically find me talking to customers about HPC and AI from a consulting

Starting point is 00:39:10 and services perspective on hyphens.com. And as for me, you'll find me at S Foskett on most social media networks, including on LinkedIn, where I'm trying to get a little bit more involved. And of course, you can find me at the Tech Field Day event series. We are going to be planning another AI Field Day event for next year, but that's still a long way off. But in the meantime, we're going to be doing a Cloud Field Day event here in another few weeks. So please do join me for that. If you enjoyed listening to the Utilizing AI podcast, please do give us a subscription on your favorite podcast app.

Starting point is 00:39:46 Maybe a rating or review if you enjoyed it. And please do share this episode with your friends. This podcast is brought to you by gestaltit.com, your home for IT coverage from across the enterprise. For show notes and more episodes, go to utilizing-ai.com or find us on Twitter at utilizing underscore AI. Thanks for listening,

Starting point is 00:40:07 and we'll see you next time.

Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 3X07: The Trillion-Parameter ML Model with Cerebras Systems

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.