In The Arena by TechArena - Where AI Inference Hits the Memory Wall

Starting point is 00:00:00 Welcome to Tech Arena, featuring authentic discussions between tech's leading innovators and our host, Allison Klein. Now, let's step into the arena. Welcome in the arena. My name is Allison Klein. This is a Data Insights episode, so that means Janice Narowski is with me. Hey, Janice, how you doing? Hi, Allison. I'm doing well. How are you?

Starting point is 00:00:27 I'm doing great. It's the first day of spring. Very exciting for those of us living in the Pacific Northwest. because it means that at some point the rain will end. This is an exciting episode, and I just wanted to ask you who you brought with you and what the topic is for today. Yeah. So today I brought a friend of my Go Away Bath.

Starting point is 00:00:49 I brought Nalesh Shah, who there's a lot of great things. You probably see him all over if you're not on LinkedIn. But Nalesh is the VP of business development at zero point. So we're going to talk a little bit about what do you? he's up to it zero point, what he's really working on in terms of engineering systems and just get his take on AI and how do we scale this stuff? So, Nalesh, welcome to the program. Thanks for having me, Alison and Janice. This is so much fun, always fun to chat with you. Now, Nalash, I see you all the time at conferences and you're always in the heart of where the

Starting point is 00:01:27 industry is focusing its most acute innovation and you're always on to the next thing. A couple weeks ago I saw you in meshed and quantum. And I know that's a passion that you're going to talk about. I know that you work in the memory and storage space as well. But why don't we just start with 0.0. Point. And can you introduce what you're doing at 0.0.0.0. Technology is a company based out of Sweden. It's a startup and it's about compression. The company was founded on the premise that the need for data and memory will explode. And one of the ways to tackle that challenge is through lossless memory compression. So what I do at ZeroPoint is I drive business development at ZeroPoint Technologies and then bringing and matching of the technology to pretty much

Starting point is 00:02:17 across the spectrum of hyperscalor chip companies. And ZeroPoint Technologies is an IP licensing company. So we develop IP that can be integrated into any piece of silicon. So it could be an ASEC chip, I think inference or it could be a GPU or it could be a memory controller, which could be, for example, a CXL controller. But right now, our key focus is, of course, AI inference systems where the need for memory and the pain is real and people are feeling it. Now, Lash, at a high level, given that information you just gave us, at a high level, why is this data movement becoming so critical for you,

Starting point is 00:02:55 specifically with modern computing systems? What's the big deal in your opinion? The big deal out here is if you look at what it takes to null data all the way from storage out to, let's say, high bandwidth memory or LPDDR memory, and then into the silicon itself, into the heart of the silicon. So like an S-RAM memory. And if you compare that movement of data to the actual computation that you perform on that dealing without data, it's ridiculous. It's almost 10x more expensive in terms of power to move them. that bit of data than it is to actually run a computation on it. So from that perspective alone, that's why if you look at all the insurance chip innovators out there and look at cerebrus, look at Brock, look at Sambanoa, rebellions, I can rattle off

Starting point is 00:03:45 tens of different names. But if you look at the core innovation of all of these companies, it's not really about the compute as much as it is about data movement and memory hierarchy. For example, GROC is all about S-RAN. We've seen with Cerebras, it's all about WIFA scale S-RAM. If you look at other companies like someone over there, tearing the memory. So really what that screams out at me is the pain point is clear. If you want to scale inference, it's all about the memory hierarchy. That's where zero point comes in by compressing that memory and enabling greater bandwidth and also capacity, which is what you need if you're about cranking.

Starting point is 00:04:28 call more tokens per second per walk. And you have to do it efficiently because, like I said, the data movement is the expensive piece compared to the actual computer. Nalish, when you think about the fact that data is distributed across data centers, both in the cloud and on-prem at the edge, and you talk about that data movement, do you think that we're constructing AI workflows correctly for that management of data? and how do you see that changing as enterprises start scaling inference into different areas of their business? I think one key thing about is we look at the problem we're trying to solve.

Starting point is 00:05:09 It's really agentic AI is entering the workflow. So, Alison, we were at the Synopsis Conference just last week, and then before that, I was at D.B.Con. So when you think about EDA tools that people require a lot of memory and a lot of data to rank out new technology. chips, what we are finding now, or at least what I heard from these conferences, is now these chip designers want to integrate five, six, maybe even ten different LLNs, all focused on a very specific task, for example, chip verification or error detection, or think of all kinds of RTF verification flows. So it's becoming obvious that now you have these domain-specific inference solutions that are required for performing, let's say, even BDA operations and the same

Starting point is 00:06:02 thing. So at that point, what happens is the way enterprises will need to think about data. You've got to need the data that you need to design the chip, but then you're going to need this extra intelligence there as well in order to enable agentic AI to make that process even more efficient. So as data continues to become a challenge, and certainly, as you said, the compute is important, but the data is becoming more important. As you see new computing paradigms emerge, right, where do you see memory bandwidth becoming a bottleneck? Or do you see that inherently becoming a bigger problem in the future, and will there be a new change to address it? Yeah, I think with these agentic workflows, if you break down inference, really take any transformer model, and there are two sections to it.

Starting point is 00:06:58 There's a pre-fill stage and there's a decode stage. So the pre-fell is pretty compute intensive, but the decode phase is extremely memory intensive. So if you break it down further, it's really what's limiting your tokens per second. And if you need instant answer, responsiveness at scale across the enterprise. I'm not talking about one user interacting with one chat agent, but I'm talking about an entire company of, say, 100,000 people interacting at that scale and then across the multiple streams of data that you might have in an enterprise. It's really that decode space, which has become the bottleneck.

Starting point is 00:07:37 And at GTC, that was pretty evident if you look at what the key notes were talking about, it's really breaking down that decode into heterogeneous architecture. So it's not all about GPUs anymore. You need these highly specialized P6s that can do the decode phase extremely efficiently. And that's where with compression, you can actually get more efficiency. If you can compress that data by, say, 1.5x or 2x. Now we are moving lesser data. And at the same time, what that does is it increases your band.

Starting point is 00:08:13 and the effective capacity that your decode engine actually sees on the silicon. I think those are some of the trends shaping the industry. It's going to be heterogeneous compute. It's got to be pipelines. Inference is no longer just a single kind of operation. It's got to be a pipeline going through heterogeneous process node. And all of that needs to be done efficiently. Now, you talk about heterogeneity, and obviously that's been something that the industry is

Starting point is 00:08:42 looking at with all of these different accelerator technologies and how we deliver Bruntforce compute to some very challenging workloads. I guess the other question that I have for you, because I do know that you are diving into the quantum space, is when does conventional computing platforms run out of gas and we need to move to alternative architectures like quantum computing? Where do you see that line? Is there a line? I guess where is it? And what types of workloads and challenges is quantum uniquely suited to address? Yeah, I'd say maybe let's talk about where does quantum enter into the picture. So really the big question is, first of all, is quantum computing required?

Starting point is 00:09:33 What is the chat CPT moments for quantum computing? That's the favorite question I like to ask. Where does or does quantum computing intersect EI of data center? because the AI data center is currently in the trillions of dollars, the market. If you look at quantum, it can be measured maybe in the billions of dollars. So obviously, the immediate point at which an attached point would, that would make sense, would be to attach quantum processing units into these AI data centers to efficiently offload some of the work.

Starting point is 00:10:08 So it may not replace the entire, let's say, inferencing, but there might be some efficient search space algorithms, for example, that could be offloaded to quantum processes. Yeah, I agree. And I think you're right now, as we're in the early throws of it, right? What is quantum computing? What does it mean? What does this ship have to take place? Is it next year? Is it five years? Is it 10 years? I'm just curious, from your perspective, and all the different companies you've been working with and learning from and consulting with, What industries will quantum computing really impact? So going back all the way to Intel, Genes, when we work together and if you remember,

Starting point is 00:10:50 we worked on a project Optane memory technology. And at that time, this was back several years ago. At that time, quantum computers were still at the toy level, there were like five, six, cubit. Cubit is the measure of quantum computing as you have bits. And at that time, the big focus was on simulating quantum. computers, for example, of 50-cubit simulation required 2 to the power 50 memory capacity. So that's where we worked on opt-in.

Starting point is 00:11:20 And the very first use case actually was with Alan Astero, music was a leading quantum chemist. At that time, I'm at Howard, but now he's got huge three-digit million funding at the University of Toronto to investigate quantum chemistry. So now, fast forward, even now the use cases that I hear about. are quantum chemistry. That seems to be a natural fit and drug discovery for the types of processes that quantum computing is designed for. And because a lot of these mechanics in the quantum chemistry space are quantum in nature.

Starting point is 00:11:58 So they seem like a good fit. Now, beyond that, of course, there are other applications which are non-data centers. So I think space, those kind of applications where you can use quantum. devices, but of course, I'm biased towards the data center. So I keep coming back to what are those use cases where quantum can help in the data center. I strongly believe the unlock of potentials of quantum computing is primarily in the data center. Of course, there are really interesting use cases for space. For example, sovereign type of capabilities where quantum effects can help develop like a secure type of

Starting point is 00:12:40 devices that go into space, but I'm looking primarily at data center and the application. So right now, I would say quantum chemistry, there's maybe some finance applications that I saw, at least some of the banks are deploying early quantum computers, primarily to test out what a slow would look like if you were to have serious quantum computing capabilities. And then maybe a third use case might be for encryption and making even more secure, encryption protocols because we know people are actually hoarding data on the internet today in the expectation that in five years they'll have a quantum computer. So your data might be encrypted today, but in five years, if they gather all these data, they can decrypt it five

Starting point is 00:13:27 years later. So I think that becomes an interesting use case for quantum to actually come up with stronger encryption. Now, no, actually you said something really interesting earlier about it's about the data. And I want to go back to conventional for a second. I know that you're working on the lunatic fringe of what can be done with memory and storage. When you think about the fact that we've got leading edge SSD technology, like what Solidime is delivering to the market, we have standard DRAM, we have HBM, we have S-RAM that you talked about. Do you see anything on the horizon that will break through some of the bottlenecks that we're seeing and introduce a new way to manage memory and storage, and where in the industry are these things being discussed?

Starting point is 00:14:14 I think a couple of things. We're seeing a lot of innovation or at least funding going into new innovation, just simply because memory and actually even storage are sold out. So looking for denser solutions for storage and then denser solutions for memory. I think that's where a lot of investments are going alternative technologies to DRAM. But DRAM was invented in several decades ago, it's not fundamentally changed that much. And so people are looking at new technologies, you know, think Engram or other persistent type of memories that may be denser and may have better power characteristics and also may not be limited by some of the manufacturing limitations that are actually now showing up because of the heavy demands from AI. I think there's maybe three

Starting point is 00:15:08 different vectors. One is looking at alternatives memory technologies. The other vector is looking at alternative interfaces, so how you get those memories to communicate to the compute engines. And then I would say entirely new paradigms in terms of networking, not just because now the conversation is not about a single chip or even a server or even a rack. I would argue now it's really about data center scale. So what I'm seeing companies do is, now, I don't care what do you put in your memory chip or your storage chip because the whole thing moves as one unit, like one data center. Or I would say, what is the new unit or the metric for compute? It's going to be megawatts. So I give you one megawatt. You put whatever you want in it, storage, memory, compute, and then

Starting point is 00:16:04 you crank out these AI tokens. So I think that's how a lot of memory and storage innovation will be driven. And a lot of the things we know today, like phone factors, let's say an SSD forms factor on a PCI card or say DRN memory going over LPDDR or HBM. There might be some completely dramatic new technologies where you'll be thinking at data center scale rather than chip or server or even racks scale. I think those are the three vectors. I see a lot of innovation or at least startups looking into these spaces to come out with a breakthrough. And as these technologies evolve over time, what do you think are people's biggest misconception about it? And what will it take to make next generation compute systems actually usable in practice?

Starting point is 00:16:58 Yeah, yeah. I think a couple of things. One is, like I said, in terms of you mentioned misconceptions. I think one misconception is that in order to produce more tokens, you need more power. So it's like we don't have an infinite supply of power. So that's one misconception that, oh, well, AI is going to grow. So let's just keep putting gigawatts power. So I think that will be one of the, I expect, a surprising breakthrough is someone will come up with an entirely new style of physics,

Starting point is 00:17:35 which will break that linear kind of assumption that people have is going from 100 LLNs to a million or going from a million users to 100 million, just multiply the megawatts of power. I think that may not be valid in the future. Something will have to fundamentally change in the physics in order to break that assumption. Belash, final question for you.

Starting point is 00:18:01 When you look at Accelerated next week, can you look at the companies that are gathered there from foundational component players to infrastructure providers to neoclouds and service providers and others that are prognosticating on what's happening next in AI? What do you expect to emerge in that conference? And how do you articulate your own vision for where AI will go by the end of 2026? So, Chair, this is probably a fun concern. Like you said, neoclots, hypers, neochip companies, even quantum computing companies, they'll all be at the same place and having the same conversation. So I think a key thing that I hope will emerge from here is clarity, actually, for,

Starting point is 00:18:46 and by the way, there will be enterprise end users at this conference and the storage innovators like Sauridime and others will be there. So I'm expecting that people walk away with a better vision of what it takes, what are the components and ingredients that go into making an AI system? And how is that different from just building your traditional data centers? We've been doing data centers for you. So what's different now? I don't think everyone clearly appreciates that. Besides, yeah, you might have heard of GPUs.

Starting point is 00:19:21 But there's a lot more like we discussed. There's memory, there's storage. And these things need to work in synchrony. Otherwise, you're not going to have an optimal system. And then even the compute itself is getting disaggregated. It's getting heterogeneous in nature. So I think people will walk away with that understanding. And the fact that not all tokens are created equal,

Starting point is 00:19:43 one tokens for a text prompt is not the same as a token for, let's see, a video generation. So what does it take to support each of these different types of tokens and different use cases are not all mornings are created equal? Now we're saying for enterprise, you may not need, these gigantic hundreds of billions of parameter models, but you might need tens of really small models that are highly efficient for laser-focused tasks that you need for enterprise. So I think all of these conversations will emerge. I'm excited about the conglades.

Starting point is 00:20:18 A lot of people may not be going that are listening, though, to FNELESH. So hopefully those sessions will be recorded. But where can folks learn more about you and maybe reach out to you and get more information just in general. I think one great place to start is, of course, you can reach out on Zero Point Technologies on my email and then also via LinkedIn. And then, of course, if you're at these conferences,

Starting point is 00:20:44 love chatting with people and learning from them. So, yeah, that's always an option. But yeah, I hope people reach out and exchange ideas. Awesome. Thank you so much. It's always a pleasure, Nalash, to have you on the show. I'm going to be accelerated as well. The Tech Arena team will be there.

Starting point is 00:21:03 I know, Janice, you're going to be there as well. Should be a fantastic conference in New York. But with that, I just want to say this is another episode of Data Insights, and it was a pleasure to have both of you with me on this journey. Thank you, Alison. Thank you, Nalesh. Thanks for once. This was fun.

Starting point is 00:21:19 Thanks for joining Tech Arena. Subscribe and engage at our website, Techorina. All content is copyright by Tech Arena.

In The Arena by TechArena - Where AI Inference Hits the Memory Wall

ZeroPoint’s Nilesh Shah explores why data movement, compression, and memory bandwidth now shape AI inference performance, and where heterogeneous systems and quantum may fit next....

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.