In The Arena by TechArena - Predictive Memory: How MEXT Turns Cheap Flash Into DRAM

Starting point is 00:00:00 Welcome to Tech Arena, featuring authentic discussions between tech's leading innovators and our host, Alison Klein. Now, let's step into the arena. My name's Allison Klein, and today I am delighted to be joined by a visionary in the tech industry, Gary Smerdna, CEO and founder of Mex. Gary, why don't we just go ahead and introduce yourself then a little bit about Mex just to kick us out? Great. First of all, thank you. Thank you for having me on. I really appreciate it. So, yeah, Gary Smerton, I've been in the technology industry for a few years now. And Mext is a company that we're focused on memory extension. You can think of the name, Mext, Memory Extension. And I'll get into what led to the founding

Starting point is 00:00:49 of the company, but basically we've seen an opportunity around challenges in memory, and that's what got us started. Now, we met each other a few weeks ago at Synopsis Converge, which was a fantastic at conference to meet people. And everybody was talking about how we deliver more compute power to large-scale challenges. But you have really been focusing on a quieter crisis that's going on within data center and edge environments, which is memory and memory scale. What was the op-haul moment that I need to realize that we were looking at the one bottlenecks and that we needed to broaden our focus into the memory complex. Yeah. So if you had to pick one moment, it's when realized in kind of the 20, 23, three years ago, that DRAM memory had become half of the cost of the server.

Starting point is 00:01:41 Half. And then when we looked at it further, basically, we also realized that not only was at half, but when you looked under the hood, much of this memory was actually being poorly utilized. We saw studies from Matt, from Google, and Azure, which all highlighted that memory was really, while it was an expensive resource, we had as much as half of it or more that wasn't being utilized. Well, so put another way, more money was being spent on DRAM and wasted on DRAM, excuse me, than was being spent on processors to Intel, AMD, and R combined. So that was like, okay, this is a problem.

Starting point is 00:02:21 Can we do something about it? And that's what kind of led us to start studying this problem in more detail. That's interesting. One thing that I think about with memory and I've spent some time. in the memory industry is that this is a really unique time for memory and it's a time where you know a memory fringe could last for a decade out at the industry so under call the demand for memory scale with AI and why is that hitting data center so hard yeah so the overall problem with memory it's kind of been a train wreck this has been taking place over a long period of time

Starting point is 00:02:53 it didn't just happen in December of this year it really goes back to i would even say the problem and start it 50 years ago, where you look at DRAM, it was first brought out in 1970 by Intel, and we've seen it in the marketplace essentially unchanged for decades. And what I started to study and look at 15 plus years ago is that memory was coming to the end of its life relative to being able to scale. Moore's Law had been a critical part of that scaling, and that was coming to an end. And so basically, it wasn't going down in cost. We weren't getting devices with more memory on it.

Starting point is 00:03:31 And now, all of a sudden, when there's a new opportunity around AI that's taken a lot of capacity from the industry, the industry hasn't been able to respond and it has become a big crisis. Now, you've mentioned the memory can now account for up to 90% of a service cost, which is crazy. Now, I know that the memory industry has worked on things like HBM to help address the low latency memory. requirements for systems, why hasn't that been enough to solve the capacity demand or at least unlock this market so that things were a little bit more in balance? Yeah, so it actually, HBM is just accelerating the problem. And so when you look at DRAM and HBM is actually just a specialized form of DRAM to give us high bandwidth, so more performance to really feed the pipelines of GPUs. But the technology is built in the same fabs as mainstream DDR5 is the current generation

Starting point is 00:04:28 of memory that's used in PCs and servers. So they use the same fab. So it's actually taking capacity away from mainstream DRA. But it's even worse than that. The bit per wafer of HBM is on the order of one third, maybe even one fifth of DDR. And so every wafer that you are used, using for HBM, you're taking away from the capacity for standard DRAM in the marketplace at that 3 to 1, 4 to 1, 5 to 1 ratio. Okay. Now, today came a big announcement from you guys. You're announcing a major milestone in what you all predictive memory.

Starting point is 00:05:06 Can you give us just the synopsis of what this software change is and what it represents in terms of capabilities for enterprises? Yeah. So let me hit on a little bit more about what we saw and what led to this. So again, we talked about the memory challenge that was taking place. And then we really set out with three objectives that if we can achieve those, then we could have something special. The first one was that we needed to increase DRA utilization.

Starting point is 00:05:34 I mentioned the studies from MAA and Azure and others that said we had poor utilization. So we wanted to have really improved memory utilization, number one. Number two is that we didn't want hardware or software changes. The reason is that anytime we have hardware, software changes, we are selling adoption, we're increasing risk, but we're also increasing cost. So in cost is then one of our biggest focus from the very beginning. And then the third one is that what if we can do something really special? What if we could bring flash, which is a storage technology and make it appear as memory to the processor, to the operating system and application?

Starting point is 00:06:14 Now, you might ask, why is that important? Well, it's 50 times cheaper per bit. It's 30 times lower power per bit. There's just one little problem, and that is it's also 500 times slower. So that's what really the industry got caught upon is that this is a storage technology. It's not a memory technology. And that's what inspired us to really work the problem. We worked it, and that's where we came up with the foundation that the company is based on that we call predictive memory. And it's a pretty simple concept. What we do is that a processor wants to work out of its caches. That's where it's getting the most performance. If it's not there, it wants to work out a DRAM, and then pages can get cold. And what we do when those pages get cold, we are able to

Starting point is 00:06:59 push those off into Flash, into NBME Flash, literally with no impact on performance. Why is that? Because these pages are being touched. They're cold. They're not needed. So there's no impact on performance. That's the easy part. The hard part is you get no warning. A page goes from the cold to being hot in one instruction. So a fraction of a nanosecond. All of a sudden, you need those pages. And now the processor is stalled. Everything's come to a stop while you're waiting for those pages to be pulled back from flash. What we came up with the idea of is what we call predictive memory. And it's an AI engine that is monitoring what's happening in the system and continually training and inferencing, or what we call predictions, is predicting and

Starting point is 00:07:45 pushing back into DMM the pages that. and believes are going to be needed in the future. So this was a really interesting idea. It inspired us to start to research it. And we found, you know what, we could actually do this. We could predict accurately what was going to be needed in the future and push it back before the application of the operating system ever needed. And that's what allows us to transparently bring flash into the memory tier. Now, this is super interesting because you're using AI to solve a problem that's created by AI. But let's go a little bit under the covers. How is it, actually driving these predictions. And obviously, one of the thoughts that I have is how do you

Starting point is 00:08:24 train this algorithm to really understand what the application will need before it even asked for it? Yeah, so we looked at many different AI models in trying to see what would work, how can we achieve the performance we need. And what we ended up on is actually, it's an ensemble of models, so multiple models. But the workhorse is a transformer-based model. So the same as Chad, GPT, and all the other Fiancelo limbs. So this transformer-based model is able, in our case, do something very different, though. It's making predictions in real time. It's also training in real time.

Starting point is 00:08:58 Because when we think about Chad GPT, you hear about this training phase that can take days, weeks, months, and then you do inferencing typically on different hardware after that. We are doing continual training. We don't pre-training. We start training right away. And then we make continual training. Why is that important? because there's thousands of applications out there.

Starting point is 00:09:19 So doing a special training for each one is not practical. But the other thing is that the behavior of an application can change while it is running. So if you don't retrain, if you're pre-trade, you could get stuck in a non-performment way. That doesn't meet the end customer's goal. So we do continual training. And then we make continual inferencing. Again, we call those predictions. And we're doing that all on a single X-86 or arm core.

Starting point is 00:09:46 So a lot of innovation there to be able to cheat it. Let me just touch on one other point. Regarding the training times, people go, how long does it take? And we actually begin making very good prediction in a matter of a few seconds. Now, one of the most striking claims about reviewing your technology is that you have OS transparency. How do you actually make the OS an application think about watch behaving like expensive DRAM without any notice of the difference? Yeah, it's, you know, part of the goals that we set out, I talked about no hardware or software modifications. And so to require no software modifications, what that really meant is that we needed to have just a driver, standard Linux driver, Linux is the operating system that we support today.

Starting point is 00:10:33 And in addition to that, in user space, that's where we run our AI engine. So that's we're running the AI model on a separate core. So what the application sees is the combined memory of DRAM plus however much of the flash we have provisioned as memory to the system. So we take, for example, a one terabyte DRAM system, we add flash to it, our predictive memory software, and our default would all of a sudden be to have a two terabyte instance. But you can also make that smaller if that's what you choose or you can make it larger. You can make it a three terabyte or even a four terabyte system. again, all off that base platform. Now, your data shows that you can actually quadruple memory capacity while having the cost of the system.

Starting point is 00:11:23 Can you talk a little bit about what that means from an ROI perspective? What have early customer responses been to this? Yeah. So really our focus from the very beginning is around performance per dollar. And so performance for dollar is just a fraction, performance divided by basically, the cost effectiveness of it. So from a performance standpoint, and this, by the way, is the same exact model that drove VMware virtualization, where VMware set out to, of course, improved utilization of CPUs. We're wanting to increase utilization of memory. So again, what we do

Starting point is 00:11:59 is from a performance standpoint, what we typically see is that we are seeing 0.9, 0.95, 0.99, even 100% of the performance. And from a cost standpoint, we're able to reduce that in half or reduce it even more given the tremendous escalation that has taken place in memory and thus server costs. So the equation, what we typically see from a performance per dollar, when you do the math on it, it is 2x, 2.5x, 3x are common numbers. Now, you've asked about customers and what are customers experiencing. This is what they're experiencing. It's been all the wide range of workloads and use cases, but what they're seeing is that they can install this in a matter of minutes.

Starting point is 00:12:43 They can try it out. They can do side-by-side testing, and they're seeing the results that we're talking about. Now, there is another case in which you can actually increase your performance, and that's where you need more memory, because if you don't have enough memory, you're having to shard your data across multiple systems, a lot of other complexity comes into play. What we are able to do is the, allow all of that data, everything to be put on a single system, and thus actually increase

Starting point is 00:13:11 the performance at the time that a job takes. Now, I know that there's going to be some skeptics in the room, so I'll just ask, if you're running a large LLM or a massive vector database, what's the actual performance tradeoff when moving from DRAM to flash at this model? Yeah, there's always your performance can vary, right? Any type of virtualization technology or any other of these things, if anybody says, guaranteed, then that's just not intellectually honest. But what we're seeing customers experience is what I talked about,

Starting point is 00:13:42 where they're achieving essentially the same level of performance while reducing their costs by this significant amount. And often what that means, by the way, is that they're able to run two times or three times as many jobs simultaneously and thus actually improving their time of getting their product to market or whatever their particular business application is. Again, one of the things that we just encourage customers to do is try it. And with our technology, you can try it on premises.

Starting point is 00:14:13 You can try it in the cloud. We've worked very closely with data U.S., with Google Cloud as well. And so we run there all the time, but we also work on the other cloud providers as well. Now, your co-founder, David Rieck, fun fact, help build the Internet. How does that deep systems perspective influence the way MECD approaches modern data center problems? Yeah, David's a very unique individual. David is one of those people that is truly an expert, everything from deep under the Linux, into the Linux kernel, all the way up to hardware systems and even into the applications,

Starting point is 00:14:46 having been an SVP at SAP working on HANA, etc. So he is a true savant across multiple things. And we've had a great privilege to work together in the founding of this company, but we also brought in a lot of other really talented people as well. Carl Waldsberger is our chief scientist. He was employees 60 at VMware. He led innovation there for a decade, has close to 250 patents to his name and is a well-respected innovation expert. Let's see. Oh, Fred Weber. Fred Weber was our first advisor and has given us great advice through the years of pointing a sent a direction. And he was the CTO of AMD in creating the 64-Bidx86, leading to that innovation. And so is another well-respective.

Starting point is 00:15:31 expert, and we have a team of that. Everybody wants to say, listen, I have a team of rock stars. We do have a team of rock stars, many of which are in the Hall of Fame. You know, one question that I have for you is, you know, you talked about how you're on-prem, you're in the cloud. We met at synopsis, so I'm really curious about the workloads around chip design and simulation tend to be incredibly memory intensive. How do you approach that space? Yeah, so EA is an amazing use case for us. And let me touch on some of the things that are happening from an industry standpoint, which are driving it that has led to our success there. So number one is that we know everything happening with AI. We know there's this huge push in the industry to get more advanced devices out

Starting point is 00:16:13 at even faster cycles. And while doing that, these devices, of course, are getting incredibly more and more complex. And not only are we talking about chips, but we're talking about chiplets. And we're not only talking about chipplets. We're talking about whole systems that all have to be simulated and really optimized simultaneously. And that was a big part of that particular conference talking about how this integration and testing is to take place across multiple ways. So EDA is historically a memory intensive workload. But it's been taken to another level of late. And so what we're doing is basically allowing huge systems to be created, allowing for less sharding of the data. I talked about that before. So we don't have to break the device into as many parts,

Starting point is 00:16:58 which allows for a better device. It allows for a device that doesn't have to be optimized in little blocks. It can be optimized at larger levels. So we get better devices. We get them faster because we are removing steps as far as from the sharding of the data standpoint and allowing more workloads to be tested simultaneously. Jensen actually at that particular show talked about how AI is causing us and giving us the ability to run more and more simulations, more and more place in route, all these other things that take place from an EDA standpoint, and to do that simultaneously or to do more of those in parallel. And of course, that's what we're doing is allowing these systems to be developed cost effectively because compute has become the bottleneck.

Starting point is 00:17:41 It's not the licenses. It's compute that is the bottleneck to running these workloads. And when I say compute, I'm talking about, of course, processors, memory, and the whole computing system. Now, if MEX scales as planned, what is the ripple effect on the supply chain? And does this take the pressure and constraints off of the major memory valves at this point? Yeah, so let me go back to a point I hit on before as I answered your question. So I talked about Moore's Law coming to a relative 15 years ago. And what do I mean by that? This technology of one transistor, one capacitor, which gives us a bit of data, that cell size has not really changed in 15 years.

Starting point is 00:18:24 So go back again to 20 years ago. You'd spend a billion dollars, build a fab, and the fab capacity doubled after two years because a new process technology came out. Wars law allowed us to shrink it. We double the capacity. Two years later, we double again. Before you know it, you go 248, 16, even 32X the bit capacity from the same fab. That's what drove the economics from a memory standpoint. But that came to an end.

Starting point is 00:18:51 So what happens today? You build a fab. You spend $30 or $50 or $100 billion on building your fad. The bit capacity is what you start with, is what you end with. So the economics have just been shattered when it comes to derail. And of course, what's happening in the world of data? Of course, we have more and more data and everything else taking place. And then, of course, we talk about HBM.

Starting point is 00:19:14 And HBM is coming in and actually eating away from the capacity that we have for DRAM and standard DDR5. So all of this has created a problem that is going to be around for a long time. Do I believe that MECD and our predictive memory software is going to need it to pay? Absolutely. But the industry, and this is going to be something that we are bringing great relief for, it needs it at lead. And I think we will help in a significant way, but the industry needs things and desperately as I'd strike. Now, looking five years into the future, do you think that this is going to be the predominant model across every server shipped? I do. And I think it'll go beyond servers as well. It's as simple as this, right? Every technology in a computer has gone through not only major revolutions. A microprocessor Intel came out a year after DRAM. They came out.

Starting point is 00:20:08 with a 4004, 4-bit single-chip microprocessor. Today's processor looks nothing like that. But all the other parts of the system have changed dramatically. DRAM specifically hasn't. And all these other areas of a computer have tiering. They have different levels of performance. They have, of course, storage is an easy one to point two, where you can have different levels of performance

Starting point is 00:20:30 and transparent tiering and storage layers. We need transparent tiering of memory layers. This is not a new idea, by the way, do tiering of memory. Tens of billions of dollars has been spent trying to do this. Famously, Intel with their phase change memory technology, which they branded as Opta, tried to do this. And so people have tried before. It's just that the combination of hardware technology, in this case flash and software technology in the form of predicted memory and AI power has come together to bring a solution that the industry needs. So yes, it's going to be

Starting point is 00:21:06 there. It's going to be widespread. It's the future of memory. Gary, it's been delightful to have this conversation with you. I'm so excited about what's coming at Mext. For someone who worked on 3D cross-point memory, both at Inton. I'm excited to see a new entrant of technology enter the fray that can help address a challenge that has been longstanding in compute systems. Where could folks find out more about Mex and engage with your team? Yeah, so we have quite a bit of information on our website, as well as the ability to connect to us directly there. So it's next, m-E-XT.a-I. Next.a-I.

Starting point is 00:21:44 Thank you so much for being on today. I am sure that our audience is going to be cutely interested in what you and the team we're doing. Well, Allison, it's been a pleasure. Thank you for having me on. And we look forward to engaging with your own. Thanks for joining Tech Arena. Subscribe and engage at our website, Techorina.aI.

Starting point is 00:22:04 All content. is copyright by techering.

In The Arena by TechArena - Predictive Memory: How MEXT Turns Cheap Flash Into DRAM

In this episode of In the Arena, Allyson Klein sits down with Gary Smerdon, CEO and founder of MEXT (Memory Extension), to unpack a brewing crisis in data center memory and the AI-powered software tha...t could fix it.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.