Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 4x15: Enabling the CXL Device Ecosystem with Marvell

Starting point is 00:00:00 Welcome to Utilizing Tech, the podcast about emerging technology from Gestalt IT. This season of Utilizing Tech focuses on CXL, a new technology that promises to revolutionize enterprise computing. I'm your host, Stephen Foskett, organizer of Tech Field Day and publisher of Gestalt IT. Joining me today as my co-host is Craig Rogers. Hi, Stephen. It's good to be here again. How have you been taking all of the recent news around CXL? Yeah, it's pretty exciting. We now have AMD and Intel on the market

Starting point is 00:00:35 with host platforms that support CXL. We were very excited to see Intel's Sapphire Rapids launch on January 10th, and we're definitely going to be diving into all of that. But of course it takes more than a platform, a host platform to deliver CXL. It's gonna take a lot of other assorted devices, right Craig?

Starting point is 00:00:56 It does. Obviously we have a lot of other companies that need to contribute to CXL for it to properly grow. And now they're not going to be hindered by manufacturer engineering sample availability and they'll have access to the wider market. We should see an increased pace here on product releases. Yeah, I would expect so.

Starting point is 00:01:19 It's going to be very, very exciting. I mean, we've been seeing CXL in demos up and running all last year on the Intel Sapphire Rapids platform. And now that it's been launched, I imagine that we're going to see a lot more products coming out. One of the companies that we talked to last year at FMS at the CXL Forum and OCP Summit was Marvell. And we actually talked to Shailesh Tussu from Marvell back in New York at the CXL Forum. But I wanted to invite him back to get a little bit of a deep dive into what Marvell is doing in CXL. So Shailesh, welcome to the show. Thank you. Happy to be here. And I'm the VP of engineering running the CXL product line with Marvell. Yeah. And I know that Marvell is obviously one of those companies that produces, well, almost a component that's in almost everything.

Starting point is 00:02:15 And so I wasn't at all surprised to see that y'all are producing CXL components as well. Maybe we can start off by sort of, I know that it's still new. Like I said, we just have our first host platforms that support it. But what is Marvell going to be making in the CXL world? Yeah, Marvell sees CXL as a very enabling technology for the next generation cloud and data center market. And we are focused on creating solutions that that will allow our customers to create uh sort of a create optimized uh deployment of cxl memory cxl pooling cxl expansion so it's going to be a wide range of products coming out in the future through more and the the first products then that we would expect to hear that would be hitting the market would be around memory expansion in servers, taking advantage of those doing is we are focused to provide solutions that

Starting point is 00:03:27 are broader than just expansion and the order in which we'll be getting some products out will be coming out in the next few months. Very good. We've seen with other companies that there's a lot of development work on these products being done using FPGA prior to committing to ASIC. Is that a similar approach that you're taking? Yes. So we have actually developed an architecture that will allow us to scale from very small utilization of CXL scale to a very large utilization.

Starting point is 00:04:08 And then we have taken that same architecture and design that we are putting into silicon into an FPGA platform. And you'll see we have done multiple public demonstrations on some of the features that our architecture allows to be done. And this allows our end customers also to try out the technology uh as you know it's a brand new technology and the use cases are actually growing every year or every month actually every few weeks people are talking about slightly a different way of using it uh our solution space with the fpg of course allows them to experiment but on the other hand

Starting point is 00:04:42 even our silicon and AFIX in the future will allow a bunch of flexibility that as use cases evolve, we'll be able to adapt to it for most of them. But of course, these aren't going to be products that end users are buying in the market. You guys are going to be enabling other people's products, which is pretty much what we've seen from Marvell all along. Is that right? That's right. We are focused on providing the silicon that will then be going into various products that are developed by other users. And the first products that we've seen in the market are memory expansion. I imagine that we're going to see a lot of memory expansion. Maybe not what Marvell is going to be working on,

Starting point is 00:05:22 but what do you suspect is going to be coming down the road from CXL in 2023? Now that we have the host platforms, you know, what do you think is going to be coming to market this year overall in 2023? Yeah, I think in the market, as you've seen, there are people coming out with memory expansion and direct connecting memory to the processors, which are very critical technology. We do need to get that out there in order to get better utilization of the cores that are being put by both Intel and AMD, and we're working with both of them to enable our technology through that.

Starting point is 00:05:59 But I see, even though the first products will start with memory expansion, it will move into more advanced things like sharing memory between processors and processor cores, better increased utilization of memory while you're also increasing the bandwidth of the cores going forward. So the later half of 23, as the technology matures, the scale at which the CXL will be used will keep growing. But the starting point is going to be direct connect memory expansion from a market perspective.

Starting point is 00:06:31 And what type of performance then are you seeing from those initial memory expansion units? Would you compare it to something similar as accessing RAM through another human node? Right. I think first generation of product, everyone has been talking about basically equal to a newer node comparison. And I think that's the right way of starting to think about it.

Starting point is 00:06:57 But we have to get better. So in order for it to really be deployed at scale, the performance or latency, let's call it latency, there's a difference between latency and bandwidth, of course. From a latency point of view, the market is looking, in our mind, is looking for even better than a new mononode. So we are very focused on latency and performance. We look at CXL, a lot of people are looking at CXL as saying it's a far memory. Sure,

Starting point is 00:07:26 there's a utilization of far memory around the technology. On the other hand, the closer you can make the CXL memory to the processor memory, I mean, it's always going to have slightly higher latency than Direct Connect processors. The question is, how can it be somewhere between a NUMA node and a Direct Connect? So I see as the next generation products start coming out, that the latencies will be getting close better than NUMA node. And we are focused towards that. Wow, that's pretty exciting because we were excited. We talked to Anders from Microsoft and he was saying that they were in testing, they were seeing, you know, sort of remote NUMA node performance, and they were pretty happy with that. But, you know, the idea that it could be even better than regular system memory on another node is incredible. But of course, it makes sense because there are PCIe links with CXL support

Starting point is 00:08:27 that go directly into each processor complex, depending on the architecture of the CPU. And so I can see that that's the case, that it could indeed be even better than a remote NUMA node. I imagine, of course, that there's going to be added latency once you do have these memory expansion products. But it seems to me that there's a software landscape emerging as well that's going to support that. And I imagine

Starting point is 00:08:58 that you all are working closely with that too, to make sure that your products are going to be supported everywhere. Yes, we are. So we are working with multiple partners to make sure that your products are going to be supported everywhere? Yes, we are. So we are working with multiple partners to make sure that we are sort of addressing the different use cases. There is going to be a use case for far memory that maybe even a distance greater than NUMA node is still useful. There are applications like that for memory capacity expansion, for example. I mean, there are definitely applications that don't care as much about latency that may be using other technologies today that are higher latency than a NUMA node.

Starting point is 00:09:33 And CXL is suited for those places also where you can provide a much larger capacity memory to a single core. So we are, as I said, in our mind, the use cases are very broad and wide. And so we want to go for applications that are more general purpose, that are better than a new one on,

Starting point is 00:09:55 and then also applications that are really more targeted towards large memory deployments, in which case they can be far memory and associated with that. Have you been exposed or heard about some of these applications that are emerging that are going to be able to make use of the sort of memory expansion we're talking about? What are customers telling you? So the first focus of customers is to be able to use their expensive processor cores, let's put it that way, to be able to get performance out of them. There are two aspects of this thing.

Starting point is 00:10:32 As the core counts have increased to a socket, the amount of memory that we attach to that socket has reduced effectively per core, as how much memory can you put in there, and also the bandwidth. So yeah, the first deployment is, okay, how do we efficiently use our cores that are already there from the various host providers? But really, the other part of it is, how do you effectively use memory? Because when you look at a rack, I mean, you look at multiple servers in a rack, in a cloud data center, for example, you might see that the amount of memory utilization at any given point in time across the rack

Starting point is 00:11:11 is low, relatively low, right? The memory is not at the right spot. The cores are in some other spot. So in order to better utilize the overall resources at the rack level is where I see the market will be starting moving towards. And that's where CXL will shine even much higher and have a much better use case than just directly connecting to one process. Yeah, it's actually interesting to hear about that because I think so far much of the discussion has been around memory expansion to meet the needs of applications. But what you're saying is really true. If we can share memory in a sort of a rack scale

Starting point is 00:11:51 architecture, then we can have a much better, much more efficient use of those resources. So systems that need more memory now can have it now, and then they can give it up and another system can use it. That's very, very difficult to do currently with how servers are architected, but that'll be something that will be made possible once CXL extends outside the server and into this shared memory space. And I think that that has to be pretty exciting

Starting point is 00:12:23 for everybody in the industry because it is, you know, it's something we've never been able to do before in terms of, you know, building this kind of rack scale architecture. And I imagine that that's the kind of space that you all are pretty excited about as well, because as I said, I mean, already any kind of device in the rack is using a ton of different Marvell chips. And you guys will be right there, right? I mean, you'll be in the server, in the memory expansion, pretty much everywhere with these chips, right? That's correct. And if you see our, you know, the demos that we are showing with our FPGA is at the rack level, right? We have

Starting point is 00:13:03 been running at the rack level for the last few months publicly. And so, yes, we do want to optimize for both cases, basically running at the full rack level and what are the resources optimized there, and then also directly connected to a host because some applications are still very useful to be directly connected. So the initial server platforms support CXL 1.1. That's what we've heard from both AMD and Intel. Should buyers be concerned about support

Starting point is 00:13:32 for different CXL versions, since 1.1 doesn't technically support any of the things that we're talking about? Or should they be reassured that there's going to be compatibility there as new features and capabilities are brought to market? Right. So they should be assured that there'll be compatibility. Of course, as the CXL 2.0 host processors get deployed,

Starting point is 00:13:53 there'll be more features that we can enable around that. On the other hand, even the 1.1 CPUs with, with the, you know, our architecture as it's modeled in an FPGA is already doing some of the things that are really for 2.0, right? So now there are limitations in that. I'm not suggesting that they go without limitations, but it's it 1.1 or closer to 2.0 features, even in their older processors as the industry matures around that. And then as 2.0 processors roll out, it'll get even better with the same devices.

Starting point is 00:14:39 You mentioned that you were building these products to scale with the APIification of everything in this day and age. What are you doing to help your customers actually scale, manage, grow, monitor, maintain and secure your products then? Marvell are a huge company that's very well resourced, and I'm sure you're putting in considerable effort there too. Right. So we, you know, there are two things that we think, there are multiple things, but there are two important things that we have to keep in mind as we are scaling. Security is a given nowadays, and we have to, of course, make it much, much more secure, especially as you start connecting multiple hosts together with the same device.

Starting point is 00:15:28 Right. And then there's RAS. So there's reliability. So our devices need to be actually more reliable than connecting a device directly to one host. Because if we go down, we take down multiple hosts. Right. Not just a single. So our blast radius is much higher. So we have done a lot of work within our device and architecture to ensure that our reliability is higher, better, equal to a better, trying to be much better than a single host so that we don't take down multiple hosts or the probability reduces significantly. Apart from that, it's all about

Starting point is 00:16:07 how do you actually enable that? How do you show that it actually works? So there's, you know, CXL defines a fabric manager, for example. So we are going, we are providing a reference fabric manager, right? For our customers to play around with it, try it out. We have a, you know, you will see some new release

Starting point is 00:16:24 that we have done about a platform around an fpga solution that will also become the ethics solution from a from a reference platform so our customers can uh quickly try it our customers partners so that they can create new ideas on how they deploy it and how their infrastructure should actually look like around that there's no There's no substitute for actually seeing things running and working as opposed, and we are already seeing good traction and good ideas from our end customers once they use our platforms to try different things. Yeah, that reflects what we've seen, I think, at some of these shows. As you mentioned, I mean,

Starting point is 00:17:01 a lot of the pre-production stuff is being demonstrated now. And it seems to work pretty well. I wonder if you can look into your crystal ball. When do you think that this is going to happen? When do you think that we're going to get rack scale and so on? I mean, you're more connected with this, with the development process.

Starting point is 00:17:21 You've been in the industry a long time. You know that it takes time to bring these things to market. And we just got our first host platform literally in the last month. When will we start seeing RackScale memory sharing? And when will we still see future versions of CXL? Right. So the speed at which it will get deployed will partially depend on the solutions that people are providing, right? Like ourselves, right? So we, I think, a bit, it's a what will happen is that as the market and our customers

Starting point is 00:18:00 see what our solutions do and get comfortable with that it can be deployed at scale, then it sort of will accelerate of when it will be shown into the market. Now, looking at the crystal ball, I would think that the deployment at the mass level, at a large scale, is probably still about two to three years away. So it's more than the 2025 range is when it gets to a level that is used at a larger scale. Yeah, that seems reasonable and realistic. I mean, there's going to be quite a bit of time before rack scale comes into place. But that being said, I would believe that memory expansion

Starting point is 00:18:47 internal to the chassis is probably gonna come this year, 2023, would you agree to that? I would say initial deployments will definitely start showing up this year. You're seeing demo, you're seeing some silicon in the market and people are trying it out. As far as the large scale deployment deployment even for that connectivity, we'll have to see because the people again are still getting used to sort of getting confidence in

Starting point is 00:19:15 this technology and how it works. And my personal opinion is there'll be really 2024 before you start seeing wide deployment, even for an expander, start-off use case and direct connecting. And then from there, it'll move relatively quickly to 2025 where we can go multi-hosted. And given the sensitivity, as you mentioned with reliability, given the sensitivity to these systems, my understanding, and I wonder if you've heard this as well, is that the system vendors, the system integrators, everybody is being very, very cautious before they qualify this for use. I know that Intel and AMD are doing a ton of testing to make sure that the systems, the platforms are really ready and that this stuff is really, really going to work.

Starting point is 00:20:02 Is that what you're seeing as well, that people are being very cautious before approving and qualifying these? Yeah, that's what I'm seeing. And as I said, it's not only the reliability and system testing, it's also security. Now, if it's for the first time, you're now taking something that is usually inside a caged box, right, and directly connected that you can't get access to, to a serial link, a PCI type of link, if it's PCI link effectively, that has the ability to get out of the box. So you have to make sure that all your security aspects are covered around that.

Starting point is 00:20:38 So that's the other part of it why I think it will probably take more than 2023 for real deployments. It will really be 2024. I think I would fall in line well. of it why i think it will probably take more than 2023 for real deployments right it'll it'll really be 2024. i think i would fall in line well i've said a couple of times i have a personal prediction that we'll see an increase in cadence in terms of release of pci express and now cxl versions and i think even by 2025 we could potentially be seeing cxL2-enabled servers. And that would coincide very well with the bedrock of testing and adoption from the CXL1.1 devices.

Starting point is 00:21:18 So it would be great to see if that was the case when things really started picking up and products like yours then across multiple hosts were coming into play. That's about the time that the next generation of server platforms will be in the market as well, Craig. So I think that's a good thought. Yeah, agreed. I guess given all of this, is it too early for people to be excited about CXL,

Starting point is 00:21:45 or should they be starting to dream about what they can do with this technology when it does come to market? No, I think it's not too early for people to be excited. But on the other hand, there's a lot of caution as the excitement is there so it has to be and it's been monitored the one of the things that people are also there's a lot of new terminology being thrown out there in the market and you know even understanding what is expansion versus pooling versus a new monode although the definitions and as as I see some of the demos around their definitions are sort of, you would think they're a little vague, right, as far as what people call expansion versus pooling. And so there is a lot of excitement, but at some point, we got to make sure what are the real products and what are the real use cases that are deployable at scale.

Starting point is 00:22:42 Because, you know, and that's what I think people are working through in the next 12 to 24 months in the market. Yeah, that's a real good point. And I'm glad you brought up the question of security as well, because it's gonna be very interesting to see where that goes, especially with external memory. Because as you said, essentially this is to allow memory access

Starting point is 00:23:05 to escape from the system chassis for the first time ever. And that could cause some problems if things aren't properly nailed down. Right. So that's why it's been one of the big focuses for us is also making sure that it's all secure because we don't see a large-scale deployment without ensuring the security that comes with it.

Starting point is 00:23:25 Yeah. And on the security side, you know, a lot of the larger companies are going to want this integrated with their SIEM. You know, they're going to want to keep an eye and a track on this. And as Stephen alluded to, it'll be the first time RAM is being outside the chassis.

Starting point is 00:23:42 Yeah. Okay. So the way I look at it is that it takes CXL type of enabling technology comes every decade or so in the industry. It's not something that comes out all the time. And it's really good to see that everyone has now sort of consolidated around CXL. There's been a lot of different technologies

Starting point is 00:24:05 over the last, I would say, five years that have sort of demoed or people have been going towards that are similar in nature or similar capabilities or functionality of promise. Now that there's a big consolidation around CXL with both the host and the devices

Starting point is 00:24:23 and industry as a whole, I see a lot of innovation that's going to be enabled and a lot of in the coming one to three years and beyond. And so I think this is a technology that for the next decade, we should be able to see

Starting point is 00:24:40 a lot of innovations around that coming through. There'd be a lot to integrate. If I put my engineer hat on, CXL is amazing. We get memory expansion. We get memory outside of hosts soon enough. If I put my architect hat on, we have a whole new way of designing solutions.

Starting point is 00:24:59 If I put my product manager hat on, we need to start thinking soon, is this going to be a viable product? Is it gonna be adopted? There's a whole lot of things to do in a company before they even think about architecting any kind of solution. What's our MVP?

Starting point is 00:25:16 Have you been discussing with any customers who have already maybe even started that process who know they're going to adopt it? They're having a need and CXL addresses that need for them. And they've started the process with that view to maybe implement them in 2025. You know, early adopters can can often make leaps and bounds ahead of companies that wait. Yes, we we have as Marwell, you know, We're not ready to announce the exact customers, but we definitely have designed one in order for us to really get our first early adapters. So we know who our early adapters are, and we're working closely with the early adapters to ensure that our silicon meets their requirements at scale.

Starting point is 00:26:11 One of the other things that you have to keep in mind that I just thought about is that so far in the industry, when new memory technologies show up, all the testing is usually done by the processor companies because the host processor companies, because that was the direct connection. Now, with these devices we are coming out, we actually have to do the same type of testing that the processor companies do to the memories ourselves. And you need a certain scale, a certain capability in your company in order for that to be enabled. So we are working both with the memory vendors from that perspective, and then from the host providers because they are familiar with it and they also, you know, they are different, slightly different capabilities within the hosts as CXL is evolving. And then also our end customers who then use the various hosts and the various memories.

Starting point is 00:27:04 And we are sort of the, you know, the silicon in the middle in some sense now. And so we have been working with all, you know, all three types of groups of companies to ensure that our solution actually works. So we see that you need to do that to make a successful business out of it, as opposed to, you know, something in the lab or something that just shows it working.

Starting point is 00:27:28 So, yes, we are taking the approach of ensuring that when we come out, that we have all those things already thought through and doesn't take a long time from there to get to full production. Yeah, absolutely. And that's a really, really good point. I hadn't really thought about that. I really appreciate you bringing that up. It's a new world for folks like yourselves to be in that position with memory qualification and so on.

Starting point is 00:27:54 Well, thank you so much for this conversation. It's been really, really interesting. I appreciate it. As we wrap the episode, where can people connect with you and continue the conversation about CXL and other topics, Shailesh? Yeah, so we'll send a few links for you to be published part of this podcast.

Starting point is 00:28:13 But you can also get to www.marvell.com. And through that, you'll find a few videos around that. But we'll have an explicit link being sent out. Yeah, thanks a lot. And Craig, I think you and I are going to be headed to Tech Field Day on March 8th. Maybe we'll see some of these CXL companies there. What else is going on with you? Looking forward to hopefully seeing some CXL companies there. I am on LinkedIn as Craig Rogers, and you can find me on

Starting point is 00:28:44 Twitter at CraigRogersMS. For me, you'll find me at S. Foskett on most social media networks. And if you'd like to learn more about that tech field day I mentioned, just look me up and drop me a line. Thank you for listening to the Utilizing CXL podcast, part of the Utilizing Tech series. If you enjoyed this discussion, please do subscribe. You'll find us in your favorite podcast application. And also, please do give us a rating or a review. This podcast is brought to you by gestaltit.com,

Starting point is 00:29:12 your home for IT coverage from across the enterprise. For show notes and more episodes, go to utilizingtech.com or find us on Twitter at Utilizing Tech. Thanks for listening, and we'll see you next time.

Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 4x15: Enabling the CXL Device Ecosystem with Marvell

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.