Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 4x4: Implementing CXL with George Apostol of Elastics.cloud

Starting point is 00:00:00 Welcome to Utilizing Tech, the podcast about emerging technology from Gestalt IT. This season, we're focusing on CXL, a new technology that promises to revolutionize server architecture. I'm your host, Stephen Foskett, organizer of Tech Field Day and publisher of Gestalt IT. On this episode, I'm joined by co-host Craig Rogers. Welcome, Craig. Thank you, Stephen. I'm Craig Rogers. You can Craig Rogers. Welcome, Craig. Thank you, Stephen. I'm Craig Rogers. You can find me on at CraigRogersMS on Twitter. And we are here to talk about CXL and how it's going to transform and change the way we interface with components in the future.

Starting point is 00:00:38 So, Craig, you and I were both part of the CXL forum recently at OCP Summit. And during that, we saw a really great explosion of technology, a great amount of support from various different companies. We also saw the progress that this technology has made from the very basic initial products, which are rolling out kind of now as you're reading this, but also where it's going in the future. We heard from the CXL Consortium. We heard from the PCI SIG. We learned about the next versions of CXL.

Starting point is 00:01:13 And we saw what this all promises. But in order to get there, we're going to need some new technology, right? Indeed. the equipment we need to interface with these components in a different way simply isn't on the market right now, and that has created a lot of opportunities. Yep, and that's why in this episode, we decided to invite on one of the speakers from the CXL forum, and frankly,

Starting point is 00:01:40 a technology pioneer in his own right, George Apostol from elastics.cloud. Welcome, George. Thank you, George Apostle from elastics.cloud. Welcome, George. Thank you. It's nice to be here. So tell us a little bit about yourself and your background and how you got into this technology. Yeah, so we were back maybe 20 years ago, I was the vice president of engineering at a company called PLX Technology. And at PLX, we pioneered PCI Express switches that are pretty much in the market today.

Starting point is 00:02:08 So from a career standpoint, I've spent a lot of time designing these systems and designing systems connected to various kinds of components that have sort of transformed throughout the years in terms of not only their functionality, but also performance. And so as that performance of the compute elements, the IO elements has gone up, the need for greater performance at the system level has become apparent. And now PCI Express is not serving the need anymore, which is the impetus for CXL and why you see such a

Starting point is 00:02:46 great adoption for it because you know the need for better performance and better utilization right has become very very key in the data center space today I actually caught your presentation at CXL summit it was very interesting the way you have taken a system on a chip approach to create your products. Is it fair to say that's going to be the bedrock, you know, the foundation for your CXL switching products moving forward? Yeah, because as we looked at it, you know, as Stephen said, we've started, and the CXL Consortium will also say, right, it's kind of like, you know, qual rock uh technology that they're looking

Starting point is 00:03:27 at and a lot of people are doing the crawling but we're looking ahead forward to see okay when when we want to run what is it going to take to do that right and and we believe being able to not just connect the devices together but to be able to control and manage, to reconfigure, to compose, all of that is going to take more than just enabling the connectivity. So we're building an SOC to be able to do all of that. And our SOC just happens to have a 256 lane switch on it. A happy coincidence. It's interesting as well that these are also you

Starting point is 00:04:06 know cxl 1.1 will be coming out with sapphire rapids and our initial gains in terms of operational efficiencies are all going to be around ram and memory pearling and the you know it's it's it's the biggest gain we'll be able to make in terms of efficiency. You know, more than 50% of a server's cost right now is RAM. And if that RAM isn't being utilized completely, it's wasted money. You know, it's negative on the TCO of your overall platform. But I think you're working at stuff currently on CXL version 3, which is even going to allow future components to be integrated with GPUs, storage, AI modules, et cetera. Can you tell us any more about that?

Starting point is 00:04:51 Yeah. So as we started to look at this technology, we started looking at what is the evolution of composability, as I call it, right? So you're correct everyone right now the focus is on memory and RAM because because of the cost issues but once you've developed a shared pooled disaggregated RAM solution right that works the next step is going to be how do you do that in a tiered memory fashion, right? Because you can't put everything in DRAM. And so we're looking at what are the different, you know, mid-tier and then longer tier technologies around memory and storage that can be created, you know, in order to, again,

Starting point is 00:05:43 get more efficiency on the cost side. So after you've got this pool of DRAM, you may have a pool of some mid-level type of memory, depending upon what it is, and going all the way to SSD. So now you have this ability to share all of this data to be able to get efficient on where your large data sets get put within that tiered structure. Then after that, now you've got this heterogeneous computes that are going to be used in order to be able to process those workloads more efficiently. And then once you have that, then you've got to figure out how do you scale that?

Starting point is 00:06:35 How do you scale that in the box? How do you scale that in the rack and rack to rack? So we believe that CXL is going to be able to scale easily within the box challenges in the rack and challenges rack to rack. So we believe that CXL is going to be able to scale easily within the box, challenges in the rack and challenges rack to rack, but that's going to be the scope of that. When you start going rack to cluster and cluster to cluster, that's where Ethernet lives and it's going to live forever. So we believe that CXL, just like PCI Express and Ethernet, are going to coexist probably for the next 20 years. So as these things evolve, what we're looking at is how do you create these pooled resources?

Starting point is 00:07:12 And then how do you execute on true composability? Where workload comes in, you can specify the compute, memory, storage, networking that's required for that workload, compose a virtual server, execute that workload, put the resources back and do the next one, right? And do that at a speed that is unheard of today, right? Literally on the order of microseconds, we believe ultimately we'll be able to do that. So to do all these pieces, right, you're going to need to have intelligence where these devices are being connected. You can't have a single point of control because the resources are too big to do that. You'll just create more bottlenecks.

Starting point is 00:08:00 And so this problem has to get sort of disaggregated in the same way that the components are disaggregated. How they are managed also has to be disaggregated. Right. So this is these are the things that we're looking ahead and trying to figure out how are these next generation composable infrastructures, composable architectures going to work? Yeah, that's really the question, isn't it? Because it seems to me, especially after attending the CXL forum and seeing what is currently being developed by the server and memory manufacturers, that the concept of CXL-based tiered memory is not really in question. I mean, these things are going to be delivered. There's going to be multiple providers of the physical components. There's going to also be support for it in software.

Starting point is 00:09:06 And that's all being worked on. The question is, I guess, long term, where does this go? I want to zoom in there, George, on one of the things that you talked about, maybe we can talk about the rest too. But let's start with this whole world of tiered memory and what those systems look like. So a system that has a CXL-based memory expander in the short term, in 2023, is going to run some kind of software that allows the application to access that expanded memory that is off the memory bus. And then you'll have basically a server with more memory than you normally would, and then that can be used for some kind of big memory application. But going forward, once we kind of get beyond that, this whole world of tiered memory, I think this may be eye-opening to listeners, it may be eye-opening to others, but it really ought not to

Starting point is 00:09:58 be, because already processors have cache memory, they have DRAM. As I said, now they're going to have CXL memory expanders. Many of the people listening may be familiar with Optane persistent memory modules that gave you tiered hierarchical memory in the third generation Ice Lake systems from Intel. And those things are going to continue in the next generation and beyond that, thanks to CXL. What do these systems look like? I mean, what do you think that a big memory system is going to really look like short term? Well, so, you know, one of the things that we see immediately, so, you know, of course, we're developing, you know, along with the SPAC as things go along also. So today, as an example, we have an FPGA card that connects to a CXL slot in a server, and that FPGA card has memory on it. So that's

Starting point is 00:11:01 sort of the basic expanding memory within the server. Right. And even just with that, we can see significant amount of performance for these workloads and databases that are much bigger than what can fit into the available memory space. So, you know, we've run a Redis database that basically is very full as we start to do queries, and it starts to swap. Today it swaps into and out of SSD. So, you know, we ran that test, and then we said, okay, instead of swapping out of SSD, let's swap directly to our FPGA memory connected through CXL. And mind you, it's an FPGA that's not as fast as what an ASIC does. But even with just that, we saw a 20x improvement across the board in bandwidth and latency and the number of operations per second, simply because the access time for that is orders of magnitude faster than what an SSD is, right? And so this is why we see these applications that are coming up. They have varying performance tiers required for them.

Starting point is 00:12:15 If you look at autonomous driving as an example, if you're driving and you're updating the map while you're doing that, while you're driving, then that's sort of critical data, right? But if you're parked at a light and not really doing anything, you don't need to have the updated information as fast as necessary. And when you look at these 12K cameras that are all looking at this data, it becomes very important how you're going to store that, how you're going to use that, and how those updates are going to happen. Same thing with AR, VR, same thing with Web 3.0, right? There is a tiered structure, and this is why, you know, Microsoft and the Linux community are all looking at how do they add semantics,

Starting point is 00:13:03 right, into the operating systems so that you can use these tiered type of memory environments and again it's all about optimizing cost for the system that you're creating and the workloads that you want to process i i think optin filled a very good gap there you know you were mentioned multiple orders of magnitude between NVMe and system RAM. Optane was slap bang in the middle. It was a perfect middle tier in terms of latency there. The potential applications for increased memory in a single server is huge across the board. And it's great that it's the first problem being addressed. But moving on a little more further term,

Starting point is 00:13:51 what do you think the next wave of devices will be coming into the CXL bus? Well, I mean, I think if you look at the manufacturers of these components across the board, they're all transforming to CXL interface today. And I think this is going to depend on the systems designers as well as the end users as to what they're going to require from a performance and cost standpoint in order to dial the right solutions in. And do you think the server manufacturers are going to be able to increase their cadence in terms of how often they upgrade to the next PCI Express or CXL bus now? There was a huge gap between PCI Express 3 and 4, and now 5 is around the corner. In order to get to CXL 3, we need 6 and 7 to hit the market.

Starting point is 00:14:50 Right. So again, not necessarily, right? The way the spec is being created is you can have CXL 3.0 functionality at Gen 5 speeds, right? So it just depends on where the systems guys want to intersect, you know, the various parts of the spec. And I believe that a lot of that functionality will be done in a Gen 5 environment because going to Gen 6 speeds is another sort of change architecturally as well as electrically, right?

Starting point is 00:15:27 And so there's some work that has to get done there. But I think right now, a lot of people are already comfortable with Gen 5 speeds. It's interesting then that you're saying, you know, as long as we're willing to accept the performance hit in effect from not moving to six or seven quickly,

Starting point is 00:15:44 we can have that level of functionality and your solution then must have a software component an orchestration component that that controls this allocation of resources um what challenges have you met there well so one of the things that we want to make sure of, because we've seen this with other technologies like InfiniBand, for an example, right, which is a great technology, but it was a heavy lift in order to implement it, especially on the software side, right? And so what we want to ensure of is that the, you know, the last few decades of work that's been done on network orchestration, service orchestration, resource orchestration, all the way up to Kubernetes and the management of

Starting point is 00:16:31 workloads, we want to make sure that we're not changing any of that paradigm, right? And so what we have to do in our chip is we call our resource manager, which is responsible for the locality of devices that are connected to our device, right? And then being able to provide that information to the upper layers of the software, depending upon where they're needed. And there are varying de facto standards. I wouldn't say that there's there is a standard I think there's things that people use that we are working with to be

Starting point is 00:17:10 compatible so that they'll be seamlessly be able to access you know our what we're doing inside of our chip in order to manage those resources it is a soft there is a big software component and the things that we're going to be able to do, because we have all of the knowledge of what's connected to us and how the traffic is flowing, I think is going to be eye-opening from a server architecture perspective as well. One of the things I tried to do when I was at Samsung, previous to starting this company, was to be able to characterize sort of where all the traffic was going in a server.

Starting point is 00:17:49 That seemingly simple task was not simple at all. There's a really big challenge to do that because the visibility wasn't there, right? But now if I have this intelligent switch to which all the components are connected, I have to count the packets anyway to make sure they're they're going in and out and with that i can i can you know gather information on traffic flows how things are going right this is the same way ethernet did and this is how qos

Starting point is 00:18:17 was born right from ethernet standpoint so you know we're we're taking those lessons learned and saying how can we move that to managing the flow of data at the nanosecond level, right is a significant part of the performance bottleneck that we have today. If we have a shared pool of memory where all the devices that need that data is accessing that same device, then we're going to minimize the number of buffer transforms, the number of copies, right? And that in itself is going to create significant better overall system performance. And so, you know, for us, it's really about optimizing that system from a hardware connectivity perspective, but also on how the software manages those resources in order to be able to get the data to where the compute can use it and operate on it.

Starting point is 00:19:20 So it sounds as though we're going to gain a whole new layer of monitoring insight in terms of looking at performance, you know, areas that we previously haven't been able to see easily. Absolutely. Yeah. And I think that's going to be, I mean, ultimately, I tell my team, right, we're going to be a five, six to one software to hardware company when all is said and done. But all of it is enabled by the performance and the connectivity that we get from the hardware. But now that we have this visibility, the controllability, the granularity of control over these devices, I believe there's a whole slew of applications and new kinds of things that we're going to be able to do architecturally in order to improve overall performance at scale. And I think that's the key thing, is to do it at scale. This is one thing that I think a lot of people aren't privy to, is the details and the amount of work that Google has done in order to do things at

Starting point is 00:20:26 scale. And a lot of the things that we're talking about, they've already implemented in many, many different things. And so what we're saying is how do we take that scale-up technology and bring it into the rack and bring it into the boxes and the rack that can then be used at not just the cloud layer, but at the fog layer and at the edge layer where we're seeing, again, tremendous growth in equipment that needs to go out there simply because you can't move data around. So let's talk about that, George. I think that from a practical application standpoint, what will this allow? What will this enable that we can't do now? And as you mentioned, I mean, I think the real difference is that current server architecture is really limited

Starting point is 00:21:14 basically by the box, by the constraints of things like memory channels, number of sockets, number of PCI channels, that sort of thing. Once we're not constrained by that, once we have a different kind of computing, a different shape of computing, how will that change the applications that are being run on systems? Will we still be sharding things? Will we still be developing microservices and tiny containers? Or will we move back to a more monolithic compute environment? Well, I don't think that you will get to a monolithic compute environment because I think the heterogeneous compute environment is here to stay.

Starting point is 00:21:57 These specialized processors, who I first heard Dave Patterson and John Hennessey talk about, is becoming the reality of of how these systems are are working there's just the the you know the scale up uh model you know because of uh denard scaling is sort of we hit a limit on that so we need to we need to scale out in terms of the the kinds of processing elements that we use to process the data. Right. But, you know, I think, you know, one of the, one of the things that we looked at was just in the memory configurations alone, right. We call it the $80 billion problem. Right. And it's because if the, you know,

Starting point is 00:22:38 the recent study that have come out that everyone's talking about in the industry, you know, that said, you know, 25% of memory in the server is stranded. That means it never even gets accessed because there's just not enough resources to be able to get to that part of the memory. And then 50% of a lot of the, you know, the virtual machines and structures that are within there also don't get used because there's there's a performance bottleneck in terms of being able to use all those right and so you know if you

Starting point is 00:23:11 look at the amount of dollars that are spent every day um in in putting memory into these servers that get stranded and that get put into a rack um and uh and you just count the dollars, right? It turns out it's at 400,000 racks per year average being deployed. It's an $80 billion savings that we're going to see just by being able to pull memory. This is why there's so much interest in doing this. And fundamentally, what that's going to do is, one, it's going to free the memory and be cheaper, but you're also going to get better performance because now the elements that have to access that memory can do it through um through greater bandwidth of connections that go into that memory that you can't put into a server because of the limitations of the power that can fit into the box, right? So if you've got a server appliance, you can have thousands of processing elements

Starting point is 00:24:08 accessing that pooled memory and processing that pooled memory, right? Where you really couldn't do that before when it was compartmentalized in these boxes. I think there's another interesting aspect there, just flipping back to your deterred memory approach. There's nothing that says it has to be DDR5 in the expansion. It could be DDR4, it could be DDR3, and now you're driving potentially further significant savings. If a workload doesn't need high-speed RAM, it just needs a lot of it, that's a fantastic use case of older technology.

Starting point is 00:24:48 Yeah, and again, that's why I'm saying that the systems designers will be able to now have options under how these things are composed. And it's not just the cookie-cutter granularity that you get today in the data center. And this disaggregation is also going to help, as you talked about before, about how do you do upgrades. Today, the server never gets touched once it gets deployed until it gets thrown away. Now, if you've got these disaggregated resources, you can add more memory to it and don't have to disturb everything else that's happening. You can add new memory, right, and new processing elements and, you know, different ways of upgrading the various pieces of the server, right, compute, memory, storage, networking, and software without having to bring down, right, entire aisles of racks and machines

Starting point is 00:25:43 in order to make that happen. I think that's going to make the management of these devices in the data center a lot easier. More complex from the technical side, but easier from the physical side of having to go through and change things. And that's going to lower overall costs again significantly. Data center guys hate the truck role, right? And we'll be able to do things with much more efficiency. So from your perspective, do you think that it would be better for that software to be standardized and live in the operating system or something like that? Or do you think that it would be better to have more custom architecture to the software as well, or maybe even integrate it into the applications? Yeah, so I mean, I think it's a tiered problem, just as creating the tiered access. Now you've

Starting point is 00:26:38 got to be able to create the tiered usability. And I think, again, this is why we need to really work together from an industry standpoint to figure out where those control points need to be, right? So that they can be optimized appropriately. In some cases, it may be just perfectly fine to let the hardware figure out where the tier needs to be, right? But in other cases, you want to have maybe even more granularity and more control on the application side,'s going to make sure that your high priority data is where it needs to be. Yeah, that's great. And I think that it reflects the reality of the challenge that a generic piece of operating system software is not really going to cut it. We're going to need to have more integration higher in the stack

Starting point is 00:27:25 in order to make best use of these things, and also in order to react as systems are recomposed and reconfigured. So to wrap up this episode, we're going to be asking a similar question to most of our guests, at least here in this season. And what I want to do is I want to kind of take a step back and think about optimistically. We mentioned, for example, Craig mentioned just now the strange concept, but a very realistic one, that CXL memory could actually open the door to

Starting point is 00:28:01 using lower performance memory modules because they're on the other side of the CXL bus. Is there some other unexpected or surprising way that you think that CXL will change technology and computing? Well, again, I think we go back to my early days working at Xerox and then being fortunate to interact with a lot of the pioneers at Xerox and then being fortunate to interact with a lot of the pioneers at Xerox PARC who are looking at different kinds of technologies. And that's where I first heard the word composability. This was back in the 80s. And again, the concept has been around for a long time.

Starting point is 00:28:41 And the reason is because we don't want to have to continue to recreate architectures over and over and over again as we start to improve, right? If we can have an architecture that, you know, that can grow, you know, with the individual component technology as it evolves, then we're better able to use that technology, right? And then ultimately that will lead to the greater performance, right? And we're being pushed to do that now because all the data that's being created now is a lot of it is machine generated data. And people don't want to lose that data, right? We can't store it all because we can't create enough bits to store it, but people don't want to lose the value of that data.

Starting point is 00:29:29 And so in order to extract that value, we've got to be able to really optimize how we run these AI ML workloads on this data and bring the cost to where it's reasonable to do it. And I think this is what CXL is enabling that we didn't have with PCI Express. And that is the ability to share components, the ability to have peer-to-peer communications between those components, and then ultimately leads to the ability to do what I call true composability. So composability is not just connecting devices together. It's how you control those devices and how you manage those devices all the way up the stack, as we've been talking about, in order to ensure that we're optimizing the system level performance at the right points.

Starting point is 00:30:22 Well, thank you so much for this conversation, George. It's been great catching up with you and connecting with you here. Before we go, where can people connect with you and continue this conversation on CXL technology? Yeah, so, I mean, we're on the web at elastics.cloud. And I think the next appearance we're going to show we'll be at is Supercomputing 22 in Texas. And there we're actually looking at demonstrating a multi-server, meaning four or more, if we can get them, servers connected and sharing memory altogether.

Starting point is 00:31:06 And so we're starting to scale this thing out so that we can start to demonstrate what the power of this is as you start to move up the stack on the software side and managing the workloads that use those memories. Well, that's great. I can't wait to see that. One thing I'll mention is that we did, all three of us were present at the CXL Forum and the videos of those presentations are going to be online shortly, maybe by the time this episode is published. So just use your favorite search engine,

Starting point is 00:31:38 look for CXL Forum and look for elastics.cloud or the panel and so on. And you'll find video recordings of that event as well. Thank you for listening to the Utilizing CXL podcast, part of the Utilizing Tech series. You can find more episodes of this podcast in your favorite podcast application. Just search for Utilizing CXL or Utilizing Tech,

Starting point is 00:32:02 and you can find episodes of our previous iteration, Utilizing AI. For show notes and more episodes, go to UtilizingTech.com. Thanks for listening, and we'll see you next week.

Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 4x4: Implementing CXL with George Apostol of Elastics.cloud

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.