Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 4x16: Transparently Tiering Memory in CXL with Hasan Al Maruf

Episode Date: February 20, 2023

Tiered memory will have different performance, so operating systems will need to incorporate techniques to adapt to pages with different characteristics. This episode of Utilizing CXL features Hasan A...l Maruf, part of a team that developed transparent page placement for Linux. He began his work enabling transparent page placement for InfiniBand-connected peers before applying the concept to NUMA nodes and now CXL memory. Hosts:   Stephen Foskett: https://www.twitter.com/SFoskett Craig Rodgers: https://www.twitter.com/CraigRodgersms Guest Host:   Hasan Al Maruf, Researcher at the University of Michigan: https://www.linkedin.com/in/hasanalmaruf/ Follow Gestalt IT and Utilizing Tech Website: https://www.UtilizingTech.com/ Website: https://www.GestaltIT.com/ Twitter: https://www.twitter.com/GestaltIT LinkedIn: https://www.linkedin.com/company/1789 Tags: #UtilizingCXL #TieringMemory #CXL #Linux @UtilizingTech @GestaltIT @UMich

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to Utilizing Tech, the podcast about emerging technology from Gestalt IT. This season of Utilizing Tech focuses on Compute Express Link, or CXL, a new technology that promises to revolutionize enterprise computing architecture. I'm your host, Stephen Foskett, organizer of Tech Field Day and publisher of Gestalt IT. Joining me today as my co-host is Craig Rogers. Hey, Stephen. Good to be here. Looking forward to getting a chat with Hassan around CXL and memory tiering, you know, that software layer in Linux. What about you? Yeah, that's really where we're going to go, Craig. Essentially, we've been hearing recently that now that there
Starting point is 00:00:47 are some CXL supporting platforms out there and some devices that can be tried out, we're starting to see some numbers. And the numbers look pretty good. The summary is that a CXL memory expansion card looks a little bit like a memory on a remote NUMA node, or maybe even a little bit better than that. But that being said, there are going to be much, there's going to be much greater latency once the memory moves onto a fabric and gets into a shared pool and maybe even goes remote, like we talked about within Teleprop. So we really need to start thinking about how to handle this idea that there's different kinds of memory. There's different tiers of memory with different performance.
Starting point is 00:01:33 And so that's why we received a tip that we should invite Hassan Almarouf to join us here. Hassan is a researcher at the University of Michigan focused on transparent page placing, which is really a great and important way to enable the use of tiered memory. Welcome to the conversation, Hassan. Hi. Hi, Stephen and Craig. It's nice to meet you. So tell us a little bit more about your background. How'd you get into this? Yeah, sure. So I'm a PhD student at the University of Michigan. I'm at the end of my PhD. So my research focus is mostly on data center research management
Starting point is 00:02:13 and performance optimization. It's mostly on research disaggregation, heterogeneity at memory management, and recently on CXL systems. So you can say in a word, my whole PhD is like how to make memory disaggregation practical. I lead a comprehensive solution to address the host level, network level, and end-to-end aspects of practical memory disaggregations, how we can hide the latency during memory disaggregation, how we make it like fault tolerant, how we can actually make it more efficient and ubiquitous. And recently with CXL systems, there are stairs of memory could be there and how we can handle those things. And mostly we focus on making these solutions
Starting point is 00:02:54 transparent to the applications so that existing or newer applications can have benefit of this memory disaggregation. And you can say whatever challenges comes to make memory disaggregation. And you can say like whatever challenges comes to make memory disaggregation practical, me and our lab kind of focused almost all of these aspects over the years. And it's very, I was like very fortunate,
Starting point is 00:03:18 highly opportunity to like work with Meta and get access to like CXL devices and do something that could be like open sourced and merged with Linux over the time. And so this TPP you are talking about is like transparent memory page placement for CXL systems. This is the like new additions in our work. And yeah, and we are thinking of many more new angles of CXL-enabled systems, how to approach them from the software's area. And yeah, that's it in a very few words about me.
Starting point is 00:03:50 You mentioned a really interesting turn of phrase there around hiding latency. You know, one of the biggest concerns people will have with large scale adoption of CXL is not just cost in adding in more RAM, but also the performance and latency is one of those key metrics that people are looking at. Can you go into any more detail around how you're able to hide that latency? Yeah, yeah, yeah. So my very first PhD project was Leap.
Starting point is 00:04:19 It was like how, like at that time when I started in 2017, there was like RDM enabled, InfiniBand networks. That was the fastest network possible available there, which is like microsecond scale. So in that case, with this InfiniBand network, you can connect two different servers, and one server can access the memory of another in microseconds.
Starting point is 00:04:40 So even if you remove all the software overheads from the OS, still the network is around three to four microseconds for a four kilobyte page. So you cannot like break the physics, right? So what you have to do, in normal case, CPU attached memory is hundreds of nanoseconds latency. And in RDM enabled area, it's like microsecond scale. So you have to somehow hide this latency by doing some prefetching or some intelligent systems. So this leap is like enabled in the kernel prefetching mechanisms for remote memory or disaggregated memory things. And that could like make almost provide hundreds of nanoseconds of latency till 95 percentile and rest of the percentiles we had to go around microseconds. With CXL this microsecond
Starting point is 00:05:33 comes to like around 300 or 200 of nanoseconds but it's still like higher than like hundreds or 200 nanosecond higher than the normal local CPU attached memory's latency. So in TPP we had some approach so that like our main approach was like whatever is hot they should be in the CPU attached memory and whatever is cold or warmer we can like handle the extra latency access could be put into the like lower tiers of memory slower tiers of memory that is at us to the CXL So that was one approach. Well, whatever is hot always try to move them to the hottest years of memory level but this is also a
Starting point is 00:06:17 Reactive approach whatever we have in the TPP. This is a state of that but still it has some scope of improvement Some prefetching could be applied here. Some other mechanisms could be applied here to make it even more faster. So yeah, so right now the best approach is like whichever is hot, always try to allocate or bring them in the topmost tiers of memory and that can like reduce a lot of slow tier memory access and you can effectively hide some of this Excel latency of CXL memory. Yeah. And we're not talking about caching here. We're talking about basically having the software intelligently move pages of memory based on the characteristics of that
Starting point is 00:07:00 page, the accesses to that page. So, you know, if it's something that's being used quite a lot, quite often, you put it in the memory that's closest to the CPU. And if it's not used very much, then you put it in a different tier. And it sounds like this would make sense. Like you said, you've got InfiniBand-connected RDMA peers. You've got other kinds of memory tiers as well. I mean, we've heard quite a lot about Optane memory, persistent memory from Intel, which Intel just announced the third generation along with the Sapphire Rapids platform. We've talked about CXL connected memory,
Starting point is 00:07:40 and even Pneuma memory, I could imagine that TPP might be viable even in a conventional, you know, just a Pneuma system where you've got some memory that's close and some memory that's a little bit further away. Yeah, yeah, you are absolutely right. So basically in TPP's world, although it's designed for the CXL systems, but it could be like generic to any kinds of NUMA systems. So when you attach a CXL memory devices, it appears to the OS as a new NUMA node. So if we can support for different NUMA nodes, how page placement for different types of NUMA nodes, that will be obviously applicable to any CXL systems. And that's what we did in TPP. It's like when you have to like allocate a page, you try to allocate the most hot pages in the very faster tiers of NERSC. And over the time, like you cannot always do these things because your CPU attached memory has some limited capacity.
Starting point is 00:08:45 So at some point, if everyone is hot, you have to move some things at some other layers, right? So it's like TPP is a solution that gives you whatever matters the most, they will always remain on the topmost layers. And these warmer or colder portions that will be efficiently migrated or efficiently moved to the lowest part so that this reclamation does not impact your allocation because in today's OS whenever you need to like allocate you have to find whether there's enough memory in the topmost tiers or not if not there then you will to the, by default it will go to the next available memory tiers. And when you allocate in the next available memory tiers,
Starting point is 00:09:29 it will be all like, even in a two socket systems, it will have extra 60 to 80 nanoseconds just for accessing a memory of the second socket or the remote socket of the CPUs. So that's why we had faster demotion path and we have one opt, we have one very optimized promotion path. In the demotion path, we demote so fast
Starting point is 00:09:56 that always there is always headroom in the topmost tier and new allocations can happen over there. And in the promotion, wherever someone becomes hot again within the coldest part they could be like optimist optimized way they could be like bring back in the systems and not every uh brought back page could be utilized there like you can say some pages could be cold and accessed within once in an hour and if you bring that if you like mistake them as a hot page and bring them back into the like topmost tier then eventually it will be colder again and you have to like uh
Starting point is 00:10:30 demote it back so this will be some wastage of bandwidth wastage of cpu cycles so we need to be more accurate about what we are bringing from the colder tier to the uh to the like uh hot tier so this all these things are uh like covered by the TPP. They consider at which point, which pages could be a promotion candidate and what will be the benefit of moving them from coldest year to the topmost year. So, yeah. And you guys submitted, or I think you submitted a kernel patch to Linux a couple of years ago in 2021. Is that in the kernel now? Is that something that's out there?
Starting point is 00:11:11 There's basically two patch sets. One is like all the basic things of TPP. There's some performance monitor, how much anon or anon page, different types, anon page types or cache page types types how much has been like promoted or demoted was the rate of this movement was the failure of page migrations or promotion emotions that's one thing another is like we decoupled the we added new watermark systems there's usually there was like high mean high, mean watermarks. We added another extra watermarks over there. Then we made this demotion path, how fast we can do this. There's another patch over there.
Starting point is 00:11:53 And also during this promotion, we modified the Autonomo. So right now, Autonomo has different modes. One is like for vanilla and another is like for tiered memory systems. So for this tiered memory system, this TPP will work and do these migration steps or promotion steps. So that's one thing. There's basic TPP approaches which will be generic. And there are some another patch set that is like for specific to applications.
Starting point is 00:12:17 Like you can say some applications use anon pages, some applications use cache pages. So let's say there one application is not sensitive to the cache but during the boot up it allocates most of the file pages and those file pages consume most of the hot tiers of your memory hot most of the memory of your hot memory tiers and when this anon comes which is very much sensitive they have to like eventually allocate it to the like lower tier or slower tier and then you either you have to like eventually allocate it to the like lower tier or slower tier and uh then you either you have to like promote them back to the topmost tier or you have to like uh move you have to like bear this latency of the cxl memory access so there's like page type error allocations if we
Starting point is 00:12:57 say okay for this application cache is not that much sensitive from the very beginning of the application runtime all the cache will be allocated to some slow memory tiers and all the unknowns will have very like will have the all top most or highest priority in the topmost tier that's one patch set another is like interleaving so right now in a NUMA when you do interleaving every new monitor is of similar behavior and they consider them of same value so one interleaving is every it will like do round robin if you have three nodes it will one one one allocate one by one but in cxl world different numenors may have different characteristics some may have like high bandwidth some may have like high low latency so based on different numuners behavior we can set the weight of interleaving.
Starting point is 00:13:47 So maybe topmost tier can have like 10 pages that will be followed by two pages in the slowest tier. So that there you can like set the ratio how much of the allocations can be like on different tiers of the memories. So this is another pathset. So these two pathsets has been like published. So far among them like the generic tpp has already been merged in Linux version 5.18. And these interleaving or page-tab error allocations policies, those are yet to merge because those are very specific to different applications. And Linux communities always consider whatever is generic, they should be merged first. So if it's not that much generic, it's very hard to convince people, okay, this should be go there. So, but we are very much, we are very much hopeful.
Starting point is 00:14:31 These things will be merged pretty soon, the page type error or applications error policies. But the basic TPP has already been in Linux kernel. It's interesting what you're describing there, you know, that tearing, you know, if we throw back to the storage world, you know, you'd had early storage devices, that would have had an SSD caching layer for hard drives, whilst we're miles apart in terms of latency and throughput, you know, with tiers of memory, the functionality level is the same, where you're keeping hot data on faster, closer storage. It's more readily accessible.
Starting point is 00:15:10 But normally those are proprietary type products that provide that level of functionality. You're talking about this being integrated into the Linux kernel itself and making those decisions around the tiering. That would have to be observable you know there would need to be evidence as to how that was being used and address so how can Linux people observe that you know monitor that and report against it okay so so far in our tpp patch there's uh some basic monitoring but this is not the wholesome monitoring. Some monitoring is like when we are reclaiming,
Starting point is 00:15:48 what types of pages we are reclaiming, what's the rate of reclamation. So then you can say, okay, these are the specific types of pages that remains cold over time. And for that, they are going to move to somewhere else. And when we are like promoting what types of pages has been promoted,
Starting point is 00:16:07 what's the rate within times, how much has been, what types of page has been promoted at what granularity that can be seen over there. And during migration, if something fails, what is the like failure or reasons of the fail, where it's failing, which node is failing. So these very basic observations are
Starting point is 00:16:26 available today with TPP. But that's not enough. You need to know what's the application's behavior. And for that, maybe you may not modify the whole Linux kernel. You may need some different tools. So we have in our project, in Meta, we we have a tool it's called like Chameleon. It's not still open source, but we are very hopeful to open source it within couples of months. So that tool is useful for monitoring. So here you can like see when you run an applications,
Starting point is 00:16:57 you can see what's the behaviors of this application's memory access pattern, how much of what chunks of its memory has been hot within a different time period. Like maybe you can say within two minute time period, 20% of this application's memory remains hot. What is the re-access granularity of a page? Maybe within two minute,
Starting point is 00:17:19 this page has 80% of its pages has been re-accessed. So that means most of the pages are being tasked over time um and like um what is their temperature like you can have a heat map i can say so um and that can be used there's different approaches to doing so one could be like using linux kernels uh page access bit tracking stuff uh there could be like using Linux kernels page access bit tracking stuff there could be another things like peps based stuff supported by Intel for AMD's processor there is another ABI's so with using these counters or CPU counters you can have some idea okay for this particular applications this is memory access behavior and if we consider this behavior whether this application
Starting point is 00:18:06 is applicable for the traditional setups or it can be moved to the like cxl world and if we move to the cxl world how much performance loss we can consider like like things are like with cxl you have to bear the latency bottlenecks so there should be some performance dimensions over there but how much performance dimensions we can have and if we can uh very intelligent in moving right portions of memory at different tiers maybe we cannot have this performance bottlenecks like in meta we find like even uh even if if we support only 20 20 percent of the whole working set into the topmost tier the top we actually reduce the topmost tier to 20 percent of the whole working set into the topmost tier, we actually reduce the topmost tier to 20% of its whole working set
Starting point is 00:18:48 and 80% could be like in the CXL tiers, we find we lose only 5% of the performance. So you can know these things by characterizing this application. So this kind of tools we already developed is able to be like open source. If the paper is published, it is easy to like open source the stuff
Starting point is 00:19:05 because in the matter you need to know, you know, like there's a lot of reviewing process whether we should publish it or not. But there's another company like Memverse. They have already a product like Memviewer. This can also be useful to understand the same kind of things. So yeah, so I think like the basic Linux,
Starting point is 00:19:23 it should have some counters. It should have some monitoring systems from the very basic thing. But in the Linux, you cannot have this freedom of doing everything. So there should be some user space tools. And with like with CXL switches, there will be more counters. So using those counters, you can see, okay, this specific node has these amounts of load, heat, means kind of things. And based on that, you can come up with another tools for monitoring what's the memory access behavior at different layers or different nodes or different
Starting point is 00:19:55 devices. And yeah, from there, you can have a more robust and more strong characterizations of applications. So I'm glad that you brought up the meta results because I know that that's been another thing that you've done is you've done some benchmarking, some work characterizing the performance of transparent page placement. And it's really pretty positive. Like you said, you don't have to have the entire working set
Starting point is 00:20:22 in the hottest tier of memory. You can do it on a page-by-page basis. You showed there some serious performance improvements by implementing this. Do you think that eventually this is... I mean, we've been sort of assuming that tiered memory is going to be a normal part of compute architecture going forward. Do you think that eventually transparent page placement is just going to be something that the kernel does and people aren't going to, it doesn't have a name anymore because it's just something that's on and you just use it? Yeah, why not? Like if you consider like in today's, in the NUMA systems, you are basically doing some kind of page placement right so what's the difference with
Starting point is 00:21:05 the cxl like it's the complicated numma systems right um and you have to do some page placement for these types of things today's numma systems is everyone is homogeneous and future numma systems would be like heterogeneous numma systems so there should be some support for heterogeneity perspectives and tpp is the stepping stone for this heterogeneity so there could be like many more improvement over it like when you like in tpp right now we are like demoting the next possible numa node but you with a very large with switching with cxl 2.0 or 3.0 when switching is available multi-layer stuff switching is there there could be like gfm that could be like some other um like type two types of devices could be there so in that case which layer or which uh specific numa node could be useful for these applications like not always having the topmost
Starting point is 00:21:58 tier is helpful for the application so if you consider multiple applications running on the same systems and everyone has different demands of their performance, in that case, maybe next top tier can be helpful even to maintain its performance, desired performance. So in that case, which tier you can choose
Starting point is 00:22:19 or what fraction of which tier is enough to provide the performance benefit of these applications, that could be a good challenge. And we are working on that, what fraction of which tier is enough to provide the performance benefit of these applications. That could be a good challenge. And we are working on them, working on that.
Starting point is 00:22:30 And I hope like it could be like easily, or it, there could be like many more challenges that can be like approached and people will work and already working on them. And obviously there will be, some of them will be merged to the Linux because the whole industry is moving at that area. So you have to have some support somewhere.
Starting point is 00:22:53 The Linux community have always been very good at begging in a lot of kernel support for a lot of hardware solutions. Are you seeing other areas of the linux community contributing anything else around cxl and memory tiering you know because it knowing the technology is going this direction it's one of these things that will benefit all the users of linux and all the contributors so like uh just just as I said before what is like memory monitoring tool like what is the work sets behavior already there's a
Starting point is 00:23:32 patch set been merged with 5.15 Linux it's like from Song Yangju from the Amazon it's like daemon the culture is daemon so it's like for a specific workload what fraction of is is like hot cold stuffs that's already been uh you can have some view of this world so that's already been there i know in the like samsung uh they are doing samsung ghosting they are doing some uh very good works or they have some work in the pipeline to support cxl ecosystems uh that could be like uh like these auto healing stuffs or these resiliency stuffs like earlier everything was within a single machine and now we are going beyond a single machines domain so animations can fail anytime
Starting point is 00:24:17 you need to support provide like your main memory is can be like unavailable at any time so you have to provide some resiliency for that. People are working on that area. And like memory sharing, sharing consistency, coherency management. So these things are being addressed. And I think the whole industry is working on different aspects of these problems. And maybe within one or two years,
Starting point is 00:24:46 I think there will be lots of patches being merged in the kernel. And maybe CXL will be very ready to use by everyone by maybe end of 2023 or 2024. Yeah, and I think we also have to give credit to Intel for doing a ton of the basic plumbing for CXL within the Linux kernel. I know that they've submitted and they just keep submitting more and more and more, including recently a bunch of support for type three memory in CXL.
Starting point is 00:25:16 So I think that it's great to see so many people contributing to the development of the Linux kernel, because, of course, a lot of systems out there, I think probably an easy majority, if not a vast majority, are going to be running Linux at this point. Though, as we've talked with other companies, I mean, we talked to VMware, they're certainly planning on having a lot of CXL support as well.
Starting point is 00:25:40 But I mean, right now, Linux is really where it's at. Does that match your opinion, Hassan? Yeah. So I think everyone is trying to build the whole ecosystem so that it's open to all. Everyone can easily use it. And I think almost all the big companies are focusing on how to do stuff, make them open source and make it even more usable by everyone. So it sounds as though the community are contributing to empower the proprietary solutions they need to build to do advanced monitoring, advanced memory tearing solutions, but they're building in the mechanisms within the kernel space
Starting point is 00:26:25 so that they facilitate their user space requirements? So some part, there should be some mechanisms that is fundamental. Like these placement stuffs, without modifying the OS from the user space, you cannot do much like uh let me share my experience so when in the matters project initially we tried with this characterization tool we found okay these are the different um these are the different uh heat map of a specific
Starting point is 00:26:58 applications and from that heat map why not we do this memory management on in the OS level in in the user space from the user space so the problem of here there was like um some memories or some pages they are very short-lived they comes and go you cannot like by the time you decide okay these are the pages we need to like migrate those pages has been gone and uh from the user, you need some user to kernel space movement context switching that can take time. And you don't have that much freedom of words. So that's from that point, we decided user space solution may not be sufficient. We need to modify the kernel and build, apply these specific things from the kernel so that it could be easy, it could be faster, and it could be like generic. And that's why even we have the freedom of doing things from the kernel so that it could be easy it could be faster and it
Starting point is 00:27:45 could be like generic and that's why even we have the freedom of doing everything from the user space eventually we have to went to the like linux kernel and need to like modify this and i know some cloud vendors uh in the hypervisor level you can say so they don't have the opportunity to like understand what the application is doing so So maybe they are completely oblivious. What's the behavior of the application? They have some idea what is doing over there, but per page wise idea or tracking per page can cost them much, much high latency.
Starting point is 00:28:20 So from their perspective, modifying the hyper hyper hypervisor could be like the best options to like uh go on so there could be some solutions that can work in the hypervisor level and yet like let's say you have some very intelligent mechanism you want to run some ml stuffs let's say you do you run some machine learning algorithm to understand what is the application behavior or what's the other uh properties and that ml applications you cannot like uh integrate it with kernel because it will be too much heavyweight uh if we run some mls inside the kernel so for that you need some user space tool or you need some other mechanisms that can like communicate with the kernel so
Starting point is 00:29:01 based on your use case, based on your requirement, I think different solution is needed at different layers. And they could be like communicate with each other. It could be like completely independent solutions. And I feel there's also some room for proprietary solutions. And there's some fundamental stuff needs to be done inside the kernel, which is like Intel's, Samsung's or some other companies are doing. Have you observed much of an overhead in tracking these page values? You know, if you're tracking heat, there's obviously an overhead to that.
Starting point is 00:29:38 Is there a point of reflection? Whenever you are tracking, there will be an overhead. Like let's say we are tracking by this idle bitmap stuffs. So in that case, you have to like set and reset the page one page flag of every pages over there. And if you're working set size, let's say it's terabyte size then you have to like frequently modify the bits of all the pages within a one terabyte.
Starting point is 00:30:04 And it will be like tremendously high overhead so that can that's one reason why we did not go in the like uh page track idle bit tracking way then we found okay uh intel gives intel cpus gives uh pips uh pmu is available there so from there we can have some cpu counters let's just do sample from the cpu counters and come up with there. But even when you're using the CPU counters, you need to like burn some CPU cores to get these numbers and parse them and then actually get some analysis. So that's another reason like this paves-based tool or this counter-based tool that has not
Starting point is 00:30:40 been in the kernel side, we did not go that route because if we from the kernel if we use this CPU counters we will lose the freedom from that we will have less counter available in the user space it will halt the CPU every now and then and if the workload is very much CPU heavy maybe your monitoring system will halt every now and then your application may crash so that's why we separated this as a like two separate tools one is for characterizations you can run it every now and then and halt it every now and then and another is like very basic thing always running within the kernels so two different tools one is characterization tools another is like Linux kernel modifications of page placement. This is very basic. So yeah, the overhead is very high when we run these chameleon tools.
Starting point is 00:31:31 We found like for a very heavy workload, it is around 10% of slowdown. And there are some other works available where they showed like in IPT-based case, you can have 20 to 90% of overhead based on the workload behavior. So whenever you're running these things, you have these overheads. And even in TPP, we used Autonomo. Autonomo is, you have also some kind of like faulting mechanism over there because in Autonomo, there is a page fault happening from the page fault. You are seeing which page is being accessed and which CPU is trying to access them.
Starting point is 00:32:02 So that has some overhead also. So in our cases, we always try to, we always thought or we always place pages in a way the cold pages will be always in the CXL tier. And if we can manage that, page becoming hot from the cold data is comparatively lower frequency. So that's how we actually hide the overhead of Autonomo. But if there's a workload that is thrashing everything in the whole tier,
Starting point is 00:32:32 all the whole tiers of memory, then we will also be at the overhead of Autonomo sampling all the hot pages from the CXL memory. So in memory world, not one solution is good fit for everyone. You have to understand what's your use case, what's the behavior of the workload, and based on that, you need to come up with different solutions. And yeah. Yeah. And that's probably true of any technology. You have to have the right horse for the right course. I wonder if you can wrap up by just sort of giving us a peek at what comes next. So you've got the transparent page placement in the Linux kernel.
Starting point is 00:33:10 At least some of the code is in there. And you're looking at moving forward with more. What do you think is coming in 23, 24, 25 on the software side for CXL support that you're excited about? So in my opinion, what we did is for CXL 1.0, that's like the proof of concept that CXL works. CXL 2.0 will come with switching, one layer of switching, we can connect multiple devices there. That could be like go within beyond the machine's boundary. And with CXL 3.0, which I guess will be available in 26 or 25 time zone. There we will have multi layers of switching.
Starting point is 00:33:50 That could be like a very gigantic networks of memory. So in that case, we can have do sharing. We can have peer-to-peer communications. We will have the GFAM. 4,000 nodes can be like connected. So this could be a really crazy time at that period, timeline. And in that case, how we are going to manage this coherency, how we are going to communicate between different devices,
Starting point is 00:34:17 like CPU, GPU, those are actually how they are going to share memories, how they are going to access each other's memory area. We need some support for that and i think like right now whatever people are doing mostly focusing on the type 3 devices where it could be like memory extension memory expansions the next one is like how we can do the disaggregations disaggregation within a rack scale and this rack scale can theoretically it can be go beyond a rack it can be connected the whole data center theoretically but I'm not sure whether any practical use case could be available or not but who knows you can like
Starting point is 00:34:58 make connect the whole data center through CXS in the and that case, all the networking problems we are facing right now will be somewhat reinvented for the memory perspective. And we may need to handle everything what we are handling right now in the network, in the memory world. So a lot of work to be done, a lot of exciting stuff happening. And the cool thing is, you know, it works. And I think that that's the thing that we're all most excited about. So thank you so much for joining us, Hasan. It's been really, really fascinating to hear this aspect of it because we spent most of our time on the utilizing CXL talking to the hardware companies.
Starting point is 00:35:39 And so it's great to hear a little bit about the Linux kernel support and the work that you've been doing. If people want to continue this discussion or learn more about TPP, where can they connect and where can they find it? Yeah, so you can connect me with my LinkedIn. Like my LinkedIn handler is Hassan Almaruf, my name. So you can shoot me a message and ask me anything, whatever you want to know about that or have more discussions. And
Starting point is 00:36:05 yeah, you can also give me an email at hasanal.umich.edu. I'm like happy to have any chat with everyone. All right, cool. And as for me and Craig, you'll see us at Tech Field Day in March, where we're going to be talking to some of the CXL companies. You'll also see us here every week on Utilizing CXL. And of course, you can find us on social media and we'll include our links in the show notes. Thank you for listening to the Utilizing CXL podcast, part of the Utilizing Tech podcast series. If you enjoyed this discussion, please subscribe in your favorite application, and please do give us a rating and review. This podcast is brought to you by gestaltit.com, your home for IT coverage from across the enterprise. For show notes and more episodes,
Starting point is 00:36:50 go to utilizingtech.com or find us on Twitter at Utilizing Tech. Thanks for listening, and we'll see you next week.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.