Storage Developer Conference - #199: CXL Memory Disaggregation and Tiering: Lessons Learned from Storage

Starting point is 00:00:00 Hello, this is Bill Martin, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developers Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to SDC Podcast Episode 199. There's been a lot of good stuff I've learned here at the conference. I don't know about you guys, but it's been certainly a pleasure to hear all the different people

Starting point is 00:00:46 talk about their experiences, especially the last one. His name was Andy, too. I don't know if you were in that one, but it seemed like we were both doing tiering. He kept his hair, and I didn't. So I'm losing mine. So I think it's interesting to see

Starting point is 00:01:00 how both different camps are following different principles. If you're in this last session, they talked about tiering as being kind of more dynamic, and then there's static and analysis-based tiering. We're going to talk a little bit here about the dynamic aspects. This is based on a real-world application we put together. So with that, let me kick off. The purpose of this talk here is really to try and recap a little what can we learn from storage. You know, storage has been around doing this thing for many, many years, right?

Starting point is 00:01:34 And I've personally been involved and touched many different aspects of storage over the years. And I'll just touch on that a little bit. But, you know, memory can learn a lot from storage, but scaling, high availability, recovery, RTO, RPO, for those of you who are familiar with the whole recovery objective scenario, because I think everyone's glossing over that little, you know, your blast radius for memory. You take out a shared memory pool device. It has much more dire consequences in many ways than storage did, which was more of a passive thing than memory. And stateful versus stateless, there's all kinds of things. VMware, we won't touch on that here, but there's a lot of different things that we can learn there. Topics covered here, we're going to do a quick refresh on the storage area networking,

Starting point is 00:02:19 a little bit of that. It's quite a complex set of things that evolved over the years, but we can certainly point out a few things, how caching tiering works. Going to go through a case study of a project that was very near and dear to me in my last company, which is now part of Smart, where we actually developed a dynamic tiering engine for storage for Linux and Windows applications

Starting point is 00:02:42 and actually partnered with people like AMD and Dell to distribute this out in real-world applications. There's about a million seats deployed, at least, of this technology. So this is real-world stuff, not a hypothetical. And then what are the lessons that we can potentially learn here going forward? A little bit of background on myself. I come from kind of a background of compute. Transputer, for those of you who have ever heard of it,

Starting point is 00:03:09 was one of my earliest projects working with multi-parallel processing systems in sonar submarines. Going into optical fiber, FTDI, Ethernet, networking, shared storage. Really got into storage in around 2000 when we started doing a lot of RAID and storage HA environments, involved with.hill, for example, other folks doing semiconductor-based storage devices, and eventually software-defined tiering, which is going to be the subject of here, or software-defined storage, with the focus on tiering is what we did for the last 10 years.

Starting point is 00:03:46 And then what I'm involved with now is I actually run a group of engineers inside of Smart that are developing CXL add-in products. If you see the E3S, that's our team produced that. There's add-in cards with DIMMs on there now that can do memory expansion. And we're now exploring the ability to tier between those various components. So a little bit of a recap on disaggregation and composability. For those of you who don't know, disaggregation has been around for like 20 years. It's not a new concept. It's just like a lot of things in storage and memory and compute. New things are really inventions of old things just

Starting point is 00:04:22 done better, or they now become practical realities. You can actually do them, right, because the industry was able to catch up and do it. So, you know, SANS is kind of a first attempt at disaggregation of storage when you think about it, right? It's the separation of the storage from the compute. And that developed the whole thing starting in the 90s. And then the word disaggregation really came out in 2013, around that time, to 2017, when Intel started popularizing the whole disaggregated server or disaggregated rack. And I remember being at several conferences where this was kind of bleeding edge, you know, will it happen? Will it not happen? And we're now experiencing and seeing that happen. And CXL has

Starting point is 00:05:03 been one of those, you know, those great adders or enablers of that technology, I think, as we move forward here. Composability has started to really take shape. Disaggregation is the act of separation, you know, separating into component parts. Composability is the ability to orchestrate or configure these components into something useful, right? And so NVMe over Fabrics was kind of the first attempt to kind of make that a little bit more dynamic, where you could make it command line driven from a central brain, as it were, in the system that can now allocate, you know, chunks of storage a lot easier than having to have a PhD in SAN or storage.

Starting point is 00:05:43 You know, you could make it a little bit more os command line driven gpus become disaggregated in 2020 you see companies like liquid and other folks like that come into the scene then the first demo of memory disaggregation really starts in 2022 so that's kind of the history let's recap a little bit on memory expansion types we we've got um look at a typical motherboard i'm going to take you right down to the depths of the server now. You've got standard DIMMs, possibly even custom modules. We as a company were involved in the OMI standard with the DDIM, for example, adopted heavily by IBM. But there are various ways of adding direct storage. And then you've got the newer concepts. We've literally, this week in our lab, brought up our first eight DIMM half-length board that they can plug in and allow

Starting point is 00:06:33 you to easily expand memory over CXL on the PCIe bus now. So if you want to add that half terabyte, one terabyte, up to potentially four terabytes on a card, you can now do it with a, quote, plug and play. We're not quite plug and play yet, but we're getting there. The industry is getting there. The other one, of course, which we've heard about throughout this conference is the ability to hook up a memory box or pool box. We see that happening a lot in our company.

Starting point is 00:06:59 We have Penguin Computing, who's involved in the HPC side, where we have to put together a lot of large memory model stuff now as we go forward. So we look at either one-to-one relationships with a JBOD expansion or a JBOM is the best way to look at it and push the bunch of memory. And then as we go out, we've seen, of course, CXL Fabric 3.0. A lot of people think for enterprise, that's where it really starts to come together for CXL. But personally, I see there's a lot of opportunity in these first two here. It's just the add-in memory expansion because we are hitting a dim limit, how many dims and how much memory you can put on a single CPU. And there are certainly applications breaking that now. The other one,

Starting point is 00:07:41 we've seen various substantiations of this diagram. Here's my own personal rendition of it, of the memory hierarchy. And I chose to draw this more from a, if you have like in a two socket system, you've got quite a bit of latency creeping in now. And that's the whole point. I think we've heard this before. CXL before, before CXL, before just aggregated memory, you just had the CPU with a bunch of memory. And that memory was one large blob of memory. Now what you're talking about here is different tiers of memory with different latencies. And nine times out of ten, some applications may not care. But mostly there's a lot of emerging applications that do care now.

Starting point is 00:08:22 Where are they running? So when a CPU comes up and just declares this is all one blob of memory, for example, it's clearly insufficient to just assume that I'm okay running down on the far right-hand corner there where you've got a CXL expansion device going through a switch, which might be something as much as half a microsecond up to a microsecond maybe with contention

Starting point is 00:08:42 and other things going on, versus am I running out of HBM or running out of local near DDR? You need some element of being able to map the workload to that. Hence, that's why the whole interest in transparent tiering and the ability to tier memory is really being talked about a lot. Okay, so a quick caching refresher. I'm not going to go over caching theory here, but just to be clear, caching and tiering are not the same thing. I just want to be clear about that.

Starting point is 00:09:12 The caching, as you look at caching, caching is really, and this was a very quick drawer of it. I apologize for the simplicity of the diagram, but you have basically a primary storage tier and a cache tier. When you read, you either have this hit or miss kind of operation going on. If I hit the cache, I'm going to be able to get very high speed, high performance out of that hit, that read, the red line shown on the circle there in the cache engine versus a general read where it misses. I'm going to pull it from the primary storage at the lower, but I'm making a copy in my cache for the next time I come back and it's there.

Starting point is 00:09:47 So pretty straightforward. Write-through operation is when you basically make a copy. You know, you write through to the media, but you make a copy in the cache maybe. I'm oversimplifying caching here. I appreciate, but, you know, there's a lot of complexity sometimes just in the whole caching world for that. And then you have write-back operation where you're going to put it in the cache and then lazily write it back, which gives you the benefit of writing very fast to the cache and then it goes back.

Starting point is 00:10:13 So that's caching. And typically caches have some, there's a diminishing point of return. In our own world over the last 10 years, we basically switched to full-on memory mapped page translation tiering because it gave you the benefit of having dedicated access to a chunk of storage. And the same will happen for memory here, as opposed to, you know, is it a hit or miss kind of thing. The unpredictability aspects of this, but sometimes for certain workloads, it's just too great.

Starting point is 00:10:44 But in most cases, it works fine. So that's why we see caching so prevalent, and it continues to be. The important concepts here, the copies of the data are managed in the cache, and all data eventually ends up on the primary storage. It has to be. That's the nature of a cache. It's temporary storage. Let's turn to transparent tiering. Transparent, there's various, I'm going to skip over a bunch of stuff for Linux. You have a number of newer base tiering, Transparent, there's various, I'm going to skip over a bunch of stuff for Linux. You have a number of newer-based tiering, load balancing. You have the whole application-based tiering. In this case, what we chose to build was a fully transparent tiering model. That meant the application, the operating system, had no idea that it was actually talking to a tiered device.

Starting point is 00:11:22 It was totally down in the lower layers that this device would make its decisions. And I'll show you a bit of the architecture on that in a second. So you've got a fast tier and a slow tier. And in general, all of that appears as one large bunch of memory, right? Or a bunch of storage. I'm going to use storage as the example here because that's what we actually built. So you've got one terabyte in the fast tier, 10 terabytes in the slow tier, for example.

Starting point is 00:11:47 You have 11 terabytes of available storage to you. It's no longer a copy. It's actually an island of storage that you're managing fractions of data between. And then in the background, you have this background tiering engine concept, which we'll talk about in a second. So when you read, it's a mapping operation now. When you're doing transparent tiering, you're actually consulting a lookup table, and you're saying, which one do I need to go to to go get that data? And you want to make that as low latency, as fast as you possibly can. And you have a lookup, a page translation table, very much like memory does today.

Starting point is 00:12:21 This is, of course, applied to storage. And then you have the write operations, same thing. You either look it up and you say, am I writing to the fast tier or am I writing to the slow tier? But the tiering is done, more importantly, after the fact. It's a balancing operation that happens after the fact. The reason you do it after the fact in a lot of this particular set of applications

Starting point is 00:12:40 is you want the data to land where it lands and then you learn over time. Certainly certain workloads can't tolerate that. That's the benefit of a write-back cache. You get the instantaneous benefit of writing to a cache if there's room. Whereas in tiering, you land where you land. And if you've done your predictive technology right, you're going to end up with an island of storage you're reading and writing to that's already in that fast tier if you do your predictive technology right. So that's kind of what, that's the difference between, you know, caching, tiering at the very, very high level. Data, the important concept is data is split across

Starting point is 00:13:15 fast and slow. So you can have a file that maybe has 10% on the fast tier, 90% can live on the slow tier. But the virtualization layer in that virtualization engine will make sure it still appears as one contiguous block of file or storage or memory. I'm going to go through this one pretty quickly. Disaggregate storage refresh from a SAM perspective. You know, I have to confess that I did live through the floppy net era where the only way to get data reliability

Starting point is 00:13:45 between machines was to literally copy it on a floppy, which ended up being a USB key, of course, nowadays. And you still get a lot of that going on. Then you went to LAN-based drive copy. By the time I ended up in the block storage, you know, Cyprico.Hill world, we were starting to look at file-based extent tiering. So tiering started to evolve around that, you know, early 2000s. There was various people doing it, but it was still very file-oriented. And it involved, it really was part of a backup process in many cases. I'm just copying stuff out to the extent now when they started to realize

Starting point is 00:14:22 we need to stop moving data around. That's why tiering really came into existence. I really don't want to keep moving data. I just want to move the bits I'm using. I don't want to have to keep pulling that big file up every time I need to use it, extract what I need and push it back down. Here, I'm just pulling up the pieces I need. File extent tiering was the first attempt to break it down into pieces so I could just pull up the pieces of that file I needed into the fast tier. So there's been various evolutions of that. Then eventually you kind of pushed into JBOD and the SAN appliance.

Starting point is 00:14:52 In around late 2000s, we were starting to build Turing into the box itself, you know, the sandbox. So that's a little bit of a refresher. It's kind of evolved over time. The most important concept, and I think this is being missed, obviously, in a lot of the discussions on CXL today is high availability. One of the things that Sam pretty much figured out before we got to the more scale-out architectures like the Google architecture today of I can assume a complete fail of a node, therefore I have multiple copies around the place. Before we lived with the whole concept of high availability,

Starting point is 00:15:27 meaning this unit needs to stay alive 24-7. And the only way to do that was to duplicate the power supplies, duplicate the controllers, duplicate the switches, duplicate everything essentially. So you see it gets pretty complex the way you end up wiring this thing up. So if you get access from one compute node down to the shared storage in a dual ported, this is where a dual ported device is coming in, it has to that dual ported device being an SSD needs to talk to either, you know, controller A or controller B through switch A or switch B. So if A path ever goes down, you have this alternative path. You know, this is HA simplified probably grossly here,

Starting point is 00:16:06 but that's essentially what HA is about. So let's move on to what we were doing in tiering. So with that backdrop, what I wanted to do here was just walk through the implementation we did a little over last year. I was actually personally involved in writing most of the code and the architecture for this, so know it well.

Starting point is 00:16:24 We chose an architecture for tiering that was designed for simplicity. And the emphasis was we actually, it was basically AMD's store in Miami that adopted a consumer version of this. We developed this originally for the Dell and the HPs of the world distributing this as an alternative to caching, you know,

Starting point is 00:16:46 out through the channels. So we got deployed in certainly a number of data center applications, but I think the biggest one that we got most of the volume was we need to make a bootable, very simplistic architecture. And what drove this simplicity in plug and play was consumers. You cannot go to consumers with a complex architecture and 10 hours of instructions on how to put it together. And we had noticed a lot of early tiering involved two people going into the site to get it configured and done. So we built this transparent tiering architecture back in, what, 2011, I think is when we had this first running in the lab. And we quickly came down to the key components being,

Starting point is 00:17:26 the first one was really page virtualization, the ability to virtualize and masquerade or emulate as a block device to the operating system or the applications, but then you take over all the devices below you. And that can be just a simple two SSD hard drive combination, SSD-SSD combination, so you get the page virtualization. We later added auto discovery and classification because we found some of our customers were putting the hard drive in the fast tier, for example, and the SSD in the slow tier.

Starting point is 00:17:55 Unwittingly, they had no idea. So they're getting reverse tiering going on. They had no idea why their system was automatically slowing down, which was not the intent. So this is very much a bottoms up design approach where it's like, oops, we better fix the intent. So this is very much a bottoms-up design approach, where it's like, oops, we better fix that one. So we added auto-discovery and classification. I think the memory, very similar thing. You can't just trust the numeric tables. You've really got to go in and see what's the real-world life performance

Starting point is 00:18:20 I'm getting out of this tier. What is this real latency that I measured, not what I'm just trusting in an ACPI table, for example, buried somewhere in the kernel. So auto-discovery became a very important component. Page virtualization, then hot page tracking, and then ranking. We had to develop a whole scheme

Starting point is 00:18:38 for tracking what the hot pages are and the cold pages, more importantly. And then the whole thing about migration was to preserve as much capacity as you could for the operating system. So instead of just reserving a whole block of fast memory, expensive SSD for example, we decided we wanted to exchange that. So we came up with a whole mechanism to exchange the two areas for the hot and cold as part of that. So you're automatically displacing, you know, a cold with a hot. So it's something that just dynamically keeps going till

Starting point is 00:19:11 it balances out. And essentially, once this thing balances out, we observed like four minutes of intense activity in some applications and then zero for 24 hours is the thing would then be balanced and it would be accessing the fast tier. And you're getting pretty much, you can step back. The beauty of not being a cache at that point is you're just doing a translation of LBAs, for example, in the case of storage for memory, a translation of memory accesses. That's all you have to do and just keep, you know, keep track and statistics. We'll get to that in a second.

Starting point is 00:19:41 Then the important thing after that was APIs. You know, we started to layer in over the, layer in over the five, six years after that, we started getting better at adding in these layers. So you ended up with something that looked like this. On the left-hand side, you can see the file system all the way down to the EFI bias drivers. We even had to come up with a virtualized boot, that EFI layer, because somebody wanted to be able to boot this as a boot drive.

Starting point is 00:20:03 So you had to understand your virtualized environment and make that into a boot volume that Windows or Linux could boot. But you can see you're essentially dropping this into an OS environment in the block layers. And we chose to go 100% kernel because, again, transparency. Remember our objectives for transparent cheering was you don't want the user to know what the heck's going on in theory. If you can do your algorithms right, you want them to just drop this in

Starting point is 00:20:32 and this thing will figure life out itself. So kernel was the best way to do it, and it also gave us access to a lot of the low-level APIs needed to be able to keep the performance. We were able to do this, by the way, and keep NVMe performance at NVMe performance. In fact, slightly higher in some cases because our queuing sometimes got a little bit better, sometimes worse, but it was in about 5% to 10%

Starting point is 00:20:53 of what the native performance would be. So you could take a Gen 5, for example, SSD today, put it with a hard drive, and you'd see Gen 5 performance for a lot of these applications because it would slab relocate most of the workloads up there and then you'd operate off that fast tier. But that's what the tier looks like. You present a number of virtual block devices and essentially get yourself access and it became more important to be able to get access through

Starting point is 00:21:18 RESTful JSON kind of sideband tools to be able to see and get visibility to what's going on. We'll talk a little bit more about that. The other key component, we talked about mapping, microtearing, and then the policy-driven stats. One of the other important aspects of this was to develop a policy engine that the user

Starting point is 00:21:38 could tune if they wanted to. You had defaults out of the box for certain applications, but what was really useful was they used to come and say, hang, hang on, I'm more heavy write-oriented or more heavy IOPS-driven versus bandwidth-driven. So we got much better, or we got endurance-driven. You could put endurance policies in there. Hang on, this tier is actually a low-endurance tier versus a high-endurance tier. So all of a sudden, this architecture became much more useful in intelligently mapping. So a little bit of an insight into, sorry for the vlog, but hopefully you can download this presentation now.

Starting point is 00:22:14 I think you can get access to all the presentation online now. You can see the basic flow from left to right is the host I.O. I'll see if we can get the pointer going here. The host I.O., you've got the data path as it flows through. There's your mapping layer, and there's your different components or block storage. You can have an SSD, hard drive, or SSD, or a SAN device in the case of the enterprise world sitting off here,

Starting point is 00:22:37 all running on the host within the kernel. This is all a kernel kind of viewpoint here. You can see what happens is, as part of the LBA command control path, you want to keep your data paths as clear as possible, unencumbered, and you want to really just capture as much as you can of the statistics. And you'll see in memory, there's a lot of discussion going on right now about where those statistics should be collected,

Starting point is 00:23:00 because one of the issues for storage was you have, you still even with SSDs, they're slow compared to memory, you know, dramatically slower. You have time to collect statistics. In memory, you don't have time. In fact, you're interfering with the whole flow if you start to use the host itself to collect statistics about what it's doing. So there's a large debate going on where you store this, but you generally need to store a table, an access patent table of what those IOs look like. How much memory am I using?

Starting point is 00:23:32 How focused is it? You need to put that somewhere. And that, in this particular architecture, which we built, is done in a RAM and then echoed out to, sorry, stored in metadata on the drives itself. So you have this kind of constant loop though. So once you come off here, you picked off and you collect the statistics,

Starting point is 00:23:53 you get a statistics kind of page-based statistics table here in RAM. You then go into this analytical modify repeat loop. You know, in our case, we just had a two second tick that went back in the background, and what it would do is look at the statistics and look for a rebalancing opportunity. And then rather than try and go and interfere

Starting point is 00:24:13 with the I.O. process, it would actually go off and schedule that. It would say to a data movement engine, okay, it's really better these guys here were on the fast tier, and these were down on the slow tier, and that kicks off the whole exchange. Then you sit back, and it's a background task.

Starting point is 00:24:27 Again, I emphasize background task because you don't want to interfere in a high-performance situation with the data flow. So again, the downside of that is there's a latency. There's a time it takes to react to a data pattern. And clearly, by the way, just while we're on that topic, you end up with some scenarios which just don't make sense for this architecture at all. In fact, we used to say that to people.

Starting point is 00:24:49 If you have a hybrid architecture, can I make it do, you know, like an iometer full random sweep of the whole volume? Just go buy all SSDs for that application if that's really what you intend to do. It's not a real-world application, but this is really good for tight locality, mostly reads kind of applications. Virtual page statistics, I just briefly touch on this. The way this kind of works here, just to decode what this drawing is,

Starting point is 00:25:19 you've got the virtual drive consists of a number of virtual pages that are mapped into fast tier pages and slow tier pages. These map to the physical devices. This is your virtual device. The operating system only sees this guy here. So, you know, 00P is really an indicator of I'm on tier zero. Page P in the fast tier. And this one here, for example, is currently mapped to 1X,

Starting point is 00:25:42 you know, is on the slow tier or the second tier on page X. And you can see, you know, you can see it's a very simple mapping technique. The trick is to do that as fast as you possibly can. You know, get in, get out, map that thing. And then in the background, you're moving stuff independent of the host. You obviously look for opportunities where the host's not busy, you know, to do that kind of stuff, but you basically are trying to do this in the gaps between, if you can, ideally, or if you really

Starting point is 00:26:12 have to hold off the host while you temporarily move stuff and get out of the way, it is beneficial. And that's where some of the cleverness of the algorithm starts to come in. And then, you know, just some points I did more for the handout slides more than anything else. You know, you know, just some points I did more for the handout slides, more than anything else. You know, you basically have pages in various states. I won't touch on them here, but there are those that are heating up and there are those that are cooling down. And you just have to keep track of those. And there are statistics for keeping track of that. So the point of this, as you can see, is a lot going on in a tiering engine behind, you know, under the hood, as it were, behind the covers.

Starting point is 00:26:51 The other lesson learned, by the way, was just, as you go to memory, this gets acutely worse, is the operating system, NUMA, and the whole allocation of processes gets pretty complex. So we found when we did this just for purely MDME, you know, CPU and numerous association becomes a big deal, even for NVMe drives or interioring, because if you're not careful, you can end up with the driver usually is associated, for the example,

Starting point is 00:27:14 for an NVMe driver is associated on the CPU that's the closest to the PCIe attach point because they don't want to have to keep going through, you know, through a hierarchy to go back and forth with the driver. But your process might be running on a totally different NUMA node. So you've got issues here that are potentially coming up on you don't really know where you're

Starting point is 00:27:36 going to be assigned in a truly transparent environment. Remember, you're trying not to touch the system. You're trying to embed yourself in and be clever and hide, as it were, in the background. So it's going to be interesting. One of the lessons we're going to have to go through here on memory is just how much can you get away with not influencing where NUMA balancing comes in versus where you come in versus. And, you know, there's going to be Linux, of course, is popularizing that right now with the NUMA load balancing. But we found it really is a case of trying to get the best affinity you can, you know, with where the memory, the storage tables are. Even our lookup tables got put on a different CPU, for example, if we just let it do all the allocation. So unfortunately, there's some tradeoffs there you've got to go through on where this stuff can live. The statistics table, just a quick peek into what that is, is just basically a bunch of the mega region counters

Starting point is 00:28:29 and the virtual page counters. We have mega regions. We had local regions for the pages. And you're tracking read, write, re-blocks, promotes pending, total promotes on the high end. And then you're tracking things like the same kind of things on the virtual page. And a virtual page might be a region of four megabytes, for example.

Starting point is 00:28:46 So you take four megabytes, you keep counts and statistics on all of those. Then you consult those with your Turing engine. Okay, what am I looking like? What's my curve? What's my hotness curve or my hotness pages look like? The other important things which are often glossed over, the rigidity controls. We have to build in certain controls where we said, hang on, we know this region here is used heavily by the OS, and we don't want to move it. We don't want it to keep jumping around here. So you have to also now be clever with the way you allocate these pages to say, you're sticky, you're not. You're kind

Starting point is 00:29:19 of sticky. You can live there for a while. So we got to the point where we even had to build in that kind of mechanism from a rigidity standpoint. So the policy engine, finally, was one of the other areas that we focused on to make it easier to tune to specific applications, promote on reads, promote on writes. Those are the easy. You heard the term maybe if you were here in the last session, the concept of promote and demote. Promote means you're being pushed up to the fast tier. Demote means you're being demoted from the fast tier. In the case of transparent page tiering, it's a page that's being demoted. You don't know if you're dragging along multiple, especially with storage, pages of other files, you don't know. So we used to call it slab

Starting point is 00:30:03 relocation. I mean, people talk about cache lines, 32K cache lines. This is really slab. You pull out a slab, you pull it down there, and yeah, you're dragging a bunch of stuff with it. But statistically, you ended up with a better performance in general for many of these applications. So again, trade-offs. You've got trade-offs there. So those policies help you control that a little bit better. Pinning was a really important one for us, an intelligent pinning. The ability to learn and then go in and retrospectively pin certain pages to the fast tier or pin them to the slow tier. There are certain things you don't want.

Starting point is 00:30:38 The example we used to always give was an MP3 file played here very frequently should not be taking all the premium resources on the top tier of an SSD, for example, versus a slower tier, because you don't need it. Yet it's played frequently. So you've got to be careful of those kind of things. So we used to have the override mechanism where you say, hey, stay. You stay down there. You're not allowed to come on up here. So pinning is a very important thing. So page locking, as we're referring to it here. So a lot of things going on. Finally, the thing that got really interesting and I think gave people some excellent insights

Starting point is 00:31:10 was what's going on. I think MemVerge have a really nice tool for this too. I think with their memory, like visibility into where all the processes are using memory. This was our version of it back in 20, let's see, it back in 2015. We produced the first iteration of this where you could go in and look at the workload

Starting point is 00:31:30 burst, the long-term activity, and start logging stuff as far as three months back of how this thing's been behaving. And what we started to observe in our customers also was, what's going on at 4 a.m.? Why is that doing what it's doing? And they started to uncover a lot of these background tasks that were going on at 4 a.m.? Why is that doing what it's doing? And they started to uncover a lot of these background tasks

Starting point is 00:31:45 that were going on on their system, causing numerous issues with the tiering engine because that was maybe a maintenance task where you need to shut off the tiering during that time. The biggest headache we had was virus scanning for a while. You know, you would go through and start touching it and touching all these files and you'd go, hang on a second, I need to ignore what you're doing.

Starting point is 00:32:05 You know, so you do need some element of handshake. Even though we're transparent, it's good to have those hints. So we have to develop the whole concept of hints or something running in the user domain that could actually say, hang on, ignore this activity right now. So there's a lot of complexity behind this,

Starting point is 00:32:21 but this tool was useful to be able to see both time-based and more importantly, across the volume itself. The nice thing about map tiering is that you can see what's going on in this part of the drive, that part of the virtual disk, that part of the virtual disk, and see how much it's shifting over time. And that's what this, if you ever play it back in real time, it gives you a nice kind of fluid motion of what's going on on the system. And finally, I really do have to evangelize a little bit more about the future of memory in terms of the HA appliance.

Starting point is 00:32:52 You know, we talk about it. The question is whether we can live with our, you know, the hyperscale kind of approach or the cluster environment where you're replicating data. Replicating memory, we're trying to get away from moving data. That's one of the things we're trying to do here. So the question is, does HA creep back in? It's more of a question than a probability here.

Starting point is 00:33:18 Do you need two CXL switches, or do you need that kind of hierarchy again? Do you need multiple controls in there? So we're going through kind of looking at that and trying to figure out, well, okay, you can do all the tiering and stuff behind there. You can go nice little offload engine, you can do this stuff. But apart from all that, you've got HA as a real environment to consider. You start to think about tiering in that environment, complexity

Starting point is 00:33:43 starts to go through the roof again, right? Because now you're tiering duplicate copies of things or you just let them autonomously operate, which we chose to do in this case. Just let them autonomously tier in both the copies here and figure out life itself. So to wrap up, you know, a couple of things we learned along the way. You know, I think, you know, our kernel-based VMAP, we called it a VMAP, a virtual map. It was a huge table of what the translation between virtual and physical was. It would get messed up occasionally. And you get very angry customers when that happens.

Starting point is 00:34:21 And you had to build in a lot of what I call the VMAT repair and nightmare, which kept me up at night many, many times in the first iterations of these things, trying to figure out what the heck happened to this customer here. And it'd be a power loss scenario coupled with something else going on, coupled with something else. But, you know, we, I'm glad to say over the first, about a year or two, three into this thing, we developed ways to make sure with the journaling methods of transferring data between the slow and the fast tiers and vice versa so you could replay anything and you never lost any data.

Starting point is 00:34:53 That was the key. Do not lose data. Rule number one in any storage company, you don't survive long if you lose data. So that was the first thing we had to get solid. So that was, you know, Vmap introduced another complexity to that because you're no longer talking one-to-one between the operating system and the application. The next one was processor affinity. I mean, we scratch our heads sometimes, especially when AMD, I think, Threadripper came out, multiple processes,

Starting point is 00:35:19 and it had some funky mechanism in the early years. Why are we going slower? You know, this is supposed to improve things. And you suddenly realize that even though I was running off, quote, the fast tier, I was now going through two or three layers of cores of stuff because things were getting, you know, spread out in ways that were not necessarily accessible to the I.O. engine. So that's where we learned about affinity. And Turing engines were, you know, were obviously the pieces of the Turing engine to get that parallelism going and the multi-thread.

Starting point is 00:35:47 Remember, MDME was the first environment where you had multiple threads and multiple OS and application threads talking to the same device. Before that, AHCI and SATA really were single-threaded when you look at it. And so I think this is the first instance where you really lit up this big engine of multiprocessor and multithreading. So that was another thing to deal with. Now, memory, order of magnitude more. I think it's going to be very interesting to see how we deal with that, and that's one of the areas that my team's going through looking at now. Translation of I.O. access.

Starting point is 00:36:22 One of the things that we're talking about is the table where it lives. It gets pretty big. The beauty of memory is there's already a page translation table, whereas in storage, we really didn't have one. So we had to invent our own. So that's going to be interesting to see how we play. We don't want to reinvent the wheel. There's a lot of good stuff going on in the Linux community today to solve that. But how do you add the value add aspects of tiering the policies, for example, that's where you can differentiate as a product guy, as a guy trying to ship a product to the market. So those are going to be interesting to kind of work through. Low level media device, SSD housekeeping and block migration for us,

Starting point is 00:37:00 less of an issue with CXL memory. But as we also make a smart, we make NV devices, we make SSDs, and all of that stuff eventually ends up on CXL. I think that's the general viewpoint here. So you're going to have to deal with multiple types, not just memory. You've got memory, you've got.io, you've got.cache, you've got all kinds of modes, as well as different kinds of media with different reactions going on there. So it's going to be important for us to figure out how to avoid interfering with the low-level intelligence that's going on.

Starting point is 00:37:35 And lastly, no one size fits all. You know, we thought we did a great job with the Turing engine, and yet we get customers showing, look how bad it is over here. So you're always going to get, you have to kind of shoot for the 80% if you can, if you're developing a product. And so in the end, that's why we ended up being largely in the gamer community, funnily enough, as opposed to, because that's where it's very predictable. You pulled in a large chunk of data, then they operated out of RAM most of the time. So all they wanted was to load next screen or load the program as fast as they could as they context switch between their different things going on on a PC.

Starting point is 00:38:10 But when we got put in, for example, I think it was like, who was it? ComStream, I think they were public when they talked about it. When they were doing stuff with petabytes of data captured into a federated server environment, you know, and using it as a means to lower the cost of storage, because the whole benefit of tiering is you can put cheap storage with a small amount of expensive storage.

Starting point is 00:38:33 They wanted to be able to capture that and process it. Well, they did show they got like a 2 to 3x improvement, but occasionally you get certain traffic patterns which would destroy your tiering engine. So it's no one size fits all. So I think it's going to be a combination of what Andy, I think, said in the last session, where you're having to analyze your workload environment and just see if it is a candidate for tiering or not. One little plug for the OCP CMS group. There's a lot of different things going on right now, but this is where there's a fairly healthy discussion going on

Starting point is 00:39:07 about composable memory systems. They are actually working on a draft specification. You can see more of the OCP. And SMART here, just a small plug for SMART. What my team have been working on here is obviously the first E3S modules are now here. We were demonstrating it just around the corner here in the hackathon yesterday. And the guy on the right is a mechanical sample, but we actually do have the real thing in-house now. And we're starting to see a lot of interest, funnily enough, in the guy on the right. A simple memory expansion, the ability to add an adapter

Starting point is 00:39:40 card in with a bunch of DIMMs, throw it in your system, and a nice plug-and-play way to get that up and running. And there'll be plenty more where they come from. Okay. So that's it. Thank you. Do we have time for questions? I guess, yeah, I guess we do. We've thrown a lot of stuff at you, so any questions?

Starting point is 00:40:11 Yeah. Say again? In the upstream? No. Yeah, this implementation here was a totally self-enclosed blob, like a closed blob with an open source wrapper. That's how we handled our Linux. But as we go forward into the memory world, we're starting to take a lot more closer look at what's going on there. So we haven't yet worked openly in that area, but

Starting point is 00:40:56 it's certainly something we focused on. The question, sorry, was are we doing any work in or have we been involved in the upstream stuff? I think there's going to be a lot of stuff, upstream Linux kernels. I think there's been a lot of good work going on there, and our plan is not to replicate, is to augment that effort and build more of the tools around it. Because I think what we found in our experience was the core tiering engine itself is, I don't want to say simple because I lost too much hair developing parts of what we have to do, but I think it's actually, it's really the tools and management and the ability to plug

Starting point is 00:41:30 into your environment are really the bits where it's going to get interesting. So for hyperscale environments, I think they've got their vertical. I think for general purpose enterprise, which we were focused on, you don't know what application you're going to be shipped into. So that's what makes it a little bit more complicated. And then you have the Windows problem. We developed this same engine for Windows and Linux, so we could cross-port between the two. It was important for us to be

Starting point is 00:41:56 able to go back and forth. We kept it proprietary for that reason that we could go back and forth. Now going forward, I think Linux is where we're going to see a lot of the early work here on Linux and tiering. Okay. Yeah, so the question was, we only have so much memory to track statistics

Starting point is 00:42:38 and keep track of what's going on in the system. And hence, one of my comments are about interference with yourself, you know, with the application. If you're running applications out of that same memory, it becomes quite problematic. So we had two methods. I mean, for the smaller capacities, we had a paging system that could handle up to like four petabytes of storage with about two gigabytes of RAM for statistics keeping. We kept a fairly efficient block if we could. Then we had a super block concept where we would keep

Starting point is 00:43:11 more detailed statistics where most of the activity was going. You had to do a two-tiered system. You can't keep track of everything. In the end, I think we could get away with the first generation, but as you go towards the petabyte threshold or the hundreds of terabytes, typically we get, I think our largest deployment was like 256 terabyte back in 2017, 2018,

Starting point is 00:43:33 when we did the first implementation of this. So that took up about a couple of hundred megabytes to just about a gigabyte of RAM. Server didn't care. That was fine. I had that. You try to go to petabytes and beyond you go to a switching architecture you know with the memory the good news about memory

Starting point is 00:43:50 is i mean there's there's less is so you're talking about 32 terabyte boxes you know being proposed today for external uh you know boxes that doesn't take a whole lot if your granularity of your page is is fairly large so we ended up tuning. We had a one, two, four, eight megabyte option you could go to on your page size. The bigger you go on the page size, the less the smaller the tables are because you're keeping track of a bigger chunk of the memory. But that's a trade-off. That's a trade-off. So we were moving towards, do you do a smaller chunk for the active areas and larger chunks for the inactive areas to optimize your memory? But you have to start playing games such as,

Starting point is 00:44:31 how do you keep that statistics table small by doing more of an activity curve, almost using your hotness map to define how much detail you keep. Okay? All right. Oh, hey, Jonathan. So for CSL memory, what would be the allocation of what's being used

Starting point is 00:44:52 to do with the data? What sort of memory is being shared in a certain amount of system servers? Do you have any insights on what you can store in a certain period that needs to be served to maybe have a platform and what you can store in this theory?

Starting point is 00:45:05 Do you think there's a particular platform and how that can be applied? I think we've always taken the approach we play with what's available rather than too much theory. So I think we're starting by building a first, how much can you put in a box? It's about 32 terabytes if you look at the math today sensibly, right, with DDR5 remote box behind the CXL, maybe, maybe 64, maybe 128. You're not talking petabytes though, right? But I think it's going to be interesting because the other ceiling you're hitting is the ability of the OS to address memory, right?

Starting point is 00:45:52 There are caps on how much Linux can address, for example, right? So how big can you go is going to be dictated by the application and many things like that. But I don't really have a good feel for it yet, Jonathan, not yet. But something we're going to be looking at. One of the team members is going through it right now. So, good question. I think that was it.

Starting point is 00:46:11 All right. Thank you so much. Appreciate it. Have a good day. Thanks for listening. For additional information on the material presented in this podcast, be sure to check out our educational library at snea.org slash library. To learn more about the Storage Developer Conference, visit storagedeveloper.org.

Storage Developer Conference - #199: CXL Memory Disaggregation and Tiering: Lessons Learned from Storage

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.