Hardware-Conscious Data Processing (ST 2023) - tele-TASK - Persistent Memory

Starting point is 00:00:00 Okay, so I think this is the schedule, I hope this is correct. So we have persistent memory today, tomorrow we'll discuss the solution of task 2, and the introduction for task 3, which is going to be a buffer manager. There's no locking in it, so this will be in task 4. And then next week we'll continue with storage and networking and so on and so forth. Okay, so today we're going to cover basically everything around persistent memory that we can cover in 90 minutes. So we'll kind of look into what is persistent memory, how does it compare to DRAM and SSD, so these may be components you're more familiar with.

Starting point is 00:00:46 How do we access persistent memory? So how can we configure persistent memory and how then as a developer can we interact with it? How is PMEM actually accessed via the CPU? Then we kind of look at some of the pitfalls we can run into. So what about atomic rights and how do I actually guarantee that certain data is persistent or not? At the end, I kind of give a short intro into programming with persistent memory. So what are the interfaces, how would you write code for this, and then at the end if time allows for it we'll kind of go into some PMEM research that has been

Starting point is 00:01:15 done in the data management field, so kind of tying it back together also to the overall topic of this lecture. Okay, let's just start off with a very, very basic first kind of question. What is persistent memory? So the SNIA, which is the Storage and Network Industry Association, they define a lot of standards for these kind of things, and they have a standard NVM programming model. And this defines that persistent memory is a storage technology with performance characteristics suitable for load and store programming model.

Starting point is 00:01:46 Now this sounds kind of technical, and it is, but basically what this means, it's a storage technology, so it's persistent. This is what you would expect from your disk in your laptop or computer. I store something there, power goes out, it's still there. And this load and store programming model basically says it's byte addressable. So this is something we generally have in DRAM, where my CPU would issue a load or store instruction, and then I can actually access data at this certain location. So we're kind of in between DRAM, which is byte addressable but volatile, and block devices or storage, which is usually a block device and not byte addressable but persistent. And generally, persistent memory is known

Starting point is 00:02:26 under a few different names. Main reason for this is that persistent memory, as a concept, has been around for a while. And it's existed in very small forms for quite a while. And now also for larger scale, data management kind of topics, it's been hyped for probably like 10 years. So you'll also come across a non-volatile memory, or NVM. This was kind of cut out recently

Starting point is 00:02:48 because of naming conflicts with, for example, NVMe SSDs. That's also non-volatile memory, and it's kind of confusing. There's storage class memory, SCM. And VRAM is something you'll still see a bit. And then there's PM and PMM, which are kind of also just abbreviations for PMEM. So in case you see any of these, they kind of mean the same thing, but not always. So it's a bit tricky.

Starting point is 00:03:13 I think I once read a paper, and it said NVM. And I was thinking about this as persistent memory all the time. And then after three quarters of the paper, I realized it's actually about block devices and SSDs. So it's not always super clear what's happening here. Yeah, so in general, also the standard, the JEDEC standard, so different, yeah. Question? So NVMe SSDs are not this?

Starting point is 00:03:33 JAN-FELIX SCHWARTZMANN- No, NVMe SSDs are not this. And so it's a very different thing, which is why I think NVM is not a great name anymore for persistent memory. This is like legacy kind of naming. NVMe is something very different. And I think this should be covered next Tuesday

Starting point is 00:03:48 in the lecture on storage, which is a different interface. So this is then based on PCIe and everything. Yes. So the JEDEC standard, again, another standardization organization. They define two types of PMEM. There's NVDIMPs and NVDIMNss and so historically NVDIM-Ns is something that have been around for a lot longer also and this is basically just battery-backed DRAM so on my DRAM

Starting point is 00:04:15 DIMM I have a small flash chip and in case of power loss I just have the battery will just make sure everything that's in DRAM is flashed is flushed onto the chip and then, it's persisted. And then if I restart, then I can get everything back into DRAM. So this gives me byte addressability, but also persistence. But this basically also comes with the same limitations

Starting point is 00:04:37 that we have with DRAM. So it has the same capacity limitations, the same cost. But data's only also persisted in case of power loss. So this is actually something you'll see a lot in other devices embedded. So for example, a lot of SSDs will have a small persistent memory kind of area, which is like a tiny bit of DRAM that they can flush onto a flash chip if they need to. But that's kind of not what we're looking at here. What we're more interested in for the sake of this presentation or this lecture, is NVDMP.

Starting point is 00:05:07 And this is basically what we would now consider real persistent memory, because it offers a regular DRAM interface. So it's a regular DRAM form factor, but it has all the properties of being persistent. So we'll get into that in a second. Yeah, so as the only really large-scale commercially available persistent memory technology, there's Intel Optane. So again, like large-scale persistent memory, talking about these DIMMs here you can see on the right. This will be in capacities between 128 and 512 gigabytes.

Starting point is 00:05:43 There is persistent memory that is a lot, lot, lot smaller. But for large scale data center kind of stuff, this is the range we're looking at. And internally, why I'm calling this real persistent memory is because it's based on actually a persistent memory storage technology called 3D Crosspoint. This is X point here. It's actually crosspoint.

Starting point is 00:06:01 I always find this very confusing. And so the Intel is kind of tight lipped on the internals of this. You can see this super generic picture down here, which says that basically this technology is a transistorless three dimensional circuit of columnar memory cells and selectors connected by horizontal wires.

Starting point is 00:06:23 Again, you may think of that as whatever you want. But this is basically the picture that describes this. But the important part is that we can address individual cells. So again, Intel is kind of tight-lipped about this because they don't want to share all their trade secrets. But I can address individual cells that have a lot lower granularity

Starting point is 00:06:41 than I can with block devices. And it's actually the same technology they also had in Optane SSDs. Storage hierarchy. So we have the classical storage hierarchy that most of you will be familiar with and have been familiar for a while, as we have rotating disks at the bottom.

Starting point is 00:07:02 Then we have flash disks or NAND chips or whatever, DRAM, cache, and CPU at the top. We know the higher we go in this hierarchy, the more speed we have. The lower we go, capacity increases. And then cost is also higher in the pyramid. And we can see between DRAM and SSD, so roughly access to DRAM will be around 100 nanoseconds,

Starting point is 00:07:21 maybe a bit faster, a bit slower. But random access to an SSD, for example, will be around 100 nanoseconds, maybe a bit faster, a bit slower. But random access to an SSD, for example, will be around 100 microseconds. Again, could be a bit faster depending on the model. So this is still like a 1,000x difference. So if your application has latency, if it's critical to have good latency, random access to SSD will kill your performance.

Starting point is 00:07:39 And we can also see we have about 100x increase in capacity between the two. And now in this new storage hierarchy, we kind of have PMEM in the middle. So it sits between DRAM and SSD. And we can see here now, for example, that the access latency is only around 5x. Again, variation is there.

Starting point is 00:07:56 We'll see that in the next slide. And we can see that the capacity is around 10x. So we can see it's really kind of in the middle between SSD and DRAM. But we can see that the capacity is around 10x. So we can see it's really kind of in the middle between SSD and DRAM. But we can see that the access latency is actually a lot closer to DRAM than it is to SSD. So let's kind of look at some performance numbers that we actually measured on real hardware.

Starting point is 00:08:18 We'll kind of go through this row by row. So let's start off with the read latency. And that's kind of what I showed at the beginning. So this is sequential read latency. We can see for DRAM and Optane, this is very, very similar. And SSD is a lot higher. But again, for latency, we're usually more interested in random latency

Starting point is 00:08:36 because sequential latency often gets masked. So for random read latency, we can see that we're about 2x, 2 and 1 half x off here, whereas we're quite a bit off here. We're in the 100 microseconds, and then we're in the 100 nanoseconds area. So this is a factor of 1,000 difference. And we can see for write latency, this is where Optane goes apart a bit.

Starting point is 00:08:58 Is that DRAM, we can see around 100 to 200 nanoseconds here. But for PMEM, we can see it's between 200 and basically nearly a microsecond in latency for random writes. And especially for random writes is where persistent memory, or at least obtained persistent memory, suffers the most in performance. And we're going to also see these numbers, this 2x for reads and this 5, 6x for writes.

Starting point is 00:09:22 We can also see in the bandwidth. So we have the nearly 2 and 2 and 1 half x difference here. But what's this 5, 6x-ish dip performance difference, which is still a lot higher than regular SSDs. But there currently are also NVMe SSDs that can achieve a lot more than that. So just to give you kind of a ballpark of where we're at. OK, so now that you've basically got a very, very brief idea

Starting point is 00:09:51 of what are the general performance characteristics. So the performance characteristics are a lot closer to DRAM than they are to PMEM. And we'll step by step kind of get to why this is the case and how we interact with PMEM and why this makes sense that these characteristics are as they are. Okay, so in general, PMEM can be run in two different modes. This is again specified by standards. And there's a memory mode in which persistent memory is basically just a larger volatile

Starting point is 00:10:19 memory area. And then the second mode is the App Direct mode, where DRAM and PMAM are separated. So we can see this on the right-hand side. Here, I can explicitly communicate with DRAM, and I can explicitly communicate with PMAM. I go into these modes in a second, or actually now. So for the first mode, the memory mode, here the CPU, or

Starting point is 00:10:43 the operating system system considers something to be near memory and far memory and all of the access in this mode is actually handled by the operating system so you as a developer have zero control of where data is going where it's coming from what's happening underneath the the real advantage of this mode is I don't have to do anything so let's say my server has 128 gigabytes of DRAM, I can now just add 1.5 terabytes of persistent memory, say this is also DRAM, and now my server suddenly has 1.5 terabytes of DRAM. And I don't have to care about anything, I don't need to reprogram anything,

Starting point is 00:11:19 everything will just work. And if we say performance is not super critical, we've seen we're in like 2 to 5x in performance for DRAM. For a lot of applications, like a lot of applications, this will be perfectly fine. But again, we don't have any control of where memory is going. And data persistence is not guaranteed. So as the operating system will write something to, actually I have another slide on this, yes I do. So the operating system will treat DRAM as kind of an L4 cache. So the data going in will first just be written to DRAM

Starting point is 00:11:58 because that's faster. So everything go to DRAM, DRAM, DRAM. And only when DRAM is full will it start writing stuff to persistent memory in the back. So this also means, as this is an L4 cache, that the total DRAM capacity, or the volatile memory capacity seen by the operating system, is that of PMEM, and not that of PMEM and DRAM together. But we've seen it's like an order of magnitude higher,

Starting point is 00:12:23 so probably the small amount of DRAM doesn't matter that much. But one interesting aspect also here is that we kind of have a high access latency in the worst case, because usually what we'll do, we'll check our caches if data's there, and then if it's not in the cache, then we'll go to main memory. And now we've introduced another level of caching. That means if our data is not in DRAM, now we need to do an additional hop to persist the memory, which is why in worst cases we get even higher access latency,

Starting point is 00:12:51 because we need to check multiple levels ever so often. OK, the other side of this is like the App Direct mode, where we explicitly control where data is stored and which data we want to access. And this, at least in research, was a lot more interesting, or is a lot more interesting, because it gives us fine granular control of data that's in DRAM, data that's in persistent memory.

Starting point is 00:13:15 So we need to fully handle this as a programmer. It gives us explicit control over where to do this and where to put memory. But we actually need to adapt existing applications to use this. So usually if you've written applications, if you write like C++ code, you've maybe used the keyword new to create something and it's created on the heap somewhere, or you've created an object on the stack and you're kind of thinking about memory already.

Starting point is 00:13:42 And in basically all other programming languages, you don't care about this at all. And now you need to go one level deeper and start thinking about, OK, do I want to allocate this in DRAM or do I want to allocate this in PMEM? And this is something that most people have never really thought about because they didn't have to. And most applications also don't really care about.

Starting point is 00:14:00 But if we want to have persistent memory in there somehow, then we need to start thinking about where to allocate our data. But the main advantage here, one second, is that if we do this correctly, then all data and persistent memory is actually persistent. So compared to the memory mode, this is where we actually get persistency guarantees from PMEM. The question? Who manages the L4 cache? Is it the operating system? Yes. So the question is who manages the L4 cache in the operating system? SIMON VERSTRAETEKER- Yes.

Starting point is 00:14:26 So the question is, who manages the L4 cache? So this is in this case. This is all completely done by the operating system. We have, as a developer, I have no control of what's going on here. I'm telling the operating system, this is what you can use. And then it will do everything for us.

Starting point is 00:14:39 Actually, I don't have that much experience in using this. Because from a research perspective, again, it wasn't super interesting for me. I think some people at Team Munich were working on this, but I'm not quite sure what the outcome was. But yeah, so depending on how the operating system implements this and implements the access or how smart the strategies are to move data from and to DRAM, PMEM, and CPU,

Starting point is 00:15:04 this might be good or very bad. But we've seen some results where this, I think there was a paper that they had, they explicitly managed the control to DRAM and PMEM. And then in the end, it also turns out if you just give it to memory mode, it's good enough. But you don't have to do anything for it. So the performance is not horrible.

Starting point is 00:15:28 You probably get a pretty decent way of managing this because operating system in general is actually quite good at managing memory. That's one of its main tasks. So with now the memory mode and after-memory mode covered, let's look at how we can actually access persistent memory. So we've seen there's this different one

Starting point is 00:15:49 where we don't care about accessing it, and one which is this mode where we explicitly care, like how do I get from my application running here through the CPU, how do I actually get to persistent memory? So persistent memory, again, offers two different modes to do this. There's a lot of modes to do this.

Starting point is 00:16:05 There's a lot of modes to do things, a lot of options. And one here in the orange box, you can kind of see this is the more like regular traditional way of approaching this, where PMEM is just a regular file system. So you can use fopen, fread, fsync or whatever to do all your interactions with it. And this is basically just a drop-in replacement for a standard file system, which is usually faster. We actually have a paper on this from now two years ago, where we kind of just looked at how do I take persistent memory as a drop-in replacement file system,

Starting point is 00:16:34 and what kind of performance can we achieve. So this, again, would be the mode where I don't need to change anything in my application. I just have a faster file system. And the second mode, which is now where this byte addressability comes into play, is that I have the direct access mode where I can use mmap and like load and store instructions to actually move data from and to persistent memory and not like a file system. And I get to how these

Starting point is 00:16:56 modes can be combined later on and why this is quite interesting. So yeah, we can see basically on top we have the application, the application that either interacts via a file system or with mmap and some load and store instructions. These modes both go through the PMEM driver and the PMEM driver is then the actual thing that interacts with the DIMMs underneath. So I basically showed you this picture and now from the bottom up we'll kind of go through the stacking and look at how this looks like. So the underlying storage again, oh there's a typo there, I need to fix that. We have two modes again because PMEM loves to have two modes, we have two modes, two modes, two modes. Big tree to go down. And so the option one is that we interleave all this memory. So this is something similar to a RAID 0 configuration

Starting point is 00:17:48 for those of you who are familiar with this, which basically just means the first four kilobytes on DIMM1, the next four logical kilobytes on DIMM2, and then on DIMM3, DIMM4. So we can see down here that the first four kilobytes are here, the next four kilobytes here, and so on. And that means if I now access 24 consecutive kilobytes, I'm actually reading this from six DIMMs at once,

Starting point is 00:18:09 so I get a lot of parallelism out of the systems for free. On the other side, we have the non-interleaved mode, where data is not striped, so that basically just means that one DIMM will have its entire range and address space, and then the next DIMM will just have the next one. So let's just say these are 180 gigabyte DIMMs. So here DIMM 1 would have 0 to 128 gigabytes, and then 128 gigabytes to 256 gigabytes would be the next, and so on and so on. And there's no easy parallelism in this, because if I access now, let's say, 24 consecutive

Starting point is 00:18:41 kilobytes again, then I'm just going to one single DIMM, and then I'm not using the the parallelism of all DIMMs at once. But there are use cases where this is useful, because again this gives us a lot more control of where to put things. Okay, so now looking at the the file I-O interfaces that we had. So in this case, persistent memory is just seen as a regular file in the file system. So we've kind of seen this here. We have our file system, and we just have some random data. And that's just how you can interact with it.

Starting point is 00:19:17 So you can use an EXT4, XFS file system that supports this. And everything just works. With one special thing is that this is a direct access file system. So the files are accessed directly underneath, and there's no page cache. So usually if you look at storage, the operating system will read something from disk,

Starting point is 00:19:38 will read it into an in-memory buffer, and then you interact with it because you can't interact on byte granularity with storage. In this case, we can. So we'll just pull data in it's direct access so everything that we're doing in the file directly goes onto file and we don't need to sync these things between memory and storage. So if for those of you who are familiar with it it's kind of like the O direct flag that we that we can specify and this is supported like in EXT4 and XFS.

Starting point is 00:20:05 And there's probably also other file systems that support this, but these are the two main ones on Linux. And then yeah, once I have this, I just have my file. I just write data to my file, I read data to my file, like you've probably done many times in different applications. I can use Open Read Write to do this. And the main difference is just it's on the persistent memory file system and not on a

Starting point is 00:20:28 regular file system. And the main point here is, or the main difference to the next one is, that in this case, PMEM is actually seen as a block device. So I said one of the main advantages of persistent memory is that it's byte addressable. And in this case, we're just throwing it away. Because if we want to have a file system on top, a file system only knows how to work with block devices, so we're telling the file system here's a block device intact with it and we're kind of throwing away a lot of the niceties of persistent memory, but again this is just drop

Starting point is 00:20:56 in, I can use it straight away, I don't really have to do anything. The next one we have is the direct memory access mode and this is again slightly more interesting because it kind of gives us more options on how to interact with persistent memory. And in this case persistent memory is actually seen as memory and not as a block device. So the big difference here is we basically now we can also have something as a character device. And that means I can access this device and I can say, give me byte 17 and we'll get byte 17. So we'll probably get it in a slightly larger granularity. Also from DRAM we can only access stuff in 64 bytes. But technically we can address every single byte and we can get that.

Starting point is 00:21:38 And yeah, the main way to interact with this now, we just do MMAP calls. So MMAP is a memory mapping call in the Linux kernel that just gives us a virtual address for some physical memory underneath. So let's say we want to allocate some memory, a persistent memory. Then we get a persistent memory location. We call MMAP.

Starting point is 00:22:04 And then in our application, this is just seen as a normal virtual address space. So then we just have these random addresses here. In my application, in your application, we can access this. And underneath, this will just map to a PMEM address underneath, and the operating system handles this for us, as it does with normal virtual memory.

Starting point is 00:22:25 Yeah, so the combining thing that I mentioned briefly at the beginning was this is now a character device. So there's a slash dev, and then for Optane at least, it's then slash DAX. And this is a character device. And I can actually just access this directly. But this is the same thing as if you have a block device, which is like dev SDA or STB or something, you usually don't want to just

Starting point is 00:22:47 directly access this because well unless you really really know what you're doing you're going to break something. So again here you probably also don't want to do this, so what you can do is actually you can create a file system on top of PMEM and tell the operating system okay this, this is a PMEM. This is a direct access file system. And now I can structure my files in this way. So I can say there's a directory which is for you, and there's a directory which is for you.

Starting point is 00:23:15 And then I can give you the access rights for this directory. And then you can write and create and delete files in that space. And someone can do that in their directory, like you know from a normal file system. But then when I access this data, I can say, okay, please access this underneath as persistent memory and indirect access.

Starting point is 00:23:31 And then I get all the nice properties that I have of persistent memory. So this is kind of the combination that you usually want to run things in because it gives you the structure of a file system to give PMEM access to different users, but you can also use byte addressability. One use case, for example, where you would actually want to directly specify bytes in the character device is, let's say, your database server, and you have full control of the server, and you have 100% control of all the devices on it, then there's an advantage of doing this directly,

Starting point is 00:24:04 because obviously adding a file system on top adds a bit of overhead. There's a bit of abstraction cost, which is usually between 5% and 10% for access. So if you don't need that, and you know exactly we're managing all of our memory directly, then you can kind of get rid of that. But it's a lot trickier and harder to get right.

Starting point is 00:24:47 OK. right. Okay. So, yeah, the let's kind of look at so now we have, are there any questions around any of the access modes, accessing persistent memory, integrating with it so far? Yes? If I do this combined mode, I don't put a regular file system on there? I can just ask the operating system to give out files that are mapped to this information? So actually it is a normal file system and the file system will still see it as a block device. So we're kind of limited by this, so everything we do is still in 4 kilobytes.

Starting point is 00:25:17 But if we mmap this correctly, then the driver will still know what's going underneath. So what the file system does is it uses a chunk of addresses that are mapped to this file. And then if I go through the driver, the driver will still know what to do internally. So we're not actually saying, so the 4 kilobyte stuff, this actually only happens if I do this. So if I do regular reads and writes, then it goes through the normal operating system path to do file access. And then in that case, the file system doesn't know what's going on because this is what's expected.

Starting point is 00:25:51 But if we combine it, then here somewhere underneath, so basically in here, the file system, the OS, and the driver will kind of figure out you can do this at byte level. And that's actually what happens. So this is also how I basically implemented all the stuff that I worked on. It's because it just gives you a nice structure on a server. Maybe you have multiple users accessing something.

Starting point is 00:26:12 You don't want them just to write on character devices. You want them to say, OK, here's your directory for user A, user B. Please do your stuff in there. But you can go down underneath and get byte access. OK. But you can go down underneath and get byte access. OK, so now I think this is kind of the main part why we still kind of consider PMEM to be closer to the CPU, like DRAM and not like storage. The main reason is that persistent memory is actually

Starting point is 00:26:38 connected to the CPU via integrated memory controllers, IMC, which is exactly the same thing that connects the CPU to DRAM. So actually, PMEM is also on the same memory bus as DRAM, which also means that they need to have the same memory bus speed. So they're all connected. It's kind of the same thing. It's the same form factor. So if I open a server, I can take a persistent memory DIMM

Starting point is 00:27:03 and I can put it in a DRAM slot. It might not work because the CPU doesn't support this. And I could put a DRAM DIMM in a PMEM slot, and this should work. So yeah, like I said, they need to have the same memory bus speed. And this is actually quite interesting. In the first generation, PMEM was actually a bit slower

Starting point is 00:27:21 than what DRAM could offer. So also in the service we have here, we need to reduce the memory speed, memory bus speed of DRAM. In the second and third generation, this is not the case anymore that PRAM clocks at the same memory bus speed as DRAM. And in the case of Optane, there's

Starting point is 00:27:39 a DDRT protocol, which is just a modified DDR4 protocol. So DDR4 is the standard protocol for the CPU to communicate with DRAM. And because PMEM has a few different characteristics, the DDRT protocol actually adds a few things, which is like supporting larger packages. There's some out-of-orderness in there, basically also because of long latency. When accessing PMEM, we have higher latency than with DRAM and so a lot of the timing issue a lot of timing issues may arise in this case so there's a some modifications in

Starting point is 00:28:13 there that allow us to handle persistent memory better and so on the current Optane the 200 series so this is the second generation of Optane, which we also have here at HPI. We have two integrated memory controllers per CPU. We can have up to four channels per integrated memory controller. That means we can have up to eight DIMMs per CPU. And the largest DIMMs we can currently buy for obtain a 512 gigabytes. That means I can have up to four terabytes of memory per CPU, which is quite a lot. So this is where persistent memory will offer like a large portion of its benefit.

Starting point is 00:28:54 And actually companies like SAP and Oracle have looked into this or are looking into this just because, well, it's a huge amount of memory. And if you have something like SAP HANA, which is an in-memory database, it's really good for them to say, oh, we have a lot more memory here than we had before, because this is often a very, very limiting factor. Okay, so let's look at how we actually get data from the CPU to, in this case specifically, Optane. So I'm kind of using Opt obtain and PMEM interchangeably here, purely for the fact that the only real available system

Starting point is 00:29:30 is obtain, or the solution is obtain. But this doesn't necessarily have to be the case. So in this case, I think a large part of this is kind of like the protocol here is kind of implemented or specified the way that obtain implemented, probably also because Intel was the only ones doing it, so they kind of defined the standard after this.

Starting point is 00:29:50 But some of the access granularities and stuff may be different in other implementations. So here on the red box on the left, we have the integrated memory controller, and on the yellow box or orange box, we have persistent memory DIMMs. So if we want to read data, then I'll access a call. The CPU will say, OK, please read cache line 1. So we have this cache line with the address 1. This goes into a read pending queue in the memory controller. And then the read pending queue, this will access data from the PMEM DIMM so this will cross the DDRT

Starting point is 00:30:27 protocol go into the PMEM controller and then underneath so I said in Optane we have this 3D crosspoint media underneath the access granularity for this is actually 256 bytes so and compared to DRAM which is usually 64 bytes here we need to access physically we always need to access 256 bytes so you can also kind of think of this as a block device but with a lot smaller block so a an SSD for example you'll usually access 4 kilobytes at once or maybe even 16 kilobytes at once here we access 256 bytes at once which is a smaller, but still larger than 64 bytes. So the thing is, if we only access cache line 1 here, then we go to the controller, we fetch cache line 1, but we can only fetch 256 bytes at once.

Starting point is 00:31:19 So we're actually also fetching three garbage cache lines from the CPU side, because this is just padding for the 256 bytes that we need to read internally. So now we send cache line one back, and now we have a four X read amplification, because we're sending four times as much data than we actually wanted to send. So this is inefficient.

Starting point is 00:31:42 But now if we actually access four consecutive addresses, so one, two, three, four, then the controller will actually fetch one, two, three, four, and then send all of these back. So these will actually be four cache lines individually. But we only need to access the media underneath once. In this case, we have no read amplification. So accessing persistent memory in multiples of 256 bytes is really

Starting point is 00:32:06 important for performance in this case. Again, obtain only. I can't make statements about other types of persistent memory. But in this case, it's something we need to be aware of is that just because we can do stuff in byte addressable or access data byte addressably doesn't necessarily mean we should do it. And there's a lot of work and a lot of research on how to design data structures specifically around this 256 bytes granularity. Now on the other side, we have when

Starting point is 00:32:40 we want to actually write data from Optane to persistent memory, we have the same setup. And now we have here this top box here, which is a write combining buffer. So what it basically does, it takes writes, it buffers it for a while, tries to do some magic internally, and then write it out. And I'll show this on the small example now.

Starting point is 00:32:58 So again, we want to write cache line number one. This goes now into our write pending queue, which is the opposite of the read pending queue. So we write this. At some point, the data will go from the write pending queue into the write combining buffer, and it'll just idle there for a bit. And at some point, there's no more data coming.

Starting point is 00:33:20 So this is the only cache line we have. So the write combining buffer now will have to say, OK, I need to flush this to my underlying storage media. So what we do is we have to perform a read, modify, write operation. So it's the same thing as before. I can't access just one cache line. I need to access four cache lines, so 256 bytes.

Starting point is 00:33:39 I need to load these from memory. I need to modify the single cache line I want to modify. I need to write back everything. So there's write amplification here, because we're writing four times as much memory as we actually want to write. Now the main benefit that we have is if we have... Let's see if I can get rid of... If we have now four consecutive cache lines coming in,

Starting point is 00:34:07 so we're just writing something to the queue. So we have cache line two, three, and four, and some other random stuff in there, which is kind of the nice property of the write combining buffer. It doesn't have to be exactly in this order, because it combines stuff in the order up here. And then we can see we have cache lines one, two, three,

Starting point is 00:34:22 and four together in the write combining buffer. Then Optane will be like, yay, I can flush this together. And then it doesn't need to do read, modify, write anymore. It can just take these four cache lines and just flush them down directly, which is a lot more efficient. And then we have no write amplification because we're writing exactly the 256 bytes

Starting point is 00:34:40 that we want to write. So write combining is very important for performance. If you don't pay attention to this and obtain, then your performance will be very low. And if you do a lot of random writes, small random writes to PMEM, even on a large server configuration, you're only going to get like three gigabytes per second of performance.

Starting point is 00:34:59 Whereas random writes in DRAM, if you do 64-byte granularity, will scale to probably like 70 gigabytes, again, depending on the source. So there's a huge difference here in what we can achieve. So there is some intricacies around accessing Optane PMEM with 256 bytes. And if you don't pay attention to that, then you're going to lose a lot of performance.

Starting point is 00:35:27 Okay. So we're kind of going to switch gears a bit now and look at the actual persistent part of persistent memory. So if I have something that needs to be persisted, in case of power loss, it's really important that I know when something is persisted. And in this case, there's this thing called the ADR, which purely from the name is very confusing because it has nothing to do with persistence at all.

Starting point is 00:36:04 And this is the asynchronous DRAM refresh domain. And this space just specifies the certain domain because for regular DRAM, we need to refresh, we need to give it power constantly to hold the data. And everything in it, in this ADR, will constantly get this power flush. And in the first generation of Optane, the ADR went into here. So we can see that as soon as data hits this ADR,

Starting point is 00:36:34 it is guaranteed to be persisted in case of power loss. So there's probably, again, a small battery somewhere in here on the CPU that says, I've lost power, but all of these cache lines here that I've that I've in the right pending queue I guarantee that these will be persisted so if I lose power now here it goes out these will still be flushed and written to obtain underneath so as soon as I know something's in my right pending queue then I know I have guaranteed data persistence a very very

Starting point is 00:37:04 important part, though, is that the CPU caches are not in the ADR. So usually, you probably have never really thought about this when programming, is you write something, and then you don't care about where it is, because you know DRAM is volatile, so if your application crashes, it's gone anyway. Now, a very, very big difference is, for persistent memory,

Starting point is 00:37:23 I need to make sure that the data is not in the caches, but actually is in the right pending queue for me to know that it's persisted. So we need to explicitly flush data from the cache or from the CPU into the right pending queue. I'll show you some examples on that later. Now, this is kind of an annoying programming model because you constantly need to think about flushing data, flushing data, and flushing data. And so in the second generation of Optane,

Starting point is 00:37:54 Intel introduced the enhanced ADR, so EADR. And now the cache is also part of this DRAM refresh domain. That means as soon as data is in my CPU cache, then I know it's going to be persisted in case of memory loss. And this is quite interesting or quite nice because this means I don't need any explicit flushes anymore. I can just take the data, I can write it. There's something small I still need to pay attention to, but if I do this, then I know my data is going to be persisted. And this makes programming a lot easier,

Starting point is 00:38:27 because I don't need to pay attention to all these flushes, which are incredibly hard to get right all the time. And I'll show you this later also. OK, with that, we're going to take a three minute break. I'm a bit ahead of time, which is good. So we can have a bit of a break and rest and stretch. But before that, are there any questions? OK.

Starting point is 00:38:49 So as I mentioned, in the first generation of Optane, I explicitly need to flush data from the CPU to the integrated memory controller. So this actually then gets flushed to persistent memory. So the first time probably any of you have actually thought about moving data explicitly from the CPU cache to somewhere memory. So the first time probably any of you have actually thought about moving data explicitly from the CPU cache to somewhere else. And there are currently three options to do this on x86.

Starting point is 00:39:13 And the first one is a cache line write back. So what this does is it takes a cache line that I've modified and it sends it to the integrated memory controller. But it keeps data in cache. So what I do is I write something. I know it's going to be persisted after this call. And then the next access to this cache line is going to be answered from cache. So then I have the nice property that I don't need to go to persistent memory to get data

Starting point is 00:39:40 again, but I can just access it maybe from my L1, 2, or 3 cache. And one instruction to do this, or the instruction to do this usually in the C++ world is mmcache-lang-writeback so the underscore mm most of you will be familiar now with after the first programming exercise but these are some of the the Intel intrinsics that you have. The second option to do is a non-temporal store and this is again something most of you have probably never thought about and so what we want to do here we have a non-temporal store. And this is, again, something most of you have probably never thought about. So what we want to do here, we have a non-temporal store.

Starting point is 00:40:09 And basically what this means is we're saying we don't need this data in the near future. So there's no temporal locality. And this means, for example, if I want to just log something, I want to write it somewhere, but I actually don't need to access it at all unless maybe there's an error case or maybe once at the end I need to scan this sequentially, but I'm not going to access this specific cache line again in the near future. And so what we do with this is we just bypass the cache hierarchy entirely and that means we don't pollute our cache with random data that we don't actually need to access. This is faster than cache line writebacks

Starting point is 00:40:43 because we're kind of skipping the entire cache hierarchy. I think the numbers are around like 20 to 30% faster. And we can use, for example, this MM512 stream instruction. And here I think most of you will have used the store instructions for the first SIMD task. And stream in this case just says this is non-temporal. They're also non-temporal loads, but they do nothing on x86

Starting point is 00:41:06 because the concept doesn't really exist, but there's instructions for it. So yeah, what we're basically just doing here is we're taking an entire cache line, so it's 512 bits. We're saying just write this directly to the memory location, no caches are involved. And if I access this cache line next iteration, then I will actually need to go to PMEM to fetch it.

Starting point is 00:41:25 The third one I'm just going to mention for completeness. This is kind of legacy and shouldn't really be used anymore. And this is cache line flush. And this is just because at the beginning, cache line writeback didn't exist. So there was only CL flush. And what it does, it takes the cache line,

Starting point is 00:41:43 it writes it to the integrated memory controller, but it also invalidates it. That means if I have my next access, then I still need to go to PMEM to fetch it, because it's not in my caches. And this is not really useful anymore, because the newer generations all implement cache line write back in the NT store.

Starting point is 00:42:01 So there's no need for cache line flush, because if I have it in my cache but then invalidated, then I'm just wasting space in my cache, because if it's invalid, I need to fetch it next time anyway. So one thing that's also, again, going one level deeper, as I said, that we need to flush data from the CPU to the integrated memory controller, and that data in the caches, at least in the first generation,

Starting point is 00:42:28 is not part of the persistent domain. But actually, in the CPU, there's one level lower than that. And that's I issue a store. And this first goes to a store buffer. So my CPU, I don't actually want to write all the data. I want to gather data before I write it somewhere, because writing is very, very expensive compared to reading.

Starting point is 00:42:48 And everything that is in the store buffer is not actually part of the cache, because everything in the store buffer is not visible to any other thread, or any other core running on the CPU. So this is really only the single core, there's no way to interact with this store buffer, so it's not visible. So this is really only the single call. There's no way to interact with this store buffer. So it's not visible. And the main point is that basically the way we can define something to be part of the caches,

Starting point is 00:43:13 and in this case also in the second generation part of the domain that has persisted, is we need to ensure that a write is visible. So if you've kind of dug into C++ memory models or C memory models and atomics and these kind of things, then you're probably familiar with the term, like, visibility. When is it right visible to something else? And basically, as soon as data is in my cache, then the CPU guarantees that if another thread wants to access this data, it will see this data.

Starting point is 00:43:45 And that's what all the cache coherence protocols are for. In this case, then data is visible. But to actually do this, so the CPU, like I said, it will issue a store instruction. It will go to some store buffer. And then the CPU is out of order. So it will execute stuff out of order. The compiler may reorder your instructions. So just because I write a store instruction here in my program

Starting point is 00:44:10 doesn't mean that this is exactly where the store instruction happens, and maybe another instruction will happen before it, or another instruction will happen, like something will be moved around it. And this is a big problem, mainly for the fact that in my application I write, please persist this cache line to persistent memory, and now another thread sees this, says, ah, okay, you've written this, it's persisted, and then I make a decision based on the fact that something else was persisted already, but this data's not actually persisted, then I have a problem.

Starting point is 00:44:45 Because then I'm making assumptions about persisted state that I can't recover. So what we want to have is that if I persist something, I want to have a way to be sure that every single instruction that is executed after this is guaranteed to be persisted. So in case of a power loss, I know that I can get back to this state. And this is really important, because otherwise it's very, very hard to reason about persistence

Starting point is 00:45:09 and when something is persistent or not. And to do this, there's an instruction called a store fence or S-fence. And this instruction guarantees global visibility. So I'm just taking this from the 5000-page page Intel software developer manual, which says, the processor ensures that every store prior to an S-Fence, so above an S-Fence, is globally visible before any store after S-Fence become globally visible. So once the S-Fence instruction is completed,

Starting point is 00:45:39 this means everything that in my application that I've written in my code, actually written my code before this instruction, is guaranteed to be globally visible. So this basically means the compiler cannot move any instructions from above the S-fence instruction below it. And the CPU needs to ensure that once this S-fence is completed, that all the stores before this are visible.

Starting point is 00:46:03 And this basically means it needs to ensure that the store buffer is flushed, it's empty, and then I know all my data must be at least in the caches. And once it's in the caches, I know that it's globally visible to other CPUs or other cores. So like I mentioned, this stops the compiler from reordering, which is very, very important because this is what the compiler can do and does a lot. So if you don't have this, then maybe you'll make an assumption about persistence before it's actually persisted.

Starting point is 00:46:35 And the SFence waits for the store buffer to be flushed. And once we've had this, we've issued this SFence command, then we know for sure everything running after this will be persisted. So a very, very common pattern you'll see, or basically the most common pattern you'll see to do this is we have data, we write some value to it somewhere, then we call the cache line write back instruction that I showed you on the previous slides.

Starting point is 00:46:59 We say, please write this address to persistent memory. Again, we're just writing a single cache line here. So this issues exactly this. So wherever this address is in, we'll flush exactly the cache line that this pointer's in. And then after we've done that, we'll issue an SFence instruction, which says, please make sure that nothing is reordered here.

Starting point is 00:47:21 And now I can guarantee that. So basically, any instruction that comes here, I'm guaranteed to know that this is persisted. And everyone else can see this. This is a very, very important property in programming persistent memory. So now if we take a step back again and look at this ADR versus EADR comparison,

Starting point is 00:47:41 EADR does not require any flushes anymore. So like I showed in this example, in the first generation, I need to issue this cache line write back instruction here. In the second generation, I don't need to do this. But in the second generation, I still need to issue an SFence. This is purely for the fact that in the second generation still, I have something in my store buffer. And I need to make sure it's in the caches.

Starting point is 00:48:03 And if I do an SFence instruction, I can guarantee it's in my caches. And if I do an SFAS instruction, I can guarantee it's in my caches, and then I know it's in the EADR, and I know it will be persisted. And there's a nice paper here from OSGI 2020, which investigates bugs in PMEM applications, and PMEM libraries, also the official PMEM libraries from Intel. And there are a lot of bugs around incorrectly flushing this.

Starting point is 00:48:27 So if you think about a very, very simple path, where it's like, oh, I write something, I need to flush one value, and then I do an S-fence, that's pretty easy to get right. But let's think about we have multiple access paths. We have a for loop that goes somewhere. Then we return out of this for loop, and we have different states. So this is somewhat similar to if you think about C and its memory model.

Starting point is 00:48:52 If I've called new to create an object, actually checking all the paths that I don't have a memory leak is quite tricky, which is why C++ kind of tries to circumvent this with different idioms. But this is kind of the same problem. How do I ensure that all the paths I take are actually correctly flushing and persisting data? And also an important part why this is interesting

Starting point is 00:49:13 is because the choice of the flush instruction I use actually impacts performance. So I briefly mentioned the non-temporal store is a lot faster than a cache line writeback because we're bypassing the caches. So depending on what I want to do, it might be beneficial to use different instructions.

Starting point is 00:49:28 And even in EADR, it still is beneficial to use explicit flush instructions. For example, if I have sequential writes. Can anyone tell me why in EADR I might still want to have explicit flushes for sequential writes. If we think many, many, many slides back to the slide where I showed you how data is actually written to obtain and what happens when I transfer cache lines. Yes? I mean, if the writes are, like, reordered, then maybe we, like, we try to, like, get it to 256 bytes to, like, send it to the obtained drive, but on the one-two.

Starting point is 00:50:15 But if it's, like, reordered, maybe the, like, full cache line is not in there already. So maybe we just, like, we get, like, only two, three, and we'd be using a read-body type, right? Yeah, that's the correct answer. Thanks for that. So let me just repeat for the microphone also. What will happen if I don't explicitly flush data, then data will just be in my caches at some point, and then my cache will randomly evict data.

Starting point is 00:50:38 So I don't control when data is evicted from cache. I have no control over this unless I use these instructions. So what will happen, or what may happen, is I want to write address 1, 2, 3, 4. I write this by don't issue a flush instruction. Now what happens, at some point cache line 3 is evicted from cache. And now cache line 2 is evicted from cache.

Starting point is 00:50:58 And then 1 and 4 just stay in the cache forever. And now I have to flush. And then now Optane needs to flush two and three together, but we have write amplification, so we need to do the read, modify, write. But if I explicitly flush all four together, then all four addresses together will go into the write combining buffer.

Starting point is 00:51:17 Optane will see this and write this at once, which is really good for performance. So yeah, depending on your access patterns, it really makes sense to think about when I need flushes and when not to use them. OK, switching to another interesting problem that we have with Optane or PMEM in general, Intel only supports eight byte atomic stores.

Starting point is 00:51:42 So some CPUs technically also do 16 byte atomic stores, but this is not really important for what we want to discuss here. We have 8 byte atomic stores. So let's say we want to write the string toman with a 0 byte null byte terminator to some pointer. And then we push this. This is 8 bytes. And this means that we can write this once. And the CPU, depending on a few things, but the CPU will guarantee that this is written atomically. So either the entire string in this case is written, or none of it's written.

Starting point is 00:52:19 For PMEM, we actually really need to pay attention to this on a CPU level and issue the instructions correctly to make sure that stuff is written atomically, because, well, I'll show some examples where this is a big problem in a second. And actually just as a small side note, most SSDs may take care of this for you. So for a very, very long time, I believed that SSDs will also flush an entire page atomically, but this is actually not the case. Some SSDs do, but a lot of SSDs actually don't.

Starting point is 00:52:45 So it's really important to think about atomic writes here. Does the pointer need to be 8 bytes aligned? Yes. So that's why the question was, does the pointer need to be aligned? Yes, I'm pretty sure it does. I'm just thinking that at least on Intel, there's a lot of weird trickery around, for example,

Starting point is 00:53:05 cross-cache line writes. So I'm not sure if, for example, it's 4-byte aligned and crosses a cache line. There's specific stores for these kind of things to happen too. But in general, for something to be an atomic 8-byte write, it needs to be 8-byte aligned, yes. So in general, it's better to be safe. I'm not 100% sure about the implementation details

Starting point is 00:53:24 if it's not 8-byte aligned. Yes, so 8 bytes are either stored completely or not at all. But now later on in my application, I want to now change the name because someone got promoted, got a new job, and now it's Professor Dr. Thirman Rabel. Or we don't want to have just a first name in the application. And together with the null terminator, if I count correctly, these are 23 bytes.

Starting point is 00:53:47 And now what can happen is, so first of all, we need three atomic bytes for this, because we need 23 bytes. So we need to write 24 bytes. So what can happen is we write the entire string. We're done. And this is the normal path. This will usually happen, because your application

Starting point is 00:54:02 doesn't crash all the time, hopefully. But what may happen, which is really important to think about, is that I could write the first eight bytes, but the rest is still old garbage. So we have this. And now if I restart my application, I'm in an inconsistent state because I'm not in the old state. I'm not in the new state, somewhere in between. And this is really, really bad because now this has persisted. I restart my application. It's broken. What can also happen is that I have the old name at the beginning, so just Timon here,

Starting point is 00:54:31 and then I've written the last bytes. And this is really important to think. It's really important to understand that the order of how the cache lines are flushed is not guaranteed atomically. So we have no control about the order in which cache lines are flushed and in which order they're actually written somewhere. So it's really, really important to make sure if I have a data structure

Starting point is 00:54:53 that needs to write multiple bytes atomically, that I somehow have metadata that indicates that something has happened or not. Because otherwise, a big difference again from DRAM is if my application DRAM crashes, I restart, I'm good. If my premium application crashes, I restart. I have inconsistent state in my application. There's no way to recover this. Same as if I have corrupt data on my SSD.

Starting point is 00:55:16 If I haven't taken care of how to recover this correctly, there's no way for me to get my data back. OK, let's kind of look into how to program with persistent memory. I'll get into some code examples here. So to program for currently obtained, the best way to do this is to use Intel's Persistent Memory Development Kit, or PMDK.

Starting point is 00:55:46 It's on GitHub. Sorry, well, it's quite a big project. And this is the main way or the de facto standard to interact with persistent memory currently. There's a lot of libraries in here. So the standard ones would be like libpmem and libpmem2, which is the newer version, which just offers general persistent memory functionality. There's libpmemobject and libpmemobject++, so this is for C and C++. And this actually offers you some nice abstractions around transactions, objects, and memory allocations.

Starting point is 00:56:19 So like I said, it's really tricky to get atomic writes correct. This library will take care of this for you. It's really tricky to get atomic writes correct. This library will take care of this for you. So there's some overhead. It's not for free to do this because of how they have to manage this. But as a normal developer, I don't really need to take care of this. It will just write something. I can say write this atomically as a transaction, and then either all of it's written or none

Starting point is 00:56:39 of it's written, which is quite nice. We can do persistent memory over RDMA. So I think RDMA will be covered in two lectures, in the network lecture, which is remote direct memory access. So I can also do remote direct access into persistent memory. There's libpmemlog for logging purposes, and libpmemblock for atomic block updates. So again, if I want to write an entire block atomically,

Starting point is 00:57:04 this will take over. And general around this ecosystem, there's a lot of documentation, a lot of examples how to implement this and how to work with it. So now if we look at how we actually want to implement persistent memory, like I said, we kind of have this combined mode between PMEM as a file and then actually accessing it in a bi-addressable way.

Starting point is 00:57:28 So if we look back at this picture we had here at the beginning, we have our application. This sees some file which is mounted on persistent memory, so in this case, file.data, which is on a PMEM-aware file system on a block device and so on. And you can kind of think of a file as a typed file. So each file has a root object, which has some kind of type, which is kind of like the metadata

Starting point is 00:57:54 for the structure of the file. So this root object can be any arbitrary, let's say, C object I can just serialize or just some struct I serialize into my file. And then there's usually some metadata somewhere in here which we need for allocations and offsets and these kind of things. But you can basically think of the root as the entry point

Starting point is 00:58:13 into the file. And it's really, really important that from this root I can access every other object in my file. Because let's say I have an object somewhere down here, object 4, there's no way for me to reach it, I have no idea if it's there or not. I have no idea what it is, I have no idea how to get there, and so on. So then in that case I have something that's called a persistent memory leak. Because somewhere I've allocated memory for this, my allocator says there's something stored here, the allocator doesn't know what it is. It just knows this memory is not available anymore.

Starting point is 00:58:47 But I have no way to reach this or access this data. So the memory is wasted. It's allocated. So it's a memory leak. But the thing is, if I restart my application, it's still there. So it's persistent. And we want to avoid this because otherwise we

Starting point is 00:59:01 have memory leaks and we run out of memory, but they're persisted forever. And now if we have the application on the right, we can see we can just mmap this file. So this is obviously a bit simplified. I'll show you some real-code examples later. And then the root object we can get, we can just do data plus some offset, so we can actually get this object. And then I can interact with this as a normal C or C++ object. I can say, please retrieve object one on this, please retrieve object two on this.

Starting point is 00:59:26 I can update some values in these objects. I can actually persist them. So then I can just interact with my file like I would with normal objects, while all this is being backed by some kind of file, which we don't really need to care about, which is a very, very neat abstraction. We just need to kind of get the file

Starting point is 00:59:43 into our virtual address space once. And then when I'm in my virtual address space, I can just operate on normal objects. OK, so just a brief example of how this would be done in a C example. So we wanted to write some strings. So we have this header here we can include. And then what I just showed you previously, the mmap call.

Starting point is 01:00:04 There's a pmem map file here, which just takes the path to a file. The length that we want to allocate, or the length of the file that we want to create a file here. These are the permissions. And then we have to put in two pointers, which will say how much we've actually mapped and if it's actually in persistent memory or not because I can use this to allocate stuff that's not on PMEM and then the isPMEM

Starting point is 01:00:31 flag will be false. Then I can just take this address here, my file, I can just store a random string to it and I can persist it and then all of this is done for me and the PMEM persist call you can see in line 11. This is basically just multiple cache line writebacks and store fences that I showed before, but just neatly wrapped behind an API that you don't need to take care of this. Now you can see basically as soon as I'm here, I can do with PMAM adjust whatever I want.

Starting point is 01:00:56 It's just like a normal point in my application. So this is quite neat. Just for completeness, I'll also show you in C++, because this kind of goes more towards this object-oriented typed file approach that I mentioned briefly. So here we have our root object, which just contains my string. And now we can create a file or an object pool with some flags again here, which permissions we want to have. And then we have this object, which permissions we want to have.

Starting point is 01:01:25 And then we have this object, which is my root object. And then I can kind of just operate on this normally. So again, a bit of API stuff. I can get the root object, have some persistent pointer, and then I can do all my atomic guarantees on top of this. So I can say, please write some data here, do this atomically, and the application or the library will take care of all of this For me, which is quite neat.

Starting point is 01:01:47 Okay. Let's get to some thinking again. After showing you some of the things that we went through, can you spot some programming Yes. So in line four and line three, we access data at the position three slot. In line six and seven, we access three slot at three slot. Is this correct? It looks a bit weird, at least. That's, I'm not sure. Three slot, okay, yeah, I can't hear you.

Starting point is 01:02:46 Okay, so I know where it is, yeah. Yes? I'm a bit concerned about you writing data without marking that slot as you. So, there's a race condition. Okay, so actually this is, it's a good point. So, the point was saying we're marking this afterwards. This is actually what you would commonly do in persistent memory, because I want to write data. Once it's written, then I say, now I

Starting point is 01:03:10 can guarantee you that this data is written. So this is a very common pattern to write something first, then update the metadata. Because if the metadata still says it's invalid, then whatever's written in there is considered garbage, and it doesn't really matter. So this would be a very common pattern. But maybe in the answer, I've already given a hint to what's missing here.

Starting point is 01:03:28 I have another thing. The data we printed in the line 3 is more than one cache line. So we need to multiple calls to the cache line. Yes, exactly. So this is actually a very common thing. I call here the cache line write back instruction on the address here. And this will just pick whatever cache line the data is, all the pointers in, write 64 bytes.

Starting point is 01:03:51 But we have 200 bytes here that we need to write. So what we need to actually do is we need to loop around this multiple times and write all the cache lines that this is in. So this is actually the first bug we have. And there's a second one also. Still looking for another hand. I'll give you a second. Think about the example I showed.

Starting point is 01:04:11 This is a very, very common pattern. When I flush data, what do I also need to pay attention to? What can happen if when I up so I said the CPU and the compiler may be order instructions. I said the metadata needs to be updated after the data has been written. What can go wrong. Yes. Yes, exactly. Yes. Well, if the metadata is reordered from the rewrite, that's no good. So maybe there's a barrier with this? Yes, exactly.

Starting point is 01:04:48 So those are the two things that we need to do. First of all, we need to persist 200 bytes. For simplicity, I just used this persist call here. So we actually guarantee that all the data is persisted. And in that persist call, there's an S fence, which is very, very, very important. And we also need one here. So we need to guarantee that stuff is not reordered.

Starting point is 01:05:13 Because if I have, I need, so this needs to be a, I think I have this on this slide here. Yeah, so the persist is basically just cache along writeback plus S fence. And we need this here to make sure that once we're actually writing the metadata, to say this slot is occupied, that I guarantee that data has been persisted before. And also after I've written the metadata, I need to ensure that there's an S-fence. Okay, this is unfortunate now that I kind of gave you the answer already, because of the animation that's broken. So the PMAM bugs like this are actually very, very, very subtle and hard to see.

Starting point is 01:05:48 So you could see it took you a while to figure this out. And if you have a lot of complex access paths throughout your code, this is very, very hard to get right. So it's really important to understand how to store stuff, when to store stuff, and how to do this correctly. And then EADR system, just because the answers are there already. And we wouldn't actually need the persist call here.

Starting point is 01:06:09 We wouldn't need the cache line write back here. But we would still need two S fences. So I was going to skip that, but I can't, because it says on the slide already. OK. Persistent memory and data management. So I'm going to use the last few minutes to kind of discuss some research that's been happening in persistent memory, some research we've been working on.

Starting point is 01:06:32 So we've been actually doing a lot. And my thesis will actually also be about persistent memory. So we've been doing quite a lot in this space. There's lots of research on PMEM in databases, on database operators, on storage engines in general, on how to do logging with it, how to design data structures, so B-trees, Radix trees, hash tables, and these kind of things for PMEM. There's a lot of work on interesting DRAM PMEM co-designs, so what data do I have in DRAM, which data do I have in PMEM, and how do I

Starting point is 01:07:05 get better performance out of this. If you're interested, there's quite a bit of work on recent database conferences and related stuff, so SIGMOD VLDB is like the two big database ones, Daemon is a workshop for modern hardware on databases, and FAST is a file and storage technology, whatever, something. And ASPLOS is more like an operating system programming language conference, which also kind of tackles this from a different angle. And I'll just very briefly show you three projects, which

Starting point is 01:07:37 is the FPTree, Dash, and Viper. And I'll just kind of show you what the interesting parts about those are. So FPTree, this is actually a paper from 2016 from Ismail Orkid, who was at TU Dresden. And this was done before Optane was available. So they have some interesting ideas on how to design this with some of the specifications that

Starting point is 01:07:57 were kind of known beforehand, but not really. And interestingly enough, this design was done pre-Optane, but it actually worked quite good even after Optane came out. So there's some data structures that are targeted more towards Optane, but this design was done pre-Optane, but it's actually quite good even after Optane came out. So there's some data structures that are targeted more towards Optane, but this design in general actually works quite well because it's quite simple. So the first idea they kind of looked at is to have selective persistence, which means we only actually persist the data that we need for recovery.

Starting point is 01:08:19 So in that case, they have a B-tree, which means we only ever need to persist the leaves, because from the leaves I can reconstruct the entire tree. And the leaves are the majority of the data I have in a B-tree, and the upper structure is a lot smaller. And you can also see it, they still called it storage class memory back then. So within each of their B-tree nodes, they kind of have some neat kind of features, which is fingerprinting, which is now actually quite commonly used across data structures to say,

Starting point is 01:08:49 instead of having to look up all my keys in persistent memory, which is quite expensive, I store a one byte hash of this key. And then let's say here I have, what is it, these are six. So I can compare six keys at once using this fingerprint. We could do this, for example, with a SIMD instruction. And only if I have a match within these fingerprints, I actually need to go to PMEM to check my keys, which

Starting point is 01:09:13 is a lot more expensive. And we can see back at the time, this significantly outperforms the other systems. Let me just quickly check. Yeah, so this is lower. It's better here because latency. You can see the blue line here is quite low at the bottom. It's actually quite close to DRAM.

Starting point is 01:09:32 And even now, for a lot of systems, this is actually still a pretty good candidate. So a lot of concepts used in here have been adapted in later versions of different systems. And the second one we kind of look at is Dash. So this is probably like the first real hash table design for Optane specifically. So this is from VLDB 2020. And then I said like there's a lot of attention to doing stuff 256 bytes aligned.

Starting point is 01:10:02 So they obviously take a hash bucket which is aligned to exactly 256 bytes. And you can see, again, there's these fingerprints in here and these kind of things, which they also use from FPTree. So accessing data exactly in this 256 byte granularity is very efficient. They use a version of cuckoo style hashing. So in cuckoo hashing, if I have a collision, that means I basically try to displace the object I'm

Starting point is 01:10:28 colliding with into another bucket. So I have a fixed number of buckets in which my entries can be. In this case, there's two. And these two buckets are exactly next to each other. So I have two adjacent 256 byte loads, which means I can load this in one 512 byte access, which is again very efficient.

Starting point is 01:10:48 And this again significantly outperforms the other designs and till now still does quite a bit. And I actually think there's a new Redis competitor. It's not as big yet, but it's a system kind trying to be in a new Redis so like a cache and they're actually explicitly building on this paper even though they're using this entirely in DRAM so a lot of the designs and he actually also worked very well for DRAM this is a very very nice paper to read and third one just to plug some of our own research here this is my work

Starting point is 01:11:20 from VLDB 2021 and so so here we're looking at the intersection between DRAM and persistent memory and a key value store kind of setup. It's saying, OK, we have random access in the hash table. This is expensive. So we want to do this in DRAM, because random writes and random reads to PMEM are a lot more expensive than they are to DRAM.

Starting point is 01:11:40 So let's keep a small index in DRAM, but we can actually write a lot of stuff sequentially to PMEM very efficiently. So let's write this small index in DRAM, but we can actually write a lot of stuff sequentially to PMEM very efficiently. So let's write this stuff to PMEM. So what we kind of do is have a slotted storage layout for this, which means each page I want to write to, I have this slotted design here, where we have slot 0 to whatever, n. And if I write to this sequentially, what I'm actually doing in the system, if I'm writing entry, entry, entry, I can write sequentially, and I can always issue my flush instructions and s-fence instructions. So the Optane media underneath can actually

Starting point is 01:12:13 do write combining very efficiently. And again, back at the time, this was the best out there. Nowadays, it's not anymore. So there's been quite a lot happening around this. And I think there's probably half a dozen papers right now that are better than this design. But it's quite good to see that this design, again, because it's very, very simple.

Starting point is 01:12:34 It actually is still usually in the top two or three candidates in the designs. So this is basically just because you're really taking into account what the access patterns are. OK. I think I'm getting closer towards the end. It's good. So there's a lot of additional resources.

Starting point is 01:12:56 There's PMEM.io, which is a website that covers a lot of PMEM implementations and how to do this. GitHub.com slash PMEM is the main GitHub organization for persistent memory. And this is led by Intel. And there's actually also a book about persistent memory. Feel free to ask us.

Starting point is 01:13:13 Again, we've been doing a lot of work on this. And a lot of students have been working on this. So I think the first one and the last two are papers by me. But the other ones in between are actually student papers that we supervised Which have gone to two very nice conferences So yeah, we've been we've been doing a lot of work in this space. It's been a lot of fun PMEM at HBI, we have six PMEM servers at HBI. We might actually get be getting more Which I don't really understand why but that might be the case um we have both

Starting point is 01:13:45 100 and 200 series obtain here which is quite nice we have a wide range of pmem servers and we have up to four terabytes on one machine so that you can actually really large experiments um now just i think this is the next next slide i didn't want to start with this purely for the fact that then you probably wouldn't have paid attention to the last 90 minutes. Intel discontinued Optane. This, as of now, unfortunately, is dead technology. So last year, they decided this is not worth it, too expensive, whatever, and they cut it entirely with their focus on shifting purely towards CPU data center kind of company. They announced this in July 22. They said the 300 series will still be available afterwards.

Starting point is 01:14:33 And with this, we might actually be getting an HBI because it has CXL 1.1 in it, which is quite nice. But yeah, unfortunately, Optane is dead. But even though this is gone for now, there's actually some very, very interesting insights and problems out of persistent memory research that we can apply to something like CXL. I think CXL will be the last lecture in the semester.

Starting point is 01:14:55 This is Compute Express Link. And this basically is a new interconnect technology to allow for memory pooling. So there will be an entire lecture on this. But basically what it covers is saying, I have my memory pool somewhere, and different applications can kind of connect remotely to this and share stuff in between. And then we have a lot of problems, again, around

Starting point is 01:15:15 a longer access latency, different characteristics between DRAM and CXL attached memory. We have stuff around how does the prefetch interact with this, and also a lot around this store fence and flushing, when is memory visible, when it's not visible. So these are some of the problems that PMEM has tackled. We have a very, very short paper on some of the outlook on this,

Starting point is 01:15:40 how this might be dealt with in CXL in the future. So we think even though Optane for now is kind of gone, a lot of the insights are still applicable, and a lot of people still believe that PMEM in one shape or form will return again, just not in the case of Optane Intel, because Intel is making different business decisions right now. Okay, actually good on time today. So just to recap the lecture, persistent memory

Starting point is 01:16:07 is biodegradable and persistent. So that's the one thing you kind of need to take away. You have biodegradability, you have persistence, it's a combination of DRAM and SSD, which is really, really nice and can offer a lot of interesting things. And large scale companies, even though it's discontinued, are still using this.

Starting point is 01:16:22 So we had a talk from Oracle here two weeks ago. They are still actively using this internally for caching because it's a lot more efficient than doing this in DRAM because it's not as expensive and it's a lot faster than SSD. So this makes sense for caching. Persistent memory can be used just as larger DRAM without taking care of it or separate memory regions, in which case we can fully control which data is where and how it's persisted. And we can use it either in a file I.O. way or in load-store semantics that we have from DRAM. And very important, when developing for persistent memory, we explicitly need to flush and do

Starting point is 01:17:00 the memory fencing. So if you do this, you actually for for the first time probably ever, start thinking about when stuff is visible, how stuff is persisted, when is data flushed, when is something guaranteed to be persisted. So this requires careful code design to ensure correctness. And this is done wrong a lot, all the time. So I've probably implemented quite a few bugs. I've found bugs in other projects.

Starting point is 01:17:22 I've found bugs in papers. This happens all the time because a lot of people are not aware of how to do this correctly. So this is very tricky. So it's important to think about this from the beginning and understand what your model is, which abstraction you want to give. And with that, I'm nearly perfectly in time.

Starting point is 01:17:40 Thanks for your attention. Are there any questions? Okay, then next Tuesday there will be a storage lecture on, yeah, I'm not 100% sure what's covered in there. I think it was changed a bit from last year. So storage, we'll look into some disks and these kind of things. And tomorrow we will discuss the solution of the art programming exercise. And we will introduce task three,

Starting point is 01:18:07 which will be a buffer manager. So I will see maybe some of you again tomorrow, depending on who has conflicting lectures or not. And with that, thank you for your attention. And see you tomorrow.

Your Ad Here

Hardware-Conscious Data Processing (ST 2023) - tele-TASK - Persistent Memory

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.