Hardware-Conscious Data Processing (ST 2024) - tele-TASK - Storage

Starting point is 00:00:00 Good morning and welcome everybody to today's lecture in hardware conscious data processing. My name is Martin Vossier. Tim and Rabe is not going to be here this week. And today we're going to talk about storage. This is the topic for today. Short announcement, or not really an announcement, just a reminder. You have less than a week left for the second task. So please, well, you should have started already.

Starting point is 00:00:31 If you have any questions, if you are stuck, please give us a note. Because some people have been stuck, please ask either via Moodle or just directly come to our office and ask if you have any questions and if you're stuck with this task. Where are we right now? So, last weeks, yeah, the last weeks, several weeks, we talked mainly about the CPU.

Starting point is 00:00:59 So we talked about how CPUs are internally working. We talked about what registers are, pipelines, how caching works in the CPU, then how to parallelize it or how the CPU is able to parallel process and parallel, for example using Zimd. Then last week we talked about multiple CPUs or the new architectures that we see in modern servers. And starting this week, we will talk about the peripherals. The first part in this section now is storage, which we are going to talk about today. Tomorrow, Marcel is going to talk about CXL,

Starting point is 00:01:38 the Express Compute Link. And then we talk about networks, GPUs, FPGAs, and also the upcoming tasks. You can see that here. So today's storage, tomorrow's Moselle, next week on Tuesday we have Task 3, which will be a buffer manager, and then yeah, you see the rest, right? GPU network, FPGAs, and then the last task 4 here. And we will end up and probably with a data center tool, if possible. In this lecture, we first will give an overview about storage. So what kinds of storage do we have today in our systems or especially in server systems and larger modern computer compute systems then we will deep dive into solid state drives or ssds we will talk about the physical

Starting point is 00:02:33 interfaces how you connect modern storage devices to you to your pc or to your server we talk about mvme and the last point will be the IO frameworks in Linux. So how do you actually communicate with the storage device using Linux? Okay so one of the most forms of storage devices you see today. First of all, we still have sometimes optical devices. You barely see them today. Maybe you have them in your Xbox or probably nowhere else anymore. But they are still existing, but probably not much longer anymore. We have magnetic devices, so HDDs. There are still a lot of those, even though you might not have them anymore in your modern laptops. They are still the way to store large amounts of data. For example, this was probably Amazon uses for AWS and so on, or EC2. And

Starting point is 00:03:41 then we have flash devices. This is probably what you have in your laptop. There's SSDs, also those little thumb drives and we have still have tape. Future technologies for storage is for example glass and people try to store data in glass and also in DNA but but we're not going to talk about these things. So hard drives. We still see a lot of hard drives around. In the data center, most of the data you write and read will probably be on a typical boring hard drive disk. and they have the largest capacity still so we have terabytes of data not the largest but large capacities and relatively high speed so if you read sequentially it's usually an acceptable speed that you get

Starting point is 00:04:36 there but the main issue here with these high HDDs is slow random access and this is mainly caused because it is still there's physical moving parts. So if you have a random read somewhere, we have to move the head and we rotate the disc. So this takes quite some time. As you can see here in this, depending on the speed of your disc, so slow discs rotate with something like 5,000 rotations per minute. Fast enterprise disks have 15,000 rotations, but that's pretty much it. So you can't really get much faster there with the rotations. And the rotation basically says how long it takes

Starting point is 00:05:16 to have this at least half round that you have to take, on average half a round, to get to your item that you want to read. So this is what you always have to expect, that you need an average half a round to get to your item that you want to read. So this is what you always have to expect that you need an average half a round of rotation, your arm needs to move to the given sector. So random accesses on HEDs are somewhat slow. Connected are those devices usually via ZATA, which is Serial Data on server systems,

Starting point is 00:05:45 often this SAS, ZAS for Serial Attached Scalzing. Where actually a lot of data is, and is still today, is also tape. This is also something we have in our pretty modern data center here at the HPI. So this is not just like really old stuff. We still store lots of data on tape. And the reason for that is that tape is just cheap. So the overall costs to store large amounts of data is relatively slow on disk, and it has a very long lifespan of 15 or more years. So it's a very good device to store archival data.

Starting point is 00:06:27 So to archive data that you write once and don't read that often. And the most, this is very energy efficient because the way usually you have, you use tapes is that you have something like a reporter that takes the tape, puts it into the reader, writes it and reads it. And then it's taken out of the system in the cartridge and just stored somewhere. So there's no energy at all usage in this moment. So it's usually when you store data on tape, it's in general quite energy efficient. The dominating technology there or vendor is or the dominating technology is linear tape open LTO which is basically all the tapes you see in nowadays. Right now we are talking about generation 9. On the

Starting point is 00:07:18 right bottom you see the upcoming generations but with generation 9 the current generation you have around about 18 terabytes of capacity which can be up to 45 terabytes can compress storage depending on what data you store right you have you might have different compression factors but usually or often you get a factor of two or more so with this 18 terabyte capacity you can store more than 40 terabytes of data. We have 20k end-to-end passes and then you see here we have 350 full writes and reads, which is not a lot. So you don't want to use this for your swap space, so you don't want to use that for data that is frequently written to, but for archival purposes this is a

Starting point is 00:08:06 pretty decent choice and the next generation so we can see here up to generation 12 currently you people aim there to have 144 terabytes per tape capacity which can be up to 360 terabytes compressed storage. Optical devices some of you might still know CDs little round things that stored music probably not used widely anymore today then we have DVDs and Blu-rays. Mostly, I guess, for movies today or for games and some gaming consoles. And the advantages of it is still that it's really cheap to manufacture. So, yeah, it's just cheap. There are not many companies doing that anymore, but it's still possible to get them at a pretty decent price. And once they are in the cartridge, it's very easy to transport. So if you need to transport data,

Starting point is 00:09:08 it's still quite light and easy to transport them. But once they're out of the cartridge, they're actually quite fragile. So you also have to be pretty careful using them. And main issue here is that the read and write speed is quite slow. What is the most interesting thing for us today is flash storage.

Starting point is 00:09:34 So this is solid state. Well, what you have in solid state drives, there's two main technology. This is NOR or NAND based flash storage nowadays you pretty much only so at least in consumer devices you only see NAND devices or NAND based storage here in this case we don't have any moving parts anymore and that many different interfaces so the first SSDs that came to the consumer market have been connected via SATA. Later it was mainly PCIe. Nowadays in notebooks and also desktops you

Starting point is 00:10:13 often have this M.2 interface but it can also be USB. So different interfaces. And on the right you can see the architecture. We'll talk about it in more detail later. But what you can see here is that on the left, basically there's a PC, and then you have the source interface, basically where you connect or where you talk to the SSD. And then there is this SSD controller, which has a processor, also buffer manager.

Starting point is 00:10:39 So there's a little bit of RAM to buffer data to allow prefetching and other interesting things, and also buffer writes. And then you have those channels that write to a certain number of flash memory packages. So this way, the data is actually stored. But before they show storage, there's an SD controller, and this controller does a lot of magic that we are going to talk about today. Yeah, so there has been this work by Gabriel Haas two years ago, or a few years

Starting point is 00:11:14 ago, four years ago, and he plotted the price per, or gigabytes per dollar, this graph. It's a log scale, so please be careful there. But this comparison shows how the amount of storage capacity you got per dollar developed over the past 20 years. What you can see here is that disk and DRAM, there's pretty wide gap there, price-wise, but both are somewhat stagnating.

Starting point is 00:11:46 So especially RAM over the past years did not really make a lot of, didn't really get any cheaper. HDD, so disks are still getting cheaper. So it's a lock scale, right? But at a very comparatively lower speed compared to a couple of years ago. And Flash, shown in blue here in the middle, it just steadily increases. It's getting cheaper and cheaper. It's also getting faster and faster.

Starting point is 00:12:15 So Flash is something that we care about a lot and that most people care about a lot if we talk about hardware conscious computing and also fast and efficient computing. And some interesting things is that flash really started extremely expensive. So the first flash devices you could buy almost 20 years ago, they have been more expensive than DRAM back then. But since then, the trend is quite clear. And right now we see that it's run about 20 times cheaper than DRAM for capacity and it's slowly approaching disk or what well not slowly, but I'm might even approach the the price of disks. Here we have a short overview about the typical bandwidth so this is

Starting point is 00:13:10 the maximum achievable bandwidth you can get or can expect for different devices. On the very left you have DRAM. You can get a little bit faster depending on which CPUs you use and how many sockets but for single sockets modern CPUs you can somewhat assume like if you use all the threads that you can achieve one about 100 gigabytes per second reading and 775 gigabytes of writing especially the maximum you can expect there has been a technology that we usually talked about in this lecture but we don't talk about it anymore, which is persistent memory. The main reason is that this has been discontinued.

Starting point is 00:13:51 So persistent memory was a product by Intel called 3DXPoint, which was basically a DIMM that was almost kind of as fast as DRAM, but it was persistent. So this will be a DIMM that you can put on your main board, and put this stored data, even though you shut down the server. But it has been discontinued, I think this year, or last year, or two years ago. So it's no longer available on the market. This one was quite fast and also had pretty nice latencies. But again, it's discontinued.

Starting point is 00:14:26 Optical devices are extremely slow, as you can see here. So we are talking about megabytes, single megabytes per second. For SSDs, this is probably something that you have or you might be able to purchase for your home PC today is something that gives you around about five six seven gigabytes per second which is for PCIe generation 4 I think there's already generation 5 will be a little bit faster but most SSDs that you will buy right now have round about this 5 gigabyte that you can see here and for HEDs yeah there's not really a trend any longer.

Starting point is 00:15:06 So those are stuck pretty much at 150 megabytes, a little bit faster, a little bit slower, depending also on the retention speed. But there hasn't been a real development there, so they are stuck at this speed. Yes. Do you know why they discontinued the persistent memory? Was it not as useful as they thought? Or was there some technical reasons?

Starting point is 00:15:30 Many reasons for that. So, once they didn't really achieve the price, so the it was not as successful commercially, so that really they could scale it to a size where the price was where they wanted it to be capacity-wise. They expected it to be much cheaper than DRAM, but they didn't achieve it because the volume just wasn't there. And then also there have been different modes, how to program it, and it was actually not really easy,

Starting point is 00:15:59 or it was not obvious how to program it and what is the best way to program it. Because DRAM is persistent, it's still the question, what do you do with it right now? It's still DRAM, so it's still right to it. You can mount a file system on it, but then you have other disadvantages and so on. So it was not really obvious what to do with it. Yeah, and this basically led to the case that not many people bought it, thus the volumes were too low for the price and so there have been multiple factors why this was not the success that they hoped for.

Starting point is 00:16:39 Here we see like an overview for comparing HDDs to SSDs. So again, there's tape and optical, but if you talk about relatively fast computing in modern data centers, we either talk about HDD or SSD. Here on the left you see the, we don't have to talk about this right now, but first of all, on the left column, you see the main metrics. So if you have an SSD, a single disk seek

Starting point is 00:17:09 will be like two milliseconds. Doesn't sound like a lot, but two milliseconds for many disk reads, that just easily sums up. So two milliseconds is quite, it's quite slow. And to read one megabyte sequentiallyentially it was 700 microseconds. If you now compare this to modern SSDs, here we have a random read of 16 microseconds. The very fast or the fastest SSDs like Intel obtains, if they are still produced, I don't know, they're even in a single-digit random read. But usually what you have is like two-digit random reads,

Starting point is 00:17:51 two-digit microsecond random reads, which is, as you can see, much faster than HEDs. And also the bandwidth, so the sequential read performance here with 40 microseconds or 39 is much better than what you get with an HDD. And what you can see on the left is basically this deck how an application can talk to data on an SSD. And if I.O. performance doesn't matter, you basically use the POSIX API that's shown in the user space. So that's just the simplest thing.

Starting point is 00:18:30 You want to use probably buffered I.O., so you want to use the I.O. buffer, the OS buffer is data for you in case you read it again, and also prefetch for you. So that's good enough for you, good enough if you write, if you interact with an HDD. If you, performance doesn't really matter, it's the same with SSD. It's just a super simple interface. The buffering mostly makes sense. So that's just a good thing to use. Yeah, decent thing to use. If we talk about, or if we are in a situation where IO performance is crucial, this can be different. So then what we usually do in database systems is that we have our custom buffer manager.

Starting point is 00:19:16 So we don't want to have the IO in between. We don't want to have the IO buffering data for us because we want to know which data is buffered, when is data buffered, and when, for to know which data is buffered, when is data buffered and when for example dirty data is written out or not. So we want to be the one who says when which data is buffered and then we would use the direct POSIX IOs in this case and have custom buffer management. For SSDs, we will probably not even use POSIX. We would use, you would use the M-axis layers

Starting point is 00:19:57 that were shown down there in the kernel space. So we would directly interact with the NVMe protocol, which we'll talk about in a few minutes. But the main takeaway here right now is if you really want to have the best performance for SSDs, you're not going to use the simple read-write commands that you usually probably use. You really want to go down to the SSD and use whatever is next to the SSD, basically. Okay, so as we have said, for SSDs, you usually have NAND storage in modern SSDs, and there are different types of cells.

Starting point is 00:20:43 So data is stored in those cells. And now the question is, oh, the difference is there, how many bits do we have per cell? And there are SLC devices, which is a single level cell. So we have just one bit per cell. And we have also different other variants there we have for example multi multiple level cell MLC which you often see in modern devices or TLC for triple level cell and how much cells you have bits per cell says what the

Starting point is 00:21:21 capacity is so more cells per more bits per cell give you higher capacity but also means that your device is wearing off faster and also has implications on the performance so for that reason you don't usually i think i think you don't see SLC SSDs in consumer hardware, but in the enterprise field you still see SLC disks with lower capacities, but good performance and long lifespans in the enterprise world. And for consumers or for archival it's mostly MLC or TLC, so you have higher capacities and not the highest peak performance but still quite decent performance there. Here on this slide you can see how flash is organized. So how your flash drive

Starting point is 00:22:19 that you probably use in your laptop is organized. And the way it is here is that we have here two example logical units, LUNs. And these LUNs store our pages in the end. So we have per LUN, we have multiple blocks, and see per LUN, block one, two, three, and so on. And then within each of these blocks, we have the actual pages which store the data. And to access the data within a page you go through the page register per plane. So we have Do you see that? Yes. It can be. Yeah.

Starting point is 00:23:05 So here we have, in this line, we have two planes. Each of these two planes here has a data, a page register. So this register is basically the way through which we have to write and read data. So that has moved from one page to the page register, and from there we can read it. And different operations have different latencies. For example, if you write,

Starting point is 00:23:28 writes are slower than reads. This is also important, even though the difference is much smaller than with HDDs, sorry, there's still a difference between writing and reading with SSDs. And the smallest unit you have for your writes and reads are pages. This depends on the device you have.

Starting point is 00:23:51 So this can be 2KB, 4KB, up to 16KB. And the number of pages per block differs between the devices, but 128 or more are common. So often you have like single digit megabytes of storage per block. And that's actually important because, here, yeah. So when you read and write, you have to be aligned with the page size. when you read and write,

Starting point is 00:24:26 you have to be aligned with the page size. So you read and write the page. But the important thing here is, with the SSDs, is that pages cannot be overwritten. So you cannot go to a single page and overwrite the content there. You have to write the entire block. We'll come to

Starting point is 00:24:47 it in a minute but this is why it's quite important to understand that we that storage basically is partitioned into these LUNs and each LUN now has a certain number for example 128 of blocks, and then in these blocks there are the pages. For writing we sometimes have to erase the entire block, you can't just erase a single page. But we'll talk about that more in a minute. So how are those LANs connected? Every LAN is connected by two channels to the channel controller. And as you can see on the upper right there, there is the shared data path, so the data channel, which is basically where data flows through and then there is per

Starting point is 00:25:46 London individual control channel and since the data channel is shared what modern controllers do to do hide latency this SSDs are fast but they are latency. So what they do try to do is that they try to do pipelining, so they want to interleave your request in a way that, hide the latency of one axis by pipelining. As you can see it here, probably, hopefully, by pipelining those requests and grouping them and carrying them out in a way that you can hide the latency.

Starting point is 00:26:35 Even more before, there's actually much more going on than just this channel controller doing some optimizations for the data path there. And one of those things is the ECC engine and error correcting code engine. And the reason why this engine is so important for SSDs is that SSDs are actually extremely sensitive. There's this characteristic that you don't want to have many ones or many zeros on an SSD. So this can, since there's a lot of noise and there is, it's bad for SSDs if you have long chains of ones or zeros. So what this error correction engine does,

Starting point is 00:27:27 besides doing error detection and hopefully fixing, is that it basically scrambles the data. So data is, even if you write a long sequence of ones, those ones are scrambled to have a somewhat random pattern of ones and zeros. Those are written to disk and then they are scrambled or de-scrambled basically when you read data again from the engine. How this is done is usually not public, but basically all the modern SSDs do something like that. And here, you can see pretty much the entire SSD that we have in modern systems.

Starting point is 00:28:17 On the left, you have the host interface controller, which usually implements the NVMe controller. So this is the interface that you connect to, which for example, talking NVMe protocol. And this is where you transfer the data or the other point to access data on SSD. Then you have the firmware. Here we have the FTL, the Flash Translation Layer.

Starting point is 00:28:44 And this does all the page mapping, wear leveling, garbage collection, and also the scheduling. So this is basically the secret magic of most SSDs, where it's responsible for the performance. So where to write data, to which LUN, how to read it, how to schedule all the reads and writes so that you get the maximum performance. And then you have the storage controller that directly interacts with the pages, performs the scrambling, the error code correction and so on. Here we can now see what we already have discussed, the write and erase. A very simplified example here, but say we have two blocks here, and each block has only

Starting point is 00:29:33 four pages. So on the left you see block 1000, three of four pages are already used, so we have page XYZ, and block 2000 is completely free. Now we want to write a page, so we want to update page X. Just change some data in this page. And as we've already said, we cannot update a single page, so we cannot update directly in place the page X. What we do here is that we write the X to the new page to this free slot there and

Starting point is 00:30:15 then we would only invalidate the previous version of this page X that is saying okay this page is no longer valid. The data in this page of this page X, that he's saying, okay, this page is no longer valid. The data in this page, this page slot there, is no longer valid, and the page mapping now needs to, remember, or disk now needs to remember, not the application code, but the disk now needs to know, okay, this page X that I'm supposed to store is no longer at this offset zero,

Starting point is 00:30:42 it is now at this offset three. Now if we want to, at some point, we want to get rid of this previous version of X, because we can't use it, it's outdated data, it's invalid. So what we now need to do, since we can't just write to a single page and update it, we are basically copying over the entire data without x. So now we can have free slots that we can write to

Starting point is 00:31:10 in this block 2000, so we can now again use the zero. But again, we have to write the entire block there. To overwrite zero, we have to write the entire block. And this has quite some implications for writing, write amplification and so on. So this is really something that SSDs heavily optimize for because, as you can see, this can be quite an amount of work if they are small writes to the same regions.

Starting point is 00:31:46 And quite an amount of work if they are small writes to the same regions. And as you might already know, the problem that we have here is called write amplification. So a single write, so just writing to, changing a few bytes of an existing page can lead to much larger costs, even though the write is pretty simple. On the one hand, this can be because we have a pretty large page size on the device so if you just write very small data it can still mean that we have to write 16 kilobytes and then what on top of that we might have to do some block erasure and to move all the pages to be able to erase the block. If we make, for example, need to make some space,

Starting point is 00:32:30 so we need to find invalid pages, but we need to move all the data from this block or the validator before we can entirely erase this block. What we can do here, or what most systems actually do to avoid this write amplification is that we align writes on the page size. We try to organize our writes in a way that avoids those scenarios. And the device does then on top of that a lot of magic to avoid this. So if the device sees that certain,

Starting point is 00:33:12 because it tries to move around certain pages that are written over and over again, so not that all the devices is moving all the blocks again and again, but that you have some, that you can basically optimize this block erasure, and also that you don't wear out single blocks very early, because again, flash is pretty sensitive, so that you distribute your wearing, that you maximize the lifetime by shifting around the block erasures over the entire device.

Starting point is 00:33:53 And the component that does most of this is the Flash Translation Layer, FTL, which is part of the SSD controller. This part has the mapping from the page that you say you want to store, maps it to the actual physical location on the disk. And as we have seen on the previous example, this mapping now needs to change when we have this x and xy and x dash and we store it somewhere else. And this mapping basically needs to be updated because now the location of this updated page is somewhere else or it changed. It has not changed. And this mapping is usually stored

Starting point is 00:34:40 in DRAM, in the RAM of the SSD. but it's also persisted in case there's a power outage so that the SSD can restore this mapping. But for performance reasons, it's also kept while running, it is kept in the RAM of the SSD. And what the FTL also does is garbage collection and rail leavening. So how does a very simplified FTL look? So we have one free flash block open per LAN so we can write our data there. We try to buffer our writes so that we have that we can for example group write them and to simplify it we can for example do a logic to physical mapping and a round

Starting point is 00:35:33 robin so that we distribute all our writes one robin over the entire LAN and the entire device and then when the buffer has a full block, we can flush that and send that away. So we have no longer those many single writes. And this allows us to maximize the channel utilization. So it gives us more performance and also helps us a little bit to counter some write characteristics of SSDs. Then we don't have updates in place, so what we do is updates, as seen with this little cross in the previous example, updates and validate entries.

Starting point is 00:36:16 So now when an update is, there's a new page for that, the old page invalidated, now the mapping table needs to be updated. And for garbage collection, we also somewhat need to store or need to remember that so for later point in time at some point in time we need to use we probably need to use the space again so the invalid version so there needs to be some garbage collection that runs from time to time and collects all this data and collects the internet pages and reorganizes the storage.

Starting point is 00:36:51 And since we don't do updates in place, that means we always need to over provision. So the SSD that you usually see in a system that has a certain amount of gigabytes has actually more space just to account for that account for the for the fact that it's often tries to to not write immediately so it doesn't want to write the entire block immediately when you have updates and to counter the fact that some pages will wear out over time. So this is something that the SSD already accounts for, that things are not getting better over time. And this garbage collection is part of the SD controller

Starting point is 00:37:44 and shows that stale pages are erased at some point in time and are again free so that we can write to it. We do that for those PE cycles, there's a program and erase, so this means program is basically writing and erase, yeah erase and the writing and erase. Yeah, erase. And they are, so all the pages in one SSDs have a certain limited amount of those PE cycles, which are usually in the thousands, one hundred thousands, so less than a million. And after them they will at some point fail. So this hopefully something that your error correction code engine will check and there will be enough free space that can be used then.

Starting point is 00:38:31 But if you write a lot, this will happen that those levels will wear out. And now the garbage collection tries to counter that, to avoid the scenario that some areas are heavily used and have many PE cycles and so wear out earlier than others. What most or what modern SSDs, no, operation systems do here, so what SSDs offer and what operation systems do here is the trim. They use trim. What is that? So if you talk about HDDs, the OS, the OS is a layer above the storage that does not necessarily tell the device when data is deleted.

Starting point is 00:39:26 So when you have your OS and you say delete this file, the OS does not really delete the page and write something new there. Usually they're just the point that at this page some data is stored is just invalidated or deleted on something. So there is not really the as the HHD does not know that data is deleted. It's no longer accessed. The OS might later overwrite it with a new file but there's no it is not deleted in some sense. That's why you also usually can easily recover deleted files because physically it's not deleted. For SSDs that can be a problem because they want to know if large parts of the of blocks for example are totally unused because if those blocks entire blocks

Starting point is 00:40:19 are unused because large file is no longer existing then that means that this can be used for garbage collection. So we don't have to copy blocks before we can write single new blocks there. We can just take the entire block, delete it and use it for a new space again. So this is this trim process. And the idea here is that the OS Bazzi tells, okay, these pages now are no longer valid. Feel free to do whatever you want with them. And the S is no longer accessing these pages.

Starting point is 00:40:57 And now the SSD can use this information as part of the next garbage collection run, for example, to optimize its storage, which is usually done asynchronously. So it's not immediately done. Yeah, it's not immediately done. As we've already said, so there is a well levelings, non-flash cells have a very limited time span. Again, something like thousands, 10,000, 100,000,

Starting point is 00:41:26 depending on the device. So we have seen before, SLC devices, the single level cell devices, they have usually much higher PE counts than those cheaper devices. Or not cheaper, but devices with a higher capacity. And if you write to a single block over and over again, you will quickly exceed the limit. So what the SSD now tries to do is to counter that. And there is dynamic layer leveling.

Starting point is 00:41:54 That means that the mapping changes from the actual logical block to the physical block, meaning that we write to a new location. So we try to move, if data is frequently written, move this frequently written data around so that we don't have a hot spot where we easily exceed the PE cycle that we don't want to reach. And then there's also static relevant,

Starting point is 00:42:25 which basically just means we move around the entire SSD. So this is basically just not only single, often written data is moved around, but the entire data is shifted, shuffled around so that we have somewhat like a even distribution of writes on the SSD. Here we see the data path from SSD. So on the very left, there's the host controller, the host memory.

Starting point is 00:42:56 So this is your server. Now we contact the host interface controller, which has a data buffer. So this is the buffer that buffers data in the SSD's RAM. There is some caching policy there, when to store, which data to keep in this DRAM, and when to write or, yeah, when to keep it in cache. Then we have this FTL and the fresh translation layer does this mapping between

Starting point is 00:43:28 logical and physical pages. And here we what comes into play here is the data placement and the garbage collection policies. So how where do we actually keep those pages? When I have to update this mapping where do I place the new map and how can I optimize or yeah can I optimize the performance and also keep track of the wear leveling of my SSD. Then from there we have the low level scheduler. So now we are on these going to the LANs and now basically the storage, no the controller basically needs to say, okay, how now I have, for example, a certain amount of reader requests, how can I schedule them to optimally use my channels to my LANs, right? There's a shared data path. I somewhat need to think about how to

Starting point is 00:44:20 schedule these requests to have the optimal performance. And since I only can read from certain page, which is so I have to time that in a manner that gives me the best performance. And here, what comes into play here also is the correctness and longevity, which basically means we have to check, we scramble and also check if data has been corrupted. Okay, so this was it about SSD internals and you don't have to remember all the aspects,

Starting point is 00:44:58 but the important things is just that it's pretty complex to interact with SSD. All the things that go on in an SSD that you're not necessarily see from the outside are pretty complex. So the SSD internally handles parallelism, the error correction, and then those many things that just, that make Flash what it is today.

Starting point is 00:45:21 So those all that has to care about those, wear leveling, about the performance characteristics and so on and so forth. So it's really a complex thing and not every SSD is the same. That also means that it's really hard to optimize for SSDs in general because the number of LUNs or the parallelism, how well leveling is done, if there is trim or not, this basically depends both sometimes on the OS and then also on the hardware, on the SSD. So it's really hard to fully optimize for an SSD, but there are still some things we can do to have a pretty decent performance on SSDs, which we will talk about later.

Starting point is 00:46:09 But it's complex. Okay, so now we talk about the physical storage interface. This is an ATX board, a little bit outdated here but this shows you this still looks somewhat what modern motherboards also looks. You have the external ports here so here's a USB port and many outdated older ports but then you have your CPU socket where your CPU is placed into. You have your memory slots here where you put your DIMMs in and then we have for storage we have mainly two connections nowadays. We have the SATA ports here

Starting point is 00:46:57 you might still remember and we have the PCIe slots. And PCI slots, but nowadays it's mainly PCIe. Yeah, and those data buses come in two fashion, especially so there is the parallel bus. And what we have in the parallel bus is what we see here on the right in a very simple example. For example, here we have eight data links. So we in this parallel bus, for example, to send one byte in parallel, we can send one bit over each link. Thus can send a byte in parallel. So the speed of the parallel link is equal to the number of bits sent at once, so number of data links we have.

Starting point is 00:47:52 But we have this problem that if we want to increase the performance there, we have to increase the number of data links. And there is a lot of cros cross talks and interference between those links on parallel lines and especially if you have longer lines, so you're getting far away So the connection, the bus is getting longer and you want to have a higher capacity so you have more links, the more problems you get with interference and cross talk The alternative here is the serial bus so here we have a single data link can send only one bit at a time advantage here is that we have no clock

Starting point is 00:48:38 skew so we don't it's as much easier to to use this thing because we don't have to care about the clock skew of different and parallel transmitted data bits because we have only this one single serial bus here, this one data link. And thus we have fewer cables that means we can have less crosstalk and it's also cheaper to produce. So in the end the serial bus is often superior in the end or nowadays often used compared to parallel buses. So one of the first SSDs that you as a consumer might have bought were connected via ZATA. ZATA means serial data.

Starting point is 00:49:36 This was the way to connect SSDs and was announced around about 2000, year 2000. And this was much, the cable size was much smaller and the cost has also been lower compared to parallel ARTA. Probably, I don't know if you remember, but your old IDE links to your drives which were wider and much slower. This was a parallel ARTA and now coming from parallel bus to a serial bus now we have or then afterwards we had serial ARTA and yeah so this is a serial bus and this was used for mass storage devices and this supported HDDs, optical drives, SSDs and so on. It came in three versions. Today we are at version 3 which came out in 2009.

Starting point is 00:50:34 Serial Arta started with 150 megabytes per second so this was enough for HDDs back then. But then with the increasing performance of SSDs, we needed better connections. So Zarter increased the data rate there to account for more than SSDs. Back then, more SSDs did increase the bandwidth performance compared to HDDs. And the next step was to connect mass storage was PCI.

Starting point is 00:51:10 This is the peripheral component interconnect. And the PCI was has one clock, but before this 33 megahertz can also be zero megahertz to save power or in some cases can also be to double the speed, so 66 MHz. This is the bus clock. It has two phases, so there's an address phase, basically say, okay, which address do I want to have and then the other phase we receive the data.

Starting point is 00:51:42 And this was a plug and play,play, it supported plug-and-play so you can theoretically while the system is running you can take out these cards put them back in and this was supported by PCI back then. How this was connected or usually was connected is shown here. So we have a PCI bridge, which is connected to the main memory and to the CPU. And then we see the PCI bus in the middle that connects to, for example, USB to your graphic card and so on. So this is the PCI bus that is connected via the PCI bridge. The successor then was PCIe Express, PCIe in short. And here we have a high-speed serial computer expansion card was 2003 roundabout.

Starting point is 00:52:38 And this is again, really the serial replacement of an previous parallel version, PCIe, the same as with P-RTA, so PATA, parallel ATA and serial ATA. And we now have the same thing again here. And similar to the cables that got much smaller with SATA, now we can also see here on this graphic that we have those PCIe slots here on the bottom in light blue there and the same performance has been achieved with the PCIe x1 and what you can see is a

Starting point is 00:53:21 much smaller size. It has two channels for each connection. Bi-directional means that for PCIe x1 you have two channels, for PCIe x8 you have 16 channels and so on. And how does that look? Now there are point-to-point connections, so that means that the devices can talk to each other via a switch. There is also error detection code, which counts a little bit the problems that PCIe had on this bus. And the switches or the connections can be relatively long

Starting point is 00:54:09 given what you know from motherboards, so these can be up to 50 centimeters. And it's also possible to even extend that with another switch. And here you can see that this switch is connected to the bridge chip, but this can also be on modern CPUs for really high performance. You often see that this is integrated into the CPU. So this is part of the CPU then and not a single chip on the motherboard.

Starting point is 00:54:38 The protocol stack is somewhat expected. So there's a physical layer. Basically deals with everything you want to transmit. You have a link layer that is also responsible for retransmission. So in case there's an error detected, it says please transmit this data again. There's a CLC check, so a cyclic redundancy check. And then we have the transaction layer that handles the bus actions. And we have the software layer, which is part of the OS. Here we can see the bandwidths with the different generations. Right now the most recent CPUs, I think starting last year, support PCIe generation 5. So this is the

Starting point is 00:55:31 most recent standard you can buy. And you can see here that we can achieve with an X16 link can achieve up to theoretically 63 gigabytes and what most SSD vendors to nowadays try to achieve and also almost do kind of is to have to fully exploit the X4 link so we have modern PCIe 5 SSDs that have data rates up to 13-14 gigabytes per second. This is the most recent SSDs we can buy. And they almost saturate the bandwidth of PCIe generation 5 here. Okay, so now how do we talk to our very fast SSD with all its magic in the controller? How do we now talk to this thing efficiently and connected via the PCIe bus?

Starting point is 00:56:41 And this is the Non-Volatile memory express protocol, short NVMe. So this is a host controller interface specification and it basically says how to efficiently communicate with SSDs that are directly connected to the PCIe fabric. The first specification is more than 10 years old, has been led by Intel and now we have pretty recently we have NVMe version 2.0 and what you can see here is basically that's the lowest level of the OS so we have the storage on the bottom this is our SSD connected via PCIe fabric. And now here we have the ME driver and that has a very efficient interface, very efficiently connected to the SSD.

Starting point is 00:57:34 On top of that, then the OS gives you the block layers, so it handles the block management for you, puts a file system on top of it if you want to and then with your POSIX API you can just write and read to or from files as you know it in your application. That's the stack that you usually see in modern servers with modern SSDs. And MFE has several transport modes depending on what you want to do. If you're talking about fiber channel or InfiniBand and do RDMA, we keep it simple for this lecture here.

Starting point is 00:58:15 We only talk about PCI Express and connecting to storage here. So how does Mwe, so why is it more, or why is it efficient and what does it do to be efficient and be optimized for SSDs connected via PCIe? As we have seen before, we have lots of potential parallelism in SSDs. So the controller can issue many requests, independent requests to different channel controllers. They have different LUNs that they can access in parallel, so there's a lot of parallelism. What we now need to do in order to achieve this performance or to exploit this potential performance is that we have to have a high amount of concurrent requests.

Starting point is 00:59:05 So we don't want to request one page, block, wait for the response, and then issue the next request. We want to have a large queue of requests that we send or that we provide to the NVMe controller that can be then processed by the SSD. And this is what you can see here so we have those we have this is really a high amount of concurrency so we have up to 64,000 commands per queue and of those queue we can have again 64,000 so really a lot of commands that we issue there in parallel.

Starting point is 00:59:45 Or at least we can queue them in parallel. And what you can see here is now that we have basically a, for example, an I.O. submission queue per core. So we have on the very left, we have the controller management. So this is the admiss submission queue, for example, to create or delete queues and so on. And then we have per call a submission queue and a completion queue. And how does that work from a programming perspective? So first of all, what the application, for example, does, or the host in this case here, is that you write the command to the submission queue. So you have an entry you say okay I would like to read page 17 and page 18 19 and

Starting point is 01:00:31 so on so you would write this to your submission queue. And then there is this doorbell signal that's the name it is called but that basically says okay now this is signal to the controller that says okay now here this is my submission queue this is the last entry that i've seen so please process this submission queue up to this entry and then the controller fetches this queue or the contents the items in this queue and then writes the result into the completion queue. And when this is done, you again signal that you say, okay, which was the most written item that I have processed? So up to when the completion queue is basically written from my side, and then the host can start to process those items that are now in the completion

Starting point is 01:01:28 queue and then later signal that it has processed all these and that we can use this space again, sometimes a ring buffer and so on, so that we can now write completed commands into this completion queue. So we have many of these very long queues to really allow the controller to pull many concurrent commands and process these in parallel. As we have said, there are admin commands. So this is, for example, a submission or completion,

Starting point is 01:02:09 so when we create or delete those queues, we can also ask the device with these admin commands for what is your type of device, what device are you, what protocol version do you support, what capabilities do you have, how large are you, and so on. And then we have the typical, we have three command sets. This is, for example, read-write. This is the typical NVMe command set. Then we have zoned commands.

Starting point is 01:02:33 This is where we can partition the data set of our device into zones. And then we have a key value commands. And so these things are busy to give an alternative to this classical block wise storage, managing data block wise. And this key value command set, this is busy like a key value stores.

Starting point is 01:03:04 So you don't have, you don't talk any longer about pages. You can say, okay I want to write this key value set. You can also, so a typical CRUD, so you can create it, read it, update it, delete it. And this is supported by NVMe via these command sets. And in this case then you don't have to care about, don't care, you don't work with pages anymore, right? So you just have this nicer key value interface. Yeah, so again, there's also a stream directive

Starting point is 01:03:38 which allows you to, is another way basically to define streams, but we don't go into much detail right now. Let's keep it a little bit simpler. Okay, so the last point for today will be the Linux I.O. frameworks. The question now is, I.O. is in Linux ever since, but now the question is how can we use those MVME protocols, how can we really leverage this hyperalism now as a programmer. So what

Starting point is 01:04:11 does Linux give us there? And for the case this was like always POSIX on top of a block device or the file system on a block device. And what you probably know is, or what this block management level gives you is comparable to the virtual memory manager, is that it says, okay, there's a linear array of blocks that you can use and write and read to. You don't have to care where those blocks are, where they're located, how to actually manage them.

Starting point is 01:04:43 This is all abstracted by the system for you. And when you use POSIX, you have your, you can just say, okay, read this logical block address or write with a payload, works just fine. And yeah, the applications then, usually what you do is you map your data structures into files. The issue however is that this entire interface is synchronous. When you write, this interface is blocking. There's no direct way how you could now issue those many concurrent reads or writes or commands or whatever to really exploit the performance of the SSD, this was synchronous.

Starting point is 01:05:28 So what has been then introduced to account for modern SSDs were first two approaches, which is AIO. And this was actually more an emulation or workaround. What this did what there have been multiple threats created those threats have have been issuing the synchronous calls so those threats have been blocked but you as the application have just issued to a queue for example and then in the background the other threats they have been blocked but you have not been blocked so that gave you somewhat the the idea or it faked, I will fake for them, basically it looked like an asynchronous interface to you

Starting point is 01:06:10 so you could issue asynchronous IOs. But you had to create, there have been multiple threads in the background that did actually synchronous IO calls. The next version then was libio and now this was really using asynchronous IOs. So this worked while the command io submit and basically what you commit, what you provide to this call is also a callback. So whenever this thing is then, this call is processed,

Starting point is 01:06:42 your read for example, then you get your callback called and you know okay my data is now in there and I can process this data. And for that to work you need to bypass the file, the OS buffer, the file buffer which you do via this ODirect flag, in case you don't know it. And also another limitation was that this was only supported with a certain number of file systems, for example, the AXE3 and so on. And one of the main issues was it required two system calls, so going to the kernel to do IO,

Starting point is 01:07:21 which as system switches between system calls are slow can be a problem. It's not directly a problem if your SSD is rather slow, you maybe might not realize it, but with modern SSDs with very low latencies and gigabytes of bandwidth per second, this will at some point hit you. So what is the most recent IO framework? Now this is mainly or what most people use is IOUring. This was introduced a couple of years ago and this basically has the same idea as the NVMe protocol that you have submission and completion queues. And the trick here is that these are shared

Starting point is 01:08:05 between user space and kernel space. So we don't have to switch between user and kernel space. Those are shared, so you just write as application into the submission queue. And then the kernel goes to, issues multiple of these calls to the MME controller, and then writes those directly into the completion queue, but you have much fewer switches

Starting point is 01:08:30 between the kernel and user space. So this can be, if you do it right, a much better performance than, for example, libAIO. Yeah, you can read that later if you're interested in, but the takeaway is that IOUing is the most recent way in Linux to work asynchronously with storage and it basically also takes the idea of submission and completion queues. Again you have to use the O direct flag and then what Intel did going a little a step further was the SPDK, I hope you can read it, the SPDK framework. And this is basically a NVMe driver in the user space. So this really allows you to do pretty much everything

Starting point is 01:09:32 that the NVMe allows you to do, but also with all that that comes with it. So we now really need to communicate with this driver directly. It's faster, but much not as simple as using IO-Uring, which in contrast is much harder to use than POSIX. So the faster you want to get, the more complex in general these things get. And yeah, so one of the main aspects of SPDK is that it does not need to go to the kernel at all and there's zero copy so you don't copy the data from the submission queue to somewhere else to the kernel space and then even though there's few calls because multiple requests are sent once, SPDK avoids this copy at all.

Starting point is 01:10:26 So there's no copy here. And as an alternative, the PWA proposed, that's rather recent, this is what is shown on the very bottom, this is XNVME, and this is a user space library that tries to abstract completely from that so whenever it's possible it gives the performance of SPDK if it's possible but

Starting point is 01:10:51 still giving a very nice interface and this is shown here that's very recent to be honest I don't know what the current status is but they were doing see what you can see here is that you can use the POSIX API. If you, to access the block layer, you can IO U-ring, you can use MVME pass-through with IO U-ring. You can use SPDK for the user level MVME driver. So all of that is basically abstracted away and you have one uniform IO layer

Starting point is 01:11:24 provided by this XVME library. It's a pretty neat idea. Okay, I think two last slides here. So there's a nice paper by Haas at Lice from last year, so this is really recent. This tape has been measured what is the impact of this framework. So we really need those frameworks. What is the advantage? And there are many things to, many interesting things here.

Starting point is 01:11:57 On the very left you see IOPS. So this is not bandwidth, this is IO operations per second. And they have used eight, back then, very recent SSDs by Samsung. Each of those SSDs has one and a half million IOPS. So what they now wanted to see is how can I achieve my 12 something something IOPS? How many can I achieve it? If I have three, eight SSDs, very recent SSDs in my system and how many threads do I have to have? So I know what is the IO depth, so how many requests do I need to issue concurrently to reach this IOPS?

Starting point is 01:12:39 And as you can see here we have, we achieved those, we can achieve those 12 million IOPS, but we need an IO depth of 3000 concurrent commands. So we really need many concurrent requests to get this maximum performance that is possible. And this was done using SPDK and they also compared the different throughput in the middle. You can see that they compared SPDK there and against IOUing, either polling or standard or LIB-AIO. And they checked how many threads do we need to achieve my full throughput, my maximum throughput. And with SPDK, since it is, yeah, completely in user space and does zero copy

Starting point is 01:13:33 and other things, you can already achieve the maximum throughput with three something, it looks three, three threads, or at least, or yeah, four for sure, threads you have the maximum throughput. With IOU ring you need already 32 if you're polling, but this has polling means, so you have 33 threads and they do polling,

Starting point is 01:13:57 so there's a lot of CPU usage for that. And with IOU ring in the standard way and LIB-IIO, you're not able to achieve this. But still, it's almost 10 million IOPS per second, so still rather decent. But the takeaway is if you really want to have the maximum performance and want to be efficient, you probably need to go to SPDK,

Starting point is 01:14:20 or at least invest CPU to use IO-U-ring to get the maximum performance. On the right that's, I hope you can read it, they compared the TPC-C, so this is a database benchmark that we talked about a couple of weeks ago. They compared the performance when now data is buffered. On the very left this is the memory case, so the 160 gigabytes of data are not buffered. On the very left this is the memory case so the 160 gigabytes of data are not buffered. We have 1.4 million transactions per second and now if we have a buffer size of a tenth of that, so 16 gigabytes with SPDK on the very right, we achieve 700 or 0.7 million transactions per second. So even though we are now on SSD,

Starting point is 01:15:10 we still achieve half of the transaction of maximum transaction count compared to my memory, which is quite impressive I think. And I think it's the most time here is now going into the transaction. So most of the time is coming into transaction processing using SPDK, and then the rest is the eviction of the buffer manager, right? When data needs to be evicted out of the buffer manager. So this is something you can't really get around with, around, so you have to invest this time.

Starting point is 01:15:39 For IOUing here, the polling you see that there's, I hope you can see that this is the next best performance, but you see this little violet line there, this is a polling, so there's CPU invested in polling, and there's also the IO submission. Takes quite an amount of processing, so not all the processing is done for eviction and transaction processing,

Starting point is 01:16:08 but we also have more to do for the polling and the old submissions. They can see the disadvantages if we really maximize throughput for this database benchmark. Okay, so yeah, time. What did we talk about today? We talked about, we had an overview about storage, optical, HDDs, SSDs, and even tape. We talked about the stack, so when we have more SSDs, there is the NVMe driver, we have the block layer and the file system all within the kernel space.

Starting point is 01:16:47 And then on top of that, we have the POSIX API. If you want to be really performant, we bypass multiple layers and try to be, can be more efficient. We talked about how solid state drives are built, how much concurrency there is and how much concurrency we need to issue to really exploit their performance and how what they do to counter wear leveling and we talked about MDME and the IO frameworks. We didn't talk about open channel SSDs which is also super interesting so programmable programmable SSDs and we didn't talk about obtain SSDs yeah so this is also really interesting stuff but yeah we just don't have the time to cover that next week we are going on not next week tomorrow Marcel is going to talk about the Compute Express link CXL and that's it for today. Any questions?

Starting point is 01:17:59 Okay, then thank you and see you tomorrow.

Hardware-Conscious Data Processing (ST 2024) - tele-TASK - Storage

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.