Hardware-Conscious Data Processing (ST 2023) - tele-TASK - Storage

Starting point is 00:00:00 okay so uh welcome everybody um although we're not that many yet let's get started today because i have many slides i'm not actually going to be able to finish all of them but doesn't matter um there's some stuff that you also can just basically towards the end that's interesting uh but also text, you can basically see for yourself. So, today we're going to talk about storage. Mostly modern storage, but I'll also way in the peripherals now. So usually if we're talking about fast hardware, we're gonna talk about going across PCI Express, but there's all other interfaces as well.

Starting point is 00:00:59 We'll talk a bit about this today as well. So today we're gonna talk about this part. So the disk in different forms. And we'll talk about this part. So the interconnect PCI Express and predecessors and alternatives to that. So, yeah, not much changed so far. There might be a few switches, but we're getting close towards the end. So everything stabilizes, I hope.

Starting point is 00:01:32 So I'll give an overview of storage today, talk more in more detail. So the overview will be quite quick about the different kind of storage technologies that are in use right now then we'll dive a bit deeper about solid state drives and the internals of solid state drives and i mean although you most likely will never be really exposed to this kind of detail in uh in your life life while programming something, it's good to basically have an understanding to get, well, to use them efficiently and to know why they behave like they behave.

Starting point is 00:02:15 A similar, I mean, if you do a database course, typically you will learn about the internals of a hard drive. And I also have a slide for this, just to know and understand how you're supposed to use the hard drive. Similarly, we need to know how the solid-state drive works internally to properly use this. Then we're going to talk about the interfaces, meaning the different ways how we connect to storage. Most importantly, PCI Express. So I a bit more again about NVMe and the Linux IO frameworks.

Starting point is 00:03:11 And here, I don't know how far I will get. So this is basically where then maybe you can look at this in more detail at home as well. So this is, I mean, there's lots of details. This is mainly based on the on a tutorial by philip bonet and alberto lana and there they have even much more detail on this so many more ways or many more details on how to use ssds especially nvMe SSDs, efficiently. And so, I mean, this is basically, if you really want to deep dive, then go there.

Starting point is 00:03:53 They have, I think, the slides are online somewhere. If not, you can ask them. And they also wrote a book about it. At least I think so. I mean, either it's already published or they will be publishing it soon. And the other book that I'm using is Structured Computer Organization, again, by Tannenbaum. So there is a lot of overview, or a good overview, let's say, on the basics of PCI Express etc. Okay, so you've probably seen, well not necessarily seen all of these devices, but you've heard of probably all of these devices. So you have classical magnetic disks, then optical drives,

Starting point is 00:04:41 not really common anymore in most user or consumer systems, but still used a lot. I mean, if you think about consoles, for example, you typically use optical devices. Then you have all kinds of different flash devices and you have tape. So this is what currently you will see in any kind of data center in one way or the other. I mean optical devices kind of more moving more towards a niche market so for certain case use cases they're still used a lot. I mean if you need to transport some small amounts of information you will typically get a DVD or something like that

Starting point is 00:05:26 because it's quite robust and super cheap, but it's not super fast. And if you have archival storage, you will go to tape. But most of you probably never really worked with tape themselves, or even touched tape. But this is not the end. I mean, this is what's around right now. with tape themselves or even touched tape. But this is not the end. I mean, this is what's around right now, but people are actively searching and researching for other types of storage.

Starting point is 00:05:55 And two things that people are trying is glass, for example, and DNA, both of which super experimental still. But you can, like using glass as well in DNA, both of which are super experimental still. But you can, using Glass as well as DNA, you can get even much more denser storage than currently possible, but currently also way, way slower. So using this, you could have very cheap archival storage, but right now it's just prohibitively, on the one hand,

Starting point is 00:06:28 I mean, still expensive, just the technology, but also prohibitively slow. Okay, so step back. Who has heard a database course where they already saw the hard drive details. Not everybody. Okay, then let me quickly dive into this. So, okay, hard drives, you still have them today a lot. Just because they're right now the, let's say, the fastest active,

Starting point is 00:07:00 not the fastest, the cheapest active storage. So if you want to store large amounts of data and want to work with the data in a reasonable amount of time, then hard disk drives are still a good way to go. Because the per gigabyte price is just still cheaper than anything else that you have like or than any kind of SSD and reasonably or so much cheaper that it actually makes sense to use it even though they're considerably slower so they have large capacity larger capacity than SSD still they have a high read speed but they have very slow

Starting point is 00:07:43 random access so the idea is I mean you can basically see this here right, let me give... is this, right, so you can see this here, this is the internal server. Any kind of hard drive will look something like this. You have these kind of platters and multiple of those in a stack, so from 3 or I mean old drives might have had two up to six, something like that. And then they have two sides and there's kind of an arm with a sensor or actuator that actually reads the platter.

Starting point is 00:08:19 And the arm, like all of these arms, like there's basically two sides and all these arms work in parallel. They don't move independently and that means whenever we want to read something, this arm first has to go to the right track. This is basically the platter is built up or segmented into individual tracks and these are then basically again segmented into different kind of sectors so we were basically splitting up the the platter like a pie and then we have like individual track sectors which we can read at a time and of course all of the arms work in parallel, so we're not just going to do this on one platter, but on all platters in parallel,

Starting point is 00:09:09 in order to get the higher bandwidth. Otherwise, we would just read one platter at a time, which of course does much less than just reading everything at a time. And because of that, because we want to read all of them in parallel, we also call this basically track, stacked one on the other, we call it a cylinder. So because we're basically going through all of the platters at a time. And everything that's basically close to each other, or like within a a track this we can read fast and basically just read at the rotational speed of the disc and the rotational

Starting point is 00:09:53 speed is something between 5,000 and 15,000 rotations per minute and and this is a physical limit right so if we go much faster than that at a certain point the disc will basically explode right so this is just like uh because of the spinning at the corners of the disc there will be a lot of forces pulling it apart and i mean this is way in safe bounds but if you go much faster then basically the then basically the physical limit is there, and you would have to build it so much stronger that it's just, again, not really cost-effective anymore. So, finding something basically means we have to move the arm to the right track and then we have to rotate until we find the right position

Starting point is 00:10:47 and that's basically 4 to 9 milliseconds and that's a lot. So that's basically where time goes and then we can read at something like 100 megabytes per second depending on the type of drive. So a bit faster if they're faster, so 4,500 rounds per minute drive will be slower than that actually. And then these are typically connected via either serial ATTA or serial attached SCSI. And I'll come to some of these abbreviations later. So I mean, somehow people would build these interconnects like abbreviations.

Starting point is 00:11:31 So SerialATA is short for Serial AT Attachment, and that's again, the AT appears to be short for advanced technology. So, but this is kind of packed one to each other. And serial attached SCSI is then serial attached small computer system interface. And this you don't need

Starting point is 00:11:56 to remember. You just need to know like, I mean, SATA is something that you will find every now and then. SAS is another type. But modern drives will be connected through PCI Express, so PCIe, typically. Okay. It's a physical device, so it actually needs to spin,

Starting point is 00:12:16 but also if it's powered down, it doesn't have to spin. It uses no power at all. But typically, I i mean in order if you like shut it down starting to spin it you will have even much higher uh delay so in an active setup you will have keep this spinning all the time okay so this is the go to storage for cheap active storage. So, if you have a large Hadoop cluster, whatever, or S3, cheap, large storage, in some kind of RAID fashion, you will go with HDDs,

Starting point is 00:12:57 or you will find HDDs today, basically, just because they're so cheap. And, yeah, cheap, period. But they're so cheap. Yeah, cheap, period. But they're somewhat slow. If you have even larger amounts of data, or it's really about, let's say, archival storage, so you don't need active access, you just need to make sure that somewhere you store this away,

Starting point is 00:13:24 then you will go to tape today still. Because here tape is even cheaper than disk. And it also has this kind of nice feature. If you basically take out the tape or you can actually remove the tape. I mean, of course, you can also remove the tape. I mean, of course, you can also remove the disc, but these tapes typically can come in some kind of a robot or setup where there's an arm that puts the tape in and out, so there's no actual connection to the tape while it's not in use. So this is really good for archival storage because it gives you quite a safe setup. So if the tape is out, nobody can touch it, nobody can change it. Well, say if you

Starting point is 00:14:06 have like your cold storage on hard drives, if the system is corrupt, something like that, then that might still be overwritten somehow, but the tape is sort of safe. It also has a long lifespan because if it's taken out, nothing is done with it, right? It doesn't need to spin, nothing. It's quite durable. It's like this small plastic cartridge and it's quite energy efficient. And so currently there's basically one format that you can use and this is the linear tape open LTO setup and currently we're at generation 9 which were one of these cartridges and this is about this size right so this is like a floppy used to be earlier so it's not not large I mean it's a bit thicker but it's it's quite small thing this has 18 terabytes of capacity and you can read it basically 20,000 times or read and write it 20,000 times no passes so that 20,000 intoend passes, but actually you need to...

Starting point is 00:15:26 It's kind of organized in a complicated way where there's multiple bands on the tape and then different tracks and wraps. So it's kind of a hierarchical organization where you can basically read both sides, like upper and lower parts, and in those then you have multiple, let's say tracks again. And that also means you need to actually go through the tape multiple times in order to fully read and write it. And you can see, I mean, it's still, there's still new developments here with,

Starting point is 00:16:17 well, as I said, up to 18 terabytes native. And of course, if you compress, you can get even more there. And there will be new versions, which are supposed to go even up to 144 terabytes at some point, Gen 12 would be. Tricky part is there's currently actually only one manufacturer, which is always a bit of a problem. I mean, there's multiple distributors, but in the end, it's really been being manufactured by only one company. And that means if there's a problem in that one company, you're screwed if you have to work only with this. But of course, you can also buy these in bulk for some time.

Starting point is 00:16:57 And they're cheap, as I said. So this is a good thing for archival. And actually, we also have this. So when we go to the data center, you can actually see there's like a small window in the system and you can see that there's the tapes left and right. And if Tobias has a good day, then he will basically show you how the arm moves around and takes one of these cartridges in and out or even he can show it to you.

Starting point is 00:17:30 It's an open tape, meaning it's not like a cassette or like a VCR thing, where you have two rings or something, but it's really like one long thing that you can just pull out and that will then basically be pulled into the machine back and forth in order to read and write it. Yes?

Starting point is 00:17:57 How is the data compressed on the chart on the right? And what has the logic to do with the data compression? So basically the drives have like a certain compression already built in. And how that is done, I don't know, to be honest. So there is basically, let's not say native, but they already have certain standards how to compress. And then this is an average. So of course, if you write random data to the drive,

Starting point is 00:18:29 there is not much to compress. And then you will end up with only the native capacity. But that's something that's what they basically assume. So you can basically say, I want to use this drive in compressed mode. And then it will be able to write more data. Basically assume so there's you can basically say I want to use this drive in compressed mode And then it will be able to write more data Okay Okay, then well still there optical devices So if you get an x-ray, for example, this is what you will get to take home with you

Starting point is 00:19:00 Or if you play PlayStation at home home this is what you will have something like this so they're nice for transport because they're easy yeah they're not heavy and they're not too fragile they are somewhat fragile not too fragile actually and they're cheap but But there's CDs, DVDs and Blu-rays, they're actually slow in read and write speed. But I mean the Blu-ray for example already also has not a bad capacity but of course nowhere close to what you get with tape. So for archival storage, it doesn't really make that much sense. But if you want to move some data in the gigabyte range,

Starting point is 00:19:51 then this is actually a good alternative. So because, yeah. And it's quite durable. So if you write the CDs yourself or DVDs, they can last up to 10 years. If they're manufactured properly, they can last much longer than that. Okay. More towards the future, right?

Starting point is 00:20:17 So this stuff has been around for a long time, but now we're seeing Flash taking over everything. Not everything, but let's say being the storage for active data processing. this type of storage in a way like when we talked about the way how this is done internally like it's similar like the like persistent memories set up slightly different but in general the concepts are somewhat similar in terms of how the physical or the electric circuitry is underneath. But these are even more dense than what you get with persistent memory. And in this case, this is typically NAND-based. So if you remember quite early how we can build caches, etc.

Starting point is 00:21:26 So these are these NAND gates typically. The flash devices have no moving components, unlike disks or optical devices, which means there is basically less, they're less prone to problems with moving or shutting the server on and off. So which often is a huge problem with hard drives. So if you have hard drives and you have a power outage, for example, so we have to shut down the data center because it's overheating. Then if you spin it up again, usually a couple of hard drives will be broken. This is just because they have to stop, they have to power off, then they have to be turned on again.

Starting point is 00:22:11 So there's just some mechanical wear and tear, which then basically eventually breaks the drive. And so this is not so much a problem with the flash drives. But they're much more complicated inside. They're all chips, right? And you use different granularities for reading, writing and erasing. Unlike the magnetical devices where you basically, while you're reading, you can overwrite and you're basically using the same granularity for reading, writing. And writing is basically updating, means also you don't have to delete first and then write something new. Here, in order to write something new, you actually have to erase a whole amount of space in order to get space for new

Starting point is 00:23:13 writes, basically. I don't want to go into too much detail about how all of this works, because I have a couple of slides explaining all of this in detail. But something that's important is that there are many different form factors and there are also many different speeds. You can see there's a lot of logic in here a lot of circuitry and it really like depending on the different type of device this will dramatically differ in how like how complex this is how performant this is i mean back here the different flash memory package so the actual storage cells they are somewhat similar but then how these

Starting point is 00:24:05 are connected how parallel you can use this will be very different if you have a smart PCI Express Drive or something like this which would be connected to probably either this small SATA or m2 which is a PCI Express type, again, interconnect again, or let's say a USB, where then all of the details, like the flash controller, etc., in here will be much easier and much less efficient. Also because the speed that you get through the USB will be much less. Okay, so just as some fundamental trends, right?

Starting point is 00:24:53 So why does it actually make sense or why is there a shift in how we use this, right? So like not too long ago, Flash was actually quite expensive. And I mean, while it was fast and basically people started using it or manufacturers started using it and speeding up certain applications where the speed would actually make such a difference that that you pay an extra price on that, we can see that the prices are actually

Starting point is 00:25:20 going down a lot. So here you can see this is from a paper of Victor Lais. You can see that here we have the the GB per dollar. So basically higher here means we're paying less per GB. And you can see while disk is kind of flattening out over the time, because the discs are not getting much larger anymore. So here we're kind of also, I mean, there's new technology, basically moving or layering the storage in the disc as well. So rather than just saying, okay, on each position where my arm is i'm just going to have one bit i can basically interlace and interleave stuff so i can more densely pack

Starting point is 00:26:13 things so i get more storage but you can see i mean this is a lock scale so it's not basically if you have like a somewhat flattening curve doesn't mean there's nothing going on it's still increasing the um basically what you get for your money but not that much anymore right so as in the past the same is it's true with dram i mean there we see kind of a seasonal pattern right and then we know there was some crisis here and there so prices went up and down but with flash we can basically see this is still increasing right so meaning flash is getting cheaper and cheaper and well at this point basically we are with flash where memory used to be in terms of performance and at this point flash is basically 20 times cheaper now we are at this point where this used to be here with DRAM actually sorry and this is

Starting point is 00:27:15 basically where flash is 20 times cheaper than DRAM so we're getting actually a much cheaper storage while still being quite fast than we get with DRAM. Which in the past, and that's not too long ago, right? So this is basically 20 years ago, Flash was the same price or even more expensive than DRAM. Okay, so looking at kind of bandwidths, and I mean, we basically have to see this on a per DIMM basis or per unit basis. With DRAM, we know we can get something like up to 100 gigabyte per second.

Starting point is 00:28:03 So typically like modern Intel servers, if you fully pack them, this would be 48 GB. But it's just going to be one DIMM. And then you have many of those, meaning per CPU or per socket, you can have eight of those, for example. So that means, like we had this two lectures ago, it means something like close to 400 gigabytes per second for DRAM, if you just fully pack it.

Starting point is 00:28:31 PMEM, factor of three, four, or two to four less, but still quite fast. Optical, much, much slower, right? Because you need to move, like you have a lot of like or the hard time to read it and it also depends on what kind of drive of course and but then we're looking at SSD and HDD which is basically what we're mostly interested in and again per device right so not like maximum speed but per individual device a hard drive will give you up

Starting point is 00:29:07 250 megabyte per second. And similarly, read and write bandwidth in an SSD, a single device will give you up to one gigabyte per second, even a bit more if you have very fast devices today. And that's just going to be one device, but you can basically pack multiple devices into a single server. Okay. So moving from, so I mean, important here is that we're still like considerably slower than DRAM, but we're catching up.

Starting point is 00:29:49 So this is not the same kind of factor that we had with hard drives. So if we package multiple SSDs into a server, then we're much closer to the bandwidth. And the other thing which is important for us, the latency will be much lower than the latency with a hard drive. So this is also something that I actually have here. So in the SSD, we're something in the microseconds, so low microseconds for the latency, and also in the low microseconds for reading a megabyte, while for the disk seek, we're in the milliseconds, typically.

Starting point is 00:30:34 And this is really just because the arm has to move and the disk has to spin to find the right position. So this random lookup is basically what's killing disk. And again, this kind of heads up here, an SSD, this is not a uniform device, right? So SSDs come in many different form factors, with many different interfaces. And while we're getting closer to some kind of consolidation, so when I started working with SSDs, it was completely unclear how this will go.

Starting point is 00:31:11 But now, at least we have interfaces where we can use these devices somewhat natively, rather than looking at them just like a hard drive. There's still a lot of variables. It's not clear. It's not standardized or there's not one way how we typically interface with those. If IOPerformance does not matter, and that's in any case, then typically you will do everything in POSIX, just like in Linux standard way of accessing files. So you have your standard block interface with some kind of buffered I.O.

Starting point is 00:32:01 I mean, you're not just going to synchronously read everything from the application. So there will be a bit of buffered I.O. I mean, you're not just gonna synchronously read everything from the application, so there will be a bit of buffered I.O., but in general, it's a block-based, and the individual reads will then still be synchronized. However, this is not efficient. It's not even efficient on modern HDDs, but it's for sure completely inefficient if we're talking about SSDs. So if we want to have fast I.O. with a hard disk drive, then we're going to build something like our own buffer management, as you have to do right now, on top of POSIX IOs. On the disk, the disk is block-based, so we'll still use this block-based interface and then

Starting point is 00:32:59 directly communicate with that in order to be fast enough. And if we want to use SSD efficiently and we have a fast SSD, then we cannot use POSIX. So then we have to do something else. And there's new standards like NVMe. So then you need to really use a different type of protocol to talk to the SSD and really do asynchronous accesses to the SSD in order to fully utilize the parallelism that's in these systems or in these devices. Okay, so yes If you have a SATA drive and performance is crucial Well, I actually I don't know how what kind of interfaces you can get on top Of those. I mean, for sure, yeah, that i

Starting point is 00:34:15 Would have to look up if there's like special, for sure you can Also do something in user space rather than using posix. And try to just be, have as many parallel access or have many parallel access to fully utilize the device. But if there's a standardized way or let's say a good best practice to do this, I don't know, to be honest. So I would have to look it up. I don't know if there is something like nBME over SATA.

Starting point is 00:34:56 I don't know. I mean we can check. I have some stuff on the throughputs of the different interfaces, also SATA. There I would have to look up. OK, so let's talk about solid state drives internally more. So you get an idea of what these actually look like internally and how they process your accesses to this drive. As I said, this is NOR or NAND flash memory typically NAND or flash memory is NOR or NAND and in SSDs is typically NAND cells and then there's different types of cells and these cells can either store one bit or two to four bit three bits or four bits so I mean you can have like single level cells double

Starting point is 00:35:55 triple or quad level cells and if you want want to have high performance then you're going to go with single level cells or double level cells or multiple level cells. If you want to have archival, so you want to get more densely packed your storage, you're going to go with triple or quad level cells. And this is basically not your choice, right? So you're not basically picking how is my SSD structured internally, but you're going to buy either one or the other. Yes? What's the difference between non-advanced dash memory? Is it cut to the point of the gates?

Starting point is 00:36:39 Yes, exactly. That's what it boils down to. So basically, how are the gates structured? Like really, what's the circuitry in the chips, essentially? How do these things exactly work? I mean, it's not really... It's basically the physical manufacturing, how you build up these cells individually. If this is kind of logically a NOR or a NAND gate. And here you have an overview level cell, then on each block or each die, you can get like 8 gigabytes.

Starting point is 00:37:33 And you have like a 3 millisecond microsecond latency for reading. You have a programming latency of 100 microseconds and the page size will be two or four kilobytes and then we can see like triple level cells like different types you will get much higher latencies but you will also get get much denser storage so And, I mean, physical details is something that you would have to look up. Also, then, at a certain stage,

Starting point is 00:38:13 a lot of the details are kind of trade secrets as soon as we're looking too far into the storage. But it's just to get an overview here. Important, something that's relevant for us still is this page size down here.

Starting point is 00:38:33 So this is something to remember. Based on this page size, this is basically your access granularity still to the SSD. So while we don't want this kind of big block-based device, we're still not reading individual bytes or bits. We're reading in pages, just as we did for memory, et cetera. But here, these are again somewhat bigger and typical performance devices will have two

Starting point is 00:39:04 or four kilobytes and often 4 kilobyte pages. So that's just something to remember. If they're larger, that means like for each individual access we're going to read more data. So going through the device setup in more detail. So the flash internally is organized in logical units that will store individual pages. So each 4 kilobyte page, for example, will be stored in a logical unit or LUN. And then these logical units also have registers that will basically where the actual data will be served from. So it's stored somewhere internally in these blocks, your individual pages.

Starting point is 00:39:57 But then if you want to read it, it needs to go into this register in order to be read out. And then you will have multiple of these logical units that will then basically form your storage. And how you're basically reading towards these registers and how you're organizing this, again, is kind of a performance or a throughput latency trade-off. And the smallest unit for read and write, as I said, it's a page size and that's, again, hardware dependent.

Starting point is 00:40:32 And then in these blocks, basically, will be 128 or 256 pages. And for the pages, we have three different kind of operations or within a logical unit. We can read and write and erase. And so we can basically read any kind of page at any point in time. If there's free space, only if it's empty, we can write. And if not enough space is there anymore,

Starting point is 00:41:07 so if our page is kind of full, or we want to delete the page or something like that, then we have erase. But erase operations will not be on the individual page level, but will be on the block level. So not on the complete logical unit level, but say here, if this is full and half of it needs to be deleted somehow, it's basically outdated data, for example, then we want to

Starting point is 00:41:35 erase this block in order to reuse it, then that basically means all of this data can later on, or all of these pages can later on be written again. So I'll come into a bit more detail later on. But important, reading and writing is on an individual page level. Erasing is on a whole block level. And as I said, block means 128 to 256 or 256 pages typically. It could of course also be different if the manufacturer decides I want to have blocks with 1024 pages for example.

Starting point is 00:42:21 Basically the logical units are then connected to a controller, which is like a channel controller. So there are basically two channels that go to each of the logical units. There's a control channel and there's a data channel. And the data channel is shared, while the control channel are individual. And the control channel basically says, oh I want to read this page or please erase this page etc. So this would be control information and then on the data channel basically the of the logical units. And so basically the, yeah, so they're controlled independently, but the data is read through,

Starting point is 00:43:18 like through the single channel on there, under here. And then as we saw here, basically, the data that we want to read or write needs to be in these page registers. So this is basically where the data channel will be connected to. And that will take some time. So to basically read one page into the page register or to basically write the page from the page register into a page, into one block. So that's in order to save time here or in order to fully utilize this data channel here, the channel controller will interleave these operations. So then basically it will say, well, on chip one, please start a read

Starting point is 00:44:08 or I'm sending a read command and on chip two, on chip three, for example, in parallel. And then there's a certain read latency until the page is loaded into the page register. And then again, basically the data could be read out through the data channel from the page registers. And so then, I mean, while we still have this kind of latency for the individual pages, we are basically continuously utilizing the channel by using,

Starting point is 00:44:45 like in a form, some form of pipeline parallelism here, right, so we're basically, or round robin, using, utilizing the individual logical units here in order to fully optimize the throughput here. So that's basically like just reading and writing and erasing or the communication that's done in there. However, that's not everything. is that the flash is extremely sensitive. So if you're basically writing just once into a flash cell, this will basically directly influence the neighboring cells.

Starting point is 00:45:42 So you basically need to make sure that the data that you're writing is roughly uniformly or is roughly randomly distributed. So, you don't want to write long ones and zeros because also the cells will basically wear much faster. So they will wear out, they will break faster if you just, say, overload them in one direction. So you want to actually scramble the data first before you write it. And at the same time, if you're reading,

Starting point is 00:46:17 because it's sensitive, you will also get a lot of errors. And for that, you need some error correction. And so then there's an error correction code engine that's in front of the channel controller which will basically scramble the data that you write and descramble the data that you're reading and then also once the data is descambled, you're basically also going to do error correction and check if what's coming out is actually correct or not, or if there's any errors, it will try to fix those errors. I mean, just regular error correction codes. However, details are not public information. How exactly which drive does it?

Starting point is 00:47:10 And that's also a reason why there is different kind of interface standards. So at a certain point here, there will be an interface standard but this part is basically not standardized and this means you cannot exchange like the different kind of chips on different kind of devices okay so going one step further so basically we have the error correction engine and the general controller multiple of these basically connecting to multiple LUNs and having different channels that we can use in parallel. So we know that we can do basically interleaving on each individual channel in order to fully utilize multiple of these blocks or of these LUNs in parallel, not blocks. But then we also have multiple of these channels in parallel, which actually can be fully utilized in parallel.

Starting point is 00:48:12 So we can read and write completely parallel plus this interleaving. And for this, we basically have two additional steps. We have the host interface controller, which does implement the actual interface to PCI Express etc. So basically the protocols that are needed to talk to the server and also this part does the data transfer in and out of the device itself. So this is all still on the SSD. And then we have firmware and the firmware implements the flash translation layer and initially this was really basically translating from this HDD block device to this kind of internal parallel channel based device. And nowadays, I mean, we want to exhibit more of this, more of the parallelism to the outside in order to be able to fully utilize this.

Starting point is 00:49:18 But still, basically, all this internal part is kind of hidden, at least like some of it, behind this error correction engine, et cetera. So how exactly the data is written here. But through this, or this part basically, one tries to ensure that like where data exactly is located and how data is read, written, especially written and deleted in order to make sure that the accesses are somewhat distributed across channels

Starting point is 00:49:56 and also are somewhat distributed across different kind of blocks. Because again, on one hand they are sensitive and they only have a limited lifetime so individual cells or individual pages will break after just writing and reading too often or then not too often like basically a certain amount of reads and writes eventually they will break and so we want to distribute this across multiple blocks and pages in order to read all of them in kind of the same speed. So we're not basically breaking one part of the SSD and the SSD all of a sudden only has

Starting point is 00:50:37 half of the capacity. Finally, we also have the storage controller, which basically has all this part. That's basically what we discussed so far. So now, let's look at how writing and erasing work. So this is kind of a small block. We said one block would be 128 or 256 pages typically. And then we have on our block 1000, we have some data. Block 2000 is still completely free.

Starting point is 00:51:13 And we have XYZ as individual pages on the block. And now we want to update page X. So what will happen, rather than replacing X in place, we actually have to write a new X. So we're basically remembering in the buffer somewhere or in the storage controller, we're remembering that X is deleted. So, we have a map and know that X' is the new version. And whenever we are going to read again,

Starting point is 00:51:55 we are actually going to read X' rather than X. And X is basically a dead page right now, because we cannot use it. So, in order to overwrite this again or to reuse this, we have to erase. And this is basically, as I said, done on a block base. So, if we want to reuse the block 1000, we actually have to copy the data over. So what will happen is that we'll just copy the block 1000. All of the pages that are in use and have valid data will be copied over to a free block and the old block will be erased. So you can see after this basically block 2000 has one

Starting point is 00:52:45 slot free because this is basically the block where the old X was and block 1000 is completely free again. And erasing takes a bit more time than writing and also reading. So this is basically something which will be done in the background if possible. And again, like write-erase cycles, there's only a limited amount of numbers how often you can do this. And so you also have this kind of write amplification, which I mean you have the same if you go to memory, right? So, we always have to write on page size, so we cannot write anything smaller than page

Starting point is 00:53:35 size. This means, even if you just write one byte, you will end up writing a complete page, which will be 2 to 16 kilobytes typically. If you have to erase a block while you're doing this, so basically you have to completely copy this, you have an additional write amplification, which would be block size, would be up to 256 blocks. So if you have 4 kilobyte pages, then we're at a megabyte basically here, in

Starting point is 00:54:10 terms of write amplification, so that we all of a sudden have to copy. And because of the page movement, etc. So in order to be efficient, of course we want to align the writes on the page size and we're going to try to write chunks of data that are either page size or multiple page sizes. And with this we can actually get maximum throughput. And that's also, again, why we use buffer management in a database, because we want to work in this page granularity rather than writing individual bytes, etc. However, the device also has something for this.

Starting point is 00:54:59 So the device in its controller will have some logic and some buffers for small individual writes in order to not do all of the individual writes separately. So if the device, rather than do random writes everywhere or inefficiently use pages and blocks, it will try to align stuff in the controller to make stuff work more efficiently. Okay, a bit more deep dive into the Flash translation layer. This is part of the controller, we said, and this also maps logical block addresses to physical block addresses and logical block addresses from the

Starting point is 00:55:51 device point of view. This translation also exists on hard drives, but it's much more complicated in SSDs. So also on hard drives, you have some extra reserved space for like if some bits break somewhere, or you also have some error correction, etc. So you can basically move pages or addresses can be translated to different areas on the hard drive. And the same is true in a flash drive. So here, you could have a completely random distribution depending on how the flash device decides to distribute the data onto the logical units. So in order to maximize the throughput across the channels,

Starting point is 00:56:47 and in order to make an even wear and tear across the different kind of cells. So basically, it uses a map for this, and this map is stored internally inside the controller. So basically, we're addressing individual blocks or individual pages on the device with a page address. If we go from our host device, and then this mapping will basically tell the device, okay, this is actually located here inside this logical unit in block so-and-so, page so-and-so. And this can change over time, depending on the garbage collection

Starting point is 00:57:40 and depending on the var leveling. Basically meaning, as soon as I have to erase a block, for example, I will move, as we saw earlier, I will move the still good blocks or the still good pages to another block, and then I'm updating this mapping inside the controller. And of course, it also needs, like this is in RAM, but it also needs to persist it in case of power failure.

Starting point is 00:58:08 So, as soon as there is no power anymore, then this kind of mapping will be stored into the drive as well. So, I mean, simple Flash translation layer would look something like this. We have one block per LUN and the writes are buffered. And we have kind of a logical to physical mapping in a round-robin fashion. And then as soon, if a buffer is full, then we're going to flush it to the actual cells. And if we have round robin, we're going to get good channel utilization because we're going to use the channels one after the other.

Starting point is 00:58:56 Or you can also do this in parallel. And I mean, we don't do necessarily update in place. Because then we are like any kind of update would then basically invalidate these, these logical to physical addressing tables. And well, we need to do a garbage collection for invalid pages. And we also need some over provisioning in order to be able to move pages around, right? So this is what I showed you earlier. If I want to do a erase,

Starting point is 00:59:43 I basically have to copy the data over to some empty space. So this is something I can basically not give all of the space on the device to the user, because otherwise there's no space for moving data around anymore. So we need to reclaim the rights after some time. So, let me quickly check time. There's still a few slides until we go to the PCI Express. So, the garbage collection in the drive, basically basically we see, right? So we're not updating in place. We're basically writing new pages if we have some data coming in because we cannot update in place.

Starting point is 01:00:37 Otherwise, we would have to erase first and copy everything over again. So that's why we're basically copying to new free blocks and then the SSD needs to basically erase the block again in order to free this space again. The garbage collection will do this asynchronously. The garbage collection will basically check which pages contain a lot of stale data and where does it make sense to basically move data around. However for this information to work, the drive needs to know which blocks do actually contain or which pages are actually invalid.

Starting point is 01:01:33 And so for this, there needs to be some information from the OS or from the user to explicitly explain what information is invalid. So this basically only works if the SSD controller sees that there is free space when logical, so I mean, if you're deleting a file, right, you're not necessarily sending a delete message or something, you're only updating if you write, like even on a disk. On a disk, you would never delete. You would just overwrite, and it doesn't really matter. And this is how these interfaces are typically designed.

Starting point is 01:02:16 So on the SSD, this is a problem, right? Because if you're just writing new data and you're not overwritingriding in space or in the same space you're updating in space then the SSD controller won't really see this so the SSD there needs to be some extra information for the SSD to actually do this garbage collection and for this there's a trim command. So this basically tells the SSD to explicitly tell it which blocks can be deleted or which pages can be erased. So which pages are empty and then with this, basically what kind of blocks can be erased. And this is something that didn't exist for a long time in POSIX.

Starting point is 01:03:06 So then basically there was this problem. You're just basically writing new data and the deleted files would never be marked properly. So for this, there's this extra command, which will explicitly tell the SSD, okay, these blocks you can actually erase. But that is just an information to the SSD, right? And then the SSD decides when to actually erase something or not.

Starting point is 01:03:32 So, a bit more on the wear leveling. So, each of these cells has a limited lifetime. So, this is like a... I don't have a number right now for the number of cycles that it can go to, but it's in the thousands, but still not unlimited. Meaning, after a certain amount of writes in the re-cycles, then the block or the page is basically broken. It won't respond. It will basically freeze to the last version and it won't accept any new arrays. And for this to not happen too quickly on hot data, so not all of your data is written and updated in the same frequency.

Starting point is 01:04:23 So you have some files that the OS uses, which are constantly changed, right? And some files will be not touched at all. And in order to not break certain parts of the SSD quite quickly, you have wear level mechanisms, which basically map from one block to the other. And there's two ways,

Starting point is 01:04:46 where either you have dynamic wear leveling, where you're only moving data that's changed all the time. So whenever you're changing data, you're updating, you're not gonna update in place. Of course, you cannot update in place, but you will basically move the data around as soon as you're erasing a block and you're making sure that these blocks are not changed too frequently or that whatever

Starting point is 01:05:11 block is still free you're basically counting how often they've been erased so far and you're switching to something that's not been erased that often. In the static wear leveling you're moving everything around in order to make sure that everything or sort of everything, right? So all also the data that's static in order to give like it's the same kind of wear leveling on all of the blocks that are there on the SSD. So that's a bit more work, it's a bit more slower, but you get a better utilization of all of the blocks. So the dynamic wear leveling will be a bit faster, but you get like different kind of wear

Starting point is 01:05:56 on different areas of the SSD. The static wear leveling will give you like a better utilization, but the drive, it will be a bit slower and the drive will be broken more quickly because you're reading and writing everything more frequently. Looking at the access completely and very quickly, if we basically want to access the device, we're going from the host memory through the host interface controller to

Starting point is 01:06:32 a data buffer. So there we have some kind of caching policies on the device. Then we have the Flash translation layer which will do this logical to physical mapping where we have data placement, we have garbage collection policies, then we have a low level scheduler where we will basically go through the different kind of queues. So we have the different kind of channels where then we will have individual logical units connected to it and there we have certain channel utilization

Starting point is 01:07:06 policies in order to get good parallelism out of the device and then we have the actual flash controller which does the error correction and longevity policies so basically this data scrambling right so where we're making sure that we're not just basically writing the same kind of data, like same kind of bit types all the time, but get some kind of a randomness in the data that we're storing. Okay, so quick internal takeaways, right?

Starting point is 01:07:43 So it's complex device, and this is what you should take away. I mean, you get kind of an overview of what the device needs to do. You see it's parallel. There's a lot of hidden design decisions. And that means you will see very different characteristics depending on these design decisions. I mean, they get good performance, but if you really want to get the most out of the device, you will have to really benchmark and test an individual device. And it also means it will have different characteristics depending on the age,

Starting point is 01:08:23 depending on the filling degree of the device. So if your device is much fuller, there's less space to swap data around, then the device will probably be slower. And there's many different types of SSDs, not only the interfaces, but also internally how they're actually structured. So if you have like an SSD optimized system that's probably nonsense because it

Starting point is 01:08:49 really can be optimized for one type of SSD but it's not like you cannot optimize for all types of SSD because they're so diverse and well there's within the device there's a lot of complex software and diversity, but then also how to access it. So this is something that we'll also see towards the end of the lecture, probably not today, towards the end of the lecture. So now let's do a quick three minutes break, and then we're going to talk about PCI Express. Let's look at the interfaces. How do we connect to this device? So this is kind of an old ATX motherboard. So modern motherboard will look slightly different but a lot of these connectors will still look similar. So you have external ports, you have the CPU socket, you have the

Starting point is 01:09:51 memory slots, you have power controller, etc. But then you have these things here, right? And this is where basically your storage and other peripherals will go. So this would be a PCI Express slot and this are older PCI slots and here you have SATA ports where you would connect your regular SATA drives to. And these then are connected to the CPU via different buses and these are internal buses similar to this inter-CPU connections. But for the PCI Express, you have basically like a network that connects to the CPU, to the PCI Express controller. So the different devices will be interconnected and can communicate with each other and can communicate with memory etc.

Starting point is 01:10:45 And there's different ways of building these buses. There's parallel buses and serial buses. And parallel buses basically have multiple channels or links, not channels but links actually, from the transmitter to the receiver and you're in parallel sending multiple bits at a time so say for example a complete byte by having all of these links like in this case eight links at a time and that's good in general or in theory because you can do a whole a whole byte in a single connection or a single Send operation single operation and

Starting point is 01:11:37 Basically the speed that you get is basically than equal to the number of the bits at a time However, there's a bit of a problem again. I mean, on the one hand, this needs to be clocked in the same way, so everything needs to be done completely in parallel. And there's crosstalk. So basically, there's an interference between these individual links, and this worsens the longer this gets, right? So if, I mean, for a short connection

Starting point is 01:12:08 might not be a big problem, but if you have a longer connection, all of a sudden you get kind of this crosstalk across these lines, which then again, you need error correction, et cetera, makes everything more expensive. And because of that, or not because of that, but as an alternative, would be a serial bus. And the serial bus only sends

Starting point is 01:12:31 basically one bit at a time. And because you only have a single channel or a single link, there's no clock screw. So you don't have to synchronize in between these different links. You need less cables, but of course you cannot send as much in parallel. But you don't have as much crosstalk and it's cheaper and smaller in space than having parallel connections. So looking at the different In-Qual connects, one would be SATA. I said it's Serial ATA, which is Serial AT attachment,

Starting point is 01:13:18 and AT comes from an IBM PC, which where it was probably called Advanced Technology, never really disclosed, but probably in the end means Serial Advanced Technology Attachment, but the abbreviation really, or the SATA really stands for Serial AT Attachment. It was announced in 2000 and was kind of a successor to parallel ATTA or PATTA. Where you would have this classical IDE interconnect. So this is which you probably also have seen, like these flat cables where you used to connect your HDDs. Maybe you've not seen them, but that's basically the successor. And with SATA 3.0, right now we can get up to 600 megabytes per second, and that's from

Starting point is 01:14:24 2009. And then there's different versions with different kind of smaller updates. But the performance hasn't really increased much. And really depends on basically the clock speed and that this bus gives you, essentially. And SATA supports HDD, optical drives and SSD, but it cannot fully utilize modern SSDs, as we'll see. I mean, slower SSDs, yes. Fast SSDs, no.

Starting point is 01:14:57 For a more general interconnect, Intel developed the Peripheral Component Interconnect or PCI that uses one clock. It used to run on 33 MHz or up to 66 MHz but can also completely be powered down to not use power. And it basically had two kind of like, it's basically again a bus, right? So everything's connected to one communication channel. And then there's different interactions. So you have an address phase where you basically in the communication say, well, I want to talk to this and that.

Starting point is 01:15:41 And then there's a data phase where then on this bus, the device will send the information that it actually wants to send. So in terms of like if you think about communicating to a disk through this then first the disk would say for example say I want to send data to the memory now and then it would actually send the data. And it's burst-oriented, so there's a master or head and target relationship and this would usually be controlled by the host, so the CPU basically says, okay, let's just communicate with the SSD or the disk for example and then you have this communication.

Starting point is 01:16:30 And the boards are supposed to be plug and play so you can actually connect them in and then they start talking to the bus and can communicate with the host and the PCI controller. And this is PCI, not PCI Express. So this is what you would see in a Pentium processor. And this would look something like this, for example. So you have the PCI bus, which is based on a PCI bridge, and multiple connectors or multiple systems are connected to this. USB would be connected to this.

Starting point is 01:17:09 The SCSI driver or controller would be connected to this. You might actually have another ESA bridge, which is like an older bus, where then other disks like the IDE disk could be connected to, etc. This would then interface with the CPU and the memory in local buses. So this is basically the smaller stuff on chip. Older devices would have separate dies on the motherboard somewhere like modern chips. This is basically integrated into the die itself. Because of course at a certain point this was too slow,

Starting point is 01:17:53 there was a newer version or an extension, or a new standard let's say, in 2003, which is PCI Express. And PCI was actually a parallel bus and this is a serial bus and Because it's a serial bus. The connectors are much less right? So because we only need like a single connector basically To do the communication rather than if we want to have parallel connection say for, for example, for a byte, we need eight cables at least to communicate. So then we can see this is actually much smaller.

Starting point is 01:18:35 So here you can see on an actual motherboard, you can see this would be a classical PCI connection. And this would have as many channels as this PCI Express one-time connection. So here for each connection you would have like two channels bi-directional and each of those has two wires one for signal and one for grounding essentially in, in order not to get crosstalk again. And then you have these different lengths. So you have one time, which would be two channels. You have four time, which would be eight channels,

Starting point is 01:19:14 and 16 would be 32 channels in this case. And then we have kind of a network setup where you have point-to-point serial connections in there. And this is done through a switch, meaning we are sending packets between devices. Then we have some error detection code in this package, and we can have quite long connections. So basically these connections can be up to 50 centimeters and we can have multiple switches.

Starting point is 01:19:51 It's a network, right? So meaning we could have like another switch where we again connect devices, which then would connect to some kind of bridge chip with the CPU. But typically, as I said, this could actually be integrated into the CPU itself today. There is an actual protocol stack. So we have a physical layer

Starting point is 01:20:15 that does the bit transmission. We have a link layer that basically deals with the packets. So we have complete packets then. We have redundancy checks, so basically error correction and retransmission if there's an error. There's a question? Yeah. Yes. Does the chipset on the motherboard, does this already play in here?

Starting point is 01:20:42 Yes. So this is all done inside the chip. This is not done in software. So basically the PCI controller does everything here. Because that is usually located quite far away from the processor, right? Like physically far away. You mean the chipset? So where it's located on the motherboard, actually I don't know.

Starting point is 01:21:05 I think, again, it's a network, right? So you can have multiple controllers connected to each other, talking to each other. So, I mean, you will have the interfaces on each of the devices. um where it's i mean it's it is uh yeah where it's connected or where it's on the motherboard i don't know exactly like consumer motherboards they have like these new chips that are just like amd has like z 300 something yeah i thought that those are like on the like lower right half of the leg? So this plays into this but also the CPU needs to be able to talk this so there is something on the CPU that controls part of it so say for example you need a certain CPU version in order to get a certain PCI Express version so that's not just part of the motherboard but how it... I don't know. So the CPU doesn't need to go through the chipset and then through the driver?

Starting point is 01:22:26 Yes. So it's basically, they actually communicate with each other. The switch does just the routing. Yeah, then we have transaction layer and software layer, which basically the software layer gives the interface to the operating system. Okay, so PCI comes in different generations and you also have this reduced, so we saw this earlier, like this, what you would see on a motherboard, right? This here, the different kind of width width so how many channels do we have there's also versions for uh for

Starting point is 01:23:13 laptops which would be the mobile version so this would be a like four times connection and you see this like if you have like a SSD for a laptop you will have these slightly different connectors there. And then there's different versions. Most servers right now still run on generation 3 if I'm not wrong or newer versions already have generation 4 and the just upcoming servers so Sapphire Rapids for example would have generation 5 PCI Express and then like with 16 channels or 16 lanes we're basically getting up to 63 gigabytes per second throughput hypoth hypothetically. But this is basically what we can read through us,

Starting point is 01:24:09 or theoretically, from a single device. Again, then we might have, or we will have more connections that actually go to the CPU. So having more devices will actually give us even more performance or bandwidth across these devices. I would say, in the interest of time, let's stop here. Now you have an overview of what it is like what are the different storage technologies, what does an SSD look internally and how

Starting point is 01:24:51 do we connect it. So this is basically PCI Express and next time at least briefly I will touch, meaning tomorrow, I will touch on non-volatile memory Express which is the interface that we're using. This is basically a software interface, how we're communicating with the SSD then. We can use different interfaces, but non-volatile memory express is the one that actually gives us the best throughput and is specifically designed for SSD. Okay, thank you very much. Questions? the best throughput and is specifically designed for SSD. Okay, thank you very much. Questions?

Starting point is 01:25:30 No questions? Then thanks a lot, see you tomorrow.

Hardware-Conscious Data Processing (ST 2023) - tele-TASK - Storage

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.