Hardware-Conscious Data Processing (ST 2023) - tele-TASK - Storage
Episode Date: June 27, 2023...
Transcript
Discussion (0)
okay so uh welcome everybody um although we're not that many yet let's get started
today because i have many slides i'm not actually going to be able to finish all of them
but doesn't matter um there's some stuff that you also can just basically towards the end
that's interesting uh but also text, you can basically see for yourself.
So, today we're going to talk about storage. Mostly modern storage, but I'll also way in the peripherals now.
So usually if we're talking about fast hardware,
we're gonna talk about going across PCI Express,
but there's all other interfaces as well.
We'll talk a bit about this today as well.
So today we're gonna talk about this part.
So the disk in different forms.
And we'll talk about this part.
So the interconnect PCI Express and predecessors and alternatives to that.
So, yeah, not much changed so far.
There might be a few switches, but we're getting close towards the end.
So everything stabilizes, I hope.
So I'll give an overview of storage today, talk more in more detail.
So the overview will be quite quick about the different kind of storage technologies that are in use right now
then we'll dive a bit deeper about solid state drives and the internals of solid state drives
and i mean although you most likely will never be really exposed to this kind of detail in
uh in your life life while programming something,
it's good to basically have an understanding
to get, well, to use them efficiently
and to know why they behave like they behave.
A similar, I mean, if you do a database course,
typically you will learn about the internals
of a hard drive.
And I also have a slide for this, just to know and understand
how you're supposed to use the hard drive. Similarly, we need to know how the solid-state
drive works internally to properly use this. Then we're going to talk about the interfaces, meaning the different ways how we connect to storage. Most importantly,
PCI Express. So I a bit more again about NVMe
and the Linux IO frameworks.
And here, I don't know how far I will get.
So this is basically where then maybe you can look at this
in more detail at home as well.
So this is, I mean, there's lots of details.
This is mainly based on the
on a tutorial by philip bonet and alberto lana and there they have even much more detail on this
so many more ways or many more details on how to use ssds especially nvMe SSDs, efficiently.
And so, I mean, this is basically, if you really want to deep dive, then go there.
They have, I think, the slides are online somewhere.
If not, you can ask them.
And they also wrote a book about it.
At least I think so. I mean, either it's already published or they will be publishing it soon.
And the other book that I'm using is Structured Computer Organization, again, by Tannenbaum.
So there is a lot of overview, or a good overview, let's say, on the basics of PCI Express etc.
Okay, so you've probably seen, well not necessarily seen all of these devices, but you've heard of probably all of these devices.
So you have classical magnetic disks, then optical drives,
not really common anymore in most user or consumer systems, but still used a lot.
I mean, if you think about consoles, for example, you typically use optical devices.
Then you have all kinds of different flash devices and you have tape.
So this is what currently you will see in any
kind of data center in one way or the other. I mean optical devices kind of
more moving more towards a niche market so for certain case use cases they're
still used a lot. I mean if you need to transport some small amounts of
information you will typically get a DVD or something like that
because it's quite robust and super cheap, but it's not super fast.
And if you have archival storage, you will go to tape.
But most of you probably never really worked with tape themselves,
or even touched tape.
But this is not the end. I mean, this is what's around right now. with tape themselves or even touched tape.
But this is not the end. I mean, this is what's around right now,
but people are actively searching and researching
for other types of storage.
And two things that people are trying is glass,
for example, and DNA,
both of which super experimental still. But you can, like using glass as well in DNA, both of which are super experimental still.
But you can, using Glass as well as DNA,
you can get even much more denser storage
than currently possible, but currently also way, way slower.
So using this, you could have very cheap archival storage,
but right now it's just prohibitively, on the one hand,
I mean, still expensive, just the technology,
but also prohibitively slow.
Okay, so step back.
Who has heard a database course
where they already saw the hard drive details.
Not everybody.
Okay, then let me quickly dive into this. So, okay, hard drives, you still have them
today a lot. Just because they're right now the, let's say, the fastest active,
not the fastest, the cheapest active storage. So if you want to store large amounts of data
and want to work with the data in a reasonable amount of time,
then hard disk drives are still a good way to go.
Because the per gigabyte price is just still cheaper
than anything else that you have like or than any kind of SSD
and reasonably or so much cheaper that it actually makes sense to use it even
though they're considerably slower so they have large capacity larger
capacity than SSD still they have a high read speed but they have very slow
random access so the idea is I mean you can basically see this here
right, let me give...
is this, right, so you can see this here, this is the internal server. Any kind of hard drive
will look something like this.
You have these kind of platters and multiple of those
in a stack, so from 3 or
I mean old drives might have had two up to six, something like that.
And then they have two sides and there's kind of an arm with a sensor or actuator that actually reads the platter.
And the arm, like all of these arms, like there's basically two sides and all these arms work in parallel.
They don't move independently and that means whenever we want to read something, this arm first has to go to the right track.
This is basically the platter is built up or segmented into individual tracks and these are then basically again segmented
into different kind of sectors so we were basically splitting up the the
platter like a pie and then we have like individual track sectors which we can
read at a time and of course all of the arms work in parallel,
so we're not just going to do this on one platter,
but on all platters in parallel,
in order to get the higher bandwidth.
Otherwise, we would just read one platter at a time,
which of course does much less than just reading everything at a time.
And because of that,
because we want to read all of them in parallel, we also call this basically track, stacked one on the other, we call it a cylinder.
So because we're basically going through all of the platters at a time.
And everything that's basically close to each other, or like within a a track this we can read fast and
basically just read at the rotational speed of the disc and the rotational
speed is something between 5,000 and 15,000 rotations per minute and and this
is a physical limit right so if we go much faster than that at a certain point the disc
will basically explode right so this is just like uh because of the spinning at the corners of the
disc there will be a lot of forces pulling it apart and i mean this is way in safe bounds but
if you go much faster then basically the then basically the physical limit is there, and
you would have to build it so much stronger that it's just, again, not really cost-effective
anymore.
So, finding something basically means we have to move the arm to the right track and then we have to rotate until we find the right position
and that's basically 4 to 9 milliseconds and that's a lot.
So that's basically where time goes and then we can read at something like 100 megabytes
per second depending on the type of drive.
So a bit faster if they're faster, so 4,500 rounds per minute drive will be slower
than that actually. And then these are typically connected via either serial ATTA or serial
attached SCSI. And I'll come to some of these abbreviations later.
So I mean, somehow people would build these interconnects
like abbreviations.
So SerialATA is short for Serial AT Attachment,
and that's again, the AT appears to be short
for advanced technology.
So, but this is kind of packed one to each other.
And serial attached SCSI
is then serial attached
small computer system interface.
And this you don't need
to remember. You just need to know
like, I mean, SATA is something that
you will find every now and then.
SAS is another type.
But modern drives will be connected through PCI Express,
so PCIe, typically.
Okay.
It's a physical device, so it actually needs to spin,
but also if it's powered down, it doesn't have to spin.
It uses no power at all.
But typically, I i mean in order if you like shut it down
starting to spin it you will have even much higher uh delay so in an active setup you will have keep
this spinning all the time okay so this is the go to storage for cheap active storage. So, if you have a large Hadoop cluster, whatever,
or S3, cheap, large storage,
in some kind of RAID fashion,
you will go with HDDs,
or you will find HDDs today, basically,
just because they're so cheap.
And, yeah, cheap, period. But they're so cheap.
Yeah, cheap, period. But they're somewhat slow.
If you have even larger amounts of data,
or it's really about, let's say, archival storage,
so you don't need active access,
you just need to make sure that somewhere you store this away,
then you will go to tape today still.
Because here tape is even cheaper than disk.
And it also has this kind of nice feature.
If you basically take out the tape or you can actually remove the tape.
I mean, of course, you can also remove the tape. I mean, of course, you can also remove the disc, but these tapes typically can come in some kind of a robot or setup where there's an arm that puts the tape in
and out, so there's no actual connection to the tape while it's not in use. So this is
really good for archival storage because it gives you quite a safe setup. So if the tape
is out, nobody can touch it, nobody can change it. Well, say if you
have like your cold storage on hard drives, if the system is corrupt, something like that, then
that might still be overwritten somehow, but the tape is sort of safe. It also has a long lifespan
because if it's taken out, nothing is done with it, right? It doesn't need to spin, nothing. It's quite durable. It's like this small plastic cartridge and it's quite energy efficient.
And so currently there's basically one format that you can use and this is the linear tape open LTO setup and currently we're at generation 9
which were one of these cartridges and this is about this size right so this is like a
floppy used to be earlier so it's not not large I mean it's a bit thicker but it's it's quite small thing this has 18 terabytes of capacity
and you can read it basically 20,000 times or read and write it 20,000 times
no passes so that 20,000 intoend passes, but actually you need to...
It's kind of organized in a complicated way where there's multiple bands on the tape
and then different tracks and wraps.
So it's kind of a hierarchical organization where you can
basically read both sides, like upper and lower parts, and in those then you have
multiple, let's say tracks again. And that also means you need to actually go
through the tape multiple times in order to fully read and write it.
And you can see, I mean, it's still,
there's still new developments here with,
well, as I said, up to 18 terabytes native.
And of course, if you compress, you can get even more there. And there will be new versions,
which are supposed to go even up to 144 terabytes at some point, Gen 12 would be.
Tricky part is there's currently actually only one manufacturer, which is always a bit of a problem.
I mean, there's multiple distributors, but in the end, it's really been being manufactured by only one company.
And that means if there's a problem in that one company,
you're screwed if you have to work only with this.
But of course, you can also buy these in bulk for some time.
And they're cheap, as I said.
So this is a good thing for archival.
And actually, we also have this.
So when we go to the data center,
you can actually see there's like a small window in the system and you can see that there's the
tapes left and right. And if Tobias has a good day, then he will basically show you how the arm
moves around and takes one of these cartridges in and out or even he can show
it to you.
It's an open tape, meaning it's not like a cassette or like a VCR thing, where you have
two rings or something, but it's really like one long
thing that you can just pull out
and that will then basically
be pulled into the machine
back and forth in order
to read and write it.
Yes?
How is the data compressed
on the chart on the right?
And what has the
logic to do with the data compression? So basically the drives have like a certain compression already built in.
And how that is done, I don't know, to be honest.
So there is basically, let's not say native, but they already have certain standards how to compress.
And then this is an average.
So of course, if you write random data to the drive,
there is not much to compress.
And then you will end up with only the native capacity.
But that's something that's what they basically assume.
So you can basically say, I want to use this drive in compressed
mode. And then it will be able to write more data. Basically assume so there's you can basically say I want to use this drive in compressed mode
And then it will be able to write more data
Okay Okay, then well still there optical devices
So if you get an x-ray, for example, this is what you will get to take home with you
Or if you play PlayStation at home home this is what you will have
something like this so they're nice for transport because they're easy yeah
they're not heavy and they're not too fragile they are somewhat fragile not
too fragile actually and they're cheap but But there's CDs, DVDs and Blu-rays, they're
actually slow in read and write speed. But I mean the Blu-ray for example
already also has not a bad capacity but of course nowhere close to what you
get with tape. So for archival storage, it doesn't really make that much sense.
But if you want to move some data in the gigabyte range,
then this is actually a good alternative.
So because, yeah.
And it's quite durable.
So if you write the CDs yourself or DVDs,
they can last up to 10 years.
If they're manufactured properly, they can last much longer than that.
Okay.
More towards the future, right?
So this stuff has been around for a long time,
but now we're seeing Flash taking over everything.
Not everything, but let's say being the storage for active data processing. this type of storage in a way like when we talked about the way how this is
done internally like it's similar like the like persistent memories set up
slightly different but in general the concepts are somewhat similar in terms of how the physical or the electric circuitry is underneath.
But these are even more dense than what you get with persistent memory.
And in this case, this is typically NAND-based.
So if you remember quite early how we can build caches, etc.
So these are these NAND gates typically.
The flash devices have no moving components, unlike disks or optical devices, which means
there is basically less, they're less prone to problems with moving or shutting the server on and off.
So which often is a huge problem with hard drives.
So if you have hard drives and you have a power outage, for example,
so we have to shut down the data center because it's overheating.
Then if you spin it up again, usually a couple of hard drives will be broken.
This is just because they have to stop, they have to power off, then they have to be turned on again.
So there's just some mechanical wear and tear, which then basically eventually breaks the drive.
And so this is not so much a problem with the flash drives.
But they're much more complicated inside.
They're all chips, right?
And you use different granularities for reading, writing and erasing.
Unlike the magnetical devices where you basically, while you're reading, you can overwrite and you're basically using the same granularity for reading, writing.
And writing is basically updating, means also you don't have to delete first and then write something new. Here, in order to write something new, you
actually have to erase a whole amount of space in order to get space for new
writes, basically. I don't want to go into too much detail about how all of this works,
because I have a couple of slides explaining all of this in detail.
But something that's important is that there are many different form factors
and there are also many different speeds.
You can see there's a lot of logic in here a lot of circuitry and it really like
depending on the different type of device this will dramatically differ in how like how complex
this is how performant this is i mean back here the different flash memory package so the actual
storage cells they are somewhat similar but then how these
are connected how parallel you can use this will be very different if you have
a smart PCI Express Drive or something like this which would be connected to
probably either this small SATA or m2 which is a PCI Express type, again, interconnect again,
or let's say a USB, where then all of the details,
like the flash controller, etc., in here will be much easier
and much less efficient.
Also because the speed that you get through the USB will be much less.
Okay, so just as some fundamental trends, right?
So why does it actually make sense or why is there a shift in how we use this, right?
So like not too long ago,
Flash was actually quite expensive.
And I mean, while it was fast
and basically people
started using it or manufacturers started using it and speeding up certain
applications where the speed would actually make such a difference that
that you pay an extra price on that, we can see that the prices are actually
going down a lot. So here you can see this is from a paper
of Victor Lais. You can see that here we have the the GB per dollar.
So basically higher here means we're paying less per GB. And you can see
while disk is kind of flattening out over the time, because the discs are not getting much larger anymore.
So here we're kind of also, I mean, there's new technology,
basically moving or layering the storage in the disc as well.
So rather than just saying, okay, on each position where my arm is i'm just
going to have one bit i can basically interlace and interleave stuff so i can more densely pack
things so i get more storage but you can see i mean this is a lock scale so it's not basically
if you have like a somewhat flattening curve doesn't mean there's nothing going on it's still increasing the um
basically what you get for your money but not that much anymore right so as in the past the same is
it's true with dram i mean there we see kind of a seasonal pattern right and then we know there was
some crisis here and there so prices went up and down but with flash we can basically see this is
still increasing right so meaning flash is getting cheaper and cheaper and well at this point
basically we are with flash where memory used to be in terms of performance and at this point flash is basically 20 times cheaper
now we are at this point where this used to be here with DRAM actually sorry and this is
basically where flash is 20 times cheaper than DRAM so we're getting actually a much cheaper storage while
still being quite fast than we get with DRAM. Which in the past, and that's
not too long ago, right? So this is basically 20 years ago, Flash was the same
price or even more expensive than DRAM.
Okay, so looking at kind of bandwidths, and I mean, we basically have to see this
on a per DIMM basis or per unit basis.
With DRAM, we know we can get something like
up to 100 gigabyte per second.
So typically like modern Intel servers,
if you fully pack them, this would be 48 GB.
But it's just going to be one DIMM.
And then you have many of those, meaning per CPU or per socket,
you can have eight of those, for example.
So that means, like we had this two lectures ago,
it means something like close to 400 gigabytes per second for DRAM,
if you just fully pack it.
PMEM, factor of three, four, or two to four less,
but still quite fast.
Optical, much, much slower, right?
Because you need to move, like you have a lot of
like or the hard time to read it and it also depends on what kind of drive of
course and but then we're looking at SSD and HDD which is basically what we're
mostly interested in and again per device right so not like maximum speed
but per individual device a hard drive will give you up
250 megabyte per second. And similarly, read and write bandwidth in an SSD, a single device will
give you up to one gigabyte per second, even a bit more if you have very fast devices today.
And that's just going to be one device, but you can basically pack multiple devices
into a single server.
Okay.
So moving from, so I mean, important here
is that we're still like considerably slower than DRAM,
but we're catching up.
So this is not the same kind of factor that we had with hard drives.
So if we package multiple SSDs into a server,
then we're much closer to the bandwidth.
And the other thing which is important for us, the latency will be much
lower than the latency with a hard drive.
So this is also something that I actually have here.
So in the SSD, we're something in the microseconds, so low microseconds for the latency, and also in the low microseconds for reading a megabyte,
while for the disk seek, we're in the milliseconds, typically.
And this is really just because the arm has to move
and the disk has to spin to find the right position.
So this random lookup is basically what's killing disk.
And again, this kind of heads up here, an SSD, this is not a uniform device, right?
So SSDs come in many different form factors, with many different interfaces.
And while we're getting closer to some kind of consolidation,
so when I started working with SSDs,
it was completely unclear how this will go.
But now, at least we have interfaces
where we can use these devices somewhat natively,
rather than looking at them just like a hard drive.
There's still a lot of variables.
It's not clear.
It's not standardized or there's not one way how we typically interface with those.
If IOPerformance does not matter, and that's in any case, then typically you will do everything in POSIX, just like in Linux standard way of accessing files.
So you have your standard block interface with some kind of buffered I.O.
I mean, you're not just going to synchronously read everything from the application. So there will be a bit of buffered I.O. I mean, you're not just gonna synchronously read everything
from the application, so there will be a bit of buffered I.O.,
but in general, it's a block-based,
and the individual reads will then still be synchronized.
However, this is not efficient.
It's not even efficient on modern HDDs, but it's for sure completely inefficient if we're talking about SSDs.
So if we want to have fast I.O. with a hard disk drive, then we're going to build something like our own buffer management, as you have to do right now, on top of POSIX IOs.
On the disk, the disk is block-based, so we'll still use this block-based interface and then
directly communicate with that in order to be fast enough. And if we want to use SSD efficiently and we have a fast SSD,
then we cannot use POSIX.
So then we have to do something else.
And there's new standards like NVMe.
So then you need to really use a different type of protocol to talk to the SSD and really do asynchronous accesses to the SSD in order to fully utilize the parallelism that's in these systems or in these devices.
Okay, so yes
If you have a SATA drive and performance is crucial
Well, I actually I don't know how what kind of interfaces you can get on top Of those. I mean, for sure, yeah, that i
Would have to look up if there's like special, for sure you can
Also do something in user space rather than using posix.
And try to just be, have as many parallel access or
have many parallel access to fully utilize the device.
But if there's a standardized way or let's say a good best practice to do this, I don't
know, to be honest.
So I would have to look it up.
I don't know if there is something like nBME over SATA.
I don't know. I mean we can check. I have some stuff on the throughputs of the different interfaces, also SATA.
There I would have to look up.
OK, so let's talk about solid state drives internally more. So you get an idea of what these actually look like internally
and how they process your accesses to this drive.
As I said, this is NOR or NAND flash memory
typically NAND or flash memory is NOR or NAND and in SSDs is typically NAND cells
and then there's different types of cells and these cells can either store one bit or two to four bit
three bits or four bits so I mean you can have like single level cells double
triple or quad level cells and if you want want to have high performance then
you're going to go with single level cells or
double level cells or multiple
level cells. If you want to have archival, so you want to get more densely packed your storage, you're going to go with
triple or quad level cells. And this is basically
not your choice, right? So you're not basically picking
how is my SSD structured internally, but you're going to buy either one or the other. Yes?
What's the difference between non-advanced dash memory? Is it cut to the point of the gates?
Yes, exactly. That's what it boils down to. So basically, how are the gates structured?
Like really, what's the circuitry in the chips, essentially?
How do these things exactly work?
I mean, it's not really...
It's basically the physical manufacturing,
how you build up these cells individually.
If this is kind of logically a NOR or a NAND gate.
And here you have an overview level cell, then on each block or each die, you can get like 8 gigabytes.
And you have like a 3 millisecond microsecond latency for reading.
You have a programming latency of 100 microseconds and the page size will be two or four
kilobytes and then we can see like triple level cells like different types you will get much
higher latencies but you will also get get much denser storage so And, I mean,
physical
details is something
that you would have to look up.
Also, then, at a certain stage,
a lot of the details are
kind of
trade secrets
as soon as we're
looking too far into the
storage. But it's just to get an overview here.
Important, something that's relevant for us still
is this page size down here.
So this is something to remember.
Based on this page size,
this is basically your access granularity still to the SSD.
So while we don't want this kind of big block-based device,
we're still not reading individual bytes or bits.
We're reading in pages, just as we did for memory, et cetera.
But here, these are again somewhat bigger
and typical performance devices will have two
or four kilobytes and often 4 kilobyte pages.
So that's just something to remember.
If they're larger, that means like for each individual access we're going to read more data.
So going through the device setup in more detail.
So the flash internally is organized in logical units that will store individual pages.
So each 4 kilobyte page, for example, will be stored in a logical unit or LUN. And then these logical units also have registers that will basically where the actual data
will be served from.
So it's stored somewhere internally in these blocks, your individual pages.
But then if you want to read it, it needs to go into this register in order to be read
out. And then you will have multiple of these logical units
that will then basically form your storage.
And how you're basically reading towards these registers
and how you're organizing this, again,
is kind of a performance or a throughput latency trade-off.
And the smallest unit for read and write, as I said,
it's a page size and that's, again, hardware dependent.
And then in these blocks, basically,
will be 128 or 256 pages.
And for the pages,
we have three different kind of operations or within a logical unit.
We can read and write and erase.
And so we can basically read any kind of page at any point in time.
If there's free space, only if it's empty, we can write.
And if not enough space is there anymore,
so if our page is kind of full,
or we want to delete the page or something like that,
then we have erase.
But erase operations will not be on the individual page level,
but will be on the block level.
So not on the complete logical unit level,
but say here, if this is full and half of it
needs to be deleted somehow, it's basically outdated data, for example, then we want to
erase this block in order to reuse it, then that basically means all of this data can
later on, or all of these pages can later on be written again.
So I'll come into a bit more detail later on.
But important, reading and writing is on an individual page level.
Erasing is on a whole block level.
And as I said, block means 128 to 256 or 256 pages typically.
It could of course also be different if the manufacturer decides
I want to have blocks with 1024 pages for example.
Basically the logical units are then connected to a controller, which is like a channel controller.
So there are basically two channels that go to each of the logical units.
There's a control channel and there's a data channel.
And the data channel is shared, while the control channel are individual. And the control channel basically says, oh I want to read this page or please erase this page etc.
So this would be control information and then on the data channel basically the of the logical units.
And so basically the, yeah,
so they're controlled independently,
but the data is read through,
like through the single channel on there, under here.
And then as we saw here, basically, the data that we want to read or write needs to be in these page registers.
So this is basically where the data channel will be connected to.
And that will take some time.
So to basically read one page into the page register or to basically write the page from the page register into a page, into one block.
So that's in order to save time here or in order to fully utilize this data channel here,
the channel controller will interleave these operations.
So then basically it will say, well, on chip one, please start a read
or I'm sending a read command
and on chip two, on chip three, for example, in parallel.
And then there's a certain read latency
until the page is loaded into the page register.
And then again, basically the data could be read out
through the data channel from the page registers.
And so then, I mean, while we still have this kind of latency for the individual pages,
we are basically continuously utilizing the channel by using,
like in a form, some form of pipeline parallelism here,
right, so we're basically, or round robin,
using, utilizing the individual logical units here
in order to fully optimize the throughput here.
So that's basically like just reading and writing and erasing or the communication that's done in there.
However, that's not everything. is that the flash is extremely sensitive.
So if you're basically writing just once into a flash cell,
this will basically directly influence the neighboring cells.
So you basically need to make sure that the data that you're writing is roughly uniformly
or is roughly randomly distributed.
So, you don't want to write long ones and zeros
because also the cells will basically wear much faster. So they will wear out, they will break faster
if you just, say, overload them in one direction.
So you want to actually scramble the data first
before you write it.
And at the same time, if you're reading,
because it's sensitive, you will also get a lot of errors.
And for that, you need some error correction.
And so then there's an error correction code engine that's in front of the channel controller
which will basically scramble the data that you write and descramble the data that you're
reading and then also once the data is descambled, you're basically also going to do
error correction and check if what's coming out is actually correct or not, or if there's any errors,
it will try to fix those errors. I mean, just regular error correction codes.
However, details are not public information. How exactly which drive does it?
And that's also a reason why there is different kind of interface standards.
So at a certain point here, there will be an interface standard but this part is basically not standardized and this means you
cannot exchange like the different kind of chips on different kind of devices
okay so going one step further so basically we have the error correction engine and the
general controller multiple of these basically connecting to multiple LUNs and having different channels that we can use in parallel.
So we know that we can do basically interleaving on each individual channel in order to fully utilize multiple of these blocks
or of these LUNs in parallel, not blocks.
But then we also have multiple of these channels in parallel, which actually can be fully utilized in parallel.
So we can read and write completely parallel plus this interleaving.
And for this, we basically have two additional steps. We have the host interface controller, which
does implement the actual interface to PCI Express etc. So basically the
protocols that are needed to talk to the server and also this part does the data transfer in and out of the device itself.
So this is all still on the SSD. And then we have firmware and the firmware implements the flash
translation layer and initially this was really basically translating from this HDD block device to this kind of internal parallel channel
based device. And nowadays, I mean, we want to exhibit more of this, more of the
parallelism to the outside in order to be able to fully utilize this.
But still, basically, all this internal part is kind of hidden,
at least like some of it,
behind this error correction engine, et cetera.
So how exactly the data is written here.
But through this, or this part basically,
one tries to ensure that like where data exactly is located
and how data is read, written, especially written and deleted
in order to make sure that the accesses are somewhat distributed across channels
and also are somewhat distributed across different kind of blocks.
Because again, on one hand they are sensitive and they
only have a limited lifetime so individual cells or individual pages
will break after just writing and reading too often or then not too often
like basically a certain amount of reads and writes eventually they will break
and so we want to distribute this across multiple blocks and pages in order
to read all of them in kind of the same speed.
So we're not basically breaking one part of the SSD and the SSD all of a sudden only has
half of the capacity.
Finally, we also have the storage controller, which basically has all this part.
That's basically what we discussed so far.
So now, let's look at how writing and erasing work.
So this is kind of a small block.
We said one block would be 128 or 256 pages typically. And then we have on our block 1000,
we have some data.
Block 2000 is still completely free.
And we have XYZ as individual pages on the block.
And now we want to update page X.
So what will happen, rather than replacing X in place, we actually
have to write a new X. So we're basically remembering in the buffer somewhere or in
the storage controller, we're remembering that X is deleted.
So, we have a map and know that
X' is the new version.
And whenever we are going to read again,
we are actually going to read X' rather than X.
And X is basically
a dead page right now, because we cannot use it.
So, in order to overwrite this again or to reuse this, we have to erase.
And this is basically, as I said, done on a block base.
So, if we want to reuse the block 1000, we actually have to copy the data over. So what will happen is that we'll just copy
the block 1000. All of the pages that are in use and have valid data will be copied over
to a free block and the old block will be erased. So you can see after this basically block 2000 has one
slot free because this is basically the block where the old X was and block 1000
is completely free again. And erasing takes a bit more time than writing and also reading.
So this is basically something which will be done in the background if possible.
And again, like write-erase cycles,
there's only a limited amount of numbers how often you can do this.
And so you also have this kind of write amplification, which I mean you have the same if you go to
memory, right?
So, we always have to write on page size, so we cannot write anything smaller than page
size.
This means, even if you just write one byte, you will end up writing a complete page, which
will be 2 to 16 kilobytes typically.
If you have to erase a block while you're doing this,
so basically you have to completely copy this,
you have an additional write amplification,
which would be block size, would be up to 256 blocks.
So if you have 4 kilobyte pages, then we're at a megabyte basically here, in
terms of write amplification, so that we all of a sudden have to copy. And because of the
page movement, etc. So in order to be efficient, of course we want to align the writes on the page size
and we're going to try to write chunks of data that are either page size or multiple
page sizes.
And with this we can actually get maximum throughput.
And that's also, again, why we use buffer management in a database,
because we want to work in this page granularity rather than writing individual bytes, etc.
However, the device also has something for this.
So the device in its controller will have some logic and some buffers
for small individual writes in order to not do all of the individual writes separately.
So if the device, rather than do random writes everywhere
or inefficiently use pages and blocks, it will try to align stuff in the controller
to make stuff work more efficiently.
Okay, a bit more deep dive into the Flash translation layer.
This is part of the controller, we said, and this also maps logical block
addresses to physical block addresses and logical block addresses from the
device point of view. This translation also exists on hard drives,
but it's much more complicated in SSDs. So also on hard drives, you have some extra reserved space for like if
some bits break somewhere, or you also have some error correction, etc. So you can basically
move pages or addresses can be translated to different areas on the hard drive. And the same is true in a flash drive.
So here, you could have a completely random distribution
depending on how the flash device decides to distribute
the data onto the logical units.
So in order to maximize the throughput across the channels,
and in order to make an even wear and tear across the different kind of cells.
So basically, it uses a map for this, and this map is stored internally inside the controller.
So basically, we're addressing individual blocks or individual pages on the device with a page address.
If we go from our host device, and then this mapping will basically tell the device,
okay, this is actually located here inside this logical unit
in block so-and-so, page so-and-so.
And this can change over time,
depending on the garbage collection
and depending on the var leveling.
Basically meaning, as soon as I have to erase a block,
for example, I will move, as we saw earlier,
I will move the still good blocks or the still good pages
to another block, and then I'm updating this mapping
inside the controller.
And of course, it also needs, like this is in RAM, but it also needs to persist it in
case of power failure.
So, as soon as there is no power anymore, then this kind of mapping will be stored into
the drive as well.
So, I mean, simple Flash translation layer would look something like this.
We have one block per LUN and the writes are buffered.
And we have kind of a logical to physical mapping in a round-robin fashion.
And then as soon, if a buffer is full, then we're going to flush it to the actual cells.
And if we have round robin, we're going to get good channel utilization because we're
going to use the channels one after the other.
Or you can also do this in parallel. And I mean, we don't do necessarily update in place. Because then we are like
any kind of update would then basically invalidate these, these logical to physical addressing tables.
And well, we need to do a garbage collection
for invalid pages.
And we also need some over provisioning
in order to be able to move pages around, right?
So this is what I showed you earlier.
If I want to do a erase,
I basically have to copy the data over to some empty
space. So this is something I can basically not give all of the space on the device to the user,
because otherwise there's no space for moving data around anymore. So we need to reclaim the rights after some time.
So, let me quickly check time.
There's still a few slides until we go to the PCI Express.
So, the garbage collection in the drive, basically basically we see, right?
So we're not updating in place.
We're basically writing new pages if we have some data coming in because we cannot update in place.
Otherwise, we would have to erase first and copy everything over again.
So that's why we're basically copying to new free blocks and
then the SSD needs to basically erase the block again in order to free
this space again. The garbage collection will do this asynchronously.
The garbage collection will basically check which pages contain a lot of stale data and
where does it make sense to basically move data around.
However for this information to work, the drive needs to know which blocks do actually
contain or which pages are actually invalid.
And so for this, there needs to be some information from the OS or from the user to explicitly
explain what information is invalid.
So this basically only works if the SSD controller sees that there is free space
when logical, so I mean, if you're deleting a file, right, you're not necessarily sending a delete message or something,
you're only updating if you write, like even on a disk.
On a disk, you would never delete.
You would just overwrite, and it doesn't really matter.
And this is how these interfaces are typically designed.
So on the SSD, this is a problem, right?
Because if you're just writing new data and you're not overwritingriding in space or in the same space
you're updating in space then the SSD controller won't really see this so the
SSD there needs to be some extra information for the SSD to actually do
this garbage collection and for this there's a trim command. So this basically tells the SSD to explicitly
tell it which blocks can be deleted or which pages can be erased. So which pages are empty
and then with this, basically what kind of blocks can be erased. And this is something that
didn't exist for a long time in POSIX.
So then basically there was this problem.
You're just basically writing new data
and the deleted files would never be marked properly.
So for this, there's this extra command,
which will explicitly tell the SSD,
okay, these blocks you can actually erase.
But that is just an information to the SSD, right?
And then the SSD decides when to actually erase something or not.
So, a bit more on the wear leveling.
So, each of these cells has a limited lifetime.
So, this is like a...
I don't have a number right now for the number of cycles that it can go to, but it's in the thousands, but still not unlimited.
Meaning, after a certain amount of writes in the re-cycles, then the block or the page is basically broken. It won't respond. It will basically freeze to the last version
and it won't accept any new arrays.
And for this to not happen too quickly on hot data,
so not all of your data is written and updated in the same frequency.
So you have some files that the OS uses,
which are constantly changed, right?
And some files will be not touched at all.
And in order to not break certain parts
of the SSD quite quickly,
you have wear level mechanisms,
which basically map from one block to the other.
And there's two ways,
where either you have dynamic wear leveling,
where you're only moving data that's changed all the time.
So whenever you're changing data, you're updating,
you're not gonna update in place.
Of course, you cannot update in place,
but you will basically move the data around
as soon as you're erasing a block and you're
making sure that these blocks are not changed too frequently or that whatever
block is still free you're basically counting how often they've been erased
so far and you're switching to something that's not been erased that often. In the
static wear leveling you're moving everything around in order to make sure
that everything or sort of everything, right? So all also the data that's static in order to give
like it's the same kind of wear leveling on all of the blocks that are there on the SSD. So that's a bit more work, it's a bit more slower,
but you get a better utilization of all of the blocks.
So the dynamic wear leveling will be a bit faster,
but you get like different kind of wear
on different areas of the SSD.
The static wear leveling will give you
like a better utilization,
but the drive, it will be a bit slower and the drive will
be broken more quickly because you're reading and writing everything more frequently.
Looking at the access completely and very quickly, if we basically want to access
the device, we're going from the host memory
through the host interface controller to
a data buffer. So there we have some kind of caching
policies on the device. Then we have the
Flash translation layer which will do
this logical to physical mapping
where we have data placement, we have garbage
collection policies, then we have a low level scheduler where we will basically go through
the different kind of queues. So we have the different kind of channels where then we will
have individual logical units connected to it and there we have certain channel utilization
policies in order to get good parallelism out of the device and then we have the actual flash
controller which does the error correction and longevity policies so basically this
data scrambling right so where we're making sure that we're not just
basically writing the same kind of data,
like same kind of bit types all the time,
but get some kind of a randomness
in the data that we're storing.
Okay, so quick internal takeaways, right?
So it's complex device, and this is what you should take away.
I mean, you get kind of an overview of what the device needs to do.
You see it's parallel.
There's a lot of hidden design decisions.
And that means you will see very different characteristics depending on these design decisions.
I mean, they get good performance, but if you really want to get the most out of the device,
you will have to really benchmark and test an individual device.
And it also means it will have different characteristics depending on the age,
depending on the filling degree of the device.
So if your device is much fuller,
there's less space to swap data around,
then the device will probably be slower.
And there's many different types of SSDs,
not only the interfaces,
but also internally how they're actually structured.
So if you have like an SSD optimized system that's probably nonsense because it
really can be optimized for one type of SSD but it's not like you cannot
optimize for all types of SSD because they're so diverse and well there's
within the device there's a lot of complex software and diversity,
but then also how to access it. So this is something that we'll also see towards the end
of the lecture, probably not today, towards the end of the lecture. So now let's do a quick three
minutes break, and then we're going to talk about PCI Express.
Let's look at the interfaces. How do we connect to this device? So this is kind of an old ATX motherboard. So modern motherboard will look slightly different but a lot of these connectors
will still look similar. So you have external ports, you have the CPU socket, you have the
memory slots, you have power controller, etc. But then you have these things here, right?
And this is where basically your storage and other peripherals will go. So this would be a PCI Express slot and this are older PCI slots and here you have SATA ports where
you would connect your regular SATA drives to. And these then are connected
to the CPU via different buses and these are internal buses similar to this inter-CPU connections.
But for the PCI Express, you have basically like a network
that connects to the CPU, to the PCI Express controller.
So the different devices will be interconnected
and can communicate with each other and can communicate with memory etc.
And there's different ways of building these buses. There's parallel buses and serial buses.
And parallel buses basically have multiple channels or links, not channels but links actually,
from the transmitter to the receiver and you're in parallel sending
multiple bits at a time so say for example a complete byte by having all of these links
like in this case eight links at a time and that's good in general or in theory because you can do a whole a whole byte in a single
connection or a single
Send operation single operation
and
Basically the speed that you get is basically than equal to the number of the bits at a time
However, there's a bit of a problem again.
I mean, on the one hand, this needs to be clocked in the same way,
so everything needs to be done completely in parallel.
And there's crosstalk.
So basically, there's an interference between these individual links,
and this worsens the longer this gets, right?
So if, I mean, for a short connection
might not be a big problem,
but if you have a longer connection,
all of a sudden you get kind of this crosstalk
across these lines, which then again,
you need error correction, et cetera,
makes everything more expensive.
And because of that, or not because of that,
but as an alternative, would be a serial bus. And the serial bus only sends
basically one bit at a time. And because you only have a single channel or a
single link, there's no clock screw. So you don't have to synchronize in between these different links.
You need less cables, but of course you cannot send as much in parallel.
But you don't have as much crosstalk and it's cheaper and smaller in space than having parallel
connections.
So looking at the different In-Qual connects,
one would be SATA.
I said it's Serial ATA, which is Serial AT attachment,
and AT comes from an IBM PC,
which where it was probably called Advanced Technology, never really disclosed, but probably
in the end means Serial Advanced Technology Attachment, but the abbreviation really, or
the SATA really stands for Serial AT Attachment. It was announced in 2000 and was kind of a successor to parallel ATTA or PATTA.
Where you would have this classical IDE interconnect. So this is which you probably also have seen, like these flat cables where you used to connect
your HDDs.
Maybe you've not seen them, but that's basically the successor.
And with SATA 3.0, right now we can get up to 600 megabytes per second, and that's from
2009.
And then there's different versions
with different kind of smaller updates.
But the performance hasn't really increased much.
And really depends on basically the clock speed
and that this bus gives you, essentially.
And SATA supports HDD, optical drives and SSD, but it cannot fully utilize
modern SSDs, as we'll see. I mean, slower SSDs, yes. Fast SSDs, no.
For a more general interconnect, Intel developed the Peripheral Component Interconnect or PCI that uses one clock.
It used to run on 33 MHz or up to 66 MHz but can also completely be powered down to not use power. And it basically had two kind of like,
it's basically again a bus, right?
So everything's connected to one communication channel.
And then there's different interactions.
So you have an address phase where you basically
in the communication say,
well, I want to talk to this and that.
And then there's a data phase where then on this bus,
the device
will send the information that it actually wants to send. So in
terms of like if you think about communicating to a disk through this
then first the disk would say for example say I want to send data to the
memory now and then it would actually send the data. And it's burst-oriented, so there's a master or head and target relationship
and this would usually be controlled by the host, so the CPU basically says,
okay, let's just communicate with the SSD or the disk for example and then you have this communication.
And the boards are supposed to be plug and play so you can actually connect them in
and then they start talking to the bus and can communicate with the host and the PCI controller. And this is PCI, not PCI Express.
So this is what you would see in a Pentium processor.
And this would look something like this, for example.
So you have the PCI bus,
which is based on a PCI bridge,
and multiple connectors or multiple systems are connected to this.
USB would be connected to this.
The SCSI driver or controller would be connected to this.
You might actually have another ESA bridge, which is like an older bus, where then other
disks like the IDE disk could be connected to, etc.
This would then interface with the CPU and the memory in local buses. So this is basically the
smaller stuff on chip.
Older devices would have separate dies on the motherboard somewhere
like modern chips. This is basically integrated into the die itself.
Because of course at a certain point this was too slow,
there was a newer version or an extension,
or a new standard let's say, in 2003,
which is PCI Express.
And PCI was actually a parallel bus and this is a serial bus and
Because it's a serial bus. The connectors are much less right? So because we only need like a
single connector basically
To do the communication rather than if we want to have parallel connection say for, for example, for a byte, we need eight cables at least to communicate.
So then we can see this is actually much smaller.
So here you can see on an actual motherboard,
you can see this would be a classical PCI connection.
And this would have as many channels as this PCI Express one-time connection.
So here for each connection you would have like two channels bi-directional and each of those
has two wires one for signal and one for grounding essentially in, in order not to get crosstalk again.
And then you have these different lengths.
So you have one time, which would be two channels.
You have four time, which would be eight channels,
and 16 would be 32 channels in this case.
And then we have kind of a network setup
where you have point-to-point serial connections in there.
And this is done through a switch, meaning we are sending packets between devices.
Then we have some error detection code in this package,
and we can have quite long connections.
So basically these connections can be up to 50 centimeters
and we can have multiple switches.
It's a network, right?
So meaning we could have like another switch
where we again connect devices,
which then would connect to some kind of bridge chip
with the CPU.
But typically, as I said,
this could actually be integrated into the CPU itself today.
There is an actual protocol stack. So we have a physical layer
that does the bit transmission.
We have a link layer that basically deals with the packets.
So we have complete packets then.
We have redundancy checks, so basically error correction and retransmission if there's an error.
There's a question?
Yeah.
Yes.
Does the chipset on the motherboard, does this already play in here?
Yes.
So this is all done inside the chip.
This is not done in software.
So basically the PCI controller does everything here.
Because that is usually located quite far away from the processor, right?
Like physically far away.
You mean the chipset?
So where it's located on the motherboard, actually I don't know.
I think, again, it's a network, right?
So you can have multiple controllers connected to each other, talking to each other.
So, I mean, you will have the interfaces on each of the devices. um where it's i mean it's it is uh
yeah where it's connected or where it's on the motherboard i don't know exactly
like consumer motherboards they have like these new chips that are just like amd has like z 300 something yeah i thought that those are like on the like lower right half of the leg?
So this plays into this but also the CPU needs to be able to talk this so there is something on the CPU that controls part of it so say for example
you need a certain CPU version in order to get a certain PCI Express
version so that's not just part of the motherboard but how it... I don't know. So the CPU doesn't need to go through the chipset and then through the driver?
Yes.
So it's basically, they actually communicate with each other.
The switch does just the routing.
Yeah, then we have transaction layer and software layer, which basically the software layer gives the interface
to the operating system.
Okay, so PCI comes in different generations and you also have this reduced, so we saw
this earlier, like this, what you would see on a motherboard, right?
This here, the different kind of width width so how many channels do we have there's also versions for uh for
laptops which would be the mobile version so this would be a like four times connection
and you see this like if you have like a SSD for a laptop you will have these
slightly different connectors there. And then there's different versions. Most servers right now
still run on generation 3 if I'm not wrong or newer versions already have generation 4 and the just upcoming servers so
Sapphire Rapids for example would have generation 5 PCI Express and then like
with 16 channels or 16 lanes we're basically getting up to 63 gigabytes per
second throughput hypoth hypothetically.
But this is basically what we can read through us,
or theoretically, from a single device.
Again, then we might have, or we will
have more connections that actually go to the CPU.
So having more devices will actually
give us even more performance or bandwidth across these devices.
I would say, in the interest of time, let's stop here.
Now you have an overview of what it is like what
are the different storage technologies, what does an SSD look internally and how
do we connect it. So this is basically PCI Express and next time at least
briefly I will touch, meaning tomorrow, I will touch on non-volatile memory
Express which is the interface that we're using.
This is basically a software interface, how we're communicating with the SSD then.
We can use different interfaces, but non-volatile memory express is the one
that actually gives us the best throughput and is specifically designed for SSD.
Okay, thank you very much. Questions? the best throughput and is specifically designed for SSD.
Okay, thank you very much. Questions?
No questions?
Then thanks a lot, see you tomorrow.