Hardware-Conscious Data Processing (ST 2023) - tele-TASK - Storage & Networking

Starting point is 00:00:00 Welcome everybody, let's get started. So today we're going to finish up storage, NVMe mainly. And then we're going to switch to networking. And I have too many slides, so I'm not going to be able to finish this today, both of it. So we're in half here and then we're going to start this part basically. We'll see how far we get in RDMA. Next week, for sure, we're going to do GPU. And this means we're going to run over with RDMA to the 11th

Starting point is 00:00:40 and basically probably push part or all of FPGA to the 12th into this Q&A session. Because next week the GPU lectures will be held by Ilin, a PhD student of mine who's been working on GPUs a lot, so that's a good fit. And since we have this extra slot for QNA, it's not really a problem. So then all of that should be fine. Okay, so just as a reminder, we've been talking about storage devices and I gave an overview of all the different kinds of storages that we have. Then I really went deep, I hope deep i hope deep enough in solid state drives mainly to explain to you um yeah what solid state drives look internally how they like the inner workings

Starting point is 00:01:35 and that there's a lot of variability in the design which also means there's a lot of variability and in the hardware there's a lot of variability. And in the hardware, there's a lot of variability in the performance. And that again means, if you really want to get the best performance characteristics, this means if you want to get the best performance out of it, you really have to adapt to the device. However, most importantly, it's a very parallel device. And that means you need parallel access to it in order to fully utilize it. And that's not given with the traditional interfaces that you have for classical hard drives. Classical hard drives are highly sequential or serial devices.

Starting point is 00:02:26 So, for those you really don't need this kind of access. So, and because of that we need new interfaces. So, classical devices which you would connect through SATA or SCSI or things like that, or even PATA before that, they have a block-based interface and very short queues basically, like command queues, because you're more or less synchronously, continuously accessing them.

Starting point is 00:02:59 And for NVMEs, as I said, or for the SSDs, we need this kind of more parallel access, and this is given by NVMe. So let's switch right here. Today we're going to talk firstly about the Non-Volatile Memory Express standard, which is the standard how we access SSDs ideally. So this is basically a new type of protocol and then of course according hardware

Starting point is 00:03:33 that allows us to talk more efficiently to SSD drives. And maybe to also give you this relation again, right? So we had this last time. I even had another one, this one here, right? So on the board, you don't have an NVMe connector. You have a PCI Express connector, right? So this is basically a PCI Express connector. There you can connect an SSD that can talk NVMe. And the same is, you have the SATA ports, and that was a question last time.

Starting point is 00:04:14 Could we also talk NVMe through SATA? No, basically the NVMe standard is defined over PCI Express. So, SATA, you have AHCI, something like that. That's basically the older standard on how to communicate with a drive. That's defined for this. And you could also connect through older standards through PCI Express, but you actually

Starting point is 00:04:41 want to go through this non-volatile memory express standard. Okay, so it's basically an interface specification or the non-volatile memory host controller interface specification that is just on top of the SSD. So this is basically, let me take a pointer here. This is basically the kind of driver or the kind of protocol that the SSD will talk. So underneath, of course, then we're not just there yet. So this doesn't mean this directly goes to hardware,

Starting point is 00:05:25 but in the SSD, we remember there's a couple of additional stages that the data needs to go through, a couple of additional software and hardware layers before we're actually ending up in the actual pages and blocks and logical units. However, this basically gives us an interface to connect the SSD directly to the PCI Express fabric. And fabric here means the network. So this is, as we said, PCI Express is kind of a network which has switches, etc. and then directly communicates to the CPU, etc. and then directly communicates to the CPU, etc.

Starting point is 00:06:06 Rather than going through additional layers, we can directly talk with NVMe to the SSD. It's fairly new in terms of what new means by hardware standards. So it's been first specified in 11 and the newer version NVMe 2.0 has been released in 21. And the new version also specifies different kind of transport models. So basically rather than just using PCI Express, you can also use other kinds of fabrics, other kinds of interfaces to talk NVMe. And this basically gives you the opportunity also to connect your drives directly through something

Starting point is 00:06:59 like networking, like say, InfiniBand or Fiber Channel. So rather than having the SSD just on your board, you could also have it somewhere else in your rack and talk to the SSD through more like a message passing interface. But we're not really concerned with this today, so what's of interest to us is really this kind of memory mode where we have the PCI Express and then we're basically getting data to and from the SSD.

Starting point is 00:07:52 The communication, we basically need to set up queues. So it's a parallel device. We know that it takes some latency for the device to process individual requests, but it can process more requests or multiple requests at a time. So that means we somehow have to queue things up, right? So we cannot do a synchronous access to the device. So if we would do synchronous, we would basically say, okay, here is a command.

Starting point is 00:08:20 Please give me, for example, this page. And then I would have to wait and do nothing. And the SSD would basically just serve this one page and come back. And in order for this not to happen, we built where we need certain queues, where we basically send multiple commands or we continuously send commands. And the SSD basically continuously processes these commands and sends the results back. And these are actually quite a few, right?

Starting point is 00:08:53 So each queue can have 64,000 commands and there can be 64,000 queues. And this is basically also to have these multiple channels that we have in the SSD. So we have different channels in this SSD which connect multiple logical units. In order to fully saturate the multiple logical units, we need a queue. In order to fully saturate multiple channels, we need multiple queues can be located in the device or in the host, in kind of a memory mapped space. And here you can basically see what this could look like. So we have kind of a management queue, and then you have individual I.O. queues, which would basically connect to the SSD and would individually send requests. And this follows a certain protocol. And the way this basically works, there's an interrupt at the SSD. So you're basically,

Starting point is 00:10:08 it's called a doorbell. So basically you're signaling that there will be commands next. So this is basically what the host does. If you want to say, read some data, then the host will create a signal on the doorbell and then send the commands to a submission queue, which basically is the queue where you say, okay, these are the reads and writes that I please SSD do. And then the controller of the SSD will fetch these commands and execute the commands and send them once the commands are executed. And of course this is continuously done, right? So you will continuously write into the queue and the controller will continuously fetch commands or only then you can actually fully saturate this and write the results in a completion queue.

Starting point is 00:11:10 And then basically also once the... once the... commands are completed, then the controller can also write, or while the controller writes a completion queue, it can also generate or does generate an interrupt to signal that something has been completed. And then the host can basically fetch the results from the completion queue. So there's a submission queue and a completion queue. And once all of the commands have been done,

Starting point is 00:11:48 then the host will again release the doorbell in order for, say, for example, finish up this queue. So then our queue would be done. And we could start a new queue, for example. And there can be many queues in parallel. So this is kind of the continuous process. But of course, there can be many queues, and there can be many entries within the queue.

Starting point is 00:12:19 There's different kind of commands in NVMe. So there's the admin commands, which basically create and complete or create and delete the queues. And then, yeah, some primitives for identifying device because it's PCI Express, right? So it's plug and play. You could basically open up your server,

Starting point is 00:12:43 put in a new device, and then through MVME basically detect the device. Of course, first it needs to be detected through PCI Express, be in the bus, but then you can identify the device, get log pages, et cetera, identify the capabilities of the device. And then you have the command sets and there's kind of different types of command sets. So you have the, let's say the namespace or the classical NVM commands, which is basically read, write, etc. commands.

Starting point is 00:13:25 And this gives you this kind of block-based abstraction that you would use as if you use it in a... Or that you regularly would use if you're using a hard drive. And then there's also a zoned command. And there you can basically split up the device into different zones and work with those independently, basically like a partitioning. And then finally, there is also a key value interface,

Starting point is 00:13:59 which would make the device essentially like a key value store. So this is also part of the standard. Rather than using it as a file or as a block-based device, you can use it like an object storage in a way, or just like a key value store, where you have a certain key size, so 16-byte key size, and then you can basically store and retrieve, so basically just a regular CRUD, so create, read, update, delete, but you can also just check if something exists.

Starting point is 00:14:42 And then there's fabric command sets, which basically helps you to set up like any kind of connections between the devices. Yeah, so we basically, well, we covered this to some degree. So we have the admin commands, which I already said, like identify the device, etc. And create and delete the queues. And also with the directives or these certain kind of NVMe directives, the host can cooperate with the SSD. So basically the host can tell the device, okay, I'm going to stream certain amounts of data directly to the device without additional block-based identification. Or basically you can have even multiple streams sending the data to the device or from the

Starting point is 00:15:54 device and identify complete streams in order to logically, let's say, make them into one unit that can, in a combined fashion, be garbage collected, et cetera. So if you have, say, different streams of data or different partitions, again, of data, then you can say, OK, I will deal with those separately. So I can logically separate the device into multiple parts. So in the end, with this kind of interfaces, I basically can easily on the one hand, or not easy, but I can use the device on the one hand or not easy but I can use the device on the one hand like just like a regular disk as I would so and that's probably the way most of the time you will use it but you can also split it up into multiple partitions and use it as separate not separate devices but separate partitions you can use it with

Starting point is 00:17:00 this kind of key value store interface where then you don't really have to deal with the block-based addressing at all, but you're just using keys to address the data. And so you have much more functionality basically through this. However, most of the time you're just gonna use it as is, as yeah, just a storage device, basically. And in order to do so, you need the Linux I.O. frameworks.

Starting point is 00:17:32 So you've briefly seen the stack already. So classically, what you have is you have your hard disk drive, then you have your SATA drive. On top of that, or the SATA driver gives you this block-based layer, right? So it gives you a block-based interface to the device. On top of that, you will set up your file system or like the OS will set a file system. And then the POSIX, this is kind of the standard API for file systems or for block-based interfaces. The POSIX API will allow you to basically interact with the file system and this block-based abstraction

Starting point is 00:18:20 on top of which you will have your actual application. The block-based interface, this will give you this nice abstraction of a linear array of fixed-size blocks. So similarly to our virtual address space, right? So similar abstraction, virtual address space, we're just talking about smaller units. And here we have like these larger blocks, but also in this kind of fixed size, or we have fixed size blocks and a linear abstraction.

Starting point is 00:19:06 You're basically addressing by the logical block address. Then you can either just read or you're writing the block address. And the POSIX interface or POSIX API basically abstracts the storage as a collection of files. And I mean, we didn't talk about file systems so far. This is something that I'm discussing in big data systems. But usually, the file system will also be based on different kind of blocks and links from blocks to blocks, so basically directory blocks, et cetera, that map ranges of data into a collection

Starting point is 00:20:01 of blocks, basically. And the original specification of POSIX was all synchronous commands. So basically the way you would interface with the file system and the SATA driver and then finally also the drive is, you're saying, okay, I need this block basically. Say for example, you're reading a logical block or you want to have one block with a logical block address

Starting point is 00:20:29 then basically your CPU would wait until this block is served and Then get the next one. Of course, I mean we know right the CPU will do more stuff You will get like kind of a page fault you can do other stuff in between so it doesn't really like the CPU doesn't have to wait only for this but you can like do some other stuff in between however you're not going to send additional requests at the same time so basically the system is is waiting or the the posix command would be waiting until you get something back. And all your application will basically map everything that needs to be written to disk

Starting point is 00:21:14 into these files and those again will be mapped into the blocks then. So and then there's like this whole ecosystem around there for this. And synchronous means really like there's no queue, right? So it means basically you have, I don't know, your memory or something and you need a logical block. So then you're basically asking for reading this logical block or writing this logical block. And then eventually you will come back once it's done. And of course this is slow, especially if you have parallelism in your device. If your device can do multiple things at the same time, then you cannot really in any way use this parallelism if you don't have some kind of queue,

Starting point is 00:22:07 if you cannot send multiple requests at the same time. And so in order to get some parallelism here, different kind of asynchronous I.O. operations were introduced in the mid-2000s. And this basically means we have some kind of queue, which where then our commands will be sent to. And then eventually, basically, the system or the drive will operate them as fast as it can or as parallel as it can and complete the results somewhere. As we saw for NVMe, this would also basically go into another queue here.

Starting point is 00:22:50 So we have basically a command queue or a request queue and then a completion queue in order to get the results back. One of the first APIs or libraries for asynchronous IOs was AIO, which was a user space emulation that basically used a thread that held a queue internally that would basically use the synchronous IOs through POSIX in order to, but at least at the application level, have like this worker queue, right? Or just this request queue. Then later there was LibIO, which actually uses asynchronous IOs through an IO submit system call. So through an IOSubmit system call. So, this is basically then already a kernel call, where then you already have these kind

Starting point is 00:23:55 of queues and you get results back through a system call again in order to have this callback function in order to get the result back into your application. I don't want to go through all of the details here today, but you can see there are multiple different ways or multiple different iterations on how this was built. In order to get even more performance or better abstraction, more modern IO frameworks were built. One of those is IO-Ring or IO-U-Ring, which has pairs of circular completion queues. So basically ring buffers for submission and completion queues, which are shared between

Starting point is 00:24:57 kernel and user space. This is basically one way of doing this and this is one one library that you can use today in order to get efficient access to ssds through nvme basically and i o ring i o u ring is available through lipRing that has all of the helper functions that you need. And I mean, this is a bit text heavy here, but you can basically with this, you can basically check the they also have different kind of ways of accessing the NVMe. So, you can bypass the file system, you can bypass the buffer cache, which would be your page cache basically in the file system. You can even bypass the file system. So, you're directly writing the blocks on the device or the pages on the device

Starting point is 00:26:13 rather than going through or the blocks on the device rather than going through the file system. You can also bypass the block layer. So so directly write the pages on the device. And if you want, there's different ways to do that. And then if you want to get even more efficient, you can completely bypass the kernel and do all of the interaction with the SSD in the user space. And for this there is SBDK. So that's an Intel development framework where you can, using a C or C++ API directly access through NVMe to the device without touching the kernel at all. And so using SPDK you can have like zero copies so without any additional copies in memory. If you write something like in your application through the kernel then there would be basically copies in memory that go from the application memory to the kernel memory, etc.

Starting point is 00:27:29 But here you can directly write through. So there's no copies, no additional copies. And so you get basically like less or you get lower latencies, you have some more CPU utilization through using this framework. But you also see all of the PCI Express and NVMe limitations. So you basically have to deal with the queues, etc be uh you will be shielded from if you go through the kernel right so the kernel will basically deal with all these different problems for you if you want to do don't want to do this directly um iou ring again would um like shield you from PCI Express and NVMe, but basically have, again, different software layers,

Starting point is 00:28:34 so you won't have zero copies, so you're basically copying the data in memory again, and in the hardware queues. So that basically gives you a bit or it gives you additional latency. Then there's something called XNVME, which is a user space library that basically gives you an abstraction across all these different kind of frameworks.

Starting point is 00:29:01 And so that's something, I mean's something under development in Samsung and some colleagues at Copenhagen. There you can see that basically NVMe would give you the abstraction to, by configuration either use the POSIX API, use the different kind of asynchronous libraries that we saw, or go directly to the block layer using IOU ring, goOU ring with NVMe pass-through, or using SPDK directly go to the SSD device. And it's kind of a uniform abstraction layer over this. And now as a final, so just as an example of what's the difference here? If you're using these different kind of interfaces, I have very recent research from TU Munich friends, so Haas and Lice, where they basically tried to see what kind of interface do we

Starting point is 00:30:20 actually want to use or how can we use SSDs most efficiently in our device. So here they basically were questioning in an OLTP-like system, can we actually saturate modern SSDs, each of which can do significant amount of performance. So can do, I think, 12 million IOPS basically. And then... No, I think each of them can do 1 million IOPS, or 1 point something million IOPS, and then all in all they can do basically 12 million IOPS, which then is

Starting point is 00:31:20 basically also what you can get through PCI Express here. And then they basically checked how many outstanding IOs do they need to fully saturate the setup. So this is basically what you can see here. It's a bit small, but you can see here is the number of IOPS. So this is not the throughput. It's just like IO operations, so reads and writes etc. And you can see that they basically need a queue of 4,000 elements or 3,000 to 4,000 elements to

Starting point is 00:31:54 actually fully saturate these SSDs. So this is basically outstanding requests, this is what we need in order to have the drives working all the time. And then they tested the different frameworks. And you can see with LIP-AIO and IOU-RING not using the pass-through, you're basically, you don't get, like, you cannot fully saturate these disks. But using IOU ring with enough threads, you can actually saturate all of the disks. And if you have SPDK, even much earlier, you can fully saturate the system.

Starting point is 00:32:46 And basically what they saw is by further analysis that basically the blocking requests are a problem and then kernel interrupts. So going through the kernel basically gives you additional latency. These additional copies, etc. will make the accesses slower and will not allow them to fully saturate the SSDs. And this is, I mean, it's interesting in a way because usually we would think that basically

Starting point is 00:33:18 the disk is always the bottleneck. But here we see it's not really the disk that is the bottleneck in these cases, but it's really the way how we access the disk. So meaning just giving more disk to the system won't speed up your system anymore, because we're not limited by the disk speed anymore, but we're limited by the IO interface. So using a block-based interface, using maybe two large blocks, for example, would be a problem. So this is also there using four kilobyte pages. So in the original design, they had 16 kilobyte pages, but they saw just basically the additional reading and writing.

Starting point is 00:34:06 If you have 16 kilobyte pages and the drives work with 4 kilobyte pages, then every time you're reading something small or you're writing something small, you have an extra four-fold extra data read or data writing, so an an amplification which will slow your system down I mean in the end your your I o throughput like the total bandwidth what you're getting in terms of data that you're reading and writing might be higher using the 16 kilobytes because you're the drive actually has to do more but the useful I also basically what the system actually utilizes will be lower.

Starting point is 00:34:48 And this is also what they basically show here. This is the relative CPU time that the system uses in terms of useful transactions. So the green part is basically useful transactions. The yellow part is eviction, meaning writing data back to disk if your buffer is full. Because if you have data on disk, then basically if you want to read something new in, you have to write out something else.

Starting point is 00:35:24 So this is basically this data transport. And you can see with SPDK, you're only doing useful or almost only doing useful work and eviction just because this is basically all what will cost you something, right? While using the other frameworks, a lot of time basically goes into this polling and IO submission. So mainly sending new IO requests, which would then lead to these copies in the kernel space, etc. Okay, interesting read anyway. So if you're interested in more details how to fully utilize NVMEs in a modern system, I would recommend to read this. It's just very fresh research right now.

Starting point is 00:36:14 Also, if you're interested in much more details in the NVMEs, etc., in the architecture, then I mentioned last time already, and it's linked on the front page, there's a tutorial by Alberto Lerner and Philippe Bonnet, who really in-depth discuss all these different kind of frameworks and how they act together, and also this kind of XNVME layer on top. Unfortunately, Haas and Leis didn't try the XNVME abstraction on top, which also would have been interesting to see if that actually works in this seamless way as promised.

Starting point is 00:36:57 So then testing these four should have been easy in a way. OK, with that, this concludes the storage part. These four should have been easy in a way. This concludes the storage part. Any questions to the storage so far? Takeaway for you is really, well, if I have this kind of hard drive, so this would be this Optane super fast SSD thing, then I cannot use it like a regular hard drive. So I really need to talk to it in parallel. Again everything, so everything we see here in this course is basically always this idea

Starting point is 00:37:42 I somehow need to parallelize my stuff I need to be aware of the latencies and the trade-offs and same for storage essentially just using something faster won't give me better performance today okay with that we're gonna switch to networking but because we're almost halfway through I would say let's do a quick break here five minutes and then we're almost halfway through, I would say let's do a quick break here. Five minutes and then we're going to continue with networking. Good. So, first part of networking. We're still typically everything today connected through PCI Express, just like this, also network, GPU, FPGA. And then, as we said, towards the end,

Starting point is 00:38:30 we're going to talk about CXL, which is yet another standard on top of PCI Express. And today, we're just going to start. So I don't know how far I'll get. All of the RDMA stuff will for sure be next time. And as I said, this will not be next week, but then the week after. So I'm going to move a bit here, because this needs to be in that week, the two GPU lectures. Yeah, so today, in the rest of today's session, we're gonna talk about parallel database management systems

Starting point is 00:39:11 and Rack Scale Computing and parallel programming. So first, what's, I mean, we've seen some parallelism already, but now we're talking about parallelism and distribution beyond a single server. So this is basically when networking comes into play. And well, if we talk about distributed and parallel setups, then we can also talk data center. So this is where it really gets huge,

Starting point is 00:39:49 just like our data center. Our data center is not as big as this, but it's multiple racks basically. And today the network is not simple and slow anymore, but actually the network is I guess is quite smart and it's quite fast today and if you want to use the like a multi node setup for a database system and basically do data processing on more than a single node, then in a way you need to do the same thing that you would do in order to use multiple cores, etc.

Starting point is 00:40:32 You need to somehow distribute the work. But on a distributed setup, what you're typically doing is you're splitting up the database into different parts. And then you want to execute your queries either with intra or inter-query parallelism or intra-query parallelism across these multiple servers in parallel. Who remembers which was which?

Starting point is 00:41:01 So intra-query parallelism means intra-inside. So we have a single query and we're splitting it up into multiple nodes. Inter-query parallelism means we have multiple individual queries and we're sending them to individual nodes. So if we have an OLTP type of workload, that would mean ideally we can run these individual queries locally on a single node. So we somehow have to be smart about partitioning the data in a way that each individual server can answer them separately.

Starting point is 00:41:42 Then things can be really parallel and we can get a nice speed up, at least most of them. We won't be able to guarantee this for everything if transactions use multiple data items at a time. If we're talking about large scale data analysis, this means we have, like we wanna run a huge query on multiple servers, this means we want to run a huge query on multiple servers. This means we somehow have to break up the query in different parts and be able to do the data processing separately. There's different ways

Starting point is 00:42:18 we can do this. There's shared nothing, which means, and this is what we often see in clusters set up like ours. So we have separate servers, basically. Each of them have a CPU, have RAM, and have storage. And then there's a network connecting them. And it's shared nothing because they don't share any resources and they only communicate over the network. Different from this, and this is what you would see in the cloud for example. In the cloud you have a setup like this, which is a shared disk setup.

Starting point is 00:42:58 So you have multiple servers, each of them have CPU and have RAM. Technically they also have storage, but you're using one big storage somewhere in the background that contains all your data, that has kind of one uniform large address space again. Again, this will not be one single disk. It will be multiple servers, but it's separated. The compute is separated from the storage.

Starting point is 00:43:26 So this means we have multiple servers, each with private CPUs and memory, also with private storage somewhere, but we have one large shared storage system. So if you think about AWS, your S3 would be this large separate storage, right? And your individual EC2 instances would be the compute nodes that you do your data processing on. And there's, I mean, from the get-go, because databases have always been large, data tends to grow

Starting point is 00:44:07 quite quickly, or data sets. So this means already in the 80s, people built parallel database systems, all of them, of course, on-premise. So back then, nobody would think about the cloud or doing this kind of processing over large-scale networks, because the networks were just way too slow. But on a smaller scale, networks could already be reasonably fast. So there were a couple of systems like Volcano, Gamma, Non-Stop SQL or Tandem and Teradata.

Starting point is 00:44:40 So these are classic systems. All of them initially shared nothing. Meaning all of them having separate nodes, each of them their individual storage, and then you partition the data across these individual nodes in order to process the data in parallel. Then after 2010, there's new parallel database systems like SAP HANA, which is a shared nothing setup. You have Snowflake as a cloud database, for example, which uses this shared disk setup, meaning like a disk aggregated setup where you have like the S3 storage in the background

Starting point is 00:45:21 and then your EC2 instances as compute nodes. Then similarly, Amazon Aurora, which is also a very similar cloud database management system. Of course, there are also many in between here. A lot, Oracle is a typical shared disk setup. While Oracle can be like a standalone instance, there's also these parallel Exadata setups, which then have like a huge storage network, a huge storage system and multiple compute nodes. And if we're talking about large-scale processing, like the one typical unit of scale would be rack. So this is also what we have in our data center. So this is kind of a deployment unit or rack. So this is typically 42 rack units. And you will see when we go to the data center,

Starting point is 00:46:34 this looks pretty much the same. Again, it's not as packed, it's not as many, but it looks something like this. So we have 42 height units, and each of those can use one or multiple servers. These height units will actually host the individual processing units or compute resources. It's kind of a sweet spot between larger clusters and a single server because this is the typical way

Starting point is 00:47:05 things are deployed right so and this is also the way database systems often are sold right so people will companies will sell you one rack of database systems as a system as a complete deployment with all that you need in there you of course can also buy individual servers which would then one be one reg unit height for example or two or four that's also possible but you can also buy the complete rack with everything in there you can also if you need more then you will get multiple racks But you probably won't get like one and a half racks or something like this. This is one scale unit and Well a rack scale computer then is kind of Usually some kind of pre-packaged setup or of course custom configured, but you will have some some

Starting point is 00:48:03 compute nodes, you might have some accelerators. So today for sure you will have some graphic cards or graphics processing units in there for all kinds of AI applications, etc. You will have storage, hot, warm and cold disks meaning like different speeds different sizes so cold storage would be something where like more of your long-term archival or not even archival

Starting point is 00:48:34 just data that you're rarely using would be lying around on spinning disks data that's hot will be lying around on SSDs and or even like with larger caches, some kind of memory appliance where your data is much faster to access. Then you have interconnects, like fast interconnects, and all of the servers within one rack will be connected to the same switch which means like latencies will also be quite low so then you have something like a top of rack switch which would connect your rack with

Starting point is 00:49:13 other racks again and the the bandwidth might be the same but the latency will be higher because you have additional hops in here so within the, this is why there you actually have quite nice connectivity, and you typically try to scale it in a way or to set it up in a way that it's nicely balanced in between compute and setup, this would be basically in a server-centric data center setup. You would basically have these individual appliances then connected to the data center network. So then this could be your rack with individual servers. In the individual server, we already know, there's this intra-server network on chip fabric,

Starting point is 00:50:13 but then also PCI Express on the motherboard, etc., which connects everything to everything. Then you have the network card, which actually goes to the data center network. And this is kind of a server centric, but like recently or increasingly also, the data centers are scaled in a resource centric way. So we already saw database management systems or cloud database management systems

Starting point is 00:50:49 are frequently split up into this compute and storage. And we also call this disaggregation, meaning we're not basically deploying servers that already have their storage packed in, but we're building something separate, like separate storage that we can scale independently of the compute. And increasingly, rather than having everything aggregated

Starting point is 00:51:14 in a single server and packaged from the get-go, this is disaggregated. And we're basically using the fast network and say for example PCI Express across multiple servers in order to get to further split up the servers and disaggregate the individual resources. And I mean very simple example is storage. So having separate storage nodes basically and not relying on the local storage of the nodes. But this also the idea when more products are coming that you can also do this for memory. And basically say rather than having like a fixed memory budget per node, we're being more flexible by making fast interconnect in between the memories or having separate memory appliances

Starting point is 00:52:12 where then rather than just using our local memory, we can also use other memory before we actually put something on the disk. So we have different kinds of pools. Basically, rather than having individual server pools, where we scale by server, and we might not utilize all of the CPU of the server, because we mainly have memory bound, or we just use a lot of memory, we

Starting point is 00:52:36 can say we're scaling independently memory, CPU, and storage. And then hopefully, we have more other applications which give us some kind of diverse workload such that we can saturate everything. So in the end the idea is we have kind of compute somewhere, we have storage somewhere, we have fast storage somewhere, we might have

Starting point is 00:53:01 accelerators somewhere and we have memory separately. And depending on the application, basically what we're doing right now, we can utilize whatever we actually need. And then also we're able, from a data center deployment perspective, to separately add more of whatever resource we need. So rather than saying, okay, I'm doing disaggregation, but actually I'm just using communication,

Starting point is 00:53:28 like fast networks, to simulate this disaggregation, I actually have a separate memory appliance, something that has a lot of memory that can be used across other servers. And this is only possible if basically the network is fast enough. Okay. So this will come basically.

Starting point is 00:53:52 I mean, to some degree, this has already happened. As I said, for storage, this was already there for a while or is already there for a while. It will also increasingly come with memory. And well, data center or data appliance, so database management systems were one of the first data or rack scale applications and deployments. And here you can see basically what

Starting point is 00:54:22 like a typical Exadata machine would look like. So basically you can go to the Oracle website and you can order such a rack, which then would have a large storage grid. So you can basically see, well, we have lots of high-capacity disks and flash in order to have a tiered storage, some storage for the hot data, large amounts of storage for the z petabytes of data for storage in general. And then we basically have additional or separate servers

Starting point is 00:55:01 for doing the actual processing, the compute, the actual transaction processing. And then these servers would basically just use the data appliance to retrieve and push the data in and out. And in order to be fast with this, or in order to make this work efficiently and fast enough, you really need really fast networking in here. And for this here, for example, you have Rocky,

Starting point is 00:55:37 so converged Ethernet, or RDMA over converged Ethernet with 100 gigabyte per second. So this is really fast network in here. So this is kind of as an overview of what database systems, like large-scale database systems in modern appliance would look like. So you're separating this up. And now we're going to look at, for the last couple of minutes, how we can program this. There's basically different ways how we can go about this.

Starting point is 00:56:13 This is basically something that we've already seen before. So if we have multiple cores on our single socket, all of those cores will have a uniform memory access on the same socket to the same RAM. Unless we have these very large CPUs, which already are basically split into multiple NUMA groups, then we already have non-uniform access there. But in a regular laptop setup, all of the cores will have the same access to the DRAM. It's basically the same speed.

Starting point is 00:56:55 And we're using the whole memory as one uniform address space, and everything has kind of the same speed as well. If you're looking at larger machines like server machines with multiple sockets, the memory actually, and we've seen this, is connected to the separate sockets. And then there's an internal fabric in between the different sockets. So Infinii fabric or UPI or what was the cross? I think it's X something. Yeah, cross link, I think. So anyway, it's like the different kind of interconnects and if

Starting point is 00:57:49 you want to basically the course have their own DRAM but you can also access the other course or the other sockets DRAM through this interconnect which will basically give you a higher latency and will be limited through this fabric in between. So this we already know, right? This is a single server, but means we have a non-uniform access, so we're not seeing everything in the same way. And in a cluster setup, well, either we're using this in kind of like the networking, like not necessarily

Starting point is 00:58:30 only yesterday's, but like classical networking means every server basically has their own RAM and we can access other servers RAM only by basically message passing or messaging the other servers, which means the cores, actually the CPUs, have to talk to each other and send the data to the other server. So this means, well, if the core 3, for example, needs something from this DRAM, then it will talk to this core, this core will need to read the data and will send it here. Basically, we have a lot of communication, etc. The network, of course, is shared between the different cores. Today, we also have remote direct memory access, meaning we don't necessarily have to go through the CPU.

Starting point is 00:59:27 So, whenever we're doing Ethernet and TCP IP, basically, then all of this stack, all of this communication, so whenever it's IP communication, it needs to go through the CPU. The CPU is always involved in packaging up the data, communicating with the others, splitting up the package, having this whole network stack that you've probably learned in your network classes, right? The whole protocol stack. But using remote direct memory access, there's a way that through the network and through PCI Express, basically

Starting point is 01:00:12 the different servers can talk directly to DRAM rather than going through the CPU. So with this, you don't have to utilize another server's CPU if you want to access their memory. There are different ways how to do this, and this is actually something that most likely we'll see next time in more detail. I'll need some power soon. Well, we'll probably manage. Let's see. Okay.

Starting point is 01:00:46 Now the question is, I mean, this is basically what the systems look like, right? So this is the physical way of how the servers are connected. Now the question is, how do we actually use this? And there's

Starting point is 01:01:02 different kind of programming models that we can use. And independent of what's underneath, these programming models will give us a different kind of abstraction on top. And one abstraction is shared memory programming. It's basically we're looking at all of the memory across all of the servers as a unified or shared address space. Then all of the communication in between is basically done by the framework or by the compiler for the programming model. Say for example if you have Pthreads or OpenMP. So if you're using this,

Starting point is 01:01:47 so OpenMP would be a library and compiler framework for your parallel program that will let you program the server or your program in a way as if you have one large memory. And internally translate this to basically all of this message passing or the network communication. Then you can also have like a partitioned global address space. And this is when you are aware about the address, like this separate address spaces. This is basically as soon as we're talking about NUMA, right? And we're aware of NUMA,

Starting point is 01:02:30 like the non-uniform address space, this is what we would program against. So we know we can access the data somewhere else, but we also know it's gonna be slower. So it makes sense to keep our local data and we're pinning threads accordingly, for example, to make sure that we try to be as local as possible, but we still have this global address space.

Starting point is 01:02:56 And finally, there's explicit distributed programming. So this means we actually say these are separate nodes, these are separate memory regions, and we explicitly program the communication using, for example, message passing. So MPI would be the standard library and approach to program in parallel by using message passing. So MPI is for message passing interface. So there, basically, you're directly saying, okay, I'm talking, like I'm sending an information from node 1 to node 2, and I'm respecting or requesting some kind of response here.

Starting point is 01:03:49 Partitioned global address space would be like this NUMA style of programming, for example, if you have, say, large distributed memory machines. As I said, this can be one NUMA machine, for example, or an RDMA machine, or it can even be like regular Ethernet connections, but there is an abstraction layer through the programming model that shows you just this global address space. And then when you use this partitioned global address space, if you have separate nodes, then the compiler will add the code that you need for remote variable access. So, in this sense, you as a programmer can basically access the remote variable and just use it as a regular variable. And then the compiler will basically add the code

Starting point is 01:05:12 for the communication. But as a programmer, you also need to be aware if you're accessing something remotely that this will cost you something, right? About the, like, you need to be aware about the implicit movement. So in order to get good performance, you need these kind of NUMA-like optimizations. Meaning, again, you want to

Starting point is 01:05:33 be local as much as possible. You don't want to have too much communication and you don't want to move data around all the time in order to be performant. Remote memory access, again, is kind of a shared memory programming abstraction. So basically, we can also have explicit read and write operations to other memory. So we can basically say, well, I have five nodes and from my current node, from my current processor, basically I want to write something into another processor's memory. And this will be explicit read and write operations, but you don't have to, again, don't have to code the actual message passing if you have this. But

Starting point is 01:06:31 the data needs to explicitly be loaded into the cash currency, but this can be done by the compiler again. Or, depending on the way your programming interface is done, you have to explicitly load it and you have to explicitly write it back, basically, if you don't have the same cache currency domain. In remote memory access,

Starting point is 01:07:01 you also have one-sided operations. And this is basically where you really get good performance or additional performance boosts. Because you can basically, rather than having a communication pattern, where you say, okay, this CPU talks to this CPU on how the memory needs to be changed. You can just say, I just want to push data exactly into this memory location. I mean, there needs to be certain memory already prepared for this. So a memory region that you can access from, like one node can access on another node's memory. But then the CPU doesn't need to be involved, the remote CPU.

Starting point is 01:07:41 So only the CPU basically issuing the read or the write, and then directly. So this means we can directly access another memory. And the CPU on the other side can basically, as soon as the data is there, can continue to work on this. So there's also some atomic operations where you can basically do some fetch and add or compare and swap operations directly.

Starting point is 01:08:21 So this remote memory access is available also in message passing libraries or in IB verbs, which would be InfiniBand verbs, for example. So here's another image how this basically works. So remote memory access means we're accessing another node's memory. And this can be done through regular networking, but it can also be done through the hardware. And this is then called remote direct memory access. So if the hardware directly supports this, then basically in the application, we set up certain buffers, which are used, which are basically memory regions on each of the servers. And then, rather than basically copying this in the CPU and communicating through the CPU,

Starting point is 01:09:20 these buffers can directly be accessed by the network interface cards. So then this application could basically say, I want to write something in the buffer of the other application, and it will tell this to the network interface card, and this NIC can then directly write into this buffer, rather than having to talk to the OS and having to talk to the application. So then basically we have certain memory regions that we can directly access and that we can directly write to. There's still this way of basically ensuring, so we have two-sided communication, so I'm sending something, the other side says, oh, I've received something,

Starting point is 01:10:07 so in order to have the synchronous communication, in order to ensure now something's happening, if I want the other side to react directly on my communication, then I need two-sided communications, but I can also say, I don't care when it's done, or I'm just using the additional memory, for example, just to push more data over there. And I might take it back later. Then I can use this one-sided communication.

Starting point is 01:10:34 So I'm just pushing the data through and pulling it back. Okay. So then this finishes up until here. And since I'm running out of battery and we're close towards the end, then I would say let's finish it here. Do we have questions so far? No? Good. Well, then we're going to continue with networks and RDMA

Starting point is 01:11:05 the week after next week. Next week we're going to talk about GPUs. So Ilin will tell you all about GPUs. And yeah, thank you very much. And see you on... I'll be there on Tuesday as well.

Your Ad Here

Hardware-Conscious Data Processing (ST 2023) - tele-TASK - Storage & Networking

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.