Hardware-Conscious Data Processing (ST 2024) - tele-TASK - Networking

Starting point is 00:00:00 Okay, so welcome to our next session on hardware-conscious data processing. So just to let me say, we're going to continue with networking and RDMA. You had an intermission with some GPU in between. I hope you enjoyed that. And so, but I'll just get back to where we were, right? So we were on PCI Express talking to the

Starting point is 00:00:32 network. We had the GPU already. We had the disk. And we're close to actually wrapping everything up, right? So today we're going to finish networking or RDMA. Then we're going to start tomorrow with FPGAs, have our last task, then FPGA second time,

Starting point is 00:00:52 and we'll have an invited talk. So this one's important to me, right? I repeatedly said that this is going to be interesting. We're going to hear about FPGA in practical applications, which actually I also am quite interested in. And then we do the summary and data center tour. So are you going to join the data center tour?

Starting point is 00:01:15 You're interested in looking at the servers, et cetera? Because then I will set this up. So then we're going to do like half of the session. We're going to do a bit of wrap up, maybe some discussions. Also, if you think of things that we want to do, should do different in the future, or where we can improve,

Starting point is 00:01:36 or if there's topics where you say, okay, this I did not fully get, or this was too slow, please do more on this end rather, then feel free to collect this and let us know. We're always trying to improve this somehow. Okay so we're, I've basically talked all up to here. Now I want to cover RDMA basics and how RDMA is used in database management systems, for example. We've already covered this to some extent in this rack applications, right? So if you have like a rack level server using Infinii band and RDMA makes a lot of sense because you get like really

Starting point is 00:02:26 high throughput networks. So this is the slide where we stopped last time. Hi folks, move to the front. Okay so this is the slide. So you remember basically, I mean, this is kind of the hierarchy that we always have, right? We have the registers, we have the caches, several levels of caches, then we have the local memory. And in the past, in traditional hardware setups,

Starting point is 00:03:00 we would say, well, then this disk, like long time nothing, then we have disk, and then long time nothing, and then we have network. But nowadays, using remote memory through low latency networking, we can actually shift this, right? So we're switching from the SSD part to the HED. And there, to be honest, there's actually always kind of a,

Starting point is 00:03:25 let's say, they of fighting against each other. So disks are becoming, or SSDs are becoming faster. So you can get lower latencies, although not necessarily the low latencies that you might get with remote memory, especially if you're thinking about also CXL devices, which we've covered also before. So far, I almost exclusively talked about InfiniBand, if I remember correctly.

Starting point is 00:03:55 And there is an alternative. There's a couple of alternatives to this. So while InfiniBand is the method of choice for RDMA because it has the lowest latency and it has very high throughputs, you can actually also get the same kind of functionality using classical Ethernet networks. And this is called RDMA overconverged Ethernet or ROCKY. So you can have, instead of using these like Infiniip and like a special type of networking infrastructure, you need the special switches, need the special NICs.

Starting point is 00:04:40 You can use Ethernet switches and NICs. They also need to support that. So it's not your general off-the-shelf basic Ethernet switch and NIC, but they need to be Rocky enabled. And then you can embed your RDMA messages in UDP data graph. So basically you're using UDP to send InfiniBand messages. This also works with, I mean generally also works with standard Ethernet, but then it's slow. So if you just package this

Starting point is 00:05:16 in regular Ethernet UDP, you get Ethernet or IP basically. So there's an example if you have 100 gigabyte Ethernet network, which is kind of similar to EDR, InfiniBand, then you get the same kind of 12 gigabyte good put. Then I pulled this out of some research paper by Ericsson

Starting point is 00:05:47 where they compare different kind of higher performance network options. Then you can see native InfiniBand you have you're in the sub microsecond range, so hundreds of nanoseconds basically. With

Starting point is 00:06:03 Rocky you can get the same kind of like slow slightly higher latency but same ballpark if you're using native native ethernet for the same kind of stuff uh then you're like one order of magnitude higher in terms of latencies right so this is just a lot slower but rocky can get it can get very similar slightly higher but very similar throughput and if you look at the bandwidth get basically the same kind of bandwidth that you would get it with infinity pen the only slight difference might be in the final good put so you remember good put is the amount of like the throughput that you actually get

Starting point is 00:06:47 for your application. So the throughput that the network gives you in terms of like 100 gigabyte Ethernet, for example, or 100 gigabit, I have to say now gigabytes, this would be 100 gigabit Ethernet. This means 12.5 gigabyte this would be 100 gigabit ethernet um this means 12.5 gigabyte uh throughput in general or like in total but then you have all the packaging right so then you have the

Starting point is 00:07:16 ethernet packages you have the ip packages etc there and with this you get a certain throughput. So basically, the throughput is the amount of bandwidth that you get when you subtract all of the package overhead. So looking at this, like, InfiniBand versus Rocky bandwidth is, again, very similar. So you get the same kind of bandwidth. You get a very similar, right? So you get the same kind of bandwidth, you get a very similar latency. So it's a good alternative today for both of them and much better than native Ethernet.

Starting point is 00:07:54 There's yet another alternative. And this is like, because this, we also have this InfiniBand, this is more like a REC scale networking technology. All right, so meaning if you have like a whole data center, you're not going to have a single InfiniBand switch connecting everything or like also the cable length are quite limited. So if you're using copper cables, this typically means maximum of seven meters of cable length. The higher the throughput or the bandwidth, typically the lower the length. So

Starting point is 00:08:33 this might be only three meters then. And this means you're not going to get much more cabling than a single rack. Then otherwise you're going to have a problem. If you want to be across data centers, then there's iWARP, that's not an abbreviation, it's called iWARP. So that's basically InfiniBand over TCP. And that again means you can just use your regular TCP network, and that means you can use RDMA, et cetera, over long distances.

Starting point is 00:09:08 So again, this is a special type of hardware. Otherwise, it's going to be slow. Otherwise, we're back to this kind of stuff here. And modern NICs often support both. So modern Ethernet NICs that do converged Ethernet, they can do Rocky, they can also do iWarp. Of course, iWarp has higher latencies, but it's good if you really want to do this across very large network topologies. So this is just as an alternative in essence. So far we basically covered the network hardware and now we're going to go into the RDMA itself. So essentially now we have the hardware and the protocols to do RDMA messaging and type of communication.

Starting point is 00:10:15 Okay, so in order to set up RDMA, so RDMA is remote direct memory access. So the idea is that I can directly somehow from one server to another access the memory. And of course I cannot directly access, I need some kind of networking infrastructure in between, but the networking infrastructure does the communication for me. It does, like I just, from an application point of view, I just set up the connection points, I just set up the memory regions, and then I connect these memory regions

Starting point is 00:10:43 and the networking infrastructure does the copying for me. So it basically copies data back and forth. And for this we need buffers. So we need areas in our memory that will be copied back and forth where we basically have designated memory that can be used by the NICs. So this means we need the memory actually being pinned. So it's not going to get lost somehow. So it's not paged out and will be slow. So then the OS would need to bring it back or something. And the NIC doesn't necessarily know or does not know about it. So it needs to be pinned.

Starting point is 00:11:21 We need the address translation information. So in order to know what kind of addresses on the one end are reflected in the address space on the other end. And we need to set the permissions and then we need kind of local and remote keys that are used by the adapters for the RDMA operations. So that's kind of this. We have memory regions on both ends, and we have kind of a channel where

Starting point is 00:11:53 we're starting to set up a gem. The actual communication itself is then based on queues. And that doesn't really depend on what kind of operations that we're doing, but we're on both ends of the or both mix basically we'll have sets of queues and particularly three queues in any way. So we have the send, we have the receive and we have the completion queue. And maybe I'll close the door just quickly. Where is from outside? Okay. So we need these three queues basically for basically telling the NICs what needs to be done.

Starting point is 00:13:13 So there's a send, a receive and a completion queue. The send and the receive queues are the work queues. And these are always created in pairs. And you can have multiple of these pairs. And then there's a completion queue, which basically tells us what is the work that is already done. And this is always parallel on basically work. How would you say? It's mirrored on both ends, like on both parts of the connection.

Starting point is 00:13:42 So you're connecting server A and B, then A and B will have a send and receive and a completion queue. And the send queue on A reflects basically or is basically connected to the receive queue on the other end and vice versa. And on both sides, we have to completion queue to just say, okay, what has been done and what's not.

Starting point is 00:14:07 The send and receive queues are there to schedule the work that needs to be done. So the send queue basically says on this side, I need to copy this buffer, for example, I need to copy this memory region to the other side. The receive queue says, okay, this is like, I need to copy this into my memory region to the other side. The receive queue says okay this is like I need to copy this into my memory region here. So one end basically says data goes to the NIC and is sent through the network. The other side says data from the NIC goes into the memory. And the completion queue,

Starting point is 00:14:41 once this copying has been done, the completion queue basically has entries in there to say this work was already, like these work items have been done. This is basically just to notify when work has been done. for the application not to necessarily wait and synchronously wait for the communication, but the communication can be done asynchronously by using these completion queues. So applications use issue jobs by sending or using work requests and these can be like send or receive requests can be put into the into the working queues and essentially these are buffers in the send queue this is basically a buff pointer to the data that needs to be sent on in the receive queue it's basically a pointer to where the data that's that has been received should be placed. When this has been done,

Starting point is 00:15:49 well, then there's a completion queue element that will be put in the completion queue. The application can basically see, okay, this has been done. From a stack point of view, this looks like this. So we have our application that posts these work requests. And each work request is essentially a message or a unit of work.

Starting point is 00:16:15 Then we have the interface. We have, yeah, it could be a library, for example, that basically gives us the kind of commands that we have. So it can be one-sided or two-sided commands. We'll go through this in a bit more in a second. And for this, then we have the driver that maintains the working queues and manages the address translation and has these mechanisms for complete, for the basically completion queue to fill the completion queue and these events. And then underneath we have the actual hardware, right? So we have the network interface card

Starting point is 00:17:00 and the network protocol where we have different kind of transport layer depending on the underlying networking infrastructure. So in InfiniBand that might be reliable, in Ethernet it's unreliable or we might even put this on top. So if we do InfiniBand over Ethernet, so classical, without any special hardware, then there might be a whole network stack on top, which is underneath the driver. There, then we also packetize the networks and implement the whole protocol that needs to be done in there. So just as a kind of a performance view on this whole thing, so we said we have InfiniBand, currently we're at NVR, so typically one network connector,

Starting point is 00:18:05 so if you have one cable that would consist of four links, then you would get up to 50 gigabytes per second. So 400 gigabit per second or 50 gigabyte per second. The specification for XDR is also already out. That would then give us the double again. So 800 gigabytes. So this is how InfiniBand currently is scaled. So just by doubling this. It used to be less than that.

Starting point is 00:18:32 So QDR was 32 gigabytes per second, FDR 54. But now from EDR, they're basically just doubling and doubling all the time. If you look at modern Rocky hardware I just looked this up yesterday you can also get up to 400 gigabytes per second same kind of performance basically. Again similarly like with a single adapter, meaning a single cable, you can get up to 400 gigabytes per second. This will be, in this case, for example,

Starting point is 00:19:11 PCI Express 5, 16 lane for nectar. So it would look basically the same as the Infinii plan setup. And again, we can also do this using iWork. Right, and you see this is still scaling. So there's also new, and I don't remember what comes after XDR, but there's new protocol levels on the horizon. RDMA itself is just a mechanism, right?

Starting point is 00:19:46 It also does not really specify any semantics of the data transfer. It's really just, okay, copy data back and forth. And or just basically how we access the remote memory. And there's two ways or two models on how to access memory. There's one-sided, which basically is read and write in atomic operations, meaning I can just say, okay, I want to write this data into the remote memory, or I want to read this region of memory from the remote memory, or I also have these atomic operations that we discussed during logging. I can also say I want to atomic fetch an update or something like the compare and swap operation something like this. I can do this remotely in the remote memory and I will have this in an atomic fashion.

Starting point is 00:20:45 So nothing else can basically then happen in between. If I do the compare and swap, I basically get the memory, the value back. It's locked on the other side. I can exchange the value. Obviously, one shouldn't say obviously, but it makes sense that this is slow. So this basically needs a complete round trip to the other memory.

Starting point is 00:21:12 So this will have a fairly high latency, much higher than if you do this locally. And then there's two-sided access. So where we have send and receive, which basically means both ends actually wait for these things to happen. So I'm sending the data and the other side, my application, I have an application that runs on two servers. Application on server A sends the data and application on server B waits for the data and receives it.

Starting point is 00:21:45 And that's different from the read and write, because in the read and write, the data just appears in the memory and the CPU is not involved. In the send and receive, both applications, both CPUs do something. Both CPUs basically are part of the communication. So like classical networking, where you have like a socket and you're reading from the socket. So this is more the way send and receive works. So traditional or send and receive is more this traditional message passing where you have both the source and the destination

Starting point is 00:22:27 are actively involved in the communication. Both have their queues in the application. So you have a send and a receive queue and the completion queue for the queue pair. And then the sender has a pointer. The sender sends basically a pointer or in its work request has a pointer to the buffer that needs to be sent. And that's then enqueued in the send queue. And the receiver has a pointer to the memory region where the data will be received.

Starting point is 00:23:01 And that's also this receive message is also in queue. So let's look at how this actually looks like. So we have the send queue, we have the receipt queue and we have the completion queue on both ends and so this will be now a send operation for example and these are just pointers right so that's again important. I'm not enqueuing the data here into these queues. I'm just enqueuing pointers here. So I have my registered memory that's pinned and I have a buffer that needs to be transferred. On the other side, I have the same thing. I have my pinned memory and a buffer where the data should be

Starting point is 00:23:43 received. And to some degree in applications, we will always use some kind of, like if we're establishing communication, unless we have some kind of global knowledge about all the data transfers so everything is very statically, as soon as data sizes, for example, are dynamically, then I will need some kind of communication to set up these buffers to make sure that there is enough buffer space here for the stuff that I want to send there, for example. So this is usually for sure done in a send and receive operation. So if I want to transfer this, the first thing is basically I create a work queue item in the send queue.

Starting point is 00:24:28 At the same time, on the system that receives, there needs to be a work queue item or work queue entry on the receive queue. And that then basically points to the buffer regions where the data actually should go to. And once this is enqueued, then the NICs know what to do. So this is basically the setup for the network interface cards to know where data should be sent. Like where data this side, basically where do I need to read the data, this side, where do I need to write the data.

Starting point is 00:25:03 As soon as this is done, the NICs can actually do the actual copying. So the NICs then do this independently. There's no CPU involvement for that. That's basically the NIC just reads the memory region, sends it over. And once it is done, so once this site has sent everything, it creates a completion queue element. Once this site has received everything, it creates a completion queue element.

Starting point is 00:25:33 And with this, we know that our transfer is done. So with this, basically, we have one RDMA message sent through. And from the CPU point of view, it's really just establishing basically these queue pair elements. So like here, I'm creating the send queue element and I'm waiting for a completion element on this side, like here, and on the other side, I'm establishing basically a receive queue element and again

Starting point is 00:26:07 waiting for the completion element here. Okay if I'm doing a read and write then only the sender side or the basically the site issuing the request is done is active.. This would basically mean in a read and write operation, rather than having a receive queue and completion queue here, I only need this on this side. I'm only setting this up here. Then I'm sending this through the NIC. I need the memory region needs to be pinned. So the other side must be set up in a way that Nick knows

Starting point is 00:26:50 where to put the data, but the application does not need to be aware what's going on on this end. So basically just, I'm just basically sending this through. And once this is completed here, I mean, the NIC will still have their queues here. But the CPU doesn't need to do anything on this side. So I'll just basically send the data through.

Starting point is 00:27:19 So the passive side doesn't issue any operations here, uses no CPU cycle, and doesn't get any kind of information that the transfer has been done or not. So basically on the passive side, if I'm writing data to the passive side, I don't know if it's written. So on the passive side, I don't know if I've received anything or not.

Starting point is 00:27:44 I basically can just say Paul every now and then I can have some kind of status bit, or I'm using specific kind of reading or send and receive operations. Additionally, in order to ensure that whenever I've done all my right, et cetera, operations, then I will use some communication again to say, okay, now everything is done. And so that would be a pattern that I can use. If I want to do this read and write, then I need to basically see the remote sites,

Starting point is 00:28:20 virtual memory address, and the remote sites memory registration key in order to I mean this needs to be all set up right so the remote site needs to be aware that something can happen there and that that the sender actually is allowed to write something in the memory region so the the active site needs the address and the key. And this is typically done through this traditional send and receive mechanisms. So looking at this, and this is more of a quick overview of the type of operation.

Starting point is 00:28:59 So just maybe for you to look it up, this is also in the slides. It's obviously in the slides but you can use the link to basically check how to set this up. So there's this is basically the whole set of operations that you need to do in order to actually do an RDMA write here. So I'll initialize the whole setup, the connection, et cetera, in order to then do the actual write. So it's basically first from here and continues from there. So you can see this is maybe the last box here is the first box over here. Okay, there's a couple of things that you can optimize. So we see, I mean, we've seen there is some involvement.

Starting point is 00:29:53 I mean, there's multiple steps, right? So it's even if we do like a send and receive, it's, or if we do a send and receive, it's multiple communication ops, right? So we need to exchange the queue information first and then we do the actual transfer and then we need to exchange the completion events uh events so in order to know that things actually happen right that the transfers have worked and uh so the completion queue can actually be, or can actually make things slow, right?

Starting point is 00:30:27 So if I'm synchronously sending things, basically I have to wait until the data has actually been transferred. I have to wait for the acknowledgement. So this is always round trips that I'm basically wasting while I could maybe already send some more data because I'm sending the data. The data has been sent.

Starting point is 00:30:48 I get a completion event. Then I basically go back to the application and say, okay, now next thing to do. So there's some cycles that get lost. In order to not lose these cycles, what I can do is I can do so-called doorbell batching. Rather than issuing individual work queue elements and then wait until something happens and my work queue element is completed,

Starting point is 00:31:16 I can actually issue multiple of these before going to the completion queue. And this is doorbell batching. Another thing is that these NICs typically even have multiple processing units. So in there it's not just a single processor essentially but there's multiple of these in order to get some parallelism. And this parallelism we can actually use. So rather than having a single queue, we can have multiple queues and use these processing units in parallel. So if everything goes to a single queue, then only a single processing unit can run,

Starting point is 00:31:58 and that might also be the bottleneck at a certain point in time. And in general, Atomics might seem like a really good idea if you have a distributed data structure, but they are actually slow. So if you put everything, like all of your logic into these kind of atomics in RDMA, then your application will wait for a long time

Starting point is 00:32:20 because it basically has nothing to do for a long time. Okay, so this is an overview to RDMA so far. So the basic concepts, I guess, clear, right? Okay, so then let's see how we can use this in the understanding of how this works in contrast to traditional networking. So now we can see how we can use this in the database system. And classically databases or many databases are shared nothing architectures. We already said there's also this like shared disk architecture that's been done for some time and this is also where fast networks have been used

Starting point is 00:33:08 for some time before because fast networks like if you if you do if you have a shared disk kind of setup then you need to ship the yeah you typically ship the data to your processing units. So you have multiple processing units. You ship the data there. And for that, you need really fast network. Otherwise, the data transfer will always be a bottleneck. The alternative, if you have slow networking, then what you do is you try to have a smart partitioning over the nodes.

Starting point is 00:33:44 So this is if you have classical ethernet, you have a large scale setup, you're trying to basically have the data nicely partitioned across the node and we try to for each node in the setup keep all of the accesses as local as possible. This works for many applications because many applications actually partition nicely. We also discussed this earlier because you're dealing with individual customers, you're dealing with individual warehouses, you're dealing with individual sites in your setup. So that makes life a bit easier.

Starting point is 00:34:24 As soon as you're distributing stuff, as soon as you have to go across the network, then things actually get slower. So then you need to communicate. And if this is socket send and receive, this actually costs cycles. And it costs latency because it's slow. So in order to do this in this shared nothing architecture, then you really want to have everything local, like all of the optic operations local,

Starting point is 00:34:58 and you want to co-partition data. Meaning if you have a classic database schema, I mean, if we're thinking about an analytical workload, for example, then you might use some of your dimensions in your data warehouse schema that nicely partition the data set. As I said, customers, for example, you basically co-partition your database by certain customers and make sure that individual customers end up with all of the data that you might need, end up completely in these individual nodes. So this co-partitioning of data is important. As soon as you're just randomly partitioning, then there's lots and lots of communication.

Starting point is 00:35:45 So if I'm just horizontally partitioning my data without any further information, then many operations will basically have to cross the network, which will most likely, just kind of like in a classic network setup, be the bottleneck. An alternative is this kind of shared nothing architecture. We have IP over InfiniBand, where rather than using this ethernet, I'm using just much faster, much faster InfiniB In Finipat. So in this case, basically, I don't need to change the code, right? I don't, I can just use the database as this.

Starting point is 00:36:34 And like with classical Ethernet, any kind of cross partition or any kind of cross nodes workloads will not harm me as much because the network is actually faster. So with this, as we saw, I can get 50 gigabytes per second using HDR, for example, or NDR, and that with a single connector, right? So single cable, and I might have multiple of these so I can have fairly decent bandwidth in between the nodes, also with lowish latency. But if I use the classical code still, I still have this overhead through the small messages. And also, if I have classic networking, I still need to go through the CPU.

Starting point is 00:37:32 So if I have still my classical infinite IP network stack, then my CPU needs to do all the packaging. And the problem with this is at a certain point, this actually makes up a major point of the overall CPU utilization. So if I have IP networking over InfiniBand and I have very fast InfiniBand, all of a sudden my CPU is utilized 10 to 30% just with networking. Just basically packaging, packaging networks, packaging data, sending it over the data by copying buffers, right?

Starting point is 00:38:16 So whenever we have different levels of packets, I need to basically copy data into these different buffers in order to get the networks better, to get the packets aligned. And so that basically doesn't necessarily fully utilize the network and it saturates the CPU to some degree. So we're basically losing a lot of efficiency. So an alternative to this is using like a shared memory architecture with RDMA.

Starting point is 00:38:49 And the difference here is that we're not using the IP stack. So we're really trying to basically keep certain memory regions attached with RDMA. And with this, the NIC can do all of the transport for us. Not all of it. We have to issue some stuff, some communication, but the actual data transfer,

Starting point is 00:39:18 like all of the small packaging, etc. This is completely done by the network. This can be completely done by the RDMA or InfiniBand or Rocky hardware. So as soon as these buffers here are established and I'm telling the network, the NIC, to copy data back and forth here, then my CPU doesn't have to do anything locally anymore and I don't need to deal with this. If I need data from another node, I can basically just say okay I need to issue an RDMA read and if I have something to do while the read is done, it doesn't really cost me anything from

Starting point is 00:40:01 the CPU perspective. I just pay some latency until this is done, but if I have local data that I can process as well, I can start with issuing requests to the remote memory, process the local memory while the remote memory is basically copied over, and then I can do the local copy. As long as I'm not fully saturating the network here, and I'm not just waiting for the data to be copied because I have some local data, the remote access is more or less for free. Of course, I've utilized the bandwidth, but I might be able to split this up somehow smartly. So this basically, if we fully take advantage of this, we can fully utilize the hardware, we can have low latency, and we minimize the CPU overhead, right? So this is important. This is however, requires us to use

Starting point is 00:41:01 RDMA. So we cannot use traditional sockets. And this means we need to change the database management, which we do here anyway all the time. So in essence, this creates a shared memory architecture. So we're basically looking at the whole memory as one large memory from a conceptual point of view, where then we just copy back and forth what we need locally. And we can do this either with send and receive works or we can do it with read and write

Starting point is 00:41:37 work. So send and receive whenever we need to establish kind of where does the data need to go from, where so which buffers are copied where and read and write just in order to get what we need locally either to say if we're updating something I can just issue and write if I need to get a remote memory in order to read it I can just issue and read the remote memory the remote CPU doesn't have to be involved. So this is also called a network attached memory architecture or NUM.

Starting point is 00:42:12 So in this, and here the key assumption would be network communication is not expensive, right? So this is kind of an extreme point of view as it really depends if a lot of your data is local and you can do a lot of communication or a lot of a lot of processing local then networking is not expensive anymore because you you have this high throughput and it can get to a much larger degree than it would be in classical setups where you have et cetera.

Starting point is 00:42:54 So if we think about this from a distributed system point of view, then often a problem is kind of we have this, yeah, we have this coordination efforts. So if we have lots of small messages, means lots of small communication, we get lots of latency increase, but that's not so much a problem because of RDMA, because RDMA has low latencies,

Starting point is 00:43:26 especially if we can use this asynchronously, right? If the CPU doesn't have to wait for this. If we have distributed data flow, as soon as we have SKU, this might be a problem. And also if we have like problems with like if we have slow bandwidth, if we have skew, then this means well whenever we have to go through the network this might slow us down. But the InfiniiPen has like the same order of magnitude as the memory. And well with the load balancing we still have to do some load balancing. So in order to have this kind of network attached memory setup or architecture, we want to use RDMA. We want to have locality still. If everything's distributed, everything

Starting point is 00:44:25 will need to go through the network. And even though the network is fast, it will be the bottleneck. So we have, say, 50 gigabytes per second throughput, but it's not the same as we have local network. We have one DIMM gives us 50 gigabytes of throughput in DDR5. So, I mean, a bit less, but sort of order of magnitude, right? So in a single server, you can get a couple of hundred gigabytes bandwidth. And this will be

Starting point is 00:44:59 for sure the bottleneck if everything needs to go through a network. But if we can keep some locality, we're going to be fine. And also we want, I mean, there's network everywhere, right? On all of the nodes. So this means we can use it on all of the nodes. So we want to make sure that we can somehow leverage the access to all nodes. And this also goes back to some degree to this, where it's kind of a plug again for CXL, right? So as soon as we can somehow disaggregate this even more,

Starting point is 00:45:35 we can make sure that we all can utilize all of the resources as good as we can, right? So just basically say, okay, I want to make sure I'm utilizing all my CPU well, I'm utilizing all my memory well, I also want to utilize all my network well. And by disaggregating this, I might even be able to disaggregate the memory to some degree and say utilize the memory from different locations. That's not in here yet. Okay let's look at this from an experimental point of view. So if we're looking at this kind of setup here, right, so we're using RDMA to connect the memories and we do either do send and receive where the CPUs are involved or we do read and write

Starting point is 00:46:28 where only one CPU would be involved and it's directly written to memory. If we're using classical TCP connections. Oh, no, we're sharing, doing, shared nothing, two-sided RMA. This is already RMA. And we have a scale-out experiment on TPCC. So TPCC is the OLTP benchmark. So we have small update workloads. It's not these very large analytical workloads, but small transactions. So we do a couple of rows, basically, typically localized. So it's not necessarily everything distributed. If we have shared nothing, two-sided RDMA, then you can see we're, let's say,

Starting point is 00:47:21 200, 250,000 transactions per second. And you can see scaling from 1 to up to 15 nodes here. If we're doing numDB, so this kind of setup where we use one-sided RDMA and there is no co-location, so the data is completely distributed or randomly distributed more or less across the nodes, then we see this nicely linearly scales with the number of nodes, so up to four million transactions per second. If we're co-locating,

Starting point is 00:48:03 so we're making sure that most of the accesses are actually local, except for those where we basically are. So we're partitioning, we're co-partitioning, but some of the accesses might go to another node. Then we can see that it even scales much better, right? So in this case, basically, I think 90% of the transactions are local. 10% might be distributed. So it's 10% of the accesses might actually hit the network. Then we can see we scale up to 6 million transactions

Starting point is 00:48:39 per second, which is actually quite significant throughput. Okay. So this kind of as an overview of how we can basically utilize RDMA in a database architecture, but from a high level point of view, meaning that if we're using send and receive, we have lots, still lots of communication, we're including the CPUs. If we're doing writes, we get better throughput because only a single CPU is utilized.

Starting point is 00:49:17 If we're doing local, we can get even higher. So questions so far? Then we'll do a quick break here, and then we'll look into RDMA-based joins next. So for the last part, it's probably not going to take all of the time, but for the last part, I want to talk about RDMA-based joins, and we're going to revisit the infamous Radix joint

Starting point is 00:49:48 and see how we can basically optimize this or how we can implement this through RDMA. So first, some general good practices. All right. some general good practices. Right. So the, I mean, in general, the memory region registration is basically the cost of this is in the amount of numbered registered pages. So, yeah. And so either, I mean, just registering and deregistering is costly. So we don't really want this. We want to have sort of a stable amount of memory pages,

Starting point is 00:50:33 because otherwise this will be kind of take a lot of time, because this needs coordination, right? And we also don't want to pin large parts of the memory, because otherwise, if we do, we cannot use it for something else. So what typically is done where we're assigning a certain portion of the memory as a buffer for this RDMA communication. And then we need some efficient buffer management for that in order to be able to reuse and reutilize the existing pages in there. And say if we want an initial implementation also of disaggregated memory, we would use exactly this. So if you remember disaggregation using CXL, you can try to, or you can emulate a similar idea using RDMA by basically having,

Starting point is 00:51:36 basically the, having RDMA regions on each of the servers and making that available across other servers. So with this, you get the same kind of disaggregation because I can utilize this memory through RDMA from other servers. I can expand the amount of memory. I can do this kind of pooling, but I need to explicitly implement this. And again, I need to reuse the memory. I need some buffer management to make this efficient.

Starting point is 00:52:05 And the big difference to see Excel in this case is I don't have any coherence, right? So I need to make sure that all of the coherence mechanisms I need to implement myself if I want to have this. In order to be fast, RDMA also requires asynchronous communication. If I do it synchronously, it will be slow because basically I'm waiting for a couple of roundtips while the communication is happening. The NIC can do most of the work by itself, but it needs some time to do this. And so for this, while doing this time, the CPU should do, or your application should

Starting point is 00:52:52 do something else. So meaning in a database system, I want to ensure that I'm issuing my communication at a point in time where I still have other things to do. And I overlap my computation with the communication. So I might stage this in one way if I definitely have to communicate. I might communicate some stuff and then work with the data while communicating more stuff.

Starting point is 00:53:21 So then I can overlap communication and computation. I don't have to wait all the time. And of course, accessing remote memory is slower than local memory. So we're talking about order, like in the order of a microsecond here. So we saw this in the latencies earlier. So it might be sub microseconds or still nanoseconds, but this is the kind of

Starting point is 00:53:48 order that we're talking. It's slower than CXL, so CXL we're poking in the hundreds of nanoseconds. Here we're in the order of a microsecond, at least the numbers that I know of. If we're in local memory, we're in the tens of nanoseconds, or 100 nanoseconds, something like this. So again, an order of magnitude faster than what RGMA is capable of doing. So this means it's definitely slower.

Starting point is 00:54:19 And hopefully, locally, we also have our caches. This is not cache coherent. So this is not in our caching domain. So this needs explicit work here. So we need to hide the network latency by interleaving computation and communication. And this essentially gives us these kind of NUMA effects. Right? This essentially gives us these kind of NUMA effects.

Starting point is 00:54:59 Because we're creating these buffers, these buffers are somewhere in memory. And only the threads that are in the same NUMA region will have this local access right so will not see exhibit any NUMA effects in there if I have my buffer and I mean in general this this might get even more complicated right if I have a very complicated setup in my machine so I I might have my Ethernet or my NIC connected to one memory controller, but if I register the memory

Starting point is 00:55:32 in another memory, like another socket, then I still need to go through the internal fabric, and this might be then, and eventually might become more of a bottleneck, or at least communicating back and forward might of a bottleneck or at least communicating back and forward might become a bottleneck because I might have the thread in the same place as the NIC but

Starting point is 00:55:52 the memory in somewhere else so I need to do both passes basically add some latency that's uh or decreases the bandwidth and if we're talking about 50 gigabytes per second, so eventually this might still become the bottleneck like the UPI or whatever interconnect, at least with some other cross communication might all of a sudden be a problem. So make sure that the buffers that you're using are actually local to your threads so you don't get into any kind of problems with new mock. So, I mean, again, if we want to have the best performance, we just really need to be aware of the overall hardware topology that we're dealing with. So if we have many cores, if we have many sockets, we have different kind of architecture, then we need to deal with these things.

Starting point is 00:56:53 OK, so let's take a step back. So who remembers the Radix Joint? What does the Radix Joint look like? You were rolling your eyes when I said Radix joint. What does the radix joint look like? You were rolling your eyes when I said radix joint. Not again a radix joint. No, I misinterpreted that. So what's a radix joint? It's a join. First step. I remember it completely correctly, but

Starting point is 00:57:25 I think it was similar to a hash join where we join based on a certain number of bits. So that's already the first keyword that I wanted to hear. It's a hash join. Yes, it's not only similar to a hash join. It is a form of a hash join. Yes. So that's already very good.

Starting point is 00:57:42 And there's like a certain or at least typically it's a very good. And there's like a certain, or at least typically it's a hash join. And we're trying to partition data in this join. It's a partition join. So rather than having like just a single hash table, we're splitting our hash table up in individual partitions. And we do this by what?

Starting point is 00:58:08 Do you remember? How do we build these partitions? Based on the certain number of bits from the hash? Exactly. Yes, so we're hashing the values, and then we're looking at the bits. Typically, most significant bits could also be least significant bits, depends on the implementation. And with these bits, we can directly influence

Starting point is 00:58:34 or very clearly influence the number of partitions because basically every bit gives us two partitions. So like N bits gives us two to the power of n partitions. And with this, we can not uniquely say, okay, these are the number of partitions that we want. If the hash function is great, which is not, well, it's not going to be perfect, but if it were perfect, then with the hash function, we would get neat, like we would get a neat distribution of hash values. With the Radix bits, we would get a perfect distribution across these individual hash tables or individual partitions. And then we would have a perfect way to just simply distributing data across multiple threads, across multiple machines, et cetera.

Starting point is 00:59:28 So this is the basic idea, so that the Radix partitioning makes it easy for us to partition data and to control the number of partitions. And with the hash function, we hope that we have a very nice and clean and smooth partition because that's not necessarily always the case so the hash function is never perfect right so we never get like a completely uniform distribution we hope to get something in the order but because of that we don't know exactly how well it partitions so we'll do a

Starting point is 01:00:07 histogram first so we're first going to check how does this actually distribute so how well how small are my petitions going to be and I mean first I know how many items I have but then I have to hash function then I can check basically how many radix but then I have to hash function. Then I can check basically how many radix bits do I need to actually have like partitions in the size that would be good for me. Typically, I would align this with cache sizes, right? So in order to cache line sizes, in order to have this nicely done cache sizes, then whatever memory pages, et cetera, to just make sure that on each level I get the best performance that I can have and the cache efficiency is good. And the same thing I, of course, will also want to have in RDMA, right?

Starting point is 01:00:56 So I basically compute the histogram to figure out how does my data distribute. Then I do the radix partition based on the bits so again I basically I'm creating a hash key and then I take a certain number of bits to identify individual partitions so that's a certain number of bit will basically say this is the hash table that this value goes in. All of the hash values that start with this number of bits with the same bits basically go into this hash table. And then within the hash table, the rest of the hash key will basically identify where it goes into the hash.

Starting point is 01:01:43 Because of that, also, it makes sense to to start with in many cases with the most significant bits because i'd rather have like partitionings running empty then within my partition everything going into the same bucket in the hash table right so i want to have a good partition or a good hash table, and I can deal with some imbalance across the partitions maybe easier. Okay, so let's look at the... And then within each partition, I mean, just to finish it up, within each partition, I just do the classical basic hash join. So for both sides, I have partitioned both sides, both tables in the same way.

Starting point is 01:02:28 With one table, I'm gonna do a build. I build a hash table with the other table, I'm gonna probe the same hash table. Okay, so if I'm doing this, like an RDMA version of the Radix join, I basically in the histogram computation phase, so the histogram is really just to identify how well will this distribute, right? How many bits do I need and which, yeah, how do I then partition in the end? So within a machine, I basically, assuming the data is nice,

Starting point is 01:03:13 is distributed across all machines, right? So this is the assumption. Otherwise, I first, of course, need to distribute the data across the machines. But initially, it's not partitioned in the way that we need. So say I have a horizontal partitioning, but not on the join key. This would be a typical way when I want to do this. Then within each of the nodes,

Starting point is 01:03:38 I do a histogram with all of the threads. Then we exchange the histograms into a machine level histogram in order to have an overview of what is the data that I have on a single node. And these then can be exchanged. So in order to have a global histogram, in order to get an idea of the partitioning sizes. Meaning if I use these many bits,

Starting point is 01:04:08 what will the partitions look like? So if I have say 10 bits, meaning a thousand partitions, how large will these partitions be? Like each of these partitions, this will be my histogram. And based on this, then I also can assign the partitions to individual machines, right? So then I can basically say, oh, where, how large need the buffers to be, where do I want to copy this back and forth. So with this information, it can already set up the buffers across the network.

Starting point is 01:04:50 It can basically inform how the data distribution then needs to be done. Then in the partitioning phase, I have two partitioning passes. I have a local partitioning pass where I want to make sure that the partitions fit into caches. But first, I have the networking partition pass, where I make sure that each partition or the partitions

Starting point is 01:05:16 go to the right node. So this is basically where I partition the data such that for each Radix partition, I have the data in there. It's more or less like a shuffle phase. And so in a network partitioning pass, I want to have a pool of RDMA buffers. And while these are sent, i have multiple uh or multiple buffers and with this i

Starting point is 01:05:49 can basically fill the buffers locally and send them while i'm filling the next buffer so this also gives me this kind of parallelism and i can also of course have multiple queues using this i can make sure that the nick gets uh like multiple things to do. I can send to multiple nodes at the same time etc. to just make sure that I fully saturated the network and I fully saturated the NIC. And the buffers are private so then we don't need to do any synchronization. Then we have the build and probe. And locally, we additionally partition the way that stuff fits into caches. That's already what I said.

Starting point is 01:06:33 After the partitioning, now all of the partitions that fit to each other. So with the same partitioning key should be on the same node. So then basically all of the data is local and we can do the complete join locally as we would have done earlier. So also we're going to split this up into cache sized so then it's even going to be thread local. The individual partitions are so small that we want to have this in the individual threads,

Starting point is 01:07:08 at least in the build phase. So we want to be able to have the hash tables ideally cache resident. And then we can just probe with the also potentially larger probe side, but still without this random memory accesses that we would need if we have a memory resident hash table or even network resident hash table. Yeah, and then the matching results

Starting point is 01:07:43 are either the local buffer, or we can also write them to the RDMA buffer in order then to have one large output. And finally, either we reuse this or we keep it local. I mean, depending if you want to further continue locally or if we want to have one large output, we might send out the result again through the network. And there's a nice paper on this, how this is like there's more details also in the implementation available somewhere.

Starting point is 01:08:17 And if we look at this from a performance point of view, then we are using an FDR or QDR network. So if you remember QDR was, if I don't, let me check. So FDR, I think was the 50 gigabytes, QDR was the 32 gigabyte network. Just don't wanna lie here.

Starting point is 01:08:54 Yeah, exactly. QDR32, FDR54, gigabyte. So if you look at this, then you can see the message sizes, this is in bytes. We need to be in kilobyte sizes in order to fully utilize the network and not basically be in just a message creation mode, et cetera, in the algorithm all the time. And so in the kilobyte message size here, or hundreds of bytes to kilobytes, this is where we fully saturate the network. So here you which would be 48 gigabits roundabout. So round we're close to what the 54 gigabits QDR network can actually do. So with this, we're actually saturating the network nicely.

Starting point is 01:10:00 Same here, right, for the QDR. So FDR and QDR network. But this, of course, means that we need to trunk this in properly. So looking at a bit more breakdowns, if we use TCP with IP over InfiniBand, then we can see, yeah, I mean, basically, like the network partitioning is what really costs us, right?

Starting point is 01:10:30 The local partitioning is always the same amount of work. In any kind of implementation, this is all local. We don't need any networking. And here, we have four machines with 32 CPUs and two times two billion tuples basically joining. I can see that the histogram computation is actually fairly little work, right? We just need to go locally through this. We need to communicate the histograms, not a lot of data. So that's actually cheap. The network partitioning, that's the expensive part.

Starting point is 01:11:09 Local partitioning is not that much, but it takes some time. And then the actual build and probe phase, that's actually very little. So that's the smallest part, except for the histogram computation that needs to be done. So that's the smallest part, except for the histogram computation that needs to be done. So that's the actual joint, right? That's all local, all cache residents,

Starting point is 01:11:29 and that's why this is super fast. If we use non-interleaved communication, but InfiniBand, then basically we can already scale this down by a factor of 3 round about in terms of just the network partitioning but you see this still this is constant right these parts are constant so that's why we don't get like we get a significant performance improvement but we don't get like the full speed up that we just would get from the InfiniBand network because that's just static right this part. This is something where the network doesn't help us.

Starting point is 01:12:12 If we interleave network communication with computation, we can get another 20% improvement. So the network and local partitioning in like in the end, are still the most expensive part. This is basically this shuffle phase you can say. So you were basically partitioning the data and sorting the data you can say in a way such that everything can be easily locally computed. Looking at this from a scale-out perspective, so going up to 10 nodes.

Starting point is 01:12:54 So here, you remember we were here. From two nodes, we're scaling nicely to four nodes. And it's not a factor of two, but it's still a knot. But as you can see, this part just stays constant for more or less. So we, I mean, we're increasing this a bit, but I mean, that's not true, right? So this actually should decrease the local parts

Starting point is 01:13:20 because we get more throughput, but the communication does not decrease as much because we still need to communicate across all, right? If you have more nodes, we need more communication. We need to scale around. So this doesn't scale perfectly. It doesn't scale linearly. So when from going from two to 10 nodes, we only get a factor of close to three because the network partitioning phase actually still

Starting point is 01:14:00 has a significant part. While the other parts are decreasing with the number of nodes, it's also not perfectly decreasing. But also, the smaller or the more nodes we have, the larger the fraction of data that needs to be sent around. Eventually, almost all of the data locally needs to be sent somewhere else. So if I only have two nodes, I mean, worst case,

Starting point is 01:14:28 or not, average case, let's say I have two nodes in the average case, I'm going to send around half of the data. If I have three nodes, I'm going to send around two-thirds of the data. If I have ten nodes, I'm going to send around nine-tenths of the data. Eventually, I'm basically around 9 tenths of the data, eventually I'm basically always sending around all of the data because of this partitioning, because of the randomness of the partitioning. And this means like some additional congestion across the network also.

Starting point is 01:14:58 Okay, but still we can get some improvement, but you can see see this it doesn't really pay off to do this like in very large scale eventually because the improvement like the speed ups are not getting very significant anymore so if we're going to 100 nodes probably not going to see a factor of 30. But I don't know. One would have to try. Okay, with this, that's actually the end of networking today. So last time I told you a bit about the parallel database management systems. We came back to this to some degree with this num db setup and shape comparing this to share nothing setup and we talked a bit about like parallel programming and then the fastest networks with low latencies which are the basis for doing rdma type of communications and then today we really talked about RDMA so how does RDMA work internally and on a high level how you can use it in database setups right so just basically as this kind of

Starting point is 01:16:16 remote memory or shared memory setup like some kind of disaggregation of memory to compute. And then even with a specialized join, so how, which parts of a radix join I would set, put into which, like where would I use the distributed communication in which way? With that question so far. For this, no, wasn't that hard, right? So next time it's not going to be

Starting point is 01:16:48 GPU. I didn't fix that, but it's going to be FPGA. So next time we're going to talk about FPGAs and we'll do this tomorrow and the week after tomorrow.

Starting point is 01:17:06 And then that's all. So thank you so much and see you tomorrow.

Pet Camera - EBO Air 2

Hardware-Conscious Data Processing (ST 2024) - tele-TASK - Networking

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

Pet Camera - EBO Air 2

Hardware-Conscious Data Processing (ST 2024) - tele-TASK - Networking

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.