Hardware-Conscious Data Processing (ST 2023) - tele-TASK - Field Programmable Gate Arrays

Starting point is 00:00:00 Today, later on, probably, we're going to talk about FPGAs. But before we do that, I want to just give you an update on the timeline. And before we also do that, we're going to talk about RDMA. So we're in the third to last week in this semester and in this course. So today we're going to talk mainly probably about RDMA. Then tomorrow FPGA. Maybe I'll be able to finish up FPGA tomorrow. I don't fully know yet.

Starting point is 00:00:40 Unfortunately, the external speaker that I invited did decline. And so we're not going to have an industry speaker this time. But this gives us a bit of time to finish up everything smoothly. So we're not going to be rushed. And we can have a Q&A session here if need be. So we're going to have the task. Then finally, CXL and the Summary and Data Center tour, if you want that.

Starting point is 00:01:12 In the last week, unfortunately, I cannot be here. CXL, anyway, Marcel will do. So that's not a big issue. The summary, I will ask Lawrence to kind of give you an overview again and do the tour with you for those of you are interested? And then we'll see so for me basically this is gonna be we're gonna end From my end we're gonna end here I

Starting point is 00:01:40 mean if there's time I can even do like a quick summary here already and then Leave myself to the seeks and summarize that by himself, basically. So then if you're lucky, we're just going to have the data center tour here. Okay, so but going back to the networking, so where we finished off or did not fully finish last time. Let me find my cursor here. So, so we talked about networking, right? So I gave you a bit of an overview of networking and parallelism last time, so how to program and make

Starting point is 00:02:23 databases at least to some degree parallel across networks, so across multiple machines and we talked about rack scale computing. And now we're going to go into more detail about networking. And this is also going to be the rest of this part of the lecture. So most likely I'll be able to finish this up before the end and then we switch to FPGA, but the rest is basically all about fast networks and how to use them.

Starting point is 00:02:53 And so far, and if you've seen big data systems, one of the major mantras in distributed computing is basically don't communicate as much or communicate as little as possible. Try to do everything locally. And I mean, this is also true in modern hardware and for networking in general. I mean, we've seen this on the chip, right? So we want to do something NUMA local, we want to be NUMA aware, we want to

Starting point is 00:03:27 have everything in the caches, etc. So locality first is still the mantra. And this is, of course, also true for networking. But networking used to be the thing that's really no-no, right? So basically the slowest link in the whole system. So if you have like a distributed system, even if you have to go to disk, in the past, networking would be slower than going to the disk, for example. So the idea was that you always try to minimize the network costs by using certain partitioning schemes. So you co-partition the data

Starting point is 00:04:11 so that the data can be used locally. You use scheduling algorithms to ship the computation to the data to make sure whatever you need to do is in the right place. And you do, again, the computation locally, and you try to restrict remote operations as much as possible. So you just, in general, avoid network traffic.

Starting point is 00:04:39 And I mean, this is basically how systems were designed for a long time, but that also means that doing something or having very complex schemas or workloads are not really well supported in this kind of setup. Or ad hoc workloads where you really don't know what the partitioning will look like. So you will find a lot of research where people basically try to find

Starting point is 00:05:04 very good scheduling algorithms or partitioning schemes based on previously seen workloads. But if the workloads change, basically you have to start from scratch again. And then you can try to somehow replicate, etc. But it will be costly in a way. However, today we have faster networks and I mean they've always been around a network like InfiniBand but they're increasingly becoming affordable and with this we can basically start again like in any case with changes in the hardware, we can start to rethink the database architecture. And so what is InfiniBand? InfiniBand is one type of networking infrastructure. So it's basically an alternative to Ethernet. And here we have much higher bandwidths than in Ethernet typically. I mean, while we can also get faster Ethernet bandwidth these days, InfiniBand usually is faster than Ethernet.

Starting point is 00:06:14 So you can basically get like FDR, EDR, nowadays HDR band or Ethernet InfiniBand which basically gets like bandwidths in the tens of gigabytes per second and there we're basically close to memory bandwidths so that means I mean basically we're all of a sudden in the same ballpark as the memory and that means, I mean, basically we're all of a sudden in the same ballpark as the memory, and that means we're probably faster than disk in many cases. However, as we already know, right, you've seen like in the storage lecture, we've also talked about basically getting ever

Starting point is 00:07:02 faster storage. So there, if you get very fast storage, again, you're going to be in the same ballpark. And then we're all of a sudden always limited basically by the PCI Express interconnect. But the interesting thing is with RDMA, we can basically saturate that. So we can basically get networking as fast as PCI Express can give us with enough channels. I mean we can have single or multiple channels meaning multiple connections

Starting point is 00:07:34 to the same server. And so I mean it's really hardware and like low level protocols and so in Finiband we have channel adapters which are basically the network cards which are the end nodes in the subnet and you have the switches of course and the network cards they can have single or multiple ports to have multiple connections they create and consume packets. And one machine can even, like, even if they have, like, this one you can see has two ports, but you can still even plug in multiple of those into a machine to get more bandwidth. And in there we have a direct memory access engine that allows direct remote memory access. This is kind of the game changer. We've talked about direct memory access before and I think I've hinted to that already.

Starting point is 00:08:39 Let me maybe get a pointer. But basically already on the card we have something where we can then directly interact with the memory. And yeah, I said one or two ports. And then we have the links. And InfiniBand has three link speeds speeds or means number of connections one four times or 12 times and each of the links then has four wires which is one for receiving or one for receiving one for trans transmission or for sending basically so we have have the transmitting signal and the receiving signal and each of those is basically a pair so we don't get too much interference. So then like this is

Starting point is 00:09:36 why we need four links in total for a single link or four cables in total for a single link or for cables in total for a single link. And there's different things, different media. So the classical one would be a copper cable, but then we also have optical cables where then we probably don't need, I actually don't know, but I'm assuming we only need a single link or a single connection per transmitting and receiving signals, so meaning one back and one return. And then we could also have printed circuits. And these cables usually are quite thick. So if you go to the data center and you

Starting point is 00:10:26 look at the network cards, you will see that there's like the regular ethernet cables that you know, right? These wobbly small things. And then you will find the InfiniBand cables. And these are thick cables that usually somehow weirdly hang out of the cages because they cannot really bend very well. Let me see, maybe I can even show you. Right, so here you can see this is basically the type of cable that you would have. So there, I mean, you can imagine they're basically quite thick and not easy to bend. This is also one thing that people always complain if they have to set up this or set this up. And because they're thick and heavy, they're also expensive.

Starting point is 00:11:17 But still much more affordable than in the past. And so now using this InfiniBand technology, there's two ways to use it. And like we can use it just like we would use ethernet using IP, right? And that's called IP over InfiniBand or IPOIB. So this is usually what you will see. And that means we have the regular normal TCP IP stack over InfiniBand to just do regular packet transmission and sending.

Starting point is 00:11:55 So it means you have your regular sockets that you would use and you can use this out of the box in your system. The problem is or the downside is if you do this, this will just be as efficient or inefficient as IP regularly is because if you have the whole TCP IP stack that means everything goes through your CPU. You have layers of protocols and these basically have to be computed and worked or sorted out by the CPU typically. And also because of the different protocols, you will have copying, right? So you copy your data from one packet into the next and to the other protocol

Starting point is 00:12:43 and that usually means a lot of CPU over I had however with I InfiniBand you also have remote direct memory access and today you also have this with converged Ethernet so there's something called remote remote direct memory access over converged Ethernet or rocky where you can also do this. But this is usually the way you want to use InfiniBand mainly. So this means we're not working with the OS at all. We're not copying data around.

Starting point is 00:13:16 We're just directly basically sending and receiving data through the network using RDMA verbs. And there's one-sided or two-sided verbs. One-sided verbs means one side, meaning one CPU will initiate everything and the other side doesn't have to do anything. So we're basically just defining memory regions on the different servers and then in the one-sided communication, the sending or receiving server will just initiate the communication and will go to the remote memory location, but the other server doesn't have to deal with it. So there then the NIC, the network interface card, will do all the communication or will do everything in the memory itself.

Starting point is 00:14:02 So it gets a reserved region in memory on the remote server that it can interact with. And then we also have the two-sided communication. This is basically send-receive, where you would do regular packet connections. You still want to do this asynchronously, but at least both sides are aware of the communication. So we're basically, again, building queues, but we're building queues on both sides are aware of the communication. So we're basically, again, building queues,

Starting point is 00:14:26 but we're building queues on both sides saying, hey, I'm going to send you this packet, and then the other side has this basically in their receiving queue. In the remote-sided communication, we don't have a queue on the other side. We're just basically touching the other memory directly. And then the cool thing about this is that the actual communication or the actual processing is not done by the CPU anymore, but this is done by the NIC, so by the network interface card. Basically, if we go back here, this thing here

Starting point is 00:15:06 will basically directly interact with memory, read the data from memory, copy the data into memory, and then whatever it gets, send it through the link into the other server. And so the cool thing about this is, or why this actually helps us in processing and why we can get much faster networking, why this is that we're bypassing the CPU. So we're not using the CPU, not completely.

Starting point is 00:15:40 So we have to basically initiate communication somewhere, right? So we will basically say on the one side of the communication, if we're doing one-sided communication, the server that initiates something will have to do something, will have to initiate the communication. But then everything else, right? All of the networking, all of the copying, etc., this will be done by the network interface card. And the CPU doesn't have to do anything.

Starting point is 00:16:08 Especially the remote CPU won't have to do anything. We're bypassing the operating system, meaning we don't have interrupts, we have no context switching, we have no copying. So this is also, we're not copying the data on the memory bus. We're not copying data around into different kind of network packages, etc. And we have this kind of message-based transaction, right? So we're sending messages back and forth. In order to have this efficiently, we need to do this asynchronously.

Starting point is 00:16:45 Because the CPU is not really involved in the communication, the CPU shouldn't wait for each individual interaction. Because otherwise we're back in the synchronous operation and we're not really utilizing this additional computation and networking capabilities of the network interface card. Because then we'll just talk with the card, the card will do something, the CPU will wait, etc. So we want to have this completely or as much as possible asynchronously. So there's different ways of doing this. I already said we can have the TCP IP stack, or we can directly go to RDMA.

Starting point is 00:17:33 So here we basically see the classical TCP IP stack. So we would have like a socket layer, then we have TCP or we have UDP. Underneath we have IP, then we have tcp or we have udp underneath we have ip then we have ethernet we have an adapter driver and then we would actually have the the network card right so and all of this would i mean at least all of this up here would still be in the kernel and to some degree, maybe in the user space. And each individual layer here means copying data, right? From one protocol into the next,

Starting point is 00:18:16 from one packet type into the next. We're basically packing packets and we have to split them up depending on the protocol. And only then we can actually send it. And otherwise we can also use like different kind of libraries. So uVerbs would be a library which has some kernel module to use with it, or we can directly use this in user space. Similarly to what we did or how we would interact with an MVME with SPDK, if you remember.

Starting point is 00:18:51 So in this way, basically we were saving all of these individual copy operations and we're directly talking to the network interface card. Okay, so, well, there's again like one of the, another comparison of the two, right? So if we do TCP IP, then we have our application, which is in user space, which has some data, which then will basically if we if we go through um through our through infiniband for example we would still use the nick with the

Starting point is 00:19:37 direct memory access um methods uh so we we would have a direct memory access buffer. But if we come like if we're requesting some data, for example, then the CPU will first have to copy this will have to translate this will from the kernel will again have to translate this for the network interface card, then we have the actual communication and the same thing happens again on the other side. Right? So on every layer of the stack, we basically have additional communication. We have additional copies.

Starting point is 00:20:14 If we're using InfiniBand, all of this is basically done by the network interface card. Right? So we're directly talking to the network interface card. So we're directly talking to the network interface card. Of course, we still have to process our data in a way to fit into the RDMA verbs or let's say the messages for now. But that's all we have to do. Then we basically say, this is in this kind of memory region

Starting point is 00:20:42 and then from there on the network interface card will do the rest for us right so the network interface card will um will directly connect to the memory will on the other side will do the transmission will there on the other side also go to the user space or to the not to the user space we'll go to the right memory region read or write the data and send it back and in the one-sided operations we really only have one of the CPUs working on this right the other CPU doesn't have to do anything at all we need some setup but after that we can basically just do everything through the NIC here on the actual hardware directly. And so we can see, I mean, we need a certain packet size in order to saturate the bandwidth. So here are some measurements by Carsten Binig et al. Where they basically measured how we can use RDMA.

Starting point is 00:21:49 And so on this side you can see the throughput, on the other side the latency. And you can see this log scale, right? So this means like these differences are huge. And this difference is two orders of magnitudes, basically, right? The difference in speed that we can see. So here we have classical IP over Ethernet. And of course, well, if we're here in like a one gigabit Ethernet, then this is basically what we typically would see, like 100 megabyte per second roundabout. And here, and actually I don't remember

Starting point is 00:22:30 what kind of InfiniBand they used, probably EDR, so not HDR, so not the fastest one available yet, but already quite fast. But we can see if we're using InfiniBand over, yeah, IP over InfiniBand, somehow I mixed this up here. If we're using IP over InfiniBand, we can get faster,

Starting point is 00:22:58 or we can get a higher throughput than we would have with Ethernet. Signific significantly higher, but we're still gonna be slower than using RDMA directly. And we need larger packets in order to, or we need more data that we send in one shot in order to get this higher throughput, right? So Ethernet is basically already maxed out when I don't remember here, like

Starting point is 00:23:30 it's 128 bytes, something like this. And for InfiniBand, we need something like two kilobytes in order to saturate the bandwidth in message sizes. And for the latency, we can see that, well, for IP over Ethernet, the latency is in the 50 microseconds, something like that. And here with RDMA, we can get into the one microsecond range. So much, much faster. And we kind of keep a constant latency, more or less. Again, this is also log scale.

Starting point is 00:24:16 So you can see if there's like a slight bend here, we already see significant increase, but not that much increase. So up to two kilobytes we're basically more or less stable and we get quite nice um latency here and much lower again as if we're using ip over infini band because ip again we have all the copying and this is what we see here right while we're still getting a good throughput not as high as directly using rdma we're getting much less latency much worse latency because of all this copying this is basically what makes this expensive here so you get much faster throughput and you get much better latency.

Starting point is 00:25:06 And the latency, of course, this is not in the range of accessing memory, but we're not that far away anymore. So we're somewhere where we can say, okay, we can actually touch remote memory in a reasonable amount of time. So if we're talking about the memory hierarchy, like expanding the memory hierarchy with remote memory, then we can see something. Well, if we're in the, like, of course,

Starting point is 00:25:35 we have the registers, then we have the different caches, then we have local memory, and then we're in the microseconds in remote memory and finally after that this is basically where we're then would be with it with disk usually again depending on the type of ssd and the nvme there might be like a constant battle about this lower ground but if we're talking about traditional hard disk drives or regular SSDs then with InfiniBand for sure we're gonna be here somewhere right then and that basically means all of a sudden for certain applications or whenever you use more memory for your applications, it might make sense to rather go to separate memory or to different memory,

Starting point is 00:26:29 than to stay on the same machine and use disk. And this is kind of the all of a sudden again a switch in architecture. Because initially we would have always thought, okay, let's keep everything on our single node, use the disk there, and only if we then have to communicate, we would talk to another node because the network is so slow. But here, with remote direct memory access, the remote memory all of a sudden becomes an option that might be faster than using your disk locally. And also, this gives you additional flexibility

Starting point is 00:27:08 in how you use memory. So in clouds, for example, you will see that systems or cloud vendors, for example, start setting up the clusters in a way that you can actually use the memory across a rack more flexibly. So because memory often is not fully utilized or is underutilized on some machines and overutilized on other machines.

Starting point is 00:27:33 And then through RDMA, you can basically try to use this more flexibly, more as a memory pool across machines. And we will learn more about this in the last lecture about CXL, which is basically another technology which is particularly built for this. But RDMA is kind of already the direction to go there, to expand your memory. Think about the memory in your distributed cluster,

Starting point is 00:28:04 at least rack-scale cluster, as one large memory. With the big caveat that you don't have cache coherence. So this is something that you have to deal with yourself. Okay, so far as kind of an overview, now let's talk about some RDMA basics. So how do we set up RDMA? So RDMA works with channels. And in order to basically have this zero communication stuff, right? In order to, or zero CPU utilization stuff, you need to set up buffers. So basically you're reserving certain memory regions for the

Starting point is 00:28:49 network card that the network card can write to and read from. And this way, basically the network card has like a safe zone where it can interact with the memory directly. And where also another memory card can basically know this is memory or this is a region where I can write to. So another server can say, I will write into this area of the memory directly. And in order to do that, you basically have to pin the memory so it cannot be swapped,

Starting point is 00:29:24 because otherwise the network card cannot access it anymore. You need the address translation in the network interface card in order to know where to actually write. Then you need the permissions for the memory region, so the network card needs to have permissions to write into the memory region. And you have some remote and local key in order to ensure that this is actually correct. So this is used by the adapters for the RDMA operations. And then you have three queues.

Starting point is 00:29:58 So for the communication, you don't need all of them all the time, but basically for complete communication, you don't need all of them all the time, but basically for complete communication, everything, you will have a send, a receive and a completion queue. And typically the work queues, or not typically, the work queues are always the send and receive queues are created as a queue pair. And these are used to schedule the work that needs to be done.

Starting point is 00:30:28 And this is basically what the CPU needs to do or your program needs to do. That basically, you're writing something into the queue that needs to be sent. And the completion queue is a queue for information when the work has actually been performed. So in a, well, I actually have some examples later. And so what you do in the application is basically you're issuing work requests or work queue elements. And this is basically just a small struct in the send queue,

Starting point is 00:31:13 which is basically a pointer to the message that needs to be sent, or a pointer to the incoming message where it should be placed later on. And then in the completion, once you have the message in the completion queue, you will know that this has actually been done. So this will be done by the network interface card. And this will, again, be enqueued, basically. And you will try to keep these queues full in order to fully saturate your bandwidth. So here's an overview of the stack. So the

Starting point is 00:31:54 application basically posts the work request to the queue and each work request is a message or a unit of work. So either receiving or sending message. Then you have this verbs interface. This is this, basically you can think of it like a library. So the different kind of interactions, the different kind of messages that you can send. Then the adapter driver, the RDMA adapter driver, maintains the work queues and manages the address

Starting point is 00:32:30 translations and provides this completion mechanism. So it's basically informing the CPU or the application that some kind of networking has been performed. And then on the RDMA or the NIC itself you have the transport layer, so you have reliable and unreliable transmissions, you have the packetization, the RDMA control, the end-to-end reliability, etc. So if your packets get dropped, etc. And you have the actual delivery. There's multiple different versions of InfiniBand and there are continuously new versions, right? So right now we're at HDR. So QDR you can basically get 32 gigabit per second. So this been already a bit older like from 2008 then

Starting point is 00:33:38 well 2015 ish we had 8 EDR now yeah usually you get HDR which would be like 200 gigabit per second I already mentioned there's also RDMA over converged ethernet or Rocky you will also see this every now and then this is basically doing the same thing through ethernet. And there's a hundred gigabit per second, 40 gigabit, 10 gigabit per second, 40 gigabit per second, or I think even a hundred gigabit per second, you can get there already.

Starting point is 00:34:16 So this is usually like using the ethernet, it will be a bit slower, but we're getting there in terms of speeds. So InfiniBand is always a bit faster, but So InfiniBand is always a bit faster, but also InfiniBand is always a bit more expensive than Rocky. And finally, there's also iWarp, like an internet wide area RDMA protocol.

Starting point is 00:34:39 So this is usually InfiniBand is something within a rack or within a couple of racks, because the links are actually heavy and the link length is quite limited. So you cannot do like Ethernet, where you can easily have a 50 meter cable or even more. And there's different specifications. And for InfiniBand, these links are usually much shorter. But with the iWarp, you can also do this over internet-wide connections, basically. Okay, so RDMA itself is just a mechanism, right? It's just basically how do you connect the memory

Starting point is 00:35:28 but then you have to specify the semantics of the data transfer in your application. And we have two types of memory access models in RDMA. There's one-sided and two-sided, already mentioned this. In the one-sided, you have read and write and some kind of atomic operations. So this basically means on the one side, only the sender or only one of the CPUs, one of the partners in communication

Starting point is 00:36:00 know about the communication at all. So the other side will not be involved in the communication. This will all be handled on the network interface card. And this, for example, is cool if you want to do pooling, memory pooling. If you say, I want to use the remote memory here just to enlarge my memory, then the other CPU doesn't have to deal with the memory region at all so

Starting point is 00:36:25 I can just use the RDMA one-sided communication and the other CPU doesn't have to deal with the memory region at all. It can still access the memory region of course with the right permissions etc so you can also do more fancy things with one-sided communication. But if you directly want to communicate with another CPU or another part of your application, then you would use something like send and receive. So where basically you can say, oh, here I have new information for you, or please give me this information. Then you have two-sided communication, which means

Starting point is 00:37:06 both CPUs have to do something, but again, this would be done asynchronously. Well, send and receive is basically classical or traditional message passing, and both sides are actively involved in the communication. So both need to have the queue pairs for send and receive and the completion queue for the queue pair. And basically the sender would then request or the sender would send give into the send queue would put a pointer to a buffer that needs to be sended which then would be entered in the work queue or in the send queue and the receiver would basically have a pointer to an empty buffer so basically say please put the data into this buffer in memory where then I will be able to read it.

Starting point is 00:38:07 And then, of course, once the transmission has worked out, then the data from the send buffer will be copied into the receive buffer on the other side. And the work queue element that send, the work queue element that was in the send queue will be removed and there will be a new entry in the completion queue on the sender side and the work queue on the receiver side. The work queue element in the receive queue on the receiver side will be removed

Starting point is 00:38:41 and there will be a completion element in the receiver side. So this is basically how this would look like. So if we have an RDMA send, then you have your host memory, you have the registered memory for the RDMA transfer, you have a certain buffer that you want to transfer. Then what you do is basically you put something on the sender side in your send queue, which is a pointer to the memory region. On the other side, you have also registered memory with a buffer for receiving the information. You will have a work queue element on the receiving queue. And then you have the direct transfer. So from here on the CPU doesn't have to do anything anymore.

Starting point is 00:39:31 So the CPU basically just put the elements into the queues and assign the memory region. Then the network interface card will just directly copy this over because using RDMA or the direct memory access, the network interface card can just copy information into the memory and copy information from the memory. So we'll basically copy the data into the network interface card, send it across using InfiniBand magic, copy it from the network interface card into the memory and once this is done, it will go basically on both sides in the completion

Starting point is 00:40:14 queue. And then both systems basically know that the transmission has worked out, so the system A can basically reuse the buffer for something else, system B knows, oh, I can read the information that I wanted, or that I got sent from the buffer here. However, there is also the read and write, so the one-sided communication, where only the sender then

Starting point is 00:40:45 would be active so and this there the receiver is active so on the passive side then the there's basically no operation that needs to be done so we don't have any queues where we add like completion or send or receive or something but this will only be done on the active side. So the passive side doesn't do anything. And also the passive side doesn't get any information that something has happened, right? So the passive side doesn't know if there was a data transfer or not.

Starting point is 00:41:22 So basically all of a sudden, something might pop up in the registered memory region that the CPU could use, or the passive side could do something with, or could not do something with. So it's basically completely unaware of what's going on. Yes? Are there any guarantees towards ethnicity, so that when both try to write at the same address,

Starting point is 00:41:46 there's not half of one and half of the other is written? So yes. So basically, initially, the buffers need to be registered properly. And then if the buffer is registered, basically, for the transfer, then only one can basically access this. So you would have, but actually I don't know. I mean, if you could set up a setting where you were mixing

Starting point is 00:42:16 up the memory in one way or the other using the one wide-sided communication, I assume so. So I guess you can break stuff here. But I don't know the, so there are atomic operations. So there are dedicated atomic operations where you can say, please do a compare and swap, something like this, which you can also use for communicating these transfers then. But presumably, you can also use for communicating these transfers then.

Starting point is 00:42:48 But presumably, you can also break something here. So, and here, basically, to have this read or write, you need the remote site's virtual memory address and memory registration key. And I would assume and again I'm not too deep into the details but I would assume that you can basically say here this buffer is basically now used for for writing something so I'm not going to write into this location. I'm not sure if you want if you cannot break it so maybe you can.

Starting point is 00:43:29 So then the active site basically needs the address and the key, and for this, you actually do some kind of communication, right? So initially, in order to set up the buffers, you do some communication. So the receiver side is actually aware something is going to happen. but then for the actual transfer Then there's nothing like no indication on the passive side So in order to set this up

Starting point is 00:43:57 Well, I'm not going to walk you through the details. This is more for your information if you want to see this Later on you can find it in the slides there's basically the different kind of operations in order to set up these things so you will basically initialize InfiniBand then you set up the connections and the you set up the buffers etc and then you can do the actual writes, etc. So here are the different kind of commands that you can use and like a link to a simple example if you want to try it out. And if you want to try it out, a lot of the servers in our data lab

Starting point is 00:44:39 have InfiniBand connections. So there you can actually play around with InfiniBand connections, so there you can actually play around with InfiniBand. I think both of the clusters, actually all of them are connected through InfiniBand, so you can do with 16 nodes different kinds of experiments there. Okay, so let's get started again and see how, for example, we can use this in a database system. And of course, there's many different ways. On the one hand, we can just drop it in instead of regular Ethernet. But using this one-sided communication, etc.,

Starting point is 00:45:20 there's lots of interesting research on how we can utilize this to make certain algorithms and a certain type of data processing faster. Okay, so traditionally, in, well, I mean, there's different kind of architectures for distributed processing. But the one that I showed you earlier, also is like the shared nothing architecture that's used a lot. And that means we're partitioning the data across all of the nodes in the network and we're trying to keep all of the data access local and this means also each of the nodes only has access to the local data and as soon as we want to have the distributed data, we need some kind of coordination and data flow

Starting point is 00:46:10 for the communication with other nodes. This is usually done with socket connections, so kind of send and receive communication. And here, as we already said, we want to have local data locality as much as possible. So we're keeping everything in the local memory, everything on local disks. And we're partitioning the data in a way that we can have local accesses. And this works for many applications quite well.

Starting point is 00:46:41 So if you think about ATM transactions, etc. So this is usually there is a high sort of locality to it, right? So you have certain customers that are in certain branches, etc. They're not moving around too much. So you basically can say, I do this customer on this server, the other customer on that server, and I can get a nice load. But in other setups, in other environments, in other applications, you might have more change, more fluctuation in the communication, and your entities are moving around across different parts of the data set or you have different connections and then you don't keep this locality or you often

Starting point is 00:47:34 have to change this and then the partitioning won't work that well. For this you have to think about something else, essentially. Using IP over InfiniBand already gives us faster networking. So that means, basically, we can still keep the database management system code unchanged. So you still use the socket. And we saw earlier from the experiments that this will give us much higher throughput, like much higher bandwidth, not perfect latency.

Starting point is 00:48:10 So the latency will still be much higher than using RDMA, but might still also be a bit lower or will still be lower than Ethernet. And especially if we have large messages, so large amounts of data that need to be shipped around, then the database management system actually benefits from the higher bandwidth. If we're talking about small individual packets, then we're still in kind of Ethernet speed, because we cannot saturate InfiniBand here. Right? So, and I mean in general we still have a problem using IP over InfiniBand because we cannot fully utilize the network bandwidth so there's an overhead

Starting point is 00:48:59 and we saw that we're not going to be as fast. And if we have small messages, this is actually more costly in relation to larger messages in Ethernet. So this means if we have small communication, small distributed transactions and distributed coordination, that will actually lead to a lot of individual small messages which will not be able to fully utilize InfiniBand, meaning that we're not fully utilizing the connection bandwidth that we would have. And so we're not fully utilizing the system. So in that sense, it actually makes sense to rethink the architecture and use a distributed shared memory architecture using RDMA.

Starting point is 00:49:52 This means we can fully take advantage of the RDMA communications. So here, we try to fully utilize the bandwidth, but we need to use the VIRB API instead of sockets. And that means we fully, like we completely have to change the code for all of the communication across the nodes. And the idea is to create a shared memory architecture. So using the memory across all nodes

Starting point is 00:50:21 as a shared memory region, but we don't have cache coherence here, right? So this is basically what we have to think about. So if two nodes access the same cache lines, we don't have cache coherence. So the system won't take care of this for you. So you have to deal with this yourself in your code. And again, there's the two communication patterns for send and receiving and for the one-sided

Starting point is 00:50:50 verbs. And so this means we basically, we're trying to keep the memory as one large pool, ideally still with locality, because accessing remote memory will still be slower than accessing local memory. But using the one-sided or two-sided communication, ideally mostly the one-sided communication, we can basically get very fast access and without any additional CPU utilization to the other memory. And so one idea is the NUMDB, so this is also by Carsten Binnig et al. from TU Darmstadt. So here they basically go with the assumption that network communication is not that expensive,

Starting point is 00:51:42 but all of the distributed coordination, this is basically costly. So basically the small latency or small messages and the latency for that, for the distributed coordination, that's basically something that's costly. But if you get low latency for this, then this will be faster if you have like a similar to memory bandwidth through RDMA then

Starting point is 00:52:17 accessing remote memory if you access it in larger chunks will be similar to accessing local memory, all by having a higher latency. And the other thing, basically, is while we're trying to keep the data local, while we're trying to keep the data somehow partitioned, if we have some kind of workload skew, we can basically get some kind of load balancing by just using this remote memory,

Starting point is 00:52:55 by just computing or distributing the computation while still having the data distributed, or just on the fly accessing remote memory. So the idea is that we use RDMA for the network communication and try to get larger message sizes in order to get the full bandwidth and still keep a low latency. We don't need to have locality first, but we still want to have some kind of locality. If we can keep locality, it's good, but the computation skew is more important. So fully utilizing the individual CPUs basically.

Starting point is 00:53:44 And we can fully utilize all of the memory resources. So by using RDMA we can just see the memory as one large pool. We don't have to split up the memory and then see how we can fit stuff into one part of the memory and the other. Maybe overload or have one memory almost completely full and the other part of the memory almost completely empty. By seeing this as one large buffer or one large memory pool, we can mitigate some of the data skew and data access skew. So here you can see an experiment that they did in 2017 already, where they basically do a scale-out experiment on TPCC. So TPCC again is the OLTP benchmark by the TPC so basically this small transactions people buying

Starting point is 00:54:36 stuff in a warehouse and here they have over 50 machines working together doing distributed transactions. On the one hand, in a shared nothing architecture, where they would do kind of a send and receive communication, like probably just basically IP over InfiniBand. And you can see in millions per second, we're not in the millions per second, right? So we're in the hundred thousands per second transactions per second. And if you use like no co-location, so the data is completely distributed using RDMA communication for everything, but no data locality at all. So then you get a nice scalability,

Starting point is 00:55:29 you can see across the 50 nodes with almost up to 4 million transactions per second. And now if we additionally also do some locality, then we can basically double, almost double this again, with, in this case, 90% local transactions and 10% distributed transactions. So this means, in contrast to this earlier one, where we basically have no idea where the data actually resides, and some of the stuff will be locally served, some of the stuff will be distributedly served. Here we know 90% of the transactions will be

Starting point is 00:56:12 local, 10% will be distributed. We can see that we can get like one and a half times to almost two times the throughput that we would have with just no call location, no locality at all. And way, way more than using traditional shared nothing architecture with this TCP IP socket based connections. Okay, so then we'll briefly look at how we can use this for joins, for example. We'll look at a Radix join as an example. So I mean, in general, some of the good practices for RDMA is that we need to register the memory. And the registration of memory needs communication.

Starting point is 00:57:13 We said this earlier. So we cannot do one-sided communication before we didn't fully specify where the communication will happen. So it's not that I'm using one-sided communication and just writing somewhere in another computer's memory. So there's like a protocol that says, okay, this region, please, is free for writing something.

Starting point is 00:57:37 And that costs, right? This is some work that needs to be done. And the more pages we need to register the more it costs and if you want like uh like large transfers at a certain point you might just register large parts of your memory that are you cannot fully utilize for other stuff um so then you this means you need some kind of buffer management and that also means you want to reuse all of the RDMA-enabled buffers as much as possible. So you don't want to re-register all the time, but you're basically registering something and then reusing it all the time using some kind of buffer management.

Starting point is 00:58:20 Also, the communication I said that earlier needs to be asynchronous. So, we're writing to the core, to the queues, and then we have to do something else because it will take some time until the work is completed. So, rather than issuing a single communication waiting until it's finished, want to issue like continuously issuing like work and then eventually picking up whatever has been done so not overloading the queue but basically trying to keep the queue full and doing other stuff in between so we need to overlap the computation with the communication and accessing remote memory, of course, is still slower than local memory.

Starting point is 00:59:10 So we want to hide the network latency. So if we access remote memory, we basically want to do something else while we're loading the data. It's similar, again, as if we want to access memory locally until it goes up in the cache and gets into the registers. Rather than just waiting for this to happen, the CPU will try to do something else in the meantime. And that we can also try to hide

Starting point is 00:59:37 or we can basically try to do something asynchronously here. And finally, an important part is NUMA effects, right? So if you register some memory somewhere, you want to use the CPU or the chiplet or whatever that is local to this region rather than something else. Otherwise you will have to deal with NUMA effects on top. And you will basically get slower communication here again.

Starting point is 01:00:16 OK, so let's look at a Radix join. So you remember Radix join? Who remembers Radix join? We had it twice already. We basically do a radix partition and then join inside the partition. Exactly. So the idea is that we're using radix partitioning, meaning we can basically partition based on the keys,

Starting point is 01:00:42 but not on a binary representation of the keys and with this can nicely change the partitions as we need them right size the partitions in the right size such that we then can later on process them nicely and using the spits we can also like ensure the size of partitions. We can make sure that the partitions fit into caches. And usually in order to figure out how large these partitions should be, we start with some kind of histogram phase. So we're basically starting with figuring out what is the data distributions, then we do the actual partitioning,

Starting point is 01:01:27 and then once we have everything partitioned, we'll do the actual join. The last step to join typically then is a hash join, basically. So everything's partitioned nicely, and then the idea is that with also, besides sizing the partitions rightly to fit into caches, etc., we can also adjust the parallelism by basically ensuring, okay, we have this many partitions in orders or factors of the power of two,

Starting point is 01:02:00 like if we're just adding one bit additionally every time. And of course, we can do this also adaptively. So, but in the RDMA version, I mean, we usually start with some kind of histogram phase where we're figuring out what is the data distribution, how do we need to partition. So, this is basically what we would do in the first step and this can be done fully local. So we're thinking now we're thinking about having a large cluster with data on each of the nodes then we're starting working on each of the nodes. So the first step is histogram computation. So each node does the same stuff, just reads through the data, checks

Starting point is 01:02:47 how is the data distributed and how will then, just to know how we'll do the partitions. For this, once they've done this, they exchange the histograms, this is usually not that much data, and combine them into one large level that basically gives an overview of which data is located where. So how do we want to partition? This is basically done through the network and we have one global big histogram in the end. This is done using send and receive operations. Then in the second step, we have the partitioning. Then basically we know where do we want to partition or how do we want to partition and where do we want to partition. So one part is basically doing the partitioning

Starting point is 01:03:41 through the network and sending the data to the correct nodes. And one part is local, meaning this part of the data will keep locally. And this is what we will compute on later on. So this is what our node, the current node, will do the join on. And here we still do partitioning in order for the data to fit into the caches. And in the network partitioning, we're basically building pools of RDMA enabled buffers

Starting point is 01:04:15 where then we can write and we can basically write the partitions too. And we know how to size these buffers from the histogram phase. We basically know how much data should go into which node from the histograms. And we're doing this asynchronously. So while we're filling up one buffer, or we're filling up one buffer,

Starting point is 01:04:44 and we're sending the buffer, and while this is being sent, we're partitioning the next buffers. And all of the buffers should be private to the threads in order for not having any communication. Of course, we could also globally on one node use single buffers, but then there would be synchronization. And this way we basically keep separate buffers for the separate threads that then can be also sent separately. I mean, we just need to ensure that our packets, that our data communication, message sizes are large enough and that we keep the queues full that we want to send. So that's basically what we need to ensure.

Starting point is 01:05:31 And then in the final phase, we basically have like traditional hash join where we have a build and a probe phase where then on a local node, basically after the communication, everything is local, right? So then the partitions should all be on the same node after the partitioning phase, but they're in separate partitions. So then in the first phase, we're building, so we're reading all of the partitions for the build side and build a hash table.

Starting point is 01:06:07 And then with the other side, with the probe side, so the second table, we're going to start probing this hash table. And I mean, here we could even do a decision based on each individual node saying, okay, which side do we want to build? which side do we want to build? Which side do we want to probe? So usually we want to build the smaller side to have a smaller hash table. And then either we can write this locally or depending if we want to have like a global result somewhere or globally somewhere, then we again would write this into the network buffers

Starting point is 01:06:49 while we're creating the next, while we continue to create new results. So again, we have to interleave this. And here you can basically see an example. So this is two different InfiniBand networks and this kind of join using FDR and QDR and different kind of message sizes that they're using up to basically 1000 bytes

Starting point is 01:07:29 or now up to 1000 no 1000 10 100 100 would be 100 megabytes right so message sizes that they're exchanging and you can see how the bandwidth basically develops. So in the kilo, so in the bytes range or kilobyte range where we don't like, we basically don't see any difference in between the FDR and QDR networks. And remember, these are basically FDR is basically twice the speed of the of the QDR networks. And remember, these are basically FDR is basically twice the speed of the QDR. And so nowadays, again, we have EDR and then HDR. HDR being up to 200 megabyte per second. So here you can see that this only when we get

Starting point is 01:08:21 into the kilobyte range, we can actually fully utilize the FDR network and get a good performance for the join. So we need a certain message size. And if we're looking at the join performance in comparison of different kind of network interfaces. So if we're using IP over InfiniBand, so billion two billion tuples basically on four machines with fdr then just using uh infiniband without anything else basically the uh the network partitioning so this part is the most expensive part right so this will basically take the most time because the communication is just much more inefficient. If we do just the communication with RDMA, we're going to be better but then interleaving in communication and computation we can get an additional 20%. So then we're there where basically local partitioning

Starting point is 01:09:48 and network partitioning take up the same amount of time. However, you can still see that this is basically the most costly part. And this is similar to if you remember GPU lecture, right? So if you think about the sorting there or GPU joining, stuff like this, this is like communication is basically still where time goes.

Starting point is 01:10:11 And this is also here, right? So we're moving data around, this is what costs us. The actual build and probe phase, I think this is this part up here. This, well, it's not that much in comparison and i mean this gets even more pronounced if we think about uh more machines so this is basically i mean going back right so you see this is basically blowing this up this part um so if So if we're only using two machines, then local and network partitioning are in the same range. If we're increasing this, going up to 10 machines, we can see that the local partitioning scales down,

Starting point is 01:10:58 because each of the nodes has to do less partitioning in relation to the number of nodes. However, the network partitioning means more communication. We have to send to more nodes. So this doesn't scale as well. So there, we won't get the same kind of performance improvement. So while we're basically shaving off time here in the build and probe phase, which is parallelizable, the local partitioning is parallelizable and the histogram computation is completely parallel. This communication part basically doesn't really get improved as much because we have to transmit more data across the network. Right. This is also logical because overall this is the strong scaling that we're looking

Starting point is 01:11:50 at, we have the same problem size, but we're using more machines. So meaning we need to move more parts of the data across the network or larger share of the network. So in two nodes, we basically have to move 50 percent of the data across the network. So in two nodes we basically have to move 50% of the data across the network. For 10 nodes we have to move 90% of the data across the network. So this is why we don't get this super nice speed up here in total. Okay, so with this we're actually through with networking. So we talked about parallel and rack scale computing and how to program this and then a lot today about fast networks. We don't have that

Starting point is 01:12:32 much time left but maybe I'll show you something before you leave. Then I can actually show you more next time. We want to talk about FPGA next., right, so are there questions to RDMA so far? Oh, it's really, I mean, it's a different way of writing or doing the communication, but then you can get actually much faster throughput and much higher bandwidth. And so next lecture, mainly, I don't want to do too much because it doesn't really make sense to start that much today, but I want to show you, right? So this is, in the next lecture, mainly, we want to talk about FPGA. So this is an example of an FPGA, and this is also an example of an FPGA.

Starting point is 01:13:42 So I brought one, actually. And the cool thing about this one is like you can actually completely program this open source unlike most of these other things. And so this is what we're going to talk about in the next lecture is a type of hardware that can be reconfigured. So, so far, I'm just going to give you a teaser today.

Starting point is 01:14:05 Where's my mouse? So far, we've basically looked at hardware in a way that, oh, we have this hardware, now we have to deal with it, right? So we have the CPU, we have multi-core CPU. We have to somehow figure out how to use this best in order to get good performance. The GPU, like a type of hardware that's super parallel. So we need to think super parallel. And now we're getting to the even more efficient stuff. And this is, I mean, we're not going to get all the way there

Starting point is 01:14:46 but this is basically what we want to look at tomorrow then is uh field programmable gate areas is type of hardware that you can reconfigure and not reconfigure during runtime but reconfigure while writing your program so the idea is that you specify in your program, rather than specify the code and the instructions that will be executed, you specify the circuits that will be on the board. So the CPU, or the processor, not CPU, the processor on this thing, so this would be this small part here. So this is a small one, right?

Starting point is 01:15:26 This consists of reconfigurable hardware, consists of circuitry and of lookup tables that you can specify what should be on there. And this then means, well, you're building this for a certain type of program. So say, for example, some kind of join algorithm. building this for a certain type of program, so say for example some kind of join algorithm, and then it will be able to only do that, but do that very fast.

Starting point is 01:15:55 And this is kind of like a trade-off between an ASIC, so a fully hard-coded or fully specialized circuit, fully specialized chip and something where we have to write a program for, where all of the instructions beforehand need to be coded for. Here on the FPGA, we say, okay, we want this this kind of circuitry on there. And there is basic building blocks for circuitry. And that will be connected in the way you want it to on the circuit here on the chip.

Starting point is 01:16:37 And then have much higher efficiency, potentially also higher throughput while they're usually much less powerful. So they have much less cores, for example, much less chip space, much less, let's say, transistors than something like a GPU or a CPU. Okay, but with that as a teaser, I'll show you this next time. And I can actually show you, it actually does something. Thanks to Richard, who programmed this for me.

Starting point is 01:17:12 It can count. That's all it can do right now. This is specialized computing. So then we'll chat more about this tomorrow. Thank you.

Hardware-Conscious Data Processing (ST 2023) - tele-TASK - Field Programmable Gate Arrays

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

Hardware-Conscious Data Processing (ST 2023) - tele-TASK - Field Programmable Gate Arrays

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.