Hardware-Conscious Data Processing (ST 2023) - tele-TASK - Field Programmable Gate Arrays
Episode Date: July 11, 2023...
Transcript
Discussion (0)
Today, later on, probably, we're going to talk about FPGAs.
But before we do that, I want to just give you an update on the timeline.
And before we also do that, we're going to talk about RDMA.
So we're in the third to last week in this semester and in this course.
So today we're going to talk mainly probably about RDMA.
Then tomorrow FPGA.
Maybe I'll be able to finish up FPGA tomorrow.
I don't fully know yet.
Unfortunately, the external speaker that I invited did decline.
And so we're not going to have an industry speaker this time.
But this gives us a bit of time to finish up everything smoothly.
So we're not going to be rushed.
And we can have a Q&A session here if need be.
So we're going to have the task.
Then finally, CXL and the Summary and Data Center
tour, if you want that.
In the last week, unfortunately, I cannot be here.
CXL, anyway, Marcel will do.
So that's not a big issue.
The summary, I will ask Lawrence to kind of give you
an overview again and do the tour with you for those of you are interested?
And then we'll see so for me basically this is gonna be we're gonna end
From my end we're gonna end here
I
mean if there's time I can even do like a quick summary here already and then
Leave myself to the seeks and summarize that by himself, basically.
So then if you're lucky, we're just going to have the data center tour here.
Okay, so but going back to the networking, so where we finished off or did not fully finish last time. Let me find my cursor here.
So,
so we talked about networking, right? So I gave you a bit of an overview of networking and parallelism last time, so how to
program
and make
databases at least to some degree parallel across networks,
so across multiple machines and we talked about rack scale computing.
And now we're going to go into more detail about networking.
And this is also going to be the rest of this part of the lecture.
So most likely I'll be able to finish this up before the end
and then we switch to FPGA,
but the rest is basically all about fast networks
and how to use them.
And so far, and if you've seen big data systems,
one of the major mantras in distributed computing
is basically don't communicate as much or communicate as little
as possible.
Try to do everything locally.
And I mean, this is also true in modern hardware and for networking in general.
I mean, we've seen this on the chip, right?
So we want to do something NUMA local, we want to be NUMA aware, we want to
have everything in the caches, etc. So locality first is still the mantra. And this is, of course,
also true for networking. But networking used to be the thing that's really no-no, right? So basically the slowest link in the whole system.
So if you have like a distributed system, even if you have to go to disk,
in the past, networking would be slower than going to the disk, for example.
So the idea was that you always try
to minimize the network costs
by using certain partitioning schemes.
So you co-partition the data
so that the data can be used locally.
You use scheduling algorithms to ship the computation
to the data to make sure whatever you need to do
is in the right place.
And you do, again, the computation locally,
and you try to restrict remote operations
as much as possible.
So you just, in general, avoid network traffic.
And I mean, this is basically how systems were designed
for a long time, but that also means that doing something
or having very complex schemas or workloads
are not really well supported in this kind of setup.
Or ad hoc workloads where you really don't know
what the partitioning will look like.
So you will find a lot of research
where people basically try to find
very good scheduling algorithms or partitioning schemes based on previously seen workloads.
But if the workloads change, basically you have to start from scratch again. And then
you can try to somehow replicate, etc. But it will be costly in a way. However, today we have faster networks and I mean they've always been around a network like
InfiniBand but they're increasingly becoming affordable and with this we can basically start
again like in any case with changes in the hardware, we can start to rethink the database architecture.
And so what is InfiniBand? InfiniBand is one type of networking infrastructure. So it's basically
an alternative to Ethernet. And here we have much higher bandwidths than in Ethernet typically. I mean, while we can also get faster Ethernet bandwidth these days,
InfiniBand usually is faster than Ethernet.
So you can basically get like FDR, EDR, nowadays HDR band or Ethernet
InfiniBand which basically gets like bandwidths
in the tens of gigabytes per second and there
we're basically close to memory bandwidths
so that means I mean basically
we're all of a sudden in the same ballpark as the memory and that means, I mean, basically we're all of a sudden in the same ballpark as the memory,
and that means we're probably faster than disk in many cases. However, as we already know,
right, you've seen like in the storage lecture, we've also talked about basically getting ever
faster storage. So there, if you get very fast storage, again,
you're going to be in the same ballpark.
And then we're all of a sudden always limited basically
by the PCI Express interconnect.
But the interesting thing is with RDMA,
we can basically saturate that.
So we can basically get networking as fast as PCI Express can give us
with enough channels. I mean we can have single or multiple channels meaning multiple connections
to the same server. And so I mean it's really hardware and like low level protocols and so in Finiband we have channel adapters which are basically the network
cards which are the end nodes in the subnet and you have the switches of course and the network
cards they can have single or multiple ports to have multiple connections they create and consume packets.
And one machine can even, like, even if they have, like, this one you can see has two ports,
but you can still even plug in multiple of those into a machine to get more bandwidth.
And in there we have a direct memory access engine that allows direct remote memory access.
This is kind of the game changer.
We've talked about direct memory access before and I think I've hinted to that already.
Let me maybe get a pointer. But basically already on the card we have something where we can then directly interact
with the memory.
And yeah, I said one or two ports.
And then we have the links.
And InfiniBand has three link speeds speeds or means number of connections one four times or
12 times and each of the links then has four wires which is one for receiving or one for receiving
one for trans transmission or for sending basically so we have have the transmitting signal and the receiving signal and each of
those is basically a pair so we don't get too much interference. So then like this is
why we need four links in total for a single link or four cables in total for a single link or for cables in total for a single link. And there's different things,
different media. So the classical one would be a copper cable, but then we also have optical
cables where then we probably don't need, I actually don't know, but I'm assuming we only need a single link or a single connection per
transmitting and receiving signals,
so meaning one back and one return.
And then we could also have printed circuits.
And these cables usually are quite thick.
So if you go to the data center and you
look at the network cards, you will see that there's like the regular ethernet cables that
you know, right? These wobbly small things. And then you will find the InfiniBand cables. And
these are thick cables that usually somehow weirdly hang out of the cages because they cannot really bend very well.
Let me see, maybe I can even show you. Right, so here you can see this is basically the type of
cable that you would have. So there, I mean, you can imagine they're basically quite thick and not
easy to bend.
This is also one thing that people always complain if they have to set up this or set this up.
And because they're thick and heavy, they're also expensive.
But still much more affordable than in the past.
And so now using this InfiniBand technology,
there's two ways to use it.
And like we can use it just like we would use ethernet
using IP, right?
And that's called IP over InfiniBand or IPOIB.
So this is usually what you will see.
And that means we have the regular normal TCP IP stack over InfiniBand to just do regular packet transmission and sending.
So it means you have your regular sockets that you would use and you can use this out of the box in your system. The problem is or the downside is if you do this,
this will just be as efficient or inefficient as IP regularly is because if you have the whole
TCP IP stack that means everything goes through your CPU. You have layers of protocols and these basically have to be computed
and worked or sorted out by the CPU typically.
And also because of the different protocols,
you will have copying, right?
So you copy your data from one packet into the next
and to the other protocol
and that usually means a lot of CPU over
I had however with I InfiniBand you also have remote direct memory access and
today you also have this with converged Ethernet so there's something called
remote remote direct memory access over converged Ethernet or rocky where you
can also do this.
But this is usually the way you want to use InfiniBand mainly.
So this means we're not working with the OS at all.
We're not copying data around.
We're just directly basically sending and receiving data
through the network using RDMA verbs.
And there's one-sided or two-sided verbs. One-sided verbs means
one side, meaning one CPU will initiate everything and the other side doesn't have to do anything.
So we're basically just defining memory regions on the different servers and then in the one-sided
communication, the sending or receiving server will just initiate the communication
and will go to the remote memory location, but the other server doesn't have to deal with it.
So there then the NIC, the network interface card, will do all the communication or will do everything in the memory itself.
So it gets a reserved region in memory on the remote server
that it can interact with.
And then we also have the two-sided communication.
This is basically send-receive,
where you would do regular packet connections.
You still want to do this asynchronously,
but at least both sides are aware of the communication.
So we're basically, again, building queues, but we're building queues on both sides are aware of the communication. So we're basically, again, building queues,
but we're building queues on both sides saying,
hey, I'm going to send you this packet,
and then the other side has this basically in their receiving queue.
In the remote-sided communication,
we don't have a queue on the other side.
We're just basically touching the other memory directly.
And then the cool thing about this is that the actual communication or the actual processing
is not done by the CPU anymore, but this is done by the NIC, so by the network interface card. Basically, if we go back here, this thing here
will basically directly interact with memory,
read the data from memory, copy the data into memory,
and then whatever it gets, send it through the link
into the other server.
And so the cool thing about this is,
or why this actually helps us in processing and why
we can get much faster networking, why this is that we're bypassing the CPU.
So we're not using the CPU, not completely.
So we have to basically initiate communication somewhere, right? So we will basically say on the one side of the communication,
if we're doing one-sided communication,
the server that initiates something will have to do something,
will have to initiate the communication.
But then everything else, right?
All of the networking, all of the copying, etc.,
this will be done by the network interface card.
And the CPU doesn't have to do anything.
Especially the remote CPU won't have to do anything.
We're bypassing the operating system, meaning we don't have interrupts,
we have no context switching, we have no copying.
So this is also, we're not copying the data on the memory bus.
We're not copying data around into different kind of network packages, etc.
And we have this kind of message-based transaction, right?
So we're sending messages back and forth.
In order to have this efficiently, we need to do this asynchronously.
Because the CPU is not really involved in the communication,
the CPU shouldn't wait for each individual interaction.
Because otherwise we're back in the synchronous operation
and we're not really utilizing this additional computation and networking capabilities of the network interface card.
Because then we'll just talk with the card, the card will do something, the CPU will wait, etc.
So we want to have this completely or as much as possible asynchronously.
So there's different ways of doing this. I already said we can have the TCP IP stack,
or we can directly go to RDMA.
So here we basically see the classical TCP IP stack.
So we would have like a socket layer,
then we have TCP or we have UDP.
Underneath we have IP, then we have tcp or we have udp underneath we have ip then we have
ethernet we have an adapter driver and then we would actually have the the network card right
so and all of this would i mean at least all of this up here would still be in the kernel and to some degree, maybe in the user space.
And each individual layer here means copying data, right?
From one protocol into the next,
from one packet type into the next.
We're basically packing packets
and we have to split them up depending on the protocol.
And only then we can actually
send it.
And otherwise we can also use like different kind of libraries.
So uVerbs would be a library which has some kernel module to use with it, or we can directly
use this in user space. Similarly to what we did or how we would interact with an MVME with SPDK, if you remember.
So in this way, basically we were saving all of these individual copy operations and we're
directly talking to the network interface card.
Okay, so, well, there's again like one of the,
another comparison of the two, right? So if we do TCP IP,
then we have our application,
which is in user space, which has some data,
which then will basically if we
if we go through um through our through infiniband for example we would still use the nick with the
direct memory access um methods uh so we we would have a direct memory access buffer. But if we come like if we're requesting
some data, for example, then the CPU will first have to copy this will have to translate this
will from the kernel will again have to translate this for the network interface card,
then we have the actual communication and the same thing happens again on the other
side.
Right?
So on every layer of the stack, we basically have additional communication.
We have additional copies.
If we're using InfiniBand, all of this is basically done by the network interface card.
Right?
So we're directly talking to the network interface card. So we're directly talking to the network interface card.
Of course, we still have to process our data
in a way to fit into the RDMA verbs
or let's say the messages for now.
But that's all we have to do.
Then we basically say, this is in this kind of memory region
and then from there on the network interface card will do
the rest for us right so the network interface card will um will directly connect to the memory
will on the other side will do the transmission will there on the other side also go to the user
space or to the not to the user space we'll go to the right memory region read or write the data and send it back and in the one-sided operations
we really only have one of the CPUs working on this right the other CPU
doesn't have to do anything at all we need some setup but after that we can basically just do everything through the NIC here on the actual hardware directly.
And so we can see, I mean, we need a certain packet size in order to saturate the bandwidth.
So here are some measurements by Carsten Binig et al. Where they basically measured how we can use RDMA.
And so on this side you can see the throughput,
on the other side the latency.
And you can see this log scale, right?
So this means like these differences are huge.
And this difference is two orders of magnitudes, basically, right? The difference
in speed that we can see. So here we have classical IP over Ethernet. And of course,
well, if we're here in like a one gigabit Ethernet, then this is basically what we typically would see, like 100 megabyte per second roundabout.
And here, and actually I don't remember
what kind of InfiniBand they used,
probably EDR, so not HDR,
so not the fastest one available yet,
but already quite fast.
But we can see if we're using
InfiniBand over, yeah, IP over InfiniBand,
somehow I mixed this up here.
If we're using IP over InfiniBand, we can get faster,
or we can get a higher throughput
than we would have with Ethernet.
Signific significantly higher,
but we're still gonna be slower than using RDMA directly.
And we need larger packets in order to,
or we need more data that we send in one shot
in order to get this higher throughput, right?
So Ethernet is basically already maxed out when I don't remember here, like
it's 128 bytes, something like this. And for InfiniBand, we need something like two kilobytes
in order to saturate the bandwidth in message sizes.
And for the latency, we can see that, well, for IP over Ethernet,
the latency is in the 50 microseconds, something like that.
And here with RDMA, we can get into the one microsecond range.
So much, much faster.
And we kind of keep a constant latency, more or less.
Again, this is also log scale.
So you can see if there's like a slight bend here,
we already see significant increase, but not that much increase.
So up to two kilobytes we're basically more or
less stable and we get quite nice um latency here and much lower again as if we're using ip over
infini band because ip again we have all the copying and this is what we see here right while
we're still getting a good throughput not as high as
directly using rdma we're getting much less latency much worse latency because of all this
copying this is basically what makes this expensive here so you get much faster throughput and you get much better latency.
And the latency, of course, this is not in the range of accessing memory,
but we're not that far away anymore.
So we're somewhere where we can say,
okay, we can actually touch remote memory in a reasonable amount of time.
So if we're talking about the memory hierarchy,
like expanding the memory hierarchy with remote memory,
then we can see something.
Well, if we're in the, like, of course,
we have the registers, then we have the different caches,
then we have local memory,
and then we're in the microseconds in remote memory and finally after that this is
basically where we're then would be with it with disk usually again depending on the type of ssd
and the nvme there might be like a constant battle about this lower ground but if we're talking about traditional hard disk
drives or regular SSDs then with InfiniBand for sure we're gonna be here
somewhere right then and that basically means all of a sudden for certain
applications or whenever you use more memory for your applications, it might make sense to rather go to separate memory or to different memory,
than to stay on the same machine and use disk.
And this is kind of the all of a sudden again a switch in architecture.
Because initially we would have always thought, okay, let's keep everything on our single node, use the disk there, and only if we then have to communicate,
we would talk to another node because the network is so slow.
But here, with remote direct memory access,
the remote memory all of a sudden becomes an option
that might be faster than using your disk locally.
And also, this gives you additional flexibility
in how you use memory.
So in clouds, for example, you will see that systems
or cloud vendors, for example, start setting up the clusters
in a way that you can actually use the memory
across a rack more flexibly.
So because memory often is not fully utilized
or is underutilized on some machines
and overutilized on other machines.
And then through RDMA, you can basically
try to use this more flexibly, more as a memory
pool across machines.
And we will learn more about this in the last lecture about CXL,
which is basically another technology which is particularly built for this.
But RDMA is kind of already the direction to go there,
to expand your memory.
Think about the memory in your distributed cluster,
at least rack-scale cluster, as one large memory.
With the big caveat that you don't have cache coherence.
So this is something that you have to deal with yourself.
Okay, so far as kind of an overview, now let's talk about some RDMA basics.
So how do we set up RDMA?
So RDMA works with channels.
And in order to basically have this zero communication stuff, right?
In order to, or zero CPU utilization stuff, you need to set up buffers. So basically you're reserving certain memory regions for the
network card that the network card can write to and read from. And this way, basically the network
card has like a safe zone where it can interact with the memory directly. And where also another
memory card can basically know this is memory
or this is a region where I can write to.
So another server can say, I will write into this area
of the memory directly.
And in order to do that, you basically
have to pin the memory so it cannot be swapped,
because otherwise the network card cannot access it anymore.
You need the address translation in the network interface card
in order to know where to actually write.
Then you need the permissions for the memory region,
so the network card needs to have permissions to write into the memory region.
And you have some remote and local key in order to ensure that this is actually correct.
So this is used by the adapters for the RDMA operations.
And then you have three queues.
So for the communication, you don't need all of them all the time,
but basically for complete communication, you don't need all of them all the time, but basically for complete communication,
everything, you will have a send, a receive
and a completion queue.
And typically the work queues, or not typically,
the work queues are always the send and receive queues
are created as a queue pair.
And these are used to schedule the work that needs to be done.
And this is basically what the CPU needs to do or your program needs to do. That basically,
you're writing something into the queue that needs to be sent. And the completion queue
is a queue for information when the work has actually been performed.
So in a, well, I actually have some examples later.
And so what you do in the application
is basically you're issuing work requests
or work queue elements.
And this is basically just a small struct in the send queue,
which is basically a pointer to the message that needs to be sent,
or a pointer to the incoming message where it should be placed later on.
And then in the completion, once you
have the message in the completion queue,
you will know that this has actually been done.
So this will be done by the network interface card.
And this will, again, be enqueued, basically.
And you will try to keep these queues full in order to fully saturate your bandwidth. So here's an overview of the stack. So the
application basically posts the work request to the queue and each work
request is a message or a unit of work. So either receiving or sending message.
Then you have this verbs interface.
This is this, basically you can think of it like a library.
So the different kind of interactions,
the different kind of messages that you can send.
Then the adapter driver, the RDMA adapter driver,
maintains the work queues and manages the address
translations and provides this completion mechanism.
So it's basically informing the CPU or the application that some kind of networking has been performed.
And then on the RDMA or the NIC itself you have the transport layer, so you have reliable and
unreliable transmissions, you have the packetization, the RDMA control, the end-to-end reliability, etc. So if your packets get dropped,
etc. And you have the actual delivery.
There's multiple different versions of InfiniBand and there are continuously new versions, right? So right now we're at HDR. So
QDR you can basically get 32 gigabit per second. So this been
already a bit older like from 2008 then
well
2015 ish we had 8 EDR now yeah usually you get HDR which would be like 200 gigabit per second
I already mentioned there's also RDMA over converged ethernet or Rocky you will also see
this every now and then this is basically doing the same thing through ethernet. And there's a hundred gigabit per second,
40 gigabit, 10 gigabit per second,
40 gigabit per second,
or I think even a hundred gigabit per second,
you can get there already.
So this is usually like using the ethernet,
it will be a bit slower,
but we're getting there in terms of speeds.
So InfiniBand is always a bit faster, but So InfiniBand is always a bit faster,
but also InfiniBand is always a bit more expensive
than Rocky.
And finally, there's also iWarp,
like an internet wide area RDMA protocol.
So this is usually InfiniBand is something within a rack
or within a couple of racks,
because the links are actually heavy and the link length is quite limited.
So you cannot do like Ethernet, where you can easily have a 50 meter cable or even more.
And there's different specifications.
And for InfiniBand, these links are usually much shorter.
But with the iWarp, you can also do this over internet-wide connections, basically.
Okay, so RDMA itself is just a mechanism, right? It's just basically how do you connect the memory
but then you have to specify the semantics
of the data transfer in your application.
And we have two types of memory access models in RDMA.
There's one-sided and two-sided, already mentioned this.
In the one-sided, you have read and write
and some kind of atomic operations.
So this basically means on the one side, only the sender
or only one of the CPUs, one of the partners in communication
know about the communication at all.
So the other side will not be involved in the communication.
This will all be handled on the network interface card.
And this, for example, is cool if you want to do pooling,
memory pooling.
If you say, I want to use the remote memory here
just to enlarge my memory,
then the other CPU doesn't have to deal with the memory region at all so
I can just use the RDMA one-sided communication and the other CPU doesn't
have to deal with the memory region at all. It can still access the memory
region of course with the right permissions etc so you can also do more
fancy things with one-sided communication. But if you directly want to communicate with another CPU or another part of your application,
then you would use something like send and receive.
So where basically you can say, oh, here I have new information for you,
or please give me this information.
Then you have two-sided communication, which means
both CPUs have to do something, but again, this would be done asynchronously.
Well, send and receive is basically classical or traditional message passing,
and both sides are actively involved in the communication.
So both need to have the queue pairs for send and receive and the completion queue for the
queue pair.
And basically the sender would then request or the sender would send give into the send queue would put a pointer to a buffer that needs to be
sended which then would be entered in the work queue or in the send queue and the receiver
would basically have a pointer to an empty buffer so basically say please put the data into this buffer in memory where then I will be able to read it.
And then, of course, once the transmission has worked out,
then the data from the send buffer will be copied into the receive buffer on the other side.
And the work queue element that send, the work queue element
that was in the send queue will be removed
and there will be a new entry in the completion queue
on the sender side and the work queue on the receiver side.
The work queue element in the receive queue
on the receiver side will be removed
and there will be a completion element in the receiver side.
So this is basically how this would look like.
So if we have an RDMA send, then you have your host memory, you have the registered memory for the RDMA transfer,
you have a certain buffer that you want to transfer.
Then what you do is basically you put something on the sender side in your send queue, which is a pointer to the memory region.
On the other side, you have also registered memory with a buffer for receiving the information.
You will have a work queue element on the receiving queue.
And then you have the direct transfer. So from here on the CPU doesn't have to do anything anymore.
So the CPU basically just put the elements into the queues
and assign the memory region.
Then the network interface card will just directly copy this over
because using RDMA or the direct memory access,
the network interface card can just copy information into the memory and copy information from the memory.
So we'll basically copy the data into the network interface card, send it across using InfiniBand magic,
copy it from the network interface card into
the memory and once this is done, it will go basically on both sides in the completion
queue.
And then both systems basically know that the transmission has worked out, so the system
A can basically reuse the buffer for something else,
system B knows, oh, I can read the information that I wanted,
or that I got sent from the buffer here.
However, there is also the read and write,
so the one-sided communication,
where only the sender then
would be active so and this there the receiver is active so on the passive
side then the there's basically no operation that needs to be done so we
don't have any queues where we add like completion or send or receive or
something but this will only be done on the active side.
So the passive side doesn't do anything.
And also the passive side doesn't get any information that
something has happened, right?
So the passive side doesn't know if there was a data transfer or not.
So basically all of a sudden, something
might pop up in the registered memory region
that the CPU could use, or the passive side
could do something with, or could not do something with.
So it's basically completely unaware of what's going on.
Yes?
Are there any guarantees towards ethnicity,
so that when both try to write at the same address,
there's not half of one and half of the other is written?
So yes.
So basically, initially, the buffers
need to be registered properly.
And then if the buffer is registered, basically,
for the transfer, then only one can basically access this.
So you would have, but actually I don't know.
I mean, if you could set up a setting where you were mixing
up the memory in one way or the other using the one
wide-sided communication, I assume so.
So I guess you can break stuff here.
But I don't know the, so there are atomic operations.
So there are dedicated atomic operations
where you can say, please do a compare and swap, something
like this, which you can also use for communicating
these transfers then. But presumably, you can also use for communicating these transfers then.
But presumably, you can also break something here.
So, and here, basically, to have this read or write,
you need the remote site's virtual memory address and memory registration key.
And I would assume and
again I'm not too deep into the details but I would assume that you can basically
say here this buffer is basically now used for for writing something so I'm
not going to write into this location. I'm not sure if you want if you cannot
break it so maybe you can.
So then the active site basically needs the address and the key, and for this,
you actually do some kind of communication, right?
So initially, in order to set up the buffers,
you do some communication.
So the receiver side is actually aware
something is going to happen. but then for the actual transfer
Then there's nothing like no indication on the passive side
So in order to set this up
Well, I'm not going to walk you through the details. This is more for your information if you want to see this
Later on you can find it in the
slides there's basically the different kind of operations in order to set up these things so you
will basically initialize InfiniBand then you set up the connections and the you set up the buffers
etc and then you can do the actual writes, etc.
So here are the different kind of commands that you can use
and like a link to a simple example if you want to try it out.
And if you want to try it out, a lot of the servers in our data lab
have InfiniBand connections.
So there you can actually play around with InfiniBand connections, so there you can actually play around with InfiniBand.
I think both of the clusters, actually all of them are connected through InfiniBand,
so you can do with 16 nodes different kinds of experiments there.
Okay, so let's get started again and see how, for example, we can use this in a database
system.
And of course, there's many different ways. On the one hand, we can just drop it in instead of regular Ethernet.
But using this one-sided communication, etc.,
there's lots of interesting research on how we can utilize this
to make certain algorithms
and a certain type of data processing faster. Okay, so traditionally, in, well, I mean,
there's different kind of architectures for distributed processing. But the one that I
showed you earlier, also is like the shared nothing architecture that's used a lot. And
that means we're partitioning the data across all of the nodes in the network and we're trying to
keep all of the data access local and this means also each of the nodes only
has access to the local data and as soon as we want to have the distributed data, we need some kind of coordination and data flow
for the communication with other nodes.
This is usually done with socket connections,
so kind of send and receive communication.
And here, as we already said,
we want to have local data locality as much as possible.
So we're keeping everything in the local memory, everything on local disks.
And we're partitioning the data in a way that we can have local accesses.
And this works for many applications quite well.
So if you think about ATM transactions, etc. So this is usually there is a high sort of locality to it, right? So you have
certain customers that are in certain branches, etc. They're not moving around too much. So you
basically can say, I do this customer on this server, the other customer on that server,
and I can get a nice load.
But in other setups, in other environments, in other applications,
you might have more change, more fluctuation in the communication,
and your entities are moving around across different parts of the data
set or you have different connections and then you don't keep this locality or you often
have to change this and then the partitioning won't work that well.
For this you have to think about something else, essentially. Using IP over InfiniBand already gives us faster networking.
So that means, basically, we can still
keep the database management system code unchanged.
So you still use the socket.
And we saw earlier from the experiments
that this will give us much higher throughput,
like much higher bandwidth, not perfect latency.
So the latency will still be much higher than using RDMA, but might still also be a bit
lower or will still be lower than Ethernet.
And especially if we have large messages, so large amounts of data that need to be shipped around,
then the database management system actually benefits from the higher bandwidth.
If we're talking about small individual packets, then we're still in kind of Ethernet speed,
because we cannot saturate InfiniBand here.
Right? So, and I mean in general we still have a problem using IP over InfiniBand
because we cannot fully utilize the network bandwidth so there's an overhead
and we saw that we're not going to be as fast. And if we have small messages, this is actually more costly in relation to larger messages in Ethernet.
So this means if we have small communication, small distributed transactions and distributed coordination,
that will actually lead to a lot of individual small messages
which will not be able to fully utilize InfiniBand, meaning that we're not fully utilizing the
connection bandwidth that we would have.
And so we're not fully utilizing the system.
So in that sense, it actually makes sense to rethink the architecture and use a distributed
shared memory architecture using RDMA.
This means we can fully take advantage of the RDMA communications.
So here, we try to fully utilize the bandwidth, but we need to use the VIRB API instead of
sockets.
And that means we fully,
like we completely have to change the code
for all of the communication across the nodes.
And the idea is to create a shared memory architecture.
So using the memory across all nodes
as a shared memory region,
but we don't have cache coherence here, right?
So this is basically what we have to think about.
So if two nodes access the same cache lines,
we don't have cache coherence.
So the system won't take care of this for you.
So you have to deal with this yourself in your code.
And again, there's the two communication patterns for send and receiving and for the one-sided
verbs.
And so this means we basically, we're trying to keep the memory as one large pool, ideally
still with locality, because accessing remote memory will still be slower
than accessing local memory.
But using the one-sided or two-sided communication, ideally mostly the one-sided communication,
we can basically get very fast access and without any additional CPU utilization to the other memory.
And so one idea is the NUMDB, so this is also by Carsten Binnig et al. from TU Darmstadt.
So here they basically go with the assumption that network communication is not that expensive,
but all of the distributed coordination,
this is basically costly.
So basically the small latency or small messages
and the latency for that, for the distributed coordination,
that's basically something that's costly.
But if you get low latency for this,
then this will be
faster if you have like a similar to memory bandwidth through RDMA then
accessing remote memory if you access it in larger chunks will be similar to accessing local memory,
all by having a higher latency.
And the other thing, basically, is
while we're trying to keep the data local,
while we're trying to keep the data somehow partitioned,
if we have some kind of workload skew,
we can basically get some kind of load balancing
by just using this remote memory,
by just computing or distributing the computation
while still having the data distributed,
or just on the fly accessing remote memory.
So the idea is that we use RDMA for the network communication
and try to get larger message sizes in order to get the full bandwidth
and still keep a low latency. We don't need to have locality first, but we still want to have some kind of locality.
If we can keep locality, it's good, but the computation skew is more important.
So fully utilizing the individual CPUs basically.
And we can fully utilize all of the memory resources.
So by using RDMA we can just see the memory as one large pool.
We don't have to split up the memory and then see how we can fit stuff into one part of the memory and the other.
Maybe overload or have one memory almost completely full and the other part of the memory almost
completely empty. By seeing this as one large buffer or one large memory pool,
we can mitigate some of the data skew and data access skew.
So here you can see an experiment that they did in 2017 already, where they basically do a scale-out experiment on TPCC.
So TPCC again is the OLTP benchmark by the TPC so basically this small transactions people buying
stuff in a warehouse and here they have over 50 machines working together doing distributed transactions.
On the one hand, in a shared nothing architecture,
where they would do kind of a send and receive communication,
like probably just basically IP over InfiniBand.
And you can see in millions per second, we're not in the millions per second, right?
So we're in the hundred thousands per second transactions per second.
And if you use like no co-location, so the data is completely distributed using RDMA communication for everything, but no data locality at all.
So then you get a nice scalability,
you can see across the 50 nodes
with almost up to 4 million transactions per second.
And now if we additionally also do some locality,
then we can basically double, almost double this again,
with, in this case, 90% local transactions and 10% distributed transactions. So this means,
in contrast to this earlier one, where we basically have no idea where the data actually
resides, and some of the stuff will be locally served,
some of the stuff will be distributedly served. Here we know 90% of the transactions will be
local, 10% will be distributed. We can see that we can get like one and a half times to almost
two times the throughput that we would have with just no call location,
no locality at all.
And way, way more than using traditional shared nothing architecture with this TCP IP socket
based connections.
Okay, so then we'll briefly look at how we can use this for joins, for example. We'll look at a Radix join as an example.
So I mean, in general, some of the good practices for RDMA is that we need to register the memory.
And the registration of memory needs communication.
We said this earlier.
So we cannot do one-sided communication
before we didn't fully specify where the communication will
happen.
So it's not that I'm using one-sided communication
and just writing somewhere in another computer's memory.
So there's like a protocol that says,
okay, this region, please, is free for writing something.
And that costs, right?
This is some work that needs to be done.
And the more pages we need to register the more it costs
and if you want like uh like large transfers at a certain point you might just register large
parts of your memory that are you cannot fully utilize for other stuff um so then you this means
you need some kind of buffer management and that also means you want to reuse all of the RDMA-enabled buffers as much as possible.
So you don't want to re-register all the time, but you're basically registering something
and then reusing it all the time using some kind of buffer management.
Also, the communication I said that earlier needs to be asynchronous.
So, we're writing to the core, to the queues, and then we have to do something else because
it will take some time until the work is completed. So, rather than issuing a single
communication waiting until it's finished, want to issue like continuously issuing like work and then eventually picking up whatever has
been done so not overloading the queue but basically trying to keep the queue
full and doing other stuff in between so we need to overlap the computation with
the communication and accessing remote memory, of course,
is still slower than local memory.
So we want to hide the network latency.
So if we access remote memory, we basically
want to do something else while we're loading the data.
It's similar, again, as if we want to access memory locally
until it goes up in the cache and gets into the registers.
Rather than just waiting for this to happen,
the CPU will try to do something else in the meantime.
And that we can also try to hide
or we can basically try to do something asynchronously here.
And finally, an important part is NUMA effects, right?
So if you register some memory somewhere,
you want to use the CPU or the chiplet or whatever
that is local to this region rather than something else.
Otherwise you will have to deal with NUMA effects on top.
And you will basically get slower communication here
again.
OK, so let's look at a Radix join.
So you remember Radix join?
Who remembers Radix join?
We had it twice already.
We basically do a radix partition and then join inside the partition.
Exactly.
So the idea is that we're using radix partitioning, meaning
we can basically partition based on the keys,
but not on a binary representation of the keys
and with this can nicely change the partitions as we need them right size
the partitions in the right size such that we then can later on process them
nicely and using the spits we can also like ensure the size of partitions.
We can make sure that the partitions fit into caches.
And usually in order to figure out how large these partitions should be, we start with
some kind of histogram phase.
So we're basically starting with figuring out what is the data distributions, then we do the actual partitioning,
and then once we have everything partitioned,
we'll do the actual join.
The last step to join typically then is a hash join, basically.
So everything's partitioned nicely,
and then the idea is that with also,
besides sizing the partitions rightly to fit into caches, etc.,
we can also adjust the parallelism by basically ensuring,
okay, we have this many partitions in orders or factors of the power of two,
like if we're just adding one bit additionally every time.
And of course, we can do this also adaptively. So, but in the RDMA version, I mean, we usually start with some
kind of histogram phase where we're figuring out what is the data distribution, how do we need to
partition. So, this is basically what we would do in
the first step and this can be done fully local. So we're thinking
now we're thinking about having a large cluster with data on each of the nodes
then we're starting working on each of the nodes. So the first step is histogram
computation. So each node does the same stuff, just reads through the data, checks
how is the data distributed and how will then, just to know how we'll do the partitions.
For this, once they've done this, they exchange the histograms, this is usually not that much
data, and combine them into one large level that basically gives an overview of which data is located where.
So how do we want to partition?
This is basically done through the network and we have one global big histogram in the end.
This is done using send and receive operations. Then in the second step,
we have the partitioning. Then basically we know where do we want to partition or how do we want
to partition and where do we want to partition. So one part is basically doing the partitioning
through the network and sending the data to the correct nodes.
And one part is local, meaning this part of the data will keep locally.
And this is what we will compute on later on.
So this is what our node, the current node, will do the join on.
And here we still do partitioning in order for the data
to fit into the caches.
And in the network partitioning,
we're basically building pools of RDMA enabled buffers
where then we can write
and we can basically write the partitions too.
And we know how to size these buffers from the histogram phase.
We basically know how much data should go into which node
from the histograms.
And we're doing this asynchronously.
So while we're filling up one buffer,
or we're filling up one buffer,
and we're sending the buffer,
and while this is being sent, we're partitioning the next buffers.
And all of the buffers should be private to the threads in order for not having any communication.
Of course, we could also globally on one node use single buffers, but then there would be synchronization.
And this way we basically keep separate buffers for the separate threads that then can be also
sent separately. I mean, we just need to ensure that our packets, that our data communication, message sizes are large enough and that we
keep the queues full that we want to send.
So that's basically what we need to ensure.
And then in the final phase, we basically have like traditional hash join where we have
a build and a probe phase where then on a local node, basically after the communication,
everything is local, right?
So then the partitions should all be on the same node
after the partitioning phase, but they're in separate partitions.
So then in the first phase, we're building,
so we're reading all of the partitions for the build side
and build a hash table.
And then with the other side, with the probe side, so the second table, we're going to
start probing this hash table.
And I mean, here we could even do a decision based on each individual node saying, okay,
which side do we want to build? which side do we want to build?
Which side do we want to probe?
So usually we want to build the smaller side to have a smaller hash table.
And then either we can write this locally or depending if we want to have like a global result somewhere or globally somewhere,
then we again would write this into the network buffers
while we're creating the next,
while we continue to create new results.
So again, we have to interleave this.
And here you can basically see an example.
So this is two different InfiniBand networks
and this kind of join using FDR and QDR
and different kind of message sizes
that they're using up to basically 1000 bytes
or now up to 1000 no 1000 10 100 100 would be 100 megabytes right so message sizes that they're exchanging and you can see how the bandwidth basically develops.
So in the kilo, so in the bytes range or kilobyte range where we don't like,
we basically don't see any difference in between the FDR and QDR networks.
And remember, these are basically FDR is basically twice the speed of the of the QDR networks. And remember, these are basically FDR is basically twice
the speed of the QDR.
And so nowadays, again, we have EDR and then HDR.
HDR being up to 200 megabyte per second.
So here you can see that this only when we get
into the kilobyte range, we can actually fully utilize the FDR network and get a good performance for the join.
So we need a certain message size.
And if we're looking at the join performance in comparison of different kind of network interfaces.
So if we're using IP over InfiniBand, so billion two billion tuples basically on four machines with fdr
then just using uh infiniband without anything else basically the uh the network partitioning
so this part is the most expensive part right so this will basically take the most time
because the communication is just much more inefficient. If we do just the communication with RDMA, we're going to be better
but then interleaving in communication and computation we can get an additional 20%. So then we're there where basically local partitioning
and network partitioning take up the same amount of time.
However, you can still see that this is basically
the most costly part.
And this is similar to if you remember GPU lecture, right?
So if you think about the sorting there
or GPU joining,
stuff like this, this is like communication
is basically still where time goes.
And this is also here, right?
So we're moving data around, this is what costs us.
The actual build and probe phase,
I think this is this part up here.
This, well, it's not that much in comparison
and i mean this gets even more pronounced if we think about uh more machines so this is basically
i mean going back right so you see this is basically blowing this up this part um so if So if we're only using two machines, then local and network partitioning are in the same range.
If we're increasing this, going up to 10 machines, we can see that the local partitioning scales down,
because each of the nodes has to do less partitioning in relation to the number of nodes. However, the network partitioning
means more communication. We have to send to more nodes. So this doesn't scale as well. So there,
we won't get the same kind of performance improvement. So while we're basically shaving
off time here in the build and probe phase, which is parallelizable, the local partitioning is parallelizable
and the histogram computation is completely parallel. This
communication part basically doesn't really get improved as much because
we have to transmit more data across the network. Right.
This is also logical because overall this is the strong scaling that we're looking
at, we have the same problem size, but we're using more machines.
So meaning we need to move more parts
of the data across the network or larger share of the network.
So in two nodes, we basically have to move 50 percent of the data across the network. So in two nodes we basically have to move 50% of the data across the network. For
10 nodes we have to move 90% of the data across the network. So this is why we don't get this
super nice speed up here in total. Okay, so with this we're actually through with networking. So
we talked about parallel and rack scale computing and how to
program this and then a lot today about fast networks. We don't have that
much time left but maybe I'll show you something before you leave.
Then I can actually show you more next time. We want to talk about FPGA next., right, so are there questions to RDMA so far?
Oh, it's really, I mean, it's a different way of writing or doing the communication,
but then you can get actually much faster throughput and much higher bandwidth. And so next lecture, mainly, I don't want to do too much
because it doesn't really make sense to start that much today,
but I want to show you, right?
So this is, in the next lecture, mainly, we want to talk about FPGA.
So this is an example of an FPGA, and this is also an example of an FPGA.
So I brought one, actually.
And the cool thing about this one is like
you can actually completely program this open source
unlike most of these other things.
And so this is what we're going to talk about
in the next lecture
is a type of hardware that can be reconfigured.
So, so far, I'm just going to give you a teaser today.
Where's my mouse?
So far, we've basically looked at hardware in a way that, oh, we have this hardware, now we have to deal with it, right?
So we have the CPU, we have multi-core CPU.
We have to somehow figure out how to use this best in order to get good performance.
The GPU, like a type of hardware that's super parallel.
So we need to think super parallel.
And now we're getting to the even more efficient stuff.
And this is, I mean, we're not going to get all the way there
but this is basically what we want to look at tomorrow then is uh field programmable gate areas
is type of hardware that you can reconfigure and not reconfigure during runtime but reconfigure
while writing your program so the idea is that you specify in your program,
rather than specify the code and the instructions that will be executed,
you specify the circuits that will be on the board.
So the CPU, or the processor, not CPU, the processor on this thing,
so this would be this small part here.
So this is a small one, right?
This consists of reconfigurable hardware,
consists of circuitry and of lookup tables
that you can specify what should be on there.
And this then means, well, you're building this
for a certain type of program.
So say, for example, some kind of join algorithm. building this for a certain type of program,
so say for example some kind of join algorithm,
and then it will be able to only do that, but do that very fast.
And this is kind of like a trade-off between an ASIC,
so a fully hard-coded or fully specialized circuit, fully specialized chip and something
where we have to write a program for, where all of the instructions beforehand need to
be coded for.
Here on the FPGA, we say, okay, we want this this kind of circuitry on there.
And there is basic building blocks for circuitry.
And that will be connected in the way you want it to
on the circuit here on the chip.
And then have much higher efficiency,
potentially also higher throughput
while they're usually much less powerful.
So they have much less cores, for example, much less chip space,
much less, let's say, transistors than something like a GPU or a CPU.
Okay, but with that as a teaser, I'll show you this next time.
And I can actually show you, it actually does something.
Thanks to Richard, who programmed this for me.
It can count.
That's all it can do right now.
This is specialized computing.
So then we'll chat more about this tomorrow.
Thank you.