Hardware-Conscious Data Processing (ST 2024) - tele-TASK - Networking
Episode Date: July 2, 2024...
Transcript
Discussion (0)
Okay, so welcome to our next session on hardware-conscious data processing.
So just to let me say, we're going to continue with networking and RDMA.
You had an intermission with some GPU in between.
I hope you enjoyed that.
And so, but I'll just
get back to where we were, right?
So we were on
PCI Express talking to the
network. We had the GPU already.
We had the disk. And we're
close to actually wrapping everything
up, right? So today we're going to finish
networking or RDMA.
Then we're going to start
tomorrow with FPGAs,
have our last task, then FPGA second time,
and we'll have an invited talk.
So this one's important to me, right?
I repeatedly said that this is going to be interesting.
We're going to hear about FPGA in practical applications,
which actually I also am quite interested
in.
And then we do the summary and data center tour.
So are you going to join the data center tour?
You're interested in looking at the servers, et cetera?
Because then I will set this up.
So then we're going to do like half of the session.
We're going to do a bit of wrap up,
maybe some discussions.
Also, if you think of things that we want to do,
should do different in the future,
or where we can improve,
or if there's topics where you say,
okay, this I did not fully get,
or this was too slow,
please do more on this end rather, then
feel free to collect this and let us know. We're always trying to improve this somehow.
Okay so we're, I've basically talked all up to here. Now I want to cover RDMA basics and how RDMA is used in database
management systems, for example. We've already covered this to some extent in this rack
applications, right? So if you have like a rack level server using Infinii band and RDMA makes a lot of sense because you get like really
high throughput networks. So this is the slide where we stopped last time. Hi folks, move to the front.
Okay so this is the slide. So you remember basically, I mean, this is
kind of the hierarchy that we always have,
right? We have the registers, we have the caches,
several levels of caches,
then we have the local memory.
And in the past,
in traditional hardware setups,
we would say, well, then this disk,
like long time nothing,
then we have disk, and then long time nothing,
and then we have network.
But nowadays, using remote memory through low latency networking,
we can actually shift this, right?
So we're switching from the SSD part to the HED.
And there, to be honest, there's actually always kind of a,
let's say, they of fighting against each other.
So disks are becoming, or SSDs are becoming faster.
So you can get lower latencies, although not necessarily
the low latencies that you might get with remote memory,
especially if you're thinking about also CXL devices, which
we've covered also before.
So far, I almost exclusively talked about InfiniBand,
if I remember correctly.
And there is an alternative.
There's a couple of alternatives to this.
So while InfiniBand is the method of choice for RDMA because it has the lowest latency and it has very high throughputs,
you can actually also get the same kind of functionality using classical Ethernet networks.
And this is called RDMA overconverged Ethernet or ROCKY.
So you can have, instead of using these like Infiniip
and like a special type of networking infrastructure,
you need the special switches, need the special NICs.
You can use Ethernet switches and NICs.
They also need to support that.
So it's not your general off-the-shelf basic Ethernet switch and NIC,
but they need to be Rocky enabled.
And then you can embed your RDMA messages in UDP data graph.
So basically you're using UDP to send InfiniBand messages.
This also works with, I mean generally also works with
standard Ethernet, but then it's slow. So if you just package this
in regular Ethernet UDP, you get
Ethernet or IP basically.
So there's an example if you have 100 gigabyte Ethernet
network, which is kind of similar to EDR,
InfiniBand, then you get the same kind of 12 gigabyte
good put.
Then I pulled this out of some research paper
by Ericsson
where they compare different kind of
higher performance network
options. Then you can see
native InfiniBand
you have
you're in the sub
microsecond range, so hundreds
of nanoseconds basically. With
Rocky you can get the same kind of like slow
slightly higher latency but same ballpark if you're using native native ethernet for the same
kind of stuff uh then you're like one order of magnitude higher in terms of latencies right so
this is just a lot slower but rocky can get it can get very similar slightly
higher but very similar throughput and if you look at the bandwidth get basically the same kind of
bandwidth that you would get it with infinity pen the only slight difference might be in the
final good put so you remember good put is the amount of
like the throughput that you actually get
for your application.
So the throughput that the network gives you
in terms of like 100 gigabyte Ethernet,
for example, or 100 gigabit,
I have to say now gigabytes,
this would be 100 gigabit Ethernet.
This means 12.5 gigabyte this would be 100 gigabit ethernet um this means 12.5 gigabyte uh throughput
in general or like in total but then you have all the packaging right so then you have the
ethernet packages you have the ip packages etc there and with this you get a certain throughput. So basically, the throughput is the amount of bandwidth
that you get when you subtract all of the package overhead.
So looking at this, like, InfiniBand versus Rocky
bandwidth is, again, very similar.
So you get the same kind of bandwidth.
You get a very similar, right? So you get the same kind of bandwidth, you get a very similar latency.
So it's a good alternative today for both of them
and much better than native Ethernet.
There's yet another alternative.
And this is like, because this,
we also have this InfiniBand,
this is more like a REC scale networking technology.
All right, so meaning if you have like a whole data center, you're not going to have a single
InfiniBand switch connecting everything or like also the cable length are quite limited.
So if you're using copper cables, this typically means maximum of seven meters of
cable length. The higher the throughput or the bandwidth, typically the lower the length. So
this might be only three meters then. And this means you're not going to get much more cabling
than a single rack. Then otherwise you're going to have a problem. If you want to be across data centers,
then there's iWARP,
that's not an abbreviation, it's called iWARP.
So that's basically InfiniBand over TCP.
And that again means you can just use
your regular TCP network,
and that means you can use RDMA, et cetera, over long distances.
So again, this is a special type of hardware.
Otherwise, it's going to be slow.
Otherwise, we're back to this kind of stuff here.
And modern NICs often support both.
So modern Ethernet NICs that do converged Ethernet, they can do Rocky, they can also do iWarp.
Of course, iWarp has higher latencies, but it's good if you really want to do this across very large network topologies. So this is just as an alternative in essence. So far we basically covered the network hardware
and now we're going to go into the RDMA itself. So essentially now we have the hardware and the
protocols to do RDMA messaging and type of communication.
Okay, so in order to set up RDMA, so RDMA is remote direct memory access. So the idea is that I can directly somehow from one server to another access the memory. And of course I cannot
directly access, I need some kind of networking infrastructure in between,
but the networking infrastructure
does the communication for me.
It does, like I just, from an application point of view,
I just set up the connection points,
I just set up the memory regions,
and then I connect these memory regions
and the networking infrastructure does the copying for me.
So it basically copies data back and forth. And for this we need buffers.
So we need areas in our memory that will be copied back and forth where we basically have designated memory that can be used by the NICs. So this means we need the memory actually being pinned.
So it's not going to get lost somehow.
So it's not paged out and will be slow.
So then the OS would need to bring it back or something.
And the NIC doesn't necessarily know or does not know about it.
So it needs to be pinned.
We need the address translation information.
So in order to know what kind of
addresses on the one end are reflected in the address space on the other end.
And we need to set the permissions and then we need kind of local and remote keys that are used
by the adapters for the RDMA operations.
So that's kind of this.
We have memory regions on both ends,
and we have kind of a channel where
we're starting to set up a gem.
The actual communication itself is then based on queues.
And that doesn't really depend on what kind of operations
that we're doing,
but we're on both ends of the or both mix basically we'll have sets of queues and particularly
three queues in any way. So we have the send, we have the receive and we have the completion
queue. And maybe I'll close the door just quickly.
Where is from outside? Okay. So we need these three queues basically for basically telling the NICs what needs to be done.
So there's a send, a receive and a completion queue.
The send and the receive queues are the work queues.
And these are always created in pairs.
And you can have multiple of these pairs.
And then there's a completion queue, which basically tells us what is the work that is already done.
And this is always parallel on basically work.
How would you say?
It's mirrored on both ends, like on both parts of the connection.
So you're connecting server A and B,
then A and B will have a send and receive
and a completion queue.
And the send queue on A reflects basically
or is basically connected to the receive queue
on the other end and vice versa.
And on both sides, we have to completion queue
to just say, okay, what has been done and what's not.
The send and receive queues are there
to schedule the work that needs to be done.
So the send queue basically says on this side,
I need to copy this buffer, for example,
I need to copy this memory region to the other side.
The receive queue says, okay, this is like, I need to copy this into my memory region to the other side. The receive queue says okay this is like I need to copy this into
my memory region here. So one end basically says data goes to the NIC and is sent through the
network. The other side says data from the NIC goes into the memory. And the completion queue,
once this copying has been done, the completion queue basically has entries in there to say this work was already, like these work items have been done.
This is basically just to notify when work has been done. for the application not to necessarily wait and synchronously wait for the communication, but
the communication can be done asynchronously by using these completion queues.
So applications use issue jobs by sending or using work requests and these can be like send or receive requests can be put into the
into the working queues and essentially these are buffers in the send queue this is basically a buff
pointer to the data that needs to be sent on in the receive queue it's basically a pointer to
where the data that's that has been received should be placed.
When this has been done,
well, then there's a completion queue element
that will be put in the completion queue.
The application can basically see,
okay, this has been done.
From a stack point of view,
this looks like this.
So we have our application that posts these work requests.
And each work request is essentially a message or a unit of work.
Then we have the interface.
We have, yeah, it could be a library, for example,
that basically gives us the kind of commands that we have.
So it can be one-sided or two-sided commands.
We'll go through this in a bit more in a second.
And for this, then we have the driver that maintains the working queues and manages the address translation and has these mechanisms for
complete, for the basically completion queue to fill the completion queue and these events.
And then underneath we have the actual hardware, right? So we have the network interface card
and the network protocol where we have different kind of transport layer depending
on the underlying networking infrastructure. So in InfiniBand that might be reliable, in Ethernet
it's unreliable or we might even put this on top. So if we do InfiniBand over Ethernet, so classical, without any special hardware,
then there might be a whole network stack on top,
which is underneath the driver.
There, then we also packetize the networks
and implement the whole protocol that needs to be done in there.
So just as a kind of a performance view on this whole thing, so we said we have InfiniBand, currently we're at NVR, so typically one network connector,
so if you have one cable that would consist of four links,
then you would get up to 50 gigabytes per second.
So 400 gigabit per second or 50 gigabyte per second.
The specification for XDR is also already out.
That would then give us the double again. So 800 gigabytes.
So this is how InfiniBand currently is scaled.
So just by doubling this.
It used to be less than that.
So QDR was 32 gigabytes per second, FDR 54.
But now from EDR, they're basically just doubling
and doubling all the time.
If you look at modern Rocky hardware
I just looked this up yesterday you can also get up to 400 gigabytes per second same kind of
performance basically. Again similarly like with a single adapter, meaning a single cable, you can get up
to 400 gigabytes per second.
This will be, in this case, for example,
PCI Express 5, 16 lane for nectar.
So it would look basically the same as the Infinii plan setup.
And again, we can also do this using iWork.
Right, and you see this is still scaling.
So there's also new, and I don't remember
what comes after XDR, but there's new protocol levels
on the horizon.
RDMA itself is just a mechanism, right?
It also does not really specify any semantics of the data transfer.
It's really just, okay, copy data back and forth.
And or just basically how we access the remote memory.
And there's two ways or two models on how to access memory.
There's one-sided, which basically is read and write in atomic operations, meaning I can just say, okay, I want to write this data into the remote memory, or I want to read this region of memory from the remote memory, or I also have these atomic operations that we discussed
during logging. I can also say I want to atomic fetch an update or something like the
compare and swap operation something like this. I can do this remotely in the remote memory
and I will have this in an atomic fashion.
So nothing else can basically then happen in between.
If I do the compare and swap,
I basically get the memory, the value back.
It's locked on the other side.
I can exchange the value.
Obviously, one shouldn't say obviously,
but it makes sense that this is slow.
So this basically needs a complete round trip to the other memory.
So this will have a fairly high latency, much higher than if you do this locally.
And then there's two-sided access.
So where we have send and receive, which basically
means both ends actually wait for
these things to happen. So I'm sending the data and the other side,
my application, I have an application that runs on two servers.
Application on server A sends the data
and application on server B waits for the data and receives it.
And that's different from the read and write,
because in the read and write, the data just appears in the memory
and the CPU is not involved.
In the send and receive, both applications, both CPUs do something.
Both CPUs basically are part of the communication.
So like classical networking, where you have like a socket and you're reading from the socket.
So this is more the way send and receive works.
So traditional or send and receive is more this traditional message passing where you have both the source and the destination
are actively involved in the communication. Both have their queues in the application. So you have
a send and a receive queue and the completion queue for the queue pair. And then the sender has a pointer.
The sender sends basically a pointer
or in its work request has a pointer to the buffer
that needs to be sent.
And that's then enqueued in the send queue.
And the receiver has a pointer to the memory region
where the data will be received.
And that's also this receive message is also in queue. So let's look at how
this actually looks like. So we have the send queue, we have the receipt queue and we have
the completion queue on both ends and so this will be now a send operation for example and these are
just pointers right so that's again important. I'm not enqueuing the data here into these queues.
I'm just enqueuing pointers here. So I have my registered
memory that's pinned and I have a buffer
that needs to be transferred. On the other side, I have the same thing.
I have my pinned memory and a buffer where the data should be
received. And to some degree in applications, we will always use some kind of,
like if we're establishing communication,
unless we have some kind of global knowledge about all the data transfers
so everything is very statically, as soon as data sizes, for example,
are dynamically, then I will need some kind of communication to set up
these buffers to make sure that there is enough buffer space here for the stuff that I want to
send there, for example. So this is usually for sure done in a send and receive operation.
So if I want to transfer this, the first thing is basically I create a work queue item in the send queue.
At the same time, on the system that receives, there needs to be a work queue item or work queue entry on the receive queue.
And that then basically points to the buffer regions where the data actually should go to.
And once this is enqueued, then the NICs know what to do.
So this is basically the setup for the network interface
cards to know where data should be sent.
Like where data this side, basically
where do I need to read the data, this side,
where do I need to write the data.
As soon as this is done,
the NICs can actually do the actual copying.
So the NICs then do this independently.
There's no CPU involvement for that.
That's basically the NIC just reads the memory region,
sends it over.
And once it is done, so once this site has sent everything, it creates a completion
queue element. Once this site has received everything, it creates a completion queue element.
And with this, we know that our transfer is done. So with this, basically, we have one RDMA message
sent through. And from the CPU point of view,
it's really just establishing basically
these queue pair elements.
So like here, I'm creating the send queue element
and I'm waiting for a completion element on this side,
like here, and on the other side,
I'm establishing basically a receive queue element and again
waiting for the completion element here. Okay if I'm doing a read and write then only the sender
side or the basically the site issuing the request is done is active.. This would basically mean in a read and write operation,
rather than having a receive queue and completion queue here,
I only need this on this side.
I'm only setting this up here.
Then I'm sending this through the NIC.
I need the memory region needs to be pinned.
So the other side must be set up in a way that Nick knows
where to put the data,
but the application does not need to be aware
what's going on on this end.
So basically just, I'm just basically sending this through.
And once this is completed here, I mean,
the NIC will still have their queues here.
But the CPU doesn't need to do anything on this side.
So I'll just basically send the data through.
So the passive side doesn't issue any operations here, uses no CPU cycle,
and doesn't get any kind of information
that the transfer has been done or not.
So basically on the passive side,
if I'm writing data to the passive side,
I don't know if it's written.
So on the passive side,
I don't know if I've received anything or not.
I basically
can just say Paul every now and then I can have some kind of status bit, or I'm using specific
kind of reading or send and receive operations. Additionally, in order to ensure that whenever
I've done all my right, et cetera, operations, then I will use some communication again to say,
okay, now everything is done.
And so that would be a pattern that I can use.
If I want to do this read and write,
then I need to basically see the remote sites,
virtual memory address,
and the remote sites memory registration key in order to
I mean this needs to be all set up right so the remote site needs to be aware that something can
happen there and that that the sender actually is allowed to write something in the memory region
so the the active site needs the address and the key. And this is typically done through this traditional send
and receive mechanisms.
So looking at this, and this is more of a quick overview
of the type of operation.
So just maybe for you to look it up,
this is also in the slides.
It's obviously in the slides but you can use the link to basically check how to set this up. So there's this is basically the whole set of operations that you need to do in order to actually do an RDMA write here. So I'll initialize the whole setup, the connection, et cetera,
in order to then do the actual write.
So it's basically first from here and continues from there.
So you can see this is maybe the last box here
is the first box over here.
Okay, there's a couple of things that you can optimize. So we see, I mean, we've seen there is some involvement.
I mean, there's multiple steps, right?
So it's even if we do like a send and receive,
it's, or if we do a send and receive,
it's multiple communication ops, right?
So we need to exchange
the queue information first and then we do the actual transfer and then we need to exchange
the completion events uh events so in order to know that things actually happen right that the
transfers have worked and uh so the completion queue can actually be, or can actually make things slow, right?
So if I'm synchronously sending things,
basically I have to wait until the data
has actually been transferred.
I have to wait for the acknowledgement.
So this is always round trips that I'm basically wasting
while I could maybe already send some more data
because I'm sending the data.
The data has been sent.
I get a completion event.
Then I basically go back to the application and say,
okay, now next thing to do.
So there's some cycles that get lost.
In order to not lose these cycles,
what I can do is I can do so-called doorbell batching.
Rather than issuing individual
work queue elements and then wait until something happens and my work queue element is completed,
I can actually issue multiple of these before going to the completion queue. And this is doorbell batching. Another thing is that these NICs
typically even have multiple processing units. So in there it's not just a single
processor essentially but there's multiple of these in order to get some parallelism.
And this parallelism we can actually use. So rather than having a single queue,
we can have multiple queues
and use these processing units in parallel.
So if everything goes to a single queue,
then only a single processing unit can run,
and that might also be the bottleneck
at a certain point in time.
And in general, Atomics might seem like a really good idea
if you have a distributed data structure,
but they are actually slow.
So if you put everything, like all of your logic
into these kind of atomics in RDMA,
then your application will wait for a long time
because it basically has nothing to do for a long time.
Okay, so this is an overview to RDMA so far.
So the basic concepts, I guess, clear, right?
Okay, so then let's see how we can use this
in the understanding of how this works
in contrast to traditional networking.
So now we can see how we can use this in the database system.
And classically databases or many databases are shared nothing architectures. We already said there's also this like shared disk architecture that's been done for some time and this is also where fast networks have been used
for some time before because fast networks like if you if you do if you have a shared disk kind
of setup then you need to ship the yeah you typically ship the data to your processing units.
So you have multiple processing units.
You ship the data there.
And for that, you need really fast network.
Otherwise, the data transfer will always be a bottleneck.
The alternative, if you have slow networking,
then what you do is you try to have a smart partitioning over the nodes.
So this is if you have classical ethernet,
you have a large scale setup, you're trying to basically have the data nicely partitioned across
the node and we try to for each node in the setup keep all of the accesses as local as possible.
This works for many applications because many applications actually partition nicely. We also
discussed this earlier because
you're dealing with individual customers, you're dealing with individual
warehouses, you're dealing with individual sites
in your setup. So that makes life a bit easier.
As soon as you're distributing stuff, as soon as you have to go
across the network, then things actually get slower.
So then you need to communicate. And if this
is socket send and receive, this actually costs
cycles. And it costs latency because it's slow.
So in order to do this in this shared nothing architecture,
then you really want to have everything local,
like all of the optic operations local,
and you want to co-partition data.
Meaning if you have a classic database schema,
I mean, if we're thinking about an analytical workload, for example,
then you might use some of your dimensions in your data warehouse schema that nicely partition the data set. As I said, customers, for example,
you basically co-partition your database by certain customers and make sure
that individual customers end up with all of the data that you might need, end up completely in
these individual nodes. So this co-partitioning of data is important. As soon as you're just
randomly partitioning, then there's lots and lots of communication.
So if I'm just horizontally partitioning my data without any further information, then
many operations will basically have to cross the network, which will most likely,
just kind of like in a classic network setup, be the bottleneck. An alternative is this kind of shared nothing architecture.
We have IP over InfiniBand, where rather than using
this ethernet, I'm using just much faster,
much faster InfiniB In Finipat.
So in this case, basically, I don't need to change the code, right?
I don't, I can just use the database as this.
And like with classical Ethernet, any kind of cross partition or any kind of cross nodes
workloads will not harm me as much because the network is actually faster. So with this, as we saw, I can get 50 gigabytes per second using HDR, for example, or NDR, and that with a single connector, right? So single cable, and I might have multiple of these
so I can have fairly decent bandwidth in between the nodes,
also with lowish latency.
But if I use the classical code still,
I still have this overhead through the small messages.
And also, if I have classic networking,
I still need to go through the CPU.
So if I have still my classical infinite IP network stack,
then my CPU needs to do all the packaging.
And the problem with this is at a certain point, this actually
makes up a major point of the overall CPU utilization. So if I have IP networking over
InfiniBand and I have very fast InfiniBand, all of a sudden my CPU is utilized 10 to 30% just with networking.
Just basically packaging,
packaging networks, packaging data,
sending it over the data by copying buffers, right?
So whenever we have different levels of packets,
I need to basically copy data into these different buffers
in order to get the networks better, to get the
packets aligned.
And so that basically doesn't necessarily fully utilize the network and it saturates
the CPU to some degree.
So we're basically losing a lot of efficiency. So an alternative to this is using
like a shared memory architecture with RDMA.
And the difference here is that we're not using the IP stack.
So we're really trying to basically keep
certain memory regions attached with RDMA.
And with this,
the NIC can do all of the transport for us.
Not all of it.
We have to issue some stuff,
some communication, but the actual data transfer,
like all of the small packaging, etc.
This is completely done by the network.
This can be completely done by the RDMA or
InfiniBand or Rocky hardware. So as soon as these buffers here are established and I'm telling the
network, the NIC, to copy data back and forth here, then my CPU doesn't have to do anything
locally anymore and I don't need to deal
with this. If I need data from another node, I can basically just say okay I need to issue an RDMA
read and if I have something to do while the read is done, it doesn't really cost me anything from
the CPU perspective. I just pay some latency until this is done,
but if I have local data that I can process as well, I can start with issuing requests to the
remote memory, process the local memory while the remote memory is basically copied over,
and then I can do the local copy. As long as I'm not fully saturating the network here, and I'm not just
waiting for the data to be copied because I have some local data, the remote access
is more or less for free. Of course, I've utilized the bandwidth, but I might be able to split this up somehow smartly. So this basically,
if we fully take advantage of this, we can fully utilize the hardware, we can have low latency,
and we minimize the CPU overhead, right? So this is important. This is however, requires us to use
RDMA. So we cannot use traditional sockets.
And this means we need to change the database management, which
we do here anyway all the time.
So in essence, this creates a shared memory architecture.
So we're basically looking at the whole memory
as one large memory from a conceptual point of view, where then we just copy back
and forth what we need locally.
And we can do this either with send and receive works or we can do it with read and write
work.
So send and receive whenever we need to establish kind of where does the data need to go from,
where so which buffers are copied where
and read and write just in order to get what we need locally either to say if we're updating
something I can just issue and write if I need to get a remote memory in order to read it I can just
issue and read the remote memory the remote CPU doesn't have to be involved.
So this is also called
a network attached memory architecture or NUM.
So in this, and here the key assumption would be
network communication is not expensive, right?
So this is kind of an extreme point of view
as it really depends if a lot of your data is local and you can do
a lot of communication or a lot of a lot of processing local then networking is not expensive
anymore because you you have this high throughput and it can get to a much larger degree
than it would be in classical setups where you have
et cetera.
So if we think about this from a distributed system
point of view, then often a problem is kind of we have this,
yeah, we have this coordination efforts.
So if we have lots of small messages,
means lots of small communication,
we get lots of latency increase,
but that's not so much a problem because of RDMA,
because RDMA has low latencies,
especially if we can use this asynchronously, right?
If the CPU doesn't have to wait for this.
If we have distributed data flow, as soon as we have SKU, this might be a problem. And also if we have like problems with like if we have slow bandwidth, if we have skew,
then this means well whenever we have to go through the network this might slow us down.
But the InfiniiPen has like the same order of magnitude as the memory.
And well with the load balancing we still have to do some load balancing.
So in order to have this kind of network attached memory setup or architecture, we want to use RDMA.
We want to have locality still. If everything's distributed, everything
will need to go through the network.
And even though the network is fast,
it will be the bottleneck.
So we have, say, 50 gigabytes per second throughput,
but it's not the same as we have local network.
We have one DIMM gives us 50 gigabytes of throughput
in DDR5. So, I mean, a bit less, but sort of order of magnitude,
right? So in a single server, you can get a couple of hundred gigabytes bandwidth. And this will be
for sure the bottleneck if everything needs to go through a network. But if we can keep some locality, we're going to be fine.
And also we want, I mean, there's network everywhere, right?
On all of the nodes.
So this means we can use it on all of the nodes.
So we want to make sure that we can somehow leverage the access to all nodes.
And this also goes back to some degree to this,
where it's kind of a plug again for CXL, right?
So as soon as we can somehow disaggregate this even more,
we can make sure that we all can utilize
all of the resources as good as we can, right?
So just basically say, okay, I want to make sure I'm utilizing all my CPU well,
I'm utilizing all my memory well, I also want to utilize all my network well.
And by disaggregating this, I might even be able to disaggregate the memory to some degree and
say utilize the memory from different locations. That's not in here yet.
Okay let's look at this from an experimental point of view. So if we're looking at this kind of
setup here, right, so we're using RDMA to connect the memories and we do either do send and receive where the CPUs are involved or we do read and write
where only one CPU would be involved and it's directly written to memory.
If we're using classical TCP connections.
Oh, no, we're sharing, doing, shared nothing, two-sided RMA. This is already RMA.
And we have a scale-out experiment on TPCC. So TPCC is the OLTP benchmark. So we have
small update workloads. It's not these very large analytical workloads, but small transactions. So we do a couple of rows, basically, typically localized.
So it's not necessarily everything distributed.
If we have shared nothing, two-sided RDMA,
then you can see we're, let's say,
200, 250,000 transactions per second.
And you can see scaling from 1 to up to 15 nodes here.
If we're doing numDB, so this kind of setup
where we use one-sided RDMA and there is no co-location,
so the data is completely distributed or randomly distributed more or less across the nodes,
then we see this nicely linearly scales with the number of nodes,
so up to four million transactions per second.
If we're co-locating,
so we're making sure that most of the accesses are actually local,
except for those where we basically are. So we're partitioning, we're co-partitioning,
but some of the accesses might go to another node. Then we can see that it even scales much
better, right? So in this case, basically, I think 90% of the transactions
are local.
10% might be distributed.
So it's 10% of the accesses might actually hit the network.
Then we can see we scale up to 6 million transactions
per second, which is actually quite significant throughput. Okay. So this kind of as an overview of how we can
basically utilize RDMA in a database architecture,
but from a high level point of view, meaning that if
we're using send and receive,
we have lots, still lots of communication,
we're including the CPUs.
If we're doing writes, we get better throughput
because only a single CPU is utilized.
If we're doing local, we can get even higher.
So questions so far?
Then we'll do a quick break here,
and then we'll look into RDMA-based joins next.
So for the last part,
it's probably not going to take all of the time, but for the last part,
I want to talk about RDMA-based joins,
and we're going to revisit the infamous Radix joint
and see how we can basically optimize this
or how we can implement this through RDMA.
So first, some general good practices.
All right. some general good practices. Right. So the, I mean, in general, the memory region registration
is basically the cost of this is in the amount of numbered registered pages. So, yeah. And so either,
I mean, just registering and deregistering is costly.
So we don't really want this.
We want to have sort of a stable amount of memory pages,
because otherwise this will be kind of take a lot of time,
because this needs coordination, right?
And we also don't want to pin large parts of the memory,
because otherwise, if we do, we cannot use it for something else.
So what typically is done where we're assigning a certain portion of the memory as a buffer for this RDMA communication. And then we need some efficient buffer management for that in order to be able to reuse and reutilize the existing pages in there.
And say if we want an initial implementation also of disaggregated memory, we would use exactly this. So if you remember disaggregation using CXL,
you can try to, or you can emulate a similar idea
using RDMA by basically having,
basically the, having RDMA regions on each of the servers
and making that available across other servers.
So with this, you get the same kind of disaggregation because I can utilize this memory through
RDMA from other servers.
I can expand the amount of memory.
I can do this kind of pooling, but I need to explicitly implement this.
And again, I need to reuse the memory.
I need some buffer management to make this efficient.
And the big difference to see Excel in this case
is I don't have any coherence, right?
So I need to make sure that all of the coherence mechanisms
I need to implement myself if I want to have this.
In order to be fast, RDMA also requires asynchronous communication.
If I do it synchronously, it will be slow because basically I'm waiting for a couple of roundtips while the communication is happening.
The NIC can do most of the work by itself, but it needs some time to do this.
And so for this, while doing this time, the CPU should do, or your application should
do something else.
So meaning in a database system, I want to ensure that I'm issuing my communication at
a point in time where I still have other things to do.
And I overlap my computation with the communication.
So I might stage this in one way
if I definitely have to communicate.
I might communicate some stuff
and then work with the data while communicating more stuff.
So then I can overlap communication and computation.
I don't have to wait all the time.
And of course, accessing remote memory
is slower than local memory.
So we're talking about order, like in the order
of a microsecond here.
So we saw this in the latencies earlier.
So it might be sub microseconds or still nanoseconds, but this is the kind of
order that we're talking.
It's slower than CXL, so CXL we're poking in the hundreds of nanoseconds.
Here we're in the order of a microsecond, at least the numbers that I know of.
If we're in local memory, we're in the tens of nanoseconds,
or 100 nanoseconds, something like this.
So again, an order of magnitude faster than what RGMA
is capable of doing.
So this means it's definitely slower.
And hopefully, locally, we also have our caches.
This is not cache coherent.
So this is not in our caching domain.
So this needs explicit work here.
So we need to hide the network latency
by interleaving computation and communication.
And this essentially gives us these kind of NUMA effects.
Right? This essentially gives us these kind of NUMA effects.
Because we're creating these buffers, these buffers are somewhere in memory.
And only the threads that are in the same NUMA region will
have this local access right so will not see exhibit any NUMA effects in there if I have
my buffer and I mean in general this this might get even more complicated right if I have a very
complicated setup in my machine so I I might have my Ethernet or my
NIC connected to one
memory controller,
but if I register the memory
in another
memory, like another socket,
then I still need to go through the
internal fabric, and this might
be then, and eventually might become
more of a bottleneck, or at least
communicating back and forward might of a bottleneck or at least communicating back and
forward might become a bottleneck because I might have the thread in the same place as the NIC but
the memory in somewhere else so I need to do both passes basically add some latency that's uh or
decreases the bandwidth and if we're talking about 50 gigabytes per second, so eventually this might
still become the bottleneck like the UPI or whatever interconnect, at least with some other
cross communication might all of a sudden be a problem. So make sure that the buffers that you're
using are actually local to your threads so you don't get into any kind of problems with new mock.
So, I mean, again, if we want to have the best performance, we just really need to be aware of the overall hardware topology that we're dealing with. So if we have many cores, if we have many sockets,
we have different kind of architecture,
then we need to deal with these things.
OK, so let's take a step back.
So who remembers the Radix Joint?
What does the Radix Joint look like?
You were rolling your eyes when I said Radix joint. What does the radix joint look like? You were rolling your eyes when I said radix joint.
Not again a radix joint.
No, I misinterpreted that.
So what's a radix joint? It's a join. First step.
I remember it completely correctly, but
I think it was similar to a hash join
where we join based on a certain
number of bits. So that's already
the first keyword that I wanted to hear.
It's a hash join. Yes, it's not only
similar to a hash join. It is a form of a hash
join. Yes. So that's
already very good.
And
there's like a certain or at least typically it's a very good. And there's like a certain,
or at least typically it's a hash join.
And we're trying to partition data in this join.
It's a partition join.
So rather than having like just a single hash table,
we're splitting our hash table up in individual partitions.
And we do this by what?
Do you remember?
How do we build these partitions?
Based on the certain number of bits from the hash?
Exactly.
Yes, so we're hashing the values, and then we're looking at the bits.
Typically, most significant bits could also be least significant bits,
depends on the implementation.
And with these bits, we can directly influence
or very clearly influence the number of partitions
because basically every bit gives us two partitions.
So like N bits gives us two to the power of n partitions. And with this,
we can not uniquely say, okay, these are the number of partitions that we want.
If the hash function is great, which is not, well, it's not going to be perfect, but if it were
perfect, then with the hash function, we would get neat, like we would get a neat distribution of hash values.
With the Radix bits, we would get a perfect distribution across these individual hash tables or individual partitions.
And then we would have a perfect way to just simply distributing data across multiple threads, across multiple machines, et cetera.
So this is the basic idea, so that the Radix partitioning
makes it easy for us to partition data
and to control the number of partitions.
And with the hash function, we hope
that we have a very nice and clean and smooth partition
because that's not necessarily always the case so the hash function is never perfect right so
we never get like a completely uniform distribution we hope to get something in the
order but because of that we don't know exactly how well it partitions so we'll do a
histogram first so we're first going to check how does this actually distribute so how well
how small are my petitions going to be and I mean first I know how many items I have but then I have
to hash function then I can check basically how many radix but then I have to hash function. Then I can check basically
how many radix bits do I need to actually have like partitions in the size that would be good
for me. Typically, I would align this with cache sizes, right? So in order to cache line sizes,
in order to have this nicely done cache sizes, then whatever memory pages, et cetera, to just make sure that on each level I get
the best performance that I can have and the cache efficiency is good.
And the same thing I, of course, will also want to have in RDMA, right?
So I basically compute the histogram to figure out how does my data distribute.
Then I do the radix partition based on the bits so again I basically I'm
creating a hash key and then I take a certain number of bits to identify individual partitions
so that's a certain number of bit will basically say this is the hash table that this value goes in.
All of the hash values that start with this number of bits
with the same bits basically go into this hash table.
And then within the hash table, the rest of the hash key
will basically identify where it goes into the hash.
Because of that, also, it makes sense to to start with in many cases with the most significant
bits because i'd rather have like partitionings running empty then within my partition everything
going into the same bucket in the hash table right so i want to have a good partition or a good hash table, and I can deal with some imbalance across the partitions maybe easier.
Okay, so let's look at the...
And then within each partition, I mean, just to finish it up,
within each partition, I just do the classical basic hash join.
So for both sides, I have partitioned both sides,
both tables in the same way.
With one table, I'm gonna do a build.
I build a hash table with the other table,
I'm gonna probe the same hash table.
Okay, so if I'm doing this,
like an RDMA version of the Radix join,
I basically in the histogram computation phase, so the histogram is really just to
identify how well will this distribute, right? How many bits do I need and which, yeah, how do I then partition in the end?
So within a machine, I basically, assuming the data is nice,
is distributed across all machines, right?
So this is the assumption.
Otherwise, I first, of course, need to distribute the data across the machines.
But initially, it's not partitioned in the way that we need.
So say I have a horizontal partitioning,
but not on the join key.
This would be a typical way when I want to do this.
Then within each of the nodes,
I do a histogram with all of the threads.
Then we exchange the histograms into a machine level histogram
in order to have an overview of what is the data
that I have on a single node.
And these then can be exchanged.
So in order to have a global histogram,
in order to get an idea of the partitioning sizes.
Meaning if I use these many bits,
what will the partitions look like?
So if I have say 10 bits, meaning a thousand partitions,
how large will these partitions be?
Like each of these partitions,
this will be my histogram.
And based on this, then I also can assign the partitions to individual machines, right? So
then I can basically say, oh, where, how large need the buffers to be, where do I want to copy
this back and forth. So with this information, it can already set up the buffers across the network.
It can basically inform how the data distribution
then needs to be done.
Then in the partitioning phase,
I have two partitioning passes.
I have a local partitioning pass
where I want to make sure that the partitions fit into caches.
But first, I have the networking partition pass,
where I make sure that each partition or the partitions
go to the right node.
So this is basically where I partition the data such
that for each Radix partition,
I have the data in there.
It's more or less like a shuffle phase.
And so in a network partitioning pass,
I want to have a pool of RDMA buffers.
And while these are sent, i have multiple uh or multiple buffers and with this i
can basically fill the buffers locally and send them while i'm filling the next buffer so this
also gives me this kind of parallelism and i can also of course have multiple queues using this i
can make sure that the nick gets uh like multiple things to do. I can
send to multiple nodes at the same time etc. to just make sure that I fully
saturated the network and I fully saturated the NIC. And the buffers are private so
then we don't need to do any synchronization. Then we have the build and probe.
And locally, we additionally partition the way that stuff fits into caches.
That's already what I said.
After the partitioning, now all of the partitions that fit to each other.
So with the same partitioning key should be on the same node.
So then basically all of the data is local and we can do the complete join locally as
we would have done earlier.
So also we're going to split this up into cache sized so then it's even going to be
thread local.
The individual partitions are so small
that we want to have this in the individual threads,
at least in the build phase.
So we want to be able to have the hash tables ideally
cache resident.
And then we can just probe with the also potentially larger probe side,
but still without this random memory accesses
that we would need if we have a memory resident hash table
or even network resident hash table.
Yeah, and then the matching results
are either the local buffer, or we can also write them
to the RDMA buffer in order then to have one large output.
And finally, either we reuse this or we keep it local.
I mean, depending if you want to further continue locally or if we want to have one large output,
we might send out the result again through the network.
And there's a nice paper on this,
how this is like there's more details also
in the implementation available somewhere.
And if we look at this
from a performance point of view,
then we are using an FDR or QDR network.
So if you remember QDR was,
if I don't, let me check.
So FDR, I think was the 50 gigabytes,
QDR was the 32 gigabyte network.
Just don't wanna lie here.
Yeah, exactly. QDR32, FDR54, gigabyte. So if you look at this, then you can see the message sizes, this is in bytes. We need to be in kilobyte sizes in order to fully utilize the network and not basically
be in just a message creation mode, et cetera,
in the algorithm all the time.
And so in the kilobyte message size
here, or hundreds of bytes to kilobytes,
this is where we fully saturate the network. So here you which would be 48 gigabits roundabout.
So round we're close to what the 54 gigabits QDR network can actually do.
So with this, we're actually saturating the network nicely.
Same here, right, for the QDR.
So FDR and QDR network.
But this, of course, means that we need to trunk this in properly.
So looking at a bit more breakdowns,
if we use TCP with IP over InfiniBand,
then we can see, yeah, I mean,
basically, like the network partitioning
is what really costs us, right?
The local partitioning is always the same amount of work.
In any kind of implementation, this is all local.
We don't need any networking.
And here, we have four machines with 32 CPUs and two times
two billion tuples basically joining. I can see that the histogram computation
is actually fairly little work, right? We just need to go locally through this.
We need to communicate the histograms, not a lot of data. So that's actually cheap.
The network partitioning, that's the expensive part.
Local partitioning is not that much, but it takes some time.
And then the actual build and probe phase,
that's actually very little.
So that's the smallest part, except for the histogram
computation that needs to be done. So that's the smallest part, except for the histogram computation
that needs to be done.
So that's the actual joint, right?
That's all local, all cache residents,
and that's why this is super fast.
If we use non-interleaved communication,
but InfiniBand, then basically we can already scale this down
by a factor of 3 round about in terms of just the
network partitioning but you see this still this is constant right these parts are constant so
that's why we don't get like we get a significant performance improvement but we don't get like the
full speed up that we just would get from the InfiniBand network because that's just static right this part.
This is something where the network doesn't help us.
If we interleave network
communication with computation, we can get another
20% improvement. So the network and
local partitioning in like in the end, are still
the most expensive part. This is basically this shuffle phase you can say.
So you were basically partitioning the data and sorting the data you can say in a way such that everything can be easily locally computed.
Looking at this from a scale-out perspective,
so going up to 10 nodes.
So here, you remember we were here.
From two nodes, we're scaling nicely to four nodes.
And it's not a factor of two, but it's still a knot.
But as you can see, this part just stays constant
for more or less.
So we, I mean, we're increasing this a bit,
but I mean, that's not true, right?
So this actually should decrease the local parts
because we get more throughput,
but the communication does not decrease as much because we still
need to communicate across all, right? If you have more nodes, we need more communication. We need to scale around.
So this doesn't scale perfectly.
It doesn't scale linearly.
So when from going from two to 10 nodes,
we only get a factor of close to three
because the network partitioning phase actually still
has a significant part.
While the other parts are decreasing
with the number of nodes, it's also not perfectly decreasing.
But also, the smaller or the more nodes we have,
the larger the fraction of data that needs to be sent around.
Eventually, almost all of the data locally
needs to be sent somewhere else.
So if I only have two nodes, I mean, worst case,
or not, average case, let's say I have two nodes
in the average case, I'm going to send around half of the data.
If I have three nodes, I'm going to send around two-thirds
of the data. If I have ten nodes, I'm going to send around nine-tenths
of the data. Eventually, I'm basically around 9 tenths of the data, eventually I'm basically
always sending around all of the data because of this partitioning, because of the randomness
of the partitioning.
And this means like some additional congestion across the network also.
Okay, but still we can get some improvement, but you can see see this it doesn't really pay off to do this like in
very large scale eventually because the improvement like the speed ups are not getting
very significant anymore so if we're going to 100 nodes probably not going to see a factor of 30. But I don't know. One would have to try.
Okay, with this, that's actually the end of networking today. So last time I told you a bit
about the parallel database management systems. We came back to this to some degree with this num db setup and shape comparing this to share nothing setup
and we talked a bit about like parallel programming and then the fastest networks
with low latencies which are the basis for doing rdma type of communications and then today we really talked about RDMA so how does RDMA work internally
and on a high level how you can use it in database setups right so just basically as this kind of
remote memory or shared memory setup like some kind of disaggregation of memory to compute.
And then even with a specialized join,
so how, which parts of a radix join I would set,
put into which, like where would I use
the distributed communication in which way?
With that question so far.
For this, no, wasn't that hard, right?
So next time it's not going to be
GPU. I didn't fix
that, but it's going to be FPGA.
So next time
we're going to talk about
FPGAs and
we'll do this
tomorrow and the week
after tomorrow.
And then that's all.
So thank you so much and see you tomorrow.