Disseminate: The Computer Science Research Podcast - Matthias Jasny | P4DB - The Case for In-Network OLTP | #10
Episode Date: August 8, 2022Summary: In this episode Matthias Jasny from TU Darmstadt talks about P4DB, a database that uses a programmable switch to accelerate OLTP workloads. The main idea of P4DB is that it implements a trans...action processing engine on top of a P4-programmable switch. The switch can thus act as an accelerator in the network, especially when it is used to store and process hot (contended) tuples on the switch. P4DB provides significant benefits compared to traditional DBMS architectures and can achieve a speedup of up to 8x.Questions: 0:55: Can you set the scene for your research and describe the motivation behind P4DB? 1:42: Can you describe to listeners who may not be familiar with them, what exactly is a programmable switch? 3:55: What are the characteristics of OLTP workloads that make them a good fit for programmable switches?5:33: Can you elaborate on the key idea of P4DB?6:46: How do you go about mapping the execution of transactions to the architecture of a programmable switch?10:13: Can you walk us through the lifecycle of a switch transaction?11:04: How does P4DB determine the optimal tuple placement on the switch?12:16: Is this allocation static or is it dynamic, can the tuple order be changed at runtime?12:55: What happens if a transaction needs to access tuples in a different order then that laid out on the switch? 14:11: Obviously you can’t fit all data on the switch, only the hot data, how does P4DB execute transactions that access some hot and some cold data that’s not on the switch?16:04: How did you evaluate P4DB? What are the results? 18:28: What was the magnitude of the speed up in the scenarios in which P4DB showed performance gains? 19:29: Are there any situations in which P4DB performs non-optimally and what are the workload characteristics of these situations?20:36: How many tuples can you get on a switch? 21:23: Where do you see your results being useful? Who will find them the most relevant? 21:57: Across your time working on P4DB, what are the most interesting, perhaps unexpected, lessons that you learned? 22:39: That leads me into my next question, what were the things you tried while working on P4DB that failed? Can you give any words of advice to people who might work with programmable switches in the future? 23:24: What do you have planned for future research? 24:24: Is P4DB publically available? 24:53: What attracted you to this research area?25:42: What’s the one key thing you want listeners to take away from your research and your work on P4DB? Links: PaperPresentationWebsiteEmailGoogle ScholarP4DB Hosted on Acast. See acast.com/privacy for more information.
Transcript
Discussion (0)
Hello and welcome to Disseminate, the computer science research podcast.
I'm your host, Jack Wardby.
This is episode 10 and the final episode of our SIGMOD 2022 series.
I'm delighted to say I'm joined today by Matthias Jasny, who will be talking about his paper P4DB, the case for in-network OLTP.
Matthias is a PhD student at the Technical University of Darmstadt and his research focuses primarily on scalable data management and programmable networks. Matthias, thanks for
joining us on the show. Hi Jack, thanks for having me. Let's dive straight in. Can you set the scene
for your research and describe the motivation behind P4DB? Yes, so my research mainly builds
on the observation that database development always
lags a bit behind in what is currently available in the area, for example, high-speed networks.
And with my work, especially on the focus of programmable switches, there has been a bit
limited work how to use these programmable switches and databases, but mainly only by
offloading some components into the network,
for example, the log manager or a key value store for caching some values.
But at this time when I started my research, I couldn't find anything
what did full transaction processing on the switch.
So I thought back then this is some nice challenge to tackle.
Awesome. So can you describe to the listeners who may not be familiar with them,
what exactly is a programmable switch?
Yeah, maybe I should start to quickly say what a switch in general is.
So a switch sits in the network and is connected to database nodes
or general nodes,
and its task is mainly to route packets between the nodes.
And normal switches do this by looking into the packet contents
and seeing what is the destination and to which part it needs to go out.
And these switches, still on the normal switches,
they can also have firewall functions
and only allow packets of a specific VLAN or something else to go to some node.
And they can also drop packets if it's some malicious attack or so.
And these normal switches are for a user kind of a black box,
and they only support a fixed set of protocols.
For example, TCP IP, UDP IP, and VLAN tags and Ethernet and so on.
And since these programmable switches came to the market,
they gave the users the opportunity to develop their own protocols.
So basically define what the data layout in the packets is.
So what fields are in the packet headers,
how are they combined together,
how they should be interpreted andched, and so on.
This architecture is realized in a switch ASIC.
And when you want to reconfigure it, you don't need to buy a new ASIC.
You can basically flash a new firmware, and then your upgraded protocol is running.
And these programmable switches are quite flexible. And as of right now,
they are on the same edge as normal switches
and some are even better.
And you can get them for the same price, basically.
And this is also one interesting point
because in the next year,
23% of all Ethernet switches will be programmable.
And some users might not even know that the network switch
inside the data center network is programmable.
So basically, you get some computing platform for free inside your network.
So bringing it back to your research,
what are the characteristics of OLTP workloads
that make them a good fit for these programmable switches?
Yes, so in the database world, you can distinguish between two major type of workloads,
as many of you might know, OLAP, online analytical processing, and OLTP, online transaction processing.
And OLAP, the transactions are long-running and mainly joins between warehouses, for example,
and OLTP transactions have the characteristic that they are short and they just access a few records.
And in OLTP, it's also very common that workloads have skew,
where only a few tuples are touched by a majority of the transaction.
And this gives many challenges because data access needs to be regulated
to a limited amount of resources,
in this case, tuples.
Think of an online shop, for example,
with popular items
where many users want to purchase
some new book or DVD.
And so our idea was now to look at UltiP
because of these attributes.
And we thought that they map quite good to the switch model.
For example, on the switch, the memory is a bit limited and also the accesses to the memory are constrained by the switching architecture because they need to route the packets very fast. And we saw then that this pattern is very similar to OTP
and then had the idea to do this OTP processing on the switch.
I know you touched on it a bit there,
but can you elaborate on what the key idea behind P4DB is?
Yes, so the key idea of my research was to take the hot tuples of our workload
and put them onto the switch and store them in the switch theorem.
And then also let the switch execute full transactions
on the stored data.
So when you now move the data to the switch,
you get a lower access latency because the network path is divided by two.
Since you don't need to go through the switch to some other node,
you just need to go through the switch to some other node, you just need to go to the switch.
And the way how the transactions are processed by the switch
is in a pipeline and log-free manner.
So you don't need to worry
about any concurrency control
and you have certain performance guarantees
because you don't need to acquire any logs.
And when you think of the bandwidth the switch has, certain performance guarantees because you don't need to acquire any logs.
And when you think of the bandwidth the switch has,
for example, with 40 input ports in each 25G,
you can get to a throughput of around 1.5 billion transactions per second,
which is very high when you just compare this to the clock speed of CPUs.
Nice. So how do you go about mapping the execution of transactions to the architecture of a programmable switch?
Yeah, so when I started with the research, I had quite a few iterations how to do this.
But let's talk about how the switch first works inside it.
So as hinted previously, the switch uses a pipeline architecture,
and this is similar to a water pipe.
Packets come into an input port and then flows through the pipe and then go to the output port.
And in the switch, it is working similar,
but at the beginning you have a parser, which takes the byte streams, so the zero and ones from the wire, and interprets them, or so-called parses them into meaningful header instances, which can be then used for further processing.
So these header instances can then be, for example, the IP header or the TCP header or the UDP header.
And then the packets moves further through the pipeline
and the inner most part of the pipeline consists
of so-called mouse stages.
These are match action units that are chained together
to form a chain of multiple stages.
And they allow us to execute different actions
based on the packet contents.
So, for example, we want to route a packet to some node.
Then in one mouse stage, we look at the destination IP address,
do a lookup on which output port it is, and then set the output port for the packet.
And then after the packet passes through the stages,
it goes to the deparser where the packet is then reassembled to a byte stream
and sent out to the wire.
And this pipeline mechanism works in a way
that you only have one packet in each stage
and they progress further on each clock cycle.
And when you think of a clock speed of around 2 GHz,
you can get to the routing speed of 1.5 billion packets per second.
This is a very important aspect because in CPUs,
you can have as many random accesses as you want,
but on the switch, each packet can only access the resources
in the stages currently in.
And in the next clock cycle, it moves to the following stage and cannot access the resources
of the previous stage anymore.
It can only access the stage local resources.
So this kind of programming model is different from CPUs, and when you design switch programs, you need to think of it.
And by having the switch pipeline, we can actually get full asset guarantees.
Atomicity, consistency, and isolation.
These are basically given us for free by the pipeline.
We only have one packet in each stage, and access always succeeds.
And when another clock cycle comes, when you think of the packets in the line,
they always see the changes of the previous packets in the next stage
consistently.
Then you can have durability, for example, but we do this on the nodes by efficiently logging
operations in a writer headlock, basically.
Nice. Could you maybe walk us through the lifecycle of one of
these switch transactions? Yes, so the user
sends out a special packet, which is our switch transactions, and
this packet is then executed by our transaction engine on the switch.
In the switch transaction, the user can encode different instructions,
which mimic the operations a normal transaction does.
And when it passes through the switch pipeline, it's executed from top to bottom
and can modify the tuple stored on the switch as well. So after the execution, when all instructions
of the switch have been successfully executed and the results have been written into this packet,
the packet is then routed back to the sender. How does P4DB determine which tuples to place on the switch?
Yes, so this tuple placement is very important
because when the packet flows through the pipeline,
its accesses need to follow the pipeline
because if this is not the case,
it needs to do multiple passes through the pipeline
and this is, of course, a bit costly.
So to optimize this data layout, we model our accesses of what transactions into a graph.
We define the tuples as nodes and each access transaction makes, we define as a
directed edge in our graph. So when tuples are accessed very frequently, the edge wages are high.
And when we now have the graph, we can partition it using a maximum cut graph algorithm and
basically cut the graph on the edges with the highest weights.
And then we get partitions of tuples.
And when we then order these partitions using the directed edges, topologically, we get partitions of tuples. And when we then order these partitions using the directed edges,
topologically, we get the optimal data placement of our tuples
into the different stages of the switch pipeline.
Is the allocation static or is it dynamic?
Can the tuple order be changed at runtime?
For our paper, we had it static to better compare against other baselines.
But you can think of management transactions which take one tuple and write it to another location.
So you can have dynamic layouts and also adaptive layouts.
For example, when workloads are shifting and the hot items, the very frequently accessed items in our shopping system moves,
then we can also offload and download other tuples to the switch.
This is possible in the design.
What happens if a transaction needs to access tuples in a different order than that laid out on the switch?
So the data layout algorithm optimizes the placement, but it cannot be always optimally.
There can always be transactions in a different order.
So we handle this by allowing the switch transaction to pass multiple times through a switch pipeline.
And this is done by sending the switch transaction to a special port on the switch, which loops
back to some input port.
And this allows us to have a switch transaction
to multiple passes through the switch pipeline.
But when you think about it,
this also violates some of our criterias,
which our switch pipeline gives at the beginning.
You can have inconsistent states,
some switch transactions might read intermediate updates
from other switch transactions and so on.
So to prevent this, we add some locking mechanism
into our switch pipeline at the beginning,
which prevents that the switch executes other transactions
while one multipass transaction is running.
Obviously, you can't fit all of the data on the switch, right?
Only the hot data.
So how does P4DB execute transactions that need to access some hot data
and some cold data that's not on the switch?
Yes, so we also thought of this case in our paper,
and we gave these transactions a special name,
so-called warm transactions,
because they access hot and cold data at the same time.
And we integrated these warm transactions
into the two-phase commit concurrency schemes,
which is used in database systems.
So since these switch transactions do not abort and are always executed log-free on the switch,
we have some constraints.
So how can we now execute these transactions that access both hot and cold data?
First, we need to ensure that once we send out the hot transaction to the switch, the
whole transaction cannot abort anymore due to some cold transaction.
So we do this by obtaining locks for the cold parts and waiting until it's in a kind of
pre-commit state.
Now we know that the cold part cannot abort anymore.
And then we send out a switch transaction
and receive the results.
We can do some further computation to the results
and then fully commit the cold part
and the whole transaction.
And this scheme is needed because, as I said,
these hot transactions cannot abort
and also cannot be rolled back on the switch.
In case the transaction wants to access hot data and cold data,
we can also temporarily offload these cold tuples
into some dedicated memory of the switch
and execute it as it would be like hot transactions.
How did you go about evaluating P4DB
and what were the key results?
For our evaluation,
we used a database system
based on two-phase logging
with all the transaction execution modules,
which was running without a switch
and compared the same database
with an active switch.
So passive switch, where the switch is just routing packets through the network against
an active switch where it's effectively executing these hot transactions inside the networks.
And to show the benefits P4DB can provide, we implemented three OTP workloads, YCSB, like everyone knows, a key value store.
To simulate these transactions, we grouped together eight operations.
And small bank is a banking application.
And then we also implemented TPCC.
And for YCSB and small bank, these can be fully implemented as hot transactions
on the switch because they are not that complex and all operations can be done in a pipeline path.
And for these workloads, we saw in our evaluation significant improvements,
especially when the skewness factor was very high. And this was due to the pipeline and luxury execution model
because you don't need to call in the accesses
and the transactions are executed as fast
as the packets are routed through the switch.
For TPCC, we needed to rely on the techniques
of warm transactions because the new order
and payment transactions in TPCC contain some table inserts and string lockups.
So for these, we only executed the hot part,
which caused the most contention on the switch
and then let the cold part not be normally executed
on the nodes.
So for TPCC, we also saw speedups.
They were not as high as for YCC and SmallWing,
but they were significant speedups too.
And for TPCC, this was basically limited by the cold subset
because this is still the major part where the nodes synchronize
and a lot of time goes to waste.
Why not waste it wasted but disused.
What was the magnitude of the speedup in the scenarios in which p4db showed performance scans?
For YCSB small bank you can be like for certain factor of skewness like 80% of all accesses go to 20% of tuples.
There you can see speedup of up to eight.
And for TPCC, all speedups were 1.5x or 2x.
I want to highlight one very interesting fact.
So for the switch, it does not matter how the workload is skewed.
It does not matter if all switch transactions access one tuple or multiple tuples
because they always take the same amount of time since it's clocked by the pipeline.
And this is very interesting when you look at the graphs
because for different write-read ratios in the transactional workloads,
the throughput is exactly the same.
Are there any situations in which P4DB's performance is non-optimal?
And what are the workload characteristics of these situations?
Yes, there are certain workloads with characteristics.
For example, we cannot support scans or similar complex operations in 4DB due to the limited hardware capabilities.
Another obvious aspect is if the workload cannot be partitioned into a hot and cold part.
So if the workload is very uniformly, the switch cannot do as much to accelerate the workload
as if it would be able to when we have a very distinctive hot
portion in our data.
This is due to the fact that when less data access is going into the switch, it basically
cannot accelerate the workload that effectively.
But in our system, this is no disadvantage.
We also have a micro benchmark on this and where increase the hot set, and there P4DB performs asymptotically as if there would be no switch in the system.
So you don't have any drawback when you use P4DB, but the workload doesn't match.
How many tuples can you actually get on one of these switches?
This also depends on the switch model, but the switch we use,
that we could store around 800,000 64-bit pupils in this area. Depends also how many tables
you want to store in the switch. If you want to have some for-read-only
data, you could also replicate it so that you store data a in the first pipeline stages and
the data a again in the last pipeline stages to optimize for different data access orders
so this also comes into the fact how how much data you can store on the switch but the rough
ballpark is as i said around 800 000 where do you see your results being the most useful?
Who do you think will find the results the most relevant?
Yes, I think in general data center networks can greatly benefit
from these programmable switches.
And as I said in the beginning, in the next year,
like one quarter of all purchased e-s purchased Ethernet switches will be programmable.
So even if you don't use it specifically for transaction processing, you can benefit greatly from the programmability of network hardware.
Across your time working on P4DB, what was the most interesting or maybe perhaps an expected lesson that you learned? I think for me the different program model of switches,
which is radically different from CPUs,
was very challenging and unexpected for me in the beginning.
So first to get your thinking to this pipeline model of processing
and then getting a working design of it took some time,
but at the end it was worth it.
What were the things that you tried while working on P4DB that failed?
Can you give any words of advice to people who might, in the future,
want to work with programmable switches?
I think the best advice is to forget how you would program the switch
using your normal programming techniques
like data structures,
which you commonly use for CPUs.
You should start from the scratch
and then start to find a solution for your problem.
At the beginning, I also made the mistakes
to have some designs,
which at the end weren't able to compile because I violated some constraints.
For example, access the register twice or in a different order.
But yeah, it takes some time.
But once you're in, I think building the new systems and also different systems becomes easier and easier.
Where do you go next with P4DB?
I think I will continue further research in this networking context as the main topic of my PhD.
But I'm also looking at some upcoming architectures for switches.
For example, there will be now FPGA embedded next to the processing agent, ASIC,
and this can open up many new possibilities for new designs.
For example, some designs where the constraints of the
ASIC fail can now then be moved to the FPGA where you
can synthesize some specific processing pipeline.
And another direction I plan to look into
are other dedicated network accelerator cards,
for example, in servers.
And there I look into how to speed up different protocols
also in a database context.
And I think this is some interesting area.
Is P4DB publicly available? Where can the listener
go and find it? Yes, I uploaded all the source code
to the GitHub repository of my lab. So it's github.com
slash datamanagementlab slash p4db. And
we also published an extended technical report
next to our paper where we cover some other aspects, more detail, which unfortunately couldn't make it to the full.
What attracted you to this research area in the first place?
Yeah, so I started this topic as a master thesis, so I kind of slid it in a bit. And as I said at the beginning, it was a bit hard, especially then
solving this very abstract problem statement now into the hardware. But then I started to like it
and I'm now further continuing in this area. And one point is that there are more and more
accelerators popping up around each corner and for each there are different
development techniques and they are popping up like in an exponential rate and i think a big
challenge is to tame all these developments and use them in the best way suited for the project
or in my context databases what is the one key thing you want listeners to take away from your research
and your work on P4DB?
I'm getting a bit philosophical now, I think,
but I would say you should take a look at the whole infrastructure you have
and see what other components can be utilized to solve your problem tasks.
And at the beginning, it might look unconventional
and you might have some surprisingly good effects
on your workload at the end.
And sometimes the solution is even not that intuitive.
For example, in our case with P4DB,
we could easily achieve great speedups for certain workloads
by basically not even having to buy new hardware
because the program will switch often already in the network.
We will end it there.
Thanks so much, Matthias, for coming on the show.
If you are interested in knowing more about Matthias' work,
all the links to his paper and all the other relevant materials
will be put in the show notes.
This episode concludes our SIGMOD 2022 series.
We hope you've enjoyed listening.
We'll be back soon with another series
focusing on a different conference.
So keep an eye out on our Twitter account
that is at disseminatepod for updates about that.
See you all next time.