Disseminate: The Computer Science Research Podcast - Matthias Jasny | P4DB - The Case for In-Network OLTP | #10

Starting point is 00:00:00 Hello and welcome to Disseminate, the computer science research podcast. I'm your host, Jack Wardby. This is episode 10 and the final episode of our SIGMOD 2022 series. I'm delighted to say I'm joined today by Matthias Jasny, who will be talking about his paper P4DB, the case for in-network OLTP. Matthias is a PhD student at the Technical University of Darmstadt and his research focuses primarily on scalable data management and programmable networks. Matthias, thanks for joining us on the show. Hi Jack, thanks for having me. Let's dive straight in. Can you set the scene for your research and describe the motivation behind P4DB? Yes, so my research mainly builds on the observation that database development always

Starting point is 00:01:06 lags a bit behind in what is currently available in the area, for example, high-speed networks. And with my work, especially on the focus of programmable switches, there has been a bit limited work how to use these programmable switches and databases, but mainly only by offloading some components into the network, for example, the log manager or a key value store for caching some values. But at this time when I started my research, I couldn't find anything what did full transaction processing on the switch. So I thought back then this is some nice challenge to tackle.

Starting point is 00:01:42 Awesome. So can you describe to the listeners who may not be familiar with them, what exactly is a programmable switch? Yeah, maybe I should start to quickly say what a switch in general is. So a switch sits in the network and is connected to database nodes or general nodes, and its task is mainly to route packets between the nodes. And normal switches do this by looking into the packet contents and seeing what is the destination and to which part it needs to go out.

Starting point is 00:02:13 And these switches, still on the normal switches, they can also have firewall functions and only allow packets of a specific VLAN or something else to go to some node. And they can also drop packets if it's some malicious attack or so. And these normal switches are for a user kind of a black box, and they only support a fixed set of protocols. For example, TCP IP, UDP IP, and VLAN tags and Ethernet and so on. And since these programmable switches came to the market,

Starting point is 00:02:45 they gave the users the opportunity to develop their own protocols. So basically define what the data layout in the packets is. So what fields are in the packet headers, how are they combined together, how they should be interpreted andched, and so on. This architecture is realized in a switch ASIC. And when you want to reconfigure it, you don't need to buy a new ASIC. You can basically flash a new firmware, and then your upgraded protocol is running.

Starting point is 00:03:19 And these programmable switches are quite flexible. And as of right now, they are on the same edge as normal switches and some are even better. And you can get them for the same price, basically. And this is also one interesting point because in the next year, 23% of all Ethernet switches will be programmable. And some users might not even know that the network switch

Starting point is 00:03:47 inside the data center network is programmable. So basically, you get some computing platform for free inside your network. So bringing it back to your research, what are the characteristics of OLTP workloads that make them a good fit for these programmable switches? Yes, so in the database world, you can distinguish between two major type of workloads, as many of you might know, OLAP, online analytical processing, and OLTP, online transaction processing. And OLAP, the transactions are long-running and mainly joins between warehouses, for example,

Starting point is 00:04:26 and OLTP transactions have the characteristic that they are short and they just access a few records. And in OLTP, it's also very common that workloads have skew, where only a few tuples are touched by a majority of the transaction. And this gives many challenges because data access needs to be regulated to a limited amount of resources, in this case, tuples. Think of an online shop, for example, with popular items

Starting point is 00:04:55 where many users want to purchase some new book or DVD. And so our idea was now to look at UltiP because of these attributes. And we thought that they map quite good to the switch model. For example, on the switch, the memory is a bit limited and also the accesses to the memory are constrained by the switching architecture because they need to route the packets very fast. And we saw then that this pattern is very similar to OTP and then had the idea to do this OTP processing on the switch. I know you touched on it a bit there,

Starting point is 00:05:34 but can you elaborate on what the key idea behind P4DB is? Yes, so the key idea of my research was to take the hot tuples of our workload and put them onto the switch and store them in the switch theorem. And then also let the switch execute full transactions on the stored data. So when you now move the data to the switch, you get a lower access latency because the network path is divided by two. Since you don't need to go through the switch to some other node,

Starting point is 00:06:06 you just need to go through the switch to some other node, you just need to go to the switch. And the way how the transactions are processed by the switch is in a pipeline and log-free manner. So you don't need to worry about any concurrency control and you have certain performance guarantees because you don't need to acquire any logs. And when you think of the bandwidth the switch has, certain performance guarantees because you don't need to acquire any logs.

Starting point is 00:06:27 And when you think of the bandwidth the switch has, for example, with 40 input ports in each 25G, you can get to a throughput of around 1.5 billion transactions per second, which is very high when you just compare this to the clock speed of CPUs. Nice. So how do you go about mapping the execution of transactions to the architecture of a programmable switch? Yeah, so when I started with the research, I had quite a few iterations how to do this. But let's talk about how the switch first works inside it. So as hinted previously, the switch uses a pipeline architecture,

Starting point is 00:07:14 and this is similar to a water pipe. Packets come into an input port and then flows through the pipe and then go to the output port. And in the switch, it is working similar, but at the beginning you have a parser, which takes the byte streams, so the zero and ones from the wire, and interprets them, or so-called parses them into meaningful header instances, which can be then used for further processing. So these header instances can then be, for example, the IP header or the TCP header or the UDP header. And then the packets moves further through the pipeline and the inner most part of the pipeline consists of so-called mouse stages.

Starting point is 00:07:53 These are match action units that are chained together to form a chain of multiple stages. And they allow us to execute different actions based on the packet contents. So, for example, we want to route a packet to some node. Then in one mouse stage, we look at the destination IP address, do a lookup on which output port it is, and then set the output port for the packet. And then after the packet passes through the stages,

Starting point is 00:08:27 it goes to the deparser where the packet is then reassembled to a byte stream and sent out to the wire. And this pipeline mechanism works in a way that you only have one packet in each stage and they progress further on each clock cycle. And when you think of a clock speed of around 2 GHz, you can get to the routing speed of 1.5 billion packets per second. This is a very important aspect because in CPUs,

Starting point is 00:08:58 you can have as many random accesses as you want, but on the switch, each packet can only access the resources in the stages currently in. And in the next clock cycle, it moves to the following stage and cannot access the resources of the previous stage anymore. It can only access the stage local resources. So this kind of programming model is different from CPUs, and when you design switch programs, you need to think of it. And by having the switch pipeline, we can actually get full asset guarantees.

Starting point is 00:09:35 Atomicity, consistency, and isolation. These are basically given us for free by the pipeline. We only have one packet in each stage, and access always succeeds. And when another clock cycle comes, when you think of the packets in the line, they always see the changes of the previous packets in the next stage consistently. Then you can have durability, for example, but we do this on the nodes by efficiently logging operations in a writer headlock, basically.

Starting point is 00:10:12 Nice. Could you maybe walk us through the lifecycle of one of these switch transactions? Yes, so the user sends out a special packet, which is our switch transactions, and this packet is then executed by our transaction engine on the switch. In the switch transaction, the user can encode different instructions, which mimic the operations a normal transaction does. And when it passes through the switch pipeline, it's executed from top to bottom and can modify the tuple stored on the switch as well. So after the execution, when all instructions

Starting point is 00:10:54 of the switch have been successfully executed and the results have been written into this packet, the packet is then routed back to the sender. How does P4DB determine which tuples to place on the switch? Yes, so this tuple placement is very important because when the packet flows through the pipeline, its accesses need to follow the pipeline because if this is not the case, it needs to do multiple passes through the pipeline and this is, of course, a bit costly.

Starting point is 00:11:25 So to optimize this data layout, we model our accesses of what transactions into a graph. We define the tuples as nodes and each access transaction makes, we define as a directed edge in our graph. So when tuples are accessed very frequently, the edge wages are high. And when we now have the graph, we can partition it using a maximum cut graph algorithm and basically cut the graph on the edges with the highest weights. And then we get partitions of tuples. And when we then order these partitions using the directed edges, topologically, we get partitions of tuples. And when we then order these partitions using the directed edges, topologically, we get the optimal data placement of our tuples

Starting point is 00:12:14 into the different stages of the switch pipeline. Is the allocation static or is it dynamic? Can the tuple order be changed at runtime? For our paper, we had it static to better compare against other baselines. But you can think of management transactions which take one tuple and write it to another location. So you can have dynamic layouts and also adaptive layouts. For example, when workloads are shifting and the hot items, the very frequently accessed items in our shopping system moves, then we can also offload and download other tuples to the switch.

Starting point is 00:12:54 This is possible in the design. What happens if a transaction needs to access tuples in a different order than that laid out on the switch? So the data layout algorithm optimizes the placement, but it cannot be always optimally. There can always be transactions in a different order. So we handle this by allowing the switch transaction to pass multiple times through a switch pipeline. And this is done by sending the switch transaction to a special port on the switch, which loops back to some input port. And this allows us to have a switch transaction

Starting point is 00:13:31 to multiple passes through the switch pipeline. But when you think about it, this also violates some of our criterias, which our switch pipeline gives at the beginning. You can have inconsistent states, some switch transactions might read intermediate updates from other switch transactions and so on. So to prevent this, we add some locking mechanism

Starting point is 00:13:58 into our switch pipeline at the beginning, which prevents that the switch executes other transactions while one multipass transaction is running. Obviously, you can't fit all of the data on the switch, right? Only the hot data. So how does P4DB execute transactions that need to access some hot data and some cold data that's not on the switch? Yes, so we also thought of this case in our paper,

Starting point is 00:14:26 and we gave these transactions a special name, so-called warm transactions, because they access hot and cold data at the same time. And we integrated these warm transactions into the two-phase commit concurrency schemes, which is used in database systems. So since these switch transactions do not abort and are always executed log-free on the switch, we have some constraints.

Starting point is 00:14:53 So how can we now execute these transactions that access both hot and cold data? First, we need to ensure that once we send out the hot transaction to the switch, the whole transaction cannot abort anymore due to some cold transaction. So we do this by obtaining locks for the cold parts and waiting until it's in a kind of pre-commit state. Now we know that the cold part cannot abort anymore. And then we send out a switch transaction and receive the results.

Starting point is 00:15:30 We can do some further computation to the results and then fully commit the cold part and the whole transaction. And this scheme is needed because, as I said, these hot transactions cannot abort and also cannot be rolled back on the switch. In case the transaction wants to access hot data and cold data, we can also temporarily offload these cold tuples

Starting point is 00:15:55 into some dedicated memory of the switch and execute it as it would be like hot transactions. How did you go about evaluating P4DB and what were the key results? For our evaluation, we used a database system based on two-phase logging with all the transaction execution modules,

Starting point is 00:16:19 which was running without a switch and compared the same database with an active switch. So passive switch, where the switch is just routing packets through the network against an active switch where it's effectively executing these hot transactions inside the networks. And to show the benefits P4DB can provide, we implemented three OTP workloads, YCSB, like everyone knows, a key value store. To simulate these transactions, we grouped together eight operations. And small bank is a banking application.

Starting point is 00:16:58 And then we also implemented TPCC. And for YCSB and small bank, these can be fully implemented as hot transactions on the switch because they are not that complex and all operations can be done in a pipeline path. And for these workloads, we saw in our evaluation significant improvements, especially when the skewness factor was very high. And this was due to the pipeline and luxury execution model because you don't need to call in the accesses and the transactions are executed as fast as the packets are routed through the switch.

Starting point is 00:17:37 For TPCC, we needed to rely on the techniques of warm transactions because the new order and payment transactions in TPCC contain some table inserts and string lockups. So for these, we only executed the hot part, which caused the most contention on the switch and then let the cold part not be normally executed on the nodes. So for TPCC, we also saw speedups.

Starting point is 00:18:05 They were not as high as for YCC and SmallWing, but they were significant speedups too. And for TPCC, this was basically limited by the cold subset because this is still the major part where the nodes synchronize and a lot of time goes to waste. Why not waste it wasted but disused. What was the magnitude of the speedup in the scenarios in which p4db showed performance scans? For YCSB small bank you can be like for certain factor of skewness like 80% of all accesses go to 20% of tuples.

Starting point is 00:18:47 There you can see speedup of up to eight. And for TPCC, all speedups were 1.5x or 2x. I want to highlight one very interesting fact. So for the switch, it does not matter how the workload is skewed. It does not matter if all switch transactions access one tuple or multiple tuples because they always take the same amount of time since it's clocked by the pipeline. And this is very interesting when you look at the graphs because for different write-read ratios in the transactional workloads,

Starting point is 00:19:24 the throughput is exactly the same. Are there any situations in which P4DB's performance is non-optimal? And what are the workload characteristics of these situations? Yes, there are certain workloads with characteristics. For example, we cannot support scans or similar complex operations in 4DB due to the limited hardware capabilities. Another obvious aspect is if the workload cannot be partitioned into a hot and cold part. So if the workload is very uniformly, the switch cannot do as much to accelerate the workload as if it would be able to when we have a very distinctive hot

Starting point is 00:20:05 portion in our data. This is due to the fact that when less data access is going into the switch, it basically cannot accelerate the workload that effectively. But in our system, this is no disadvantage. We also have a micro benchmark on this and where increase the hot set, and there P4DB performs asymptotically as if there would be no switch in the system. So you don't have any drawback when you use P4DB, but the workload doesn't match. How many tuples can you actually get on one of these switches? This also depends on the switch model, but the switch we use,

Starting point is 00:20:49 that we could store around 800,000 64-bit pupils in this area. Depends also how many tables you want to store in the switch. If you want to have some for-read-only data, you could also replicate it so that you store data a in the first pipeline stages and the data a again in the last pipeline stages to optimize for different data access orders so this also comes into the fact how how much data you can store on the switch but the rough ballpark is as i said around 800 000 where do you see your results being the most useful? Who do you think will find the results the most relevant? Yes, I think in general data center networks can greatly benefit

Starting point is 00:21:37 from these programmable switches. And as I said in the beginning, in the next year, like one quarter of all purchased e-s purchased Ethernet switches will be programmable. So even if you don't use it specifically for transaction processing, you can benefit greatly from the programmability of network hardware. Across your time working on P4DB, what was the most interesting or maybe perhaps an expected lesson that you learned? I think for me the different program model of switches, which is radically different from CPUs, was very challenging and unexpected for me in the beginning. So first to get your thinking to this pipeline model of processing

Starting point is 00:22:20 and then getting a working design of it took some time, but at the end it was worth it. What were the things that you tried while working on P4DB that failed? Can you give any words of advice to people who might, in the future, want to work with programmable switches? I think the best advice is to forget how you would program the switch using your normal programming techniques like data structures,

Starting point is 00:22:50 which you commonly use for CPUs. You should start from the scratch and then start to find a solution for your problem. At the beginning, I also made the mistakes to have some designs, which at the end weren't able to compile because I violated some constraints. For example, access the register twice or in a different order. But yeah, it takes some time.

Starting point is 00:23:16 But once you're in, I think building the new systems and also different systems becomes easier and easier. Where do you go next with P4DB? I think I will continue further research in this networking context as the main topic of my PhD. But I'm also looking at some upcoming architectures for switches. For example, there will be now FPGA embedded next to the processing agent, ASIC, and this can open up many new possibilities for new designs. For example, some designs where the constraints of the ASIC fail can now then be moved to the FPGA where you

Starting point is 00:23:59 can synthesize some specific processing pipeline. And another direction I plan to look into are other dedicated network accelerator cards, for example, in servers. And there I look into how to speed up different protocols also in a database context. And I think this is some interesting area. Is P4DB publicly available? Where can the listener

Starting point is 00:24:27 go and find it? Yes, I uploaded all the source code to the GitHub repository of my lab. So it's github.com slash datamanagementlab slash p4db. And we also published an extended technical report next to our paper where we cover some other aspects, more detail, which unfortunately couldn't make it to the full. What attracted you to this research area in the first place? Yeah, so I started this topic as a master thesis, so I kind of slid it in a bit. And as I said at the beginning, it was a bit hard, especially then solving this very abstract problem statement now into the hardware. But then I started to like it

Starting point is 00:25:13 and I'm now further continuing in this area. And one point is that there are more and more accelerators popping up around each corner and for each there are different development techniques and they are popping up like in an exponential rate and i think a big challenge is to tame all these developments and use them in the best way suited for the project or in my context databases what is the one key thing you want listeners to take away from your research and your work on P4DB? I'm getting a bit philosophical now, I think, but I would say you should take a look at the whole infrastructure you have

Starting point is 00:25:57 and see what other components can be utilized to solve your problem tasks. And at the beginning, it might look unconventional and you might have some surprisingly good effects on your workload at the end. And sometimes the solution is even not that intuitive. For example, in our case with P4DB, we could easily achieve great speedups for certain workloads by basically not even having to buy new hardware

Starting point is 00:26:30 because the program will switch often already in the network. We will end it there. Thanks so much, Matthias, for coming on the show. If you are interested in knowing more about Matthias' work, all the links to his paper and all the other relevant materials will be put in the show notes. This episode concludes our SIGMOD 2022 series. We hope you've enjoyed listening.

Starting point is 00:26:52 We'll be back soon with another series focusing on a different conference. So keep an eye out on our Twitter account that is at disseminatepod for updates about that. See you all next time.

CODACE Plant Stand

Disseminate: The Computer Science Research Podcast - Matthias Jasny | P4DB - The Case for In-Network OLTP | #10

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

CODACE Plant Stand

Disseminate: The Computer Science Research Podcast - Matthias Jasny | P4DB - The Case for In-Network OLTP | #10

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.