Disseminate: The Computer Science Research Podcast - Jinkun Geng | Nezha: Deployable and High-Performance Consensus Using Synchronized Clocks | #42

Starting point is 00:00:00 Hello and welcome to Disseminate, the computer science research podcast. I'm your host, Jack Wardby. A reminder that if you do enjoy the show, do please consider supporting us through buying me a coffee. It really helps us to keep making the show. So yeah, please do consider supporting us. Today, I'm joined by Jinkun Jain, who will be telling us everything we need to know about NESA, deployable and high-performance consensus using synchronized clocks. Jinkun is a PhD student at Stanford University. Jin Kun, welcome to the show. Yeah, thanks for the invitation, Jack. Great stuff. So let's jump straight in then. So can you tell us a little bit more about yourself

Starting point is 00:00:53 and how you became interested in databases, distributed systems, and all the cool research that you do? Yeah, cool. Sure. So my name is Jin Kun Gan and so I'm now a fifth-year PhD student in the CS department of Stanford University. So my research direction is the networking system. So more specifically, my current research project is focusing on high-performance distributed protocols, including consensus protocols and concurrency control protocols. So before I joined Stanford in 2019, I completed my undergraduate study and my master's study both in China. So I completed my undergraduate study in Beihang University, and then I entered Tsinghua University to obtain my master's degree. So regarding my research career, I think I can divide that into two parts.

Starting point is 00:01:44 So the first part is before my PhD. Regarding my research career, I think I can divide that into two parts. So the first part is before my PhD. So mostly I conducted research in Tsinghua University, advised by Professor Dan Li. So at that time, I was doing high-performance networking, which means I used DPDK RDMA to accelerate the communication in distributed systems. So further, based on that, we can continue to bring performance improvement on distributed applications, such as distributed machine learning applications. And that is what I was doing in my master.

Starting point is 00:02:17 And when I came to Stanford, I finally found my current supervisors. So I joined the group of Professor Balaji Prabhagat. And Balaji is my primary advisor. And in the meantime, I'm also co-advised by Professor Mindo Rosblum and also Professor Anirudh Sabarami. So Mindo is another professor

Starting point is 00:02:39 in Stanford, and Anirudh is a professor in New York University. So in our current group, the general direction is called time-sensitive networks. So the motivation for this general direction is like this. So dating back to decades ago, such as 10 years or 20 years ago,

Starting point is 00:03:01 at that time, clock synchronization had not been a mature technique. So the clock synchronization error can be quite bad dating back to 20 years ago. The different nodes can even reach the error bound of 100 milliseconds. So for the early researchers in distributed systems, they cannot make strong assumptions of the clock synchronization. But in the recent years, especially in the past three to five years, we have seen that clock synchronization is developing very fast. And the traditional protocol like NTP has improved the synchronization accuracy to a greater amount.

Starting point is 00:03:42 And some new clock synchronization protocols have also been proposed, like the synchronization algorithm we have used in Neutra protocol, which is called Huygens, which is developed by another senior lab mate in our group. So with the new synchronization algorithm, we can reach the clock synchronization accuracy

Starting point is 00:04:04 at nanosecond level. So in that way, with this new powerful tool, we come to ask whether we can review and even revolutionize the previous distributed system research and achieve much better performance. And the consensus protocol is one such scenario fantastic that's a great mission statement for the for your lab are doing that i mean yeah like you say it's it's a different ball game from 20 years ago right like the underlying assumptions have changed we've got better hardware so well how does that how does that change the state of play so yeah i guess that's what we're going to be talking about today so um in some more depth so cool so let's start

Starting point is 00:04:43 off with a little bit more background for the listening. So before we dig into all the nitty-gritty and everything, can you tell us more about kind of consensus protocols generally? What are they and why do we need them? Yeah, okay. So, you know, actually consensus protocol is not as widely applied in many distributed systems because distributed systems require fault tolerance. So consensus protocols provide the fault tolerance for the systems,

Starting point is 00:05:10 so it can ensure the service availability for the distributed systems. When one server fails, you just need to use the other servers to replace the failed ones, and it can resume the service for the systems. So simply speaking, the consensus protocols can be defined in a scenario with multiple servers. So the protocol needs to coordinate the multiple servers so that they can reach an agreement on a single data value. So this is commonly known as a single-degree Pexels, or you can call that a single-degree consensus, because Pexels has already become the synonym of consensus. But of course, in the practical cases, the single-degree consensus is not sufficient for the applications.

Starting point is 00:05:58 So in many practical cases, we can make abstraction of the application's execution. We represent the execution history of the application's execution. We represent the execution history of the application as an ordered log list. So it becomes the responsibility of the consensus protocol to make sure that every server has the same ordered log list. So in that way, when the active server fails, we just need to use another server which has the same old log list to replace that. By re-executing the log list, we can recover the same application state and then we can resume the service.

Starting point is 00:06:35 Fantastic. That's a great description of what Consent is for. I mean, I personally work with one every day. So we have Raft at work. So I have fun with that quite a lot of time. But cool, right? Anyway, I'm getting sidetracked so let me let me keep on on focus so of course you mentioned there that and paxos is sort of like kind of almost the synonym for consensus protocols i know there's a host of other ones like raft out there but what are the current problems with the with what are

Starting point is 00:07:00 the problems with the current state of our protocols? And why do we need to have a new one? What's the motivation here? The problems and then, I guess, leading into the motivation for your work. Yeah. I think, based on my understanding, I think for the state of art, for the existing consensus protocols, the drawbacks can be summarized into two dimensions. So the first dimension is performance. and the second dimension is deployability.

Starting point is 00:07:29 As you mentioned, for the traditional protocols like multipixels and like Raft, when they are designing the protocols, they cannot make strong assumptions. So for these protocols, of course, they do not have strong assumptions, so they can be deployed everywhere. But the performance is not very good. And for these protocols, they usually suffer from the throughput bottleneck. And also the committed latency can be very long.

Starting point is 00:07:55 So that will become the bottleneck for the system and applications running on top of them. So these drawbacks of the traditional protocols have already been realized by some recent works. So for some recent protocols like no-Paxos and speculative Paxos, they are trying to achieve high performance. So in order to do that, they decided to use some network support, such as the program switches, such as SDN control. But in that case, it introduces another problem, which is the deployability.

Starting point is 00:08:28 Because we know in the public cloud, which is the general environment for many systems and applications, we do not have such strong support. From the perspective of the cloud tenant, the network is just like a black box, so you cannot use program switches or SDN control. So in that way, the high-performance protocols

Starting point is 00:08:50 developed in the recent years cannot be easily deployed in public cloud. So that is why it motivates us to design a new protocol. We hope that we can achieve both high performance and deployability

Starting point is 00:09:02 so that we can provide benefit to the cloud systems and applications deployability so that we can provide a benefit to the cloud systems and applications. Yeah, that's fantastic. I mean, I've always had this sort of, whenever I come across a cool paper that's used some sort of modern hardware that's cutting edge and it says, yeah, look how fast we can make this system go using this hardware or we can solve these problems using this hardware and it goes really fast.

Starting point is 00:09:24 Is this what you mentioned there there this deployability of that like that's not generally available in the public cloud and therefore it's kind of well it's not irrelevant but like it's i can't use that in a day in front like day to day because not everyone has access to it right so i think it's great to factor that sort of dimension into into your design process so i guess with these sort of two dimensions in mind then, give us the elevator pitch for Nezha. And I think I'm pronouncing it right. Is it Nezha?

Starting point is 00:09:50 Nezha, yeah. Yeah, okay. What does it mean? What does it mean? Nezha actually comes from, Nezha is the name of a Chinese legendary figure. So Nezha actually, this figure has three heads and six arms.

Starting point is 00:10:04 So you can imagine that if you cut down one head, this guy is still alive. So it has a stronger fault tolerance. I get it. I get it. That's a great name. That's a great name. Cool. So give us the elevator pitch then. How are we going to achieve these two goals? Yeah, I think there are three pillars in the Nerja design. The first aspect, of course,

Starting point is 00:10:26 is clock synchronization. So let's first think about this question. Why does clock synchronization accelerate consensus protocols? So essentially, consensus protocols are trying to establish the agreement

Starting point is 00:10:42 when all the log lists. So you can imagine that if you have multiple replicas and you have many clients, imagine that each request is multicast to every replicas and imagine that each request will arrive as a replica in the same order. So in that way, these replicas will establish the same ordered log list by nature.

Starting point is 00:11:03 So we do not, even without a coordination. This is the ideal case, but this does not always happen. This can hardly happen, actually, in the public cloud, because in the public cloud, we have observed that message reordering occurs very frequently. So in that way, if we can reduce the message reordering in the network, then we should be able to accelerate the consensus protocols. This design principle has already been realized by the existing works like what I

Starting point is 00:11:32 mentioned, such as no-pixel and speculative pixels. So they decide to resort to the networking support like SDN control to control the networking transmission in order to reduce the message reordering. But also, as I mentioned, this network support is not available in the public cloud, but clock synchronization is already available in the public cloud. So we decided that we can use clock synchronization to reduce the message ordering so that we can accelerate the consensus. So motivated by that, we first developed a primitive, which is called Deadline

Starting point is 00:12:05 Order Multicast Primitive, and we can talk about that later. And based on DOM Primitive, we continue to develop now a NERDA protocol to achieve high performance. So this is the first aspect I want to highlight. So the second aspect I want to say is the speculative design, which we included in the NERDA protocol. So if we take a review of the traditional protocols like multiplexers and Raft, we can see that for these protocols, they ask the leader to execute the request after the leader has confirmed that the request has been replicated to a majority of replicas. So the confirmation operation is also known as the quorum check, which means that in the traditional protocols

Starting point is 00:12:52 like multipack, salt, and raft, only after the quorum check is completed, then the execution can be done. So this actually prolongs the commit workflow. So we don't want that. So in our design, we just ask the leader to aggressively access the request. So that means it is no longer the responsibility of the leader to do the quorum check. And we have another entity to help the protocol do the quorum check to decide whether this request is committed or not. So this is the second aspect, the speculative execution. And following that, the third aspect is also related to the second one. So which means in our protocol, we have extracted a stateless proxy design. So the proxy becomes an important component.

Starting point is 00:13:43 It helps us to do the request multicast and help us to do the quorum check. So in that way, for these heavy operations, which are used to be done by the leader, we have offloaded them into the proxy. So in that way, the leader's bottleneck becomes much reduced. And because the proxy is stateless, we can scale up and down the number of the proxies based on the workload. So if your workload is heavy, you can deploy more proxies. Otherwise, you just need to reduce the number of proxies. So because the proxy is stateless, the failure of the proxy does not affect the correctness of the protocol.

Starting point is 00:14:20 And because it is scalable, it helps us to reduce much of the bottleneck as a leader. And additionally, because the proxy also serves as a middleware, it actually serves as an adapter for us to migrate the distributed system from legacy protocols to NERDA. For example, many systems are running on top of multiplexers and Raft. By just implementing an adapter of the proxy, we can easily migrate the system from Raft and multiplexers to NerdR so that they can also enjoy the high performance provided by now protocol. Fantastic. So just to recap there briefly, the three pillars of NerdR there are the deadline-ordered multicast, which is helping us with our...

Starting point is 00:15:09 I did have a question on that, actually, before we continue. You mentioned that in the public cloud, it's been observed that message reordering happens frequently. Do you have any sort of numbers with that, as a rough guideline of how frequent we're talking? Oh, yes. So in our paper, we have included that number. We have included that experiment. So based on our... First, we define a metric, which is called

Starting point is 00:15:31 reordering score, and to measure the frequency of the reordering. And based on our measurement, we have found that so long as you have multiple senders, and so long as the sending rate increases, you will reach a very high reordering score, which can be 20% to 40%.

Starting point is 00:15:51 So that indicates a very, very frequent reordering in the public cloud when you do not have any control. Yeah, that's larger than I thought it would be. That's very interesting. Cool. So just to continue the quick recap there we also have and the speculative execution so we kind of we don't have to kind of have this sort of like commit latency waiting for the leader to sort of do

Starting point is 00:16:12 everything and then the last thing is we have this stateless proxy design which allows us to upload multicast and quorum trick and all that to some sort of middleware so again we're offloading stuff from from the leader awesome cool great so let's dig into the details a little bit more. So let's start off with Deadline Ordered Multicast. Tell us more about it. Yeah. Explain it to us. Yeah.

Starting point is 00:16:31 As I have mentioned that the message reordering is a performance killer to the consensus protocols. So that is why we developed the Deadline Ordered Multicast, or simply we can call it DOM. We developed DOM primitive to help us reduce the message reordering in the public cloud. So the design of DOM primitive is quite simple. So suppose that we have multiple senders and we have multiple receivers, and the senders will multicast the messages to the receivers. We hope that the receivers can receive the same sequence of messages. So in order to do that, first,

Starting point is 00:17:05 we synchronize the clocks of the sender and receivers. So with the clock synchronization, the sender can measure the one-way delay between the sender and the receivers. So in that way, when the sender wants to multicast the messages, it will attach a deadline on the messages. So the deadline is decided by the sending time plus the maximum estimated one-way delay between the sender and all the receivers. So in the good case, in most cases, when the request is multicast,

Starting point is 00:17:35 it should be enabled to arrive at each receiver before its deadline. So in that way, after receiving the messages, the receiver will just hold the message until its deadline. So in that way, after receiving the messages, the receiver will just hold the message until its deadline. And then it will append this message to the log list. So we can imagine that if the DOM can eliminate all the message reordering, then each receiver should establish an order log list, which

Starting point is 00:18:03 follows the increasing order of their deadlines. So in that way, we can achieve the consistency among these receivers. But we should also notice that the DOM primitive is just the best effort primitive. That means we are trying our best to reduce the message ordering, but we do not guarantee that we can eliminate all the reordering. So that is why for Nerja, we should have a fast path to commit a request efficiently

Starting point is 00:18:31 if the message reordering is completely eliminated. If the message reordering is completely eliminated. But if the message reordering still occurs, it can still lead to inconsistency. So in that way, we still need to have a slow path in Nerja protocol still lead to inconsistency. So in that way, we still need to have a slow pass in the

Starting point is 00:18:46 protocol to resolve the inconsistency and finally commit a request. Awesome. Cool. So I'm just trying to kind of get my head around the idea of the DOM primitive then. So I guess the deadline is every time we want to send a message, the clocks are synced. So I kind of say, okay, I guess the parameter that you control there is how long you set the deadline to be right i guess that's something you can manually control is that

Starting point is 00:19:10 sort of configurable like how do you know from how long to set the deadline for basically yeah yeah yeah yeah we decide that line in this way so as i mentioned we have the clock synchronization so in that way so periodically we will send some probing packet from the sender to every receiver. So in the probing packet, we include the sending time. And when the receiver receives the probing packet, it can calculate how long does it take for the probing packet to arrive at the receiver. So that is what we call one-way delay between the sender and the receiver. And by collecting these one-way delay between the sender and receiver. And by collecting

Starting point is 00:19:45 these one-way delay samples, we can make an estimation of the one-way delay between the sender and receivers. So for example, if you have three receivers and the sender sends a probing packet to these three receivers, but between the receiver one and the sender, maybe the one-way delay is 20 microseconds. And between the sender and receiver 2, it is maybe 40 microseconds. And between it with stream, it may be 30 microseconds. So since we have known these one-way delays

Starting point is 00:20:15 between the sender and receivers, so when a sender wants to multicast the messages, it will use the sending time plus the max of the one-way delays. So that should be sending time plus 40 microseconds. And that will be the deadline to multicast this message. So in the good case, this message should be able to arrive at all the receivers before the deadline. And then if it arrives early,

Starting point is 00:20:40 then the receiver will hold the message until its deadline, which means the receiver will hold the message until its deadline, which means the receiver will just check its clock time. And then when the clock time passes the deadline, it will release the request and append it to the log list. Fantastic. I'm really excited to kind of hear some numbers on this later on, but cool. So before we jump to that, we need to get through the fast path and the slow path. But maybe before we even do that, we should maybe talk through the architecture of Netter a little bit more and some of the assumptions that you've made while you were designing this protocol.

Starting point is 00:21:09 Yeah. Yeah. So briefly speaking, I should say that the common setting, the setting of Nerdja is quite similar to the setting of traditional protocols. So we still have two F plus one replicas. So typically you can have three replicas, five replicas, you can have anF plus 1 replicas. So typically you can have 3 replicas, 5 replicas. You can have an odd number of replicas. And among these 2F plus 1 replicas,

Starting point is 00:21:30 one replica is elected as a leader and the other replicas are followers. So when the client wants to commit its request, it directly talks to one proxy and the proxy will do the request model cast and later it will also do the quorum check to decide whether this request is committed and if the request is committed it will return the execution result

Starting point is 00:21:53 to the client and here as i mentioned with the DOM primitive we have a fast pass if there is no message reordering then this request should be committed in a fast pass but if there is a message reordering, then this request should be committed in a fast path. But if there is a message reordering, it can lead to the inconsistency among replicas. But that is not a big worry, because proxy will detect the inconsistency. And then finally, the replicas can commit a request in the slow path. Amazing stuff. So let's go through the fast path then. So walk us through the fast path, the happy case where we've not got many messages out of order.

Starting point is 00:22:26 The DOM permit has worked perfectly. Cool, yeah, take it away. Yeah, yeah, yeah. For the FastPath, it works in this way. So the proxy receives this message and receives the request from the client. And so it uses a DOM to multicast the request to all the replicas.

Starting point is 00:22:43 In the happy case, we do not have message reordering. So in that way, every replica willast the request to all the replicas. In the HP case, we do not have message reordering. So in that way, every replica will receive the request and hold the request until its deadline. And then this request is released and appended to a log list of this replica. So here we have a smart design so that only the leader will execute the request following the increasing order of their deadline. But the followers do not execute the request.

Starting point is 00:23:09 So followers only need to maintain the log list. And then every replica will send a reply to the proxy. And we call this reply fast reply. So inside the fast reply, every replica will include a hash value. And this hash value represents the log list on this replica. So after receiving the false replies, the proxy can check the hash value inside the false replies. So in that way, it can know whether these replicas have consistent states.

Starting point is 00:23:39 So if the proxy receives the false reply from the leader and also receives the fast reply from F plus half of followers, then this request is considered committed in the fast path. And because the leader's fast reply will include the execution result, so the proxy will obtain the execution result and reply to the client. So this is the fast path. Nice, cool. So I guess now we've got the fast path, we is the fast path. Nice, cool. So I guess now we've got the fast path, we need the slow path, right?

Starting point is 00:24:08 So yeah, so what happens when things are out of order and we can't take this speedy route, we can't take the shortcuts, we've got to go the long way around. How does that work? Yeah, cool. So as you have mentioned that,

Starting point is 00:24:19 this is the fast path is just about the happy case. So we cannot guarantee that every request can arrive before its deadline. Let's see if some requests arrive very late and it arrives at some replicas after the deadline. Then the replica receives this request. It tries to append this request to the log list.

Starting point is 00:24:39 But suddenly, the replica notices, oh, I have already appended a request with a larger deadline. So I cannot continue to append this incoming one because this guy has a smaller deadline. If I append that, that will violate the increasing order of deadline on my log list. So how should we address that? So for the leader replica and the follower replicas, they will take different actions. So for the leader replica, we gave them much power. So if the leader replica finds that, okay, this incoming request deadline

Starting point is 00:25:12 is too small, then the leader replica will modify the deadline of the incoming request to be a larger value. And then this request becomes eligible to be appended to the log list of the leader. But for the follower replicas, they do not have so much power. So when the follower notices, okay, this guy the incoming request deadline is too small, we cannot append that, the follower will just put the request aside. It will not handle that

Starting point is 00:25:37 at this moment. So in that way, you can see, if some request is appended, but if the request is appended to some replicasicas but not appended on the other replicas, then these replicas will have inconsistency. So in that way, they will generate different hash values in the first replies. Then the proxy will detect this inconsistency. So the proxy will not be able to commit this request in a fast path. So here, we make the leader replica take an active role.

Starting point is 00:26:10 So that means every time the leader appends the request into a log list, the leader will also broadcast its log indexes into the other follower replicas. So after the followers receive the log indexes from the leader, the followers will listen to the leader. It may notice that, okay, my log list is inconsistent with the leader, so I must modify my log list to keep consistent with the leader.

Starting point is 00:26:38 So here I hope you still remember when I was describing the first pass, I'm saying that only the leader will execute the request. Followers only maintain the log list. So this is quite useful in resolving the inconsistency between the leader and followers. So imagine that if we also ask the followers to execute the request at the beginning, then we are trying to resolve the inconsistency. Maybe we will need to roll back at the follower. So that can be quite expensive and even unsupported by the application. But in our case, the followers do not execute the

Starting point is 00:27:13 request. So to resolve the inconsistency, the followers just need to modify the log list, and then it can rapidly fix the inconsistency and keep consistent log list as a leader. So after the followers have completed the resolving of the inconsistency, it will send another reply to the proxy, which we call that slow reply. And the proxy, after it receives the fast reply from the leader and also the slow reply from as followers it will consider this request committed in the slow path and again it can obtain the execution result from the leader

Starting point is 00:27:53 fast reply and the reply that to the client fantastic so i'm really interested to see how how how kind of some before the performance that was behind the fast path versus the slow path that's his other protocols out there so i think it's time to talk some numbers. Can you tell us about your experiments and your setup and what sort of benchmarks you ran and what you compared NetApp against? Yeah. Actually, we have compared the NetApp protocol with multiple baselines, including, I remember it is about six to seven baselines,

Starting point is 00:28:20 including multi-paxels, fastpaxels, no-paxels, raft, and e-paxels, domino, and so on. So at the very beginning, we only conducted now evaluation in the local area network. So we just employed the VMs in Google Cloud. So we ran the comparative evaluation in a single data region. And we have achieved much higher throughput and better latency. So the speedup can be about 1.3 to 20.9x compared with the baselines. But later, when we submitted to the VLDB, we got some feedback from the reviewers, and reviewers suggested that we should also do some evaluation in the wide area network with

Starting point is 00:29:01 multiple regions. And then when we supplemented the evaluation in the wide area network with multiple regions. And then when we supplement the evaluation in the wide area network, actually the performance advantage of NERDA becomes even more distinct because we have known that in the wide area network, perhaps the most well-known protocol is EPEXOS. EPEXOS claims it can achieve optimal latency

Starting point is 00:29:23 in the wide area network. But that assumption will become not true when we separate the client and the replicas into different regions. So now in the more general setting, when we have different regions for the client and the replicas, only Neja protocol can achieve the optimal latency, which means we can commit the request in just one RTT in the fast path.

Starting point is 00:29:50 For ePaxels and multi-Paxels, they still need at least two or even 2.5 one RTTs to commit the request. So that makes our latency performance much more distinct compared with the baselines. Yeah, for sure. I mean, like one round trip versus 2.2.5 that's a big difference yeah cool and so i i guess kind of on this then so

Starting point is 00:30:13 this this works well when obviously we have when the dom print is working great and our messages uh um uh aren't kind of being reordered did you do any experiments sort of in the extremes of what happens when kind of, well, I don't know, what is the performance sort of tipping point when I'm seeing a lot more of my messages being arriving sort of out of order and I need to do some reordering? So I end up on the slow path

Starting point is 00:30:37 more than I do on the fast path. Yeah, actually, I think, actually, we have included such experiment in our conference version. So, for DOM, the key parameter that we need to care about is how should we decide the estimated deadline. So here we are using the sliding window to estimate the one-way delay and further we can decide the deadline. So for the sliding window, we have maintained a hyperparameter which is called the percentile. So that means you collect some OOD, some one-way delay samples, and then you will just choose maybe to the 50 percentile

Starting point is 00:31:17 to the 75 percentile as the estimated one-way delay. Of course, the larger percentile you use, then you will suffer from longer holding delay, but in that way, you can reduce more message reordering. So we have measured different, we have chosen different percentiles used to estimate the one-way delay and further decide the deadline. And finally

Starting point is 00:31:38 we found that with the 50 percentile, we can achieve the best trade-off between the holding delay and the faster commit ratio. As we increase the deadline value, then, of course, we can continue to increase the fast commit ratio, but that increase is just marginal. We are just suffering from more holding delays, which causes longer latency.

Starting point is 00:32:06 So in our future work, we need to do more experiments to maybe develop an automatic mechanism to select the estimation method. But in our current design, we have used the sliding window estimation with the 15 percentile as the estimated value. That has been proved to

Starting point is 00:32:25 achieve the best trade-off yeah i guess that probably is going to be the best answer in most cases right so for now it's good but it's interesting like as a future research direction sort of kind of having this monitored at sort of like a runtime and varied depending on what's happening in the underlying network or whatever and sort of changing it over time based on on various factors that's that's really it'd be really cool to see it to see to see that um um awesome yeah i guess sort of like building on top of this then so are there any sort of i mean i always like to i always kind of ask this question but are there any sort of scenarios in which and nether is sort of

Starting point is 00:33:00 suboptimal in terms of its performance like what are the areas where it's limitations, I guess? Yeah, for the limitations, actually, I should admit it has limitations. Actually, I have uploaded the technical report of NerdJar to Archive. So in the appendix, I have talked about the limitations. I think the biggest limitation is still about the clock synchronization. As I have mentioned, actually I should emphasize that NerdJazz protocol's correctness does not depend on the clock synchronization. So we only rely on clock synchronization for performance

Starting point is 00:33:36 rather than correctness. But even if the clock synchronization fails, the protocol is still correct, but the performance will suffer from some negative impact. So, for example, if the clock synchronization suddenly fails, and then the DOM primitive is likely to generate very large deadlines, and when the replica receives a message with very large deadlines, and then this replica will just hold the message for a very long time. And this long holding time will be counted into the commit latency of this request. So that will lead to very bad latency performance. So in the appendix of the technical report,

Starting point is 00:34:24 I have discussed this issue and also discussed some mechanisms to address this problem. For example, when we have noticed that the holding time is very long, we can force this request to be committed in a slow pass. So, in that way, we do not need to suffer from the long holding time caused by the wrong deadline. And here I want to highlight one thing, that is, for this kind of issue, it can be theoretically a problem, but in practical cases, it can hardly happen. I said this because of two reasons.

Starting point is 00:35:01 The first reason is because of the Higgins algorithm we used in NERDA. So Higgins is a very robust clock synchronization algorithm, which means that it is using a robust probing mesh. So even some clock synchronization agents fail. So long as there is no Biont node, the remaining healthy nodes can quickly, can still synchronize with each other and they can quickly converge to the synchronized clocks. So the second reason is

Starting point is 00:35:32 even if there is some Byzantine node which keeps polluting the clock synchronization, in the recent literature, I have noticed that there are some mechanisms developed. For example, you can use some watchdog mechanism, and the watchdog mechanism can detect the Byzantine node.

Starting point is 00:35:52 And then if this node keeps polluting my clock synchronization, we can just detect that and kick that node out of the synchronization scope. So in that way, these mechanisms will prevent the long-lasting negative impact from Byzantine nodes. Fascinating. So yeah, it's interesting to see. I mean, I kind of am a little bit sort of detached from the whole sort of Byzantine sort of protocols, BFT protocols, and sort of don't keep up to speed with them too much. So it's really interesting to see that they're kind of in their way way, even working on solutions around that sort of, and like a watchdog or whatever, and how that can be leveraged

Starting point is 00:36:27 for these sorts of protocols as well, which don't have the same sort of underlying assumptions. It's really cool. I had another question as well about how do you handle sort of changing the member set of the protocol, like rolling in new members and rolling out old members.

Starting point is 00:36:45 How does NetEarth handle that? The short answer is, for now, we just inherited the view-based approach. We will just use the view-stamped replica approach. But in the near future,

Starting point is 00:36:56 I think we can design more efficient membership change. For example, in the recent protocol, maybe you have noticed which is called Matchmaker. I think it is also published in VLDB. So in the recent protocol, maybe you have noticed, which is called Matchmaker. I think it is also published in VLDB. So for the Matchmaker protocol, it can achieve even more efficient reconfiguration. And I believe such protocol can be adapted and be applied

Starting point is 00:37:18 to support the membership management in the future version of NEDA. Fantastic stuff. So I guess kind of leading on from that then, where do you go next with NERDA? What's the next thing on the research agenda? Yeah, good question. So actually, you know, this is not the end of the story of NERDA. So currently

Starting point is 00:37:37 in my research, there are still three ongoing directions related to NERDA. So the first direction is to productize the Neutra protocol. So after the Neutra protocol is developed, my supervisors and I

Starting point is 00:37:53 all believe that this Neutra protocol can become a promising alternative and replacement of the legacy protocols such as the Raft and Multiplexers. So for now, currently I'm working with some industry colleagues and we are trying to refactor

Starting point is 00:38:10 the code base of a NERDA protocol. And after we have done that, we will continue to integrate NERDA into some industry systems. For example, Kubernetes. You know, currently the Kubernetes is using the ETCD as the storage backend. And ETCD actually is a standard implementation of Raft.

Starting point is 00:38:31 And based on the feedback from some industry staff, they told us that the Kubernetes is suffering from the performance bottleneck due to the ETCD backend. So if we can replace the etcd with another protocol which offers higher performance, then we should be able to bring performance boost to the Kubernetes system. And similar systems

Starting point is 00:38:54 such as Apache Parse, which is now using the loopkeeper, and Apache Parse is also suffering from the bottleneck due to the loopkeeper. And in the future, maybe we can also replace the loopkeeper with Nerja to bring more performance improvement for PaaS. And another scenario is the financial system.

Starting point is 00:39:16 You know, for financial systems, they also require the fault tolerance. So this is something related to my first project, which is called CloudX. And in the future, I think I will also apply the NERJA protocol into my first project to provide the fault tolerance for the financial system. So this is the first direction in my schedule. And the second direction related to NERJA is to extend the application scope from the consensus protocol to concurrency control protocol.

Starting point is 00:39:48 Oh, by the way, I have listened to your podcast published last month when you were talking about the DTOC project with Coim. Yeah. That is a very insightful podcast and I enjoyed that a lot.

Starting point is 00:40:02 So, you know, based on my understanding, you see, the concurrency control protocol actually is trying to establish the ordering agreement of multiple shots. So that means you submit a transaction, this transaction

Starting point is 00:40:17 will be executed across multiple shots, but we need to satisfy the strict serializability. So, concurrency control protocol is trying to establish ordering agreement across multiple shots, while the consensus protocol is trying to establish the ordering agreement across multiple replicas. So these two share much similarity. So if we can do high-performance consensus protocol with ClockSync, then we should also be able to do high-performance concurrency control with ClockSync. So actually, currently, I'm already implementing such a new protocol

Starting point is 00:40:53 which can provide both consensus and concurrency control. Nowadays, I have completed the design of the new protocol, and I'm now involved in implementing and evaluating. So hopefully, I think this new protocol will come out very soon in the near future. And maybe when that day comes, we can have another talk about the new protocol. And yeah, this is the second direction. And for the third direction, as you have mentioned, I'm also looking at some literature on the Byzantine fault- scenario. So at this moment, the NERJA is just a crash fault tolerant consensus protocol.

Starting point is 00:41:32 So compared with the CFT consensus, the BFT consensus is even more challenging because in the BFT consensus, we should not handle the case when the node fails. You should also handle the case that the node is dishonest. So, how should we achieve high-performance BFT consensus protocol? In the past three years, we have seen that many researchers are becoming more interested in using

Starting point is 00:41:55 TEE to accelerate BFT. So, TEE stands for Trustworthy Execution Environment. For example, Intel SCX is a very good TEE technique. So I'm thinking that by using the TEE technique and combine the TEE technique with NERDA protocol, we can make NERDA work.

Starting point is 00:42:17 We can also make NERDA work in the BFT scenarios. So that is the three dimensions. Fascinating. I mean, you're going to be busy for a long time. That sounds fascinating in that that directions i'm really excited i mean if you can kind of my next question is going to be what impact do you think your work can have but i mean if you manage to get nesla in using kubernetes and using apache pulsa and then obviously the financial application that'd be some fantastic

Starting point is 00:42:41 industry impact right that'd be amazing exactly and yes you can definitely come on the show again and talk about you the um the extension to concurrence control protocol that'd be a fascinating chat i'm sure yeah and then yeah the third direction as well the bft then that's that seems fascinating as well so yeah i'm sure there's gonna be loads of really interesting work coming out from yourself from your lab over the next sort of coming years for sure so um so I mean, I guess my next question was going to be what impact do you think your work can have? But I don't know if you want to add anything more onto that, like how do you think people can leverage your findings

Starting point is 00:43:12 in their day-to-day sort of work? Yeah, as we have mentioned in the previous question, I think that by replacing the Lexi protocols with NerdR, we can bring the performance improvement to many distributed systems. Another thing which is also put in my to-do list, maybe in the far future, that is

Starting point is 00:43:31 for the DOM primitive, we think that we can make the DOM primitive as a general middleware. So if we can make the DOM as a general middleware, we can plug into any distributed systems which requires consistency. So in that way, it should be able to as a general middleware, we can plug into any distributed systems which requires consistency. So in that way, it should be able to bring a general acceleration to the distributed systems.

Starting point is 00:43:51 So that can also be a promising direction that we need to do. But of course, by implementing this general middleware, it may be more challenging. We need to consider many other issues. Fair. I find that'd be really interesting. Yeah, for sure. I guess. Yeah. So of course, I am my, my, um, kind of just changing direction a little bit here. We sort of working on, on, on the, on the nether project. What's the most interesting thing that you've learned? Uh, actually, yeah, I, I think probably interesting thing that I want to share to, I think I want to share two lessons I learned. One good news and one bad news. Okay.

Starting point is 00:44:27 Let's first talk about the bad news. Yeah, sure. Yeah, so first let's talk about the bad news. So in the distribution, while I was doing the research on the consent protocols, I realized that this area is really crowded. So for example, because there are too many, because consent protocols have been

Starting point is 00:44:45 developed for more than 30 years, so there are too many smart people and they have spent too much effort in developing different kinds of protocols. So this is what I call history burden. Because we are doing that 30 years after the pioneers. So

Starting point is 00:45:02 when we develop a new protocol, in order to make another protocol stand out, we need to defeat many, many baselines. This is what we have encountered from the reviewer feedback. So when we submitted the NERDA paper to some conference tracks, our reviewers were just arguing that

Starting point is 00:45:19 why don't you compare with baseline A? Why don't you compare with baseline B? So in order to satisfy the reviewers, we need to add more and more evaluation in order to distinguish novel protocol compared with the existing works. So for the young researchers and students who want to enter

Starting point is 00:45:35 this area and make some contributions in the consent protocols, I suggest that you must be prepared for the laborious implementation and evaluation work that you must be prepared for the laborious implementation and evaluation work that you need to do. So this is bad news.

Starting point is 00:45:51 But there are also some good news. So the good news is we should be aware that compared with the young researchers 30 years ago, we are now enjoying much better implementation conditions. You can have many available powerful techniques that you can use to improve the protocols. So

Starting point is 00:46:13 if you want to make a contribution in the consensus protocol, I'm sure that you can do something. Based on my understanding, there are two paths that you can consider. So the first path is that you can consider to use some special network support like what NoPaxels and SpeculativePaxels have done. You can also use RDMA, use ProgramMage, use ISDN. You can use these new techniques to enhance the performance of the consensus protocols

Starting point is 00:46:41 and build your own contribution. And the second path is to use some software and deployable techniques, such as the clock synchronization and such as TEE. And by the way, TEE is also available in the public cloud like AWS and Microsoft. So by using these new techniques, you can create the unique contribution, which you can do, but the early researchers cannot do. So in that way, you can make your work stand out. I think that's a fantastic answer to that question. Yeah.

Starting point is 00:47:13 Great. I love the bad news and I love the good news as well. I say it's great stuff. But it's so true, that kind of the history bed. And it's true, obviously, I spend a lot of time in like the in the same sort of field and like and like you know with database research it feels like everyone's already done it all in the 60s and 70s and 80s right so you there's so much to sort of learn and kind of get your head around before you can then make a some sort of like layer on top right and make

Starting point is 00:47:39 some incremental improvement in some in some area but yeah no i guess it's that's the the the way the world is at the moment i guess but anyway cool um awesome so yeah i guess kind of on the on the other side then were there were there some sort of things that you tried while you were working on there that you that failed that maybe the listener might be interested in knowing about yeah yeah i think yeah this question i think this question is quite relevant to the previous question yeah i think yeah the problem is that in the research, actually, this is a common issue for almost all the system-related research. That is, behind the paper, you need to spend much, much more effort

Starting point is 00:48:17 trying to do the implementation and evaluation. Actually, you know, Mildred protocol has been submitted for three times, and it is accepted in the third time. But for the first two times, we submitted to the other companies and it is rejected. And the reason for the rejection is because they said that, okay, we do not compare with the other baselines, or we do not do this experiment or that experiment. So, you know, this actually makes me quite disappointed because we have designed such protocol.

Starting point is 00:48:48 The only problem is we need to make our protocol defeat so many rivals. So these hard work cannot be reflected in a paper. So that is quite disappointing.

Starting point is 00:49:01 So when I got rejection for the first and the second time, my supervisors come to comfort me. They said, oh, Jingkun, do not be so disappointed. You know, for the rest of the paper, they have been rejected for six times. Really? Okay.

Starting point is 00:49:14 Well, yeah, that makes it easier now, right? You know, that got rejected six times. Look how influential that protocol is. Yeah, yeah, yeah. Oh, yeah. Yeah, finally, so my supervisors always encouraged me. They said, we believe that now work is good work, so we just need to keep working on that.

Starting point is 00:49:28 And it is just a matter of time to get it published. And finally, yes, we got published in VLDB. And also the final review feedback is very positive. That encourages us a lot. That's great. But it is kind of quite disheartening, right? You've got to do so much sort of work that doesn't ever get any really any real credit in the in the paper right

Starting point is 00:49:50 like and it's like i don't know maybe it's a problem it's a kind of a systemic problem with the system at the moment i guess with the academic system and the way that things kind of work and i think it's especially prevalent in systems um systems research but yeah um yeah um at least he got accepted eventually and i'm glad it did get accepted because it was a great read so um we've got a couple more questions now and the this one is this can you maybe spend some time telling the listener about your other research i'm sure you work on loads of interesting and interesting things i haven't in the past so yeah tell us about some of your other stuff. Yeah, yeah, yeah. Actually, I want to share some lessons,

Starting point is 00:50:26 share some tips that I have learned based on my past research experience. Yeah, go for it. Yeah, yeah. I think for the PhD students, this is a common challenge that we will encounter. So how should you find the project that you think is promising

Starting point is 00:50:41 and what you think you can do? So I think for the PhD research, it can be divided into two steps. Step one, you need to have an idea. You should have an idea that you believe it is promising. The second, second step two, you need to implement your idea and demonstrate the effectiveness of your idea. So comparing these two steps, I think step one is harder because when you do not. Because this is a starting point. If you do not have the idea, you will just be blocked. You cannot make any progress.

Starting point is 00:51:09 So here I want to share the experience. How should we generate some interesting ideas? So there are two ways. The first way is if you do not have the idea, I think at least you will have a general direction. For example, you can ask your supervisor to give you a general direction. Of course, the supervisor will not tell you what to do step by step.

Starting point is 00:51:29 That is your responsibility to figure out what to do step by step. So after we have got a general research direction, then you will need to do a lot of literature review. For example, when I first decided to do something in the consensus protocol, I just need to do a lot of literature review. For example, when I first decided to do something in the consensus protocol, I just need to read about 20 papers

Starting point is 00:51:49 which are published in the top 10 conferences. So after you have read 20 papers, I believe that you should already get some sense. You should already have some inspiration. If you are saying that, okay, after I read 20 papers, I still have no idea, then don't worry.

Starting point is 00:52:04 Keep going and continue to read another 20 papers and I still have no idea. Then, don't worry. Keep going and continue to read another 20 papers. And keep doing this. Finally, you will find something interesting. This is the first option that I suggested to other PhDs. And the second option is to make your hands dirty. For example, if you do not know what to

Starting point is 00:52:20 do, you can find a merger system. For example, if you want to do something in the consensus protocol, you do not know what to create. Then you can start by implementing the Paxos protocol or by implementing the Vue replication protocol by yourself.

Starting point is 00:52:35 So when you are doing the implementation, you will learn some potential bottlenecks, some potential issues that the other guys cannot know if they do not execute, if they do not implement that. So based on that, perhaps at the beginning, you will have a very small idea. And then you continue with your small idea. During the implementation, you will make your idea bigger and bigger. And finally, probably you will have a good work.

Starting point is 00:53:00 So that is the second tip I want to share with the other PhD students. That's fantastic. Honestly, I love that share with the other phd students that's fantastic honestly i love that it's great i think that's fantastic advice is read read read read read keep reading and it's a great way to generate ideas right like that was sort of the approach that i sort of took initially and and was sort of like if i can read all of these distributed transaction papers then surely maybe one day i can have an idea of my own and then obviously yeah get your hands dirty right and i think that's that that that kind of advice is is is true in in life as well in general in all and all things right like you you kind of you've got to say you've got to crack a few eggs to scrap to make an omelette right so like you've

Starting point is 00:53:39 got to get your hands dirty get in there and then ideas flow from there right and the first thing you do might be terrible but the everything is at first right the first podcast i recorded was terrible and hopefully by now they've got a little bit better i don't know at least sticking with the judge of that i'm joking i'm joking but um yeah but no i think that that that's that's brilliant um so i think that kind of that answer that sort of i normally ask about people's creative process but you've kind of answered it there as well so i think that that's a really, really cool sort of answer. I really enjoyed that.

Starting point is 00:54:08 And I guess, yeah, the time for the last question now is, what's the one thing you want the listeners to take away from this podcast today? Yeah, yeah. To summarize in one sentence, I would say ClockSync is becoming the new powerful weapon that we can use

Starting point is 00:54:23 to accelerate the distributed protocols. That is what I want to share with the audience. That's fantastic. Great. Let's end it there. Thank you so much, Jinkeviko, on the show. It's been a fascinating chat. I've loved it.

Starting point is 00:54:36 If the listener wants to know more about Jinkeviko's work, we'll put a link to everything in the show notes. And again, if you do enjoy the show, please do consider supporting us through Buy Me a Coffee. It really helps us to keep making the show. and we'll see you all next time for some more awesome computer science research

Disseminate: The Computer Science Research Podcast - Jinkun Geng | Nezha: Deployable and High-Performance Consensus Using Synchronized Clocks | #42

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.