Disseminate: The Computer Science Research Podcast - Jinkun Geng | Nezha: Deployable and High-Performance Consensus Using Synchronized Clocks | #42
Episode Date: October 23, 2023Summary: In this episode Jinkun Geng talks to us about Nezha, a high-performance consensus protocol. Nezha can be deployed by cloud tenants without support from cloud providers. Nezha bridges the gap ...between protocols such as MultiPaxos and Raft, which can be readily deployed, and protocols such as NOPaxos and Speculative Paxos, that provide better performance, but require access to technologies such as programmable switches and in-network prioritization, which cloud tenants do not have. Tune in to learn more! Links: Jinkun's HomepageNezha VLDB'23 PaperNezha GitLab Repo Hosted on Acast. See acast.com/privacy for more information.
Transcript
Discussion (0)
Hello and welcome to Disseminate, the computer science research podcast. I'm your host, Jack
Wardby. A reminder that if you do enjoy the show, do please consider supporting us through
buying me a coffee. It really helps us to keep making the show. So yeah, please do consider
supporting us. Today, I'm joined by Jinkun Jain, who will be telling us everything we need to know
about NESA, deployable and high-performance consensus using synchronized clocks. Jinkun
is a PhD student at Stanford
University. Jin Kun, welcome to the show. Yeah, thanks for the invitation, Jack.
Great stuff. So let's jump straight in then. So can you tell us a little bit more about yourself
and how you became interested in databases, distributed systems, and all the cool research
that you do? Yeah, cool. Sure. So my name is Jin Kun Gan and so I'm now a fifth-year PhD student in the CS department of Stanford University.
So my research direction is the networking system.
So more specifically, my current research project is focusing on high-performance distributed protocols, including consensus protocols and concurrency control protocols. So before I joined Stanford in 2019,
I completed my undergraduate study and my master's study both in China.
So I completed my undergraduate study in Beihang University,
and then I entered Tsinghua University to obtain my master's degree.
So regarding my research career, I think I can divide that into two parts.
So the first part is before my PhD. Regarding my research career, I think I can divide that into two parts.
So the first part is before my PhD.
So mostly I conducted research in Tsinghua University,
advised by Professor Dan Li. So at that time, I was doing high-performance networking,
which means I used DPDK RDMA to accelerate the communication in distributed systems.
So further, based on that, we can continue to bring performance improvement on distributed
applications, such as distributed machine learning applications.
And that is what I was doing in my master.
And when I came to Stanford, I finally found my current supervisors.
So I joined the group of Professor Balaji Prabhagat.
And Balaji is my primary advisor.
And in the meantime,
I'm also co-advised
by Professor Mindo Rosblum
and also Professor Anirudh Sabarami.
So Mindo is another professor
in Stanford,
and Anirudh is a professor
in New York University.
So in our current group,
the general direction is called time-sensitive networks.
So the motivation for this general direction is like this.
So dating back to decades ago,
such as 10 years or 20 years ago,
at that time, clock synchronization
had not been a mature technique. So the clock
synchronization error can be quite bad dating back to 20 years ago. The different nodes can
even reach the error bound of 100 milliseconds. So for the early researchers in distributed systems,
they cannot make strong assumptions of the clock synchronization.
But in the recent years, especially in the past three to five years,
we have seen that clock synchronization is developing very fast.
And the traditional protocol like NTP has improved the synchronization accuracy to a greater amount.
And some new clock synchronization protocols
have also been proposed,
like the synchronization algorithm
we have used in Neutra protocol,
which is called Huygens,
which is developed by another senior lab mate in our group.
So with the new synchronization algorithm,
we can reach the clock synchronization accuracy
at nanosecond level.
So in that way, with this new powerful tool, we come to ask whether we can review and even
revolutionize the previous distributed system research and achieve much better performance.
And the consensus protocol is one such scenario fantastic that's a great mission
statement for the for your lab are doing that i mean yeah like you say it's it's a different
ball game from 20 years ago right like the underlying assumptions have changed we've got
better hardware so well how does that how does that change the state of play so yeah i guess
that's what we're going to be talking about today so um in some more depth so cool so let's start
off with a little bit more background for the listening.
So before we dig into all the nitty-gritty and everything,
can you tell us more about kind of consensus protocols generally?
What are they and why do we need them?
Yeah, okay.
So, you know, actually consensus protocol is not as widely applied
in many distributed systems because distributed systems require fault tolerance.
So consensus protocols provide the fault tolerance for the systems,
so it can ensure the service availability for the distributed systems.
When one server fails, you just need to use the other servers to replace the failed ones,
and it can resume the service for the systems. So simply speaking, the consensus
protocols can be defined in a scenario with multiple servers. So the protocol needs to
coordinate the multiple servers so that they can reach an agreement on a single data value.
So this is commonly known as a single-degree Pexels, or you can call that a single-degree consensus,
because Pexels has already become the synonym of consensus.
But of course, in the practical cases, the single-degree consensus is not sufficient for the applications.
So in many practical cases, we can make abstraction of the application's execution.
We represent the execution history of the application's execution. We represent the execution history
of the application as an ordered log list. So it becomes the responsibility of the consensus
protocol to make sure that every server has the same ordered log list. So in that way,
when the active server fails, we just need to use another server which has the same old log list to replace that.
By re-executing the log list,
we can recover the same application state
and then we can resume the service.
Fantastic.
That's a great description of what Consent is for.
I mean, I personally work with one every day.
So we have Raft at work.
So I have fun with that quite a lot of time.
But cool, right?
Anyway, I'm getting sidetracked so let me let me keep on on focus so of course you mentioned there that and paxos is sort of like kind of almost the synonym for consensus protocols i know there's a
host of other ones like raft out there but what are the current problems with the with what are
the problems with the current state of our protocols? And why do we need to have a new one?
What's the motivation here?
The problems and then, I guess, leading into the motivation for your work.
Yeah.
I think, based on my understanding, I think for the state of art,
for the existing consensus protocols,
the drawbacks can be summarized into two dimensions.
So the first dimension is performance. and the second dimension is deployability.
As you mentioned, for the traditional protocols like multipixels and like Raft,
when they are designing the protocols, they cannot make strong assumptions.
So for these protocols, of course, they do not have strong assumptions,
so they can be deployed everywhere.
But the performance is not very good.
And for these protocols,
they usually suffer from the throughput bottleneck.
And also the committed latency can be very long.
So that will become the bottleneck
for the system and applications running on top of them.
So these drawbacks of the traditional protocols
have already been realized by some recent works.
So for some recent protocols like no-Paxos and speculative Paxos, they are trying to achieve high performance.
So in order to do that, they decided to use some network support, such as the program switches, such as SDN control.
But in that case, it introduces another problem,
which is the deployability.
Because we know in the public cloud, which is the general environment for many systems and applications,
we do not have such strong support.
From the perspective of the cloud tenant,
the network is just like a black box,
so you cannot use program switches
or SDN control.
So in that way,
the high-performance protocols
developed in the recent years
cannot be easily deployed
in public cloud.
So that is why it motivates us
to design a new protocol.
We hope that we can achieve
both high performance
and deployability
so that we can provide benefit
to the cloud systems and applications deployability so that we can provide a benefit to the cloud
systems and applications.
Yeah, that's fantastic.
I mean, I've always had this sort of, whenever I come across a cool paper that's used some
sort of modern hardware that's cutting edge and it says, yeah, look how fast we can make
this system go using this hardware or we can solve these problems using this hardware and
it goes really fast.
Is this what you mentioned there there this deployability of that like that's not generally
available in the public cloud and therefore it's kind of well it's not irrelevant but like it's i
can't use that in a day in front like day to day because not everyone has access to it right so i
think it's great to factor that sort of dimension into into your design process so i guess with
these sort of two dimensions in mind then,
give us the elevator pitch for Nezha.
And I think I'm pronouncing it right.
Is it Nezha?
Nezha, yeah.
Yeah, okay.
What does it mean?
What does it mean?
Nezha actually comes from,
Nezha is the name of a Chinese legendary figure.
So Nezha actually,
this figure has three heads and six arms.
So you can imagine that if you cut down one head, this guy is still alive.
So it has a stronger fault tolerance.
I get it. I get it. That's a great name.
That's a great name.
Cool. So give us the elevator pitch then.
How are we going to achieve these two goals?
Yeah, I think there are three pillars in the Nerja design.
The first aspect, of course,
is clock synchronization.
So let's first think about this question.
Why does clock synchronization
accelerate
consensus protocols?
So essentially,
consensus protocols are trying
to establish the agreement
when all the log lists.
So you can imagine that if you have multiple replicas
and you have many clients,
imagine that each request is multicast to every replicas
and imagine that each request will arrive
as a replica in the same order.
So in that way, these replicas will establish
the same ordered log list by nature.
So we do not, even without a coordination.
This is the ideal case, but this does not always happen.
This can hardly happen, actually, in the public cloud,
because in the public cloud, we have observed that
message reordering occurs very frequently.
So in that way, if we can reduce the message reordering in the network,
then we should be able to accelerate the consensus
protocols. This design principle has already been realized by the existing works like what I
mentioned, such as no-pixel and speculative pixels. So they decide to resort to the networking support
like SDN control to control the networking transmission in order to reduce the message
reordering. But also, as I mentioned, this network support is not available in the public cloud,
but clock synchronization is already available in the public cloud.
So we decided that we can use clock synchronization to reduce the message ordering
so that we can accelerate the consensus.
So motivated by that, we first developed a primitive,
which is called Deadline
Order Multicast Primitive, and we can talk about that later. And based on DOM Primitive,
we continue to develop now a NERDA protocol to achieve high performance. So this is the first
aspect I want to highlight. So the second aspect I want to say is the speculative design, which we
included in the NERDA protocol.
So if we take a review of the traditional protocols like multiplexers and Raft, we can
see that for these protocols, they ask the leader to execute the request after the leader
has confirmed that the request has been replicated to a majority of replicas. So the confirmation
operation is also known as the quorum check, which means that in the traditional protocols
like multipack, salt, and raft, only after the quorum check is completed, then the execution
can be done. So this actually prolongs the commit workflow. So we don't want that. So in our design, we just ask the leader to aggressively access the request.
So that means it is no longer the responsibility of the leader to do the quorum check.
And we have another entity to help the protocol do the quorum check to decide whether this request is committed or not.
So this is the second aspect, the speculative execution.
And following that, the third aspect is also related to the second one.
So which means in our protocol, we have extracted a stateless proxy design.
So the proxy becomes an important component.
It helps us to do the request multicast and help us to do the quorum check.
So in that way, for these heavy operations, which are used to be done by the leader,
we have offloaded them into the proxy.
So in that way, the leader's bottleneck becomes much reduced.
And because the proxy is stateless, we can scale up and down the number of the proxies based on the workload.
So if your workload is heavy, you can deploy more proxies.
Otherwise, you just need to reduce the number of proxies.
So because the proxy is stateless, the failure of the proxy does not affect the correctness of the protocol.
And because it is scalable, it helps us to reduce much of the bottleneck as a leader.
And additionally, because the proxy also serves as a middleware,
it actually serves as an adapter for us to migrate the distributed system from legacy protocols to NERDA.
For example, many systems are running on top of multiplexers and Raft. By just implementing an adapter of the proxy,
we can easily migrate the system from Raft and multiplexers to NerdR
so that they can also enjoy the high performance provided by now protocol.
Fantastic. So just to recap there briefly,
the three pillars of NerdR there are the deadline-ordered multicast, which is helping us with our...
I did have a question on that, actually, before we continue.
You mentioned that in the public cloud, it's been observed that message reordering happens frequently.
Do you have any sort of numbers with that, as a rough guideline of how frequent we're talking?
Oh, yes.
So in our paper, we have included that number. We have
included that experiment. So based
on our... First,
we define a metric, which is called
reordering score, and to measure
the frequency
of the reordering. And based on our measurement,
we have found that so long as
you have multiple senders, and so long
as the sending rate increases, you will
reach a very high reordering score,
which can be 20% to 40%.
So that indicates a very, very frequent reordering
in the public cloud
when you do not have any control.
Yeah, that's larger than I thought it would be.
That's very interesting.
Cool.
So just to continue the quick recap there we also have and the speculative execution so we kind of we don't
have to kind of have this sort of like commit latency waiting for the leader to sort of do
everything and then the last thing is we have this stateless proxy design which allows us to
upload multicast and quorum trick and all that to some sort of middleware so again we're offloading
stuff from from the leader awesome cool great so let's dig into the details a little bit more.
So let's start off with Deadline Ordered Multicast.
Tell us more about it.
Yeah.
Explain it to us.
Yeah.
As I have mentioned that the message reordering is a performance killer to the consensus protocols.
So that is why we developed the Deadline Ordered Multicast, or simply we can call it DOM.
We developed DOM primitive to help us reduce the message reordering in the public cloud.
So the design of DOM primitive is quite simple.
So suppose that we have multiple senders and we have multiple receivers,
and the senders will multicast the messages to the receivers.
We hope that the receivers can receive the same sequence of messages.
So in order to do that, first,
we synchronize the clocks of the sender and receivers. So with the clock synchronization,
the sender can measure the one-way delay between the sender and the receivers. So in that way,
when the sender wants to multicast the messages, it will attach a deadline on the messages.
So the deadline is decided by the sending time
plus the maximum estimated one-way delay
between the sender and all the receivers.
So in the good case, in most cases,
when the request is multicast,
it should be enabled to arrive at each receiver
before its deadline.
So in that way, after receiving the messages,
the receiver will just hold the message until its deadline. So in that way, after receiving the messages, the receiver will just hold the message until its deadline.
And then it will append this message to the log list.
So we can imagine that if the DOM can eliminate
all the message reordering, then each receiver
should establish an order log list, which
follows the increasing order of their deadlines.
So in that way, we can achieve the consistency among these receivers.
But we should also notice that the DOM primitive is just the best effort primitive.
That means we are trying our best to reduce the message ordering,
but we do not guarantee that we can eliminate all the reordering.
So that is why for Nerja,
we should have a fast path
to commit a request efficiently
if the message reordering
is completely eliminated.
If the message reordering
is completely eliminated.
But if the message reordering still occurs,
it can still lead to inconsistency.
So in that way,
we still need to have a slow path in Nerja protocol still lead to inconsistency. So in that way, we still need to have a slow pass in the
protocol to resolve the inconsistency and
finally commit a request. Awesome. Cool. So I'm
just trying to kind of get my head around the idea of the
DOM primitive then. So I guess the deadline is every time
we want to send a message, the clocks are synced. So I kind of
say, okay, I guess the parameter
that you control there is how long you
set the deadline to be right i guess that's something you can manually control is that
sort of configurable like how do you know from how long to set the deadline for basically yeah yeah
yeah yeah we decide that line in this way so as i mentioned we have the clock synchronization
so in that way so periodically we will send some probing packet from the sender to every receiver.
So in the probing packet, we include the sending time.
And when the receiver receives the probing packet,
it can calculate how long does it take for the probing packet to arrive at the receiver.
So that is what we call one-way delay between the sender and the receiver.
And by collecting these one-way delay between the sender and receiver. And by collecting
these one-way delay samples, we can make an estimation of the one-way delay between the
sender and receivers. So for example, if you have three receivers and the sender sends a
probing packet to these three receivers, but between the receiver one and the sender, maybe
the one-way delay is 20 microseconds.
And between the sender and receiver 2,
it is maybe 40 microseconds.
And between it with stream, it may be 30 microseconds.
So since we have known these one-way delays
between the sender and receivers,
so when a sender wants to multicast the messages,
it will use the sending time plus the max of the one-way delays.
So that should be sending time plus 40 microseconds.
And that will be the deadline to multicast this message.
So in the good case,
this message should be able to arrive at all the receivers before the deadline.
And then if it arrives early,
then the receiver will hold the message until its deadline,
which means the receiver will hold the message until its deadline, which means the receiver will
just check its clock time. And then when the clock time passes the deadline, it will release the
request and append it to the log list. Fantastic. I'm really excited to kind of hear some numbers
on this later on, but cool. So before we jump to that, we need to get through the fast path and
the slow path. But maybe before we even do that, we should maybe talk through the architecture of Netter
a little bit more and some of the assumptions that you've made while you were designing
this protocol.
Yeah.
Yeah.
So briefly speaking, I should say that the common setting, the setting of Nerdja is quite
similar to the setting of traditional protocols.
So we still have two F plus one replicas.
So typically you can have three replicas, five replicas, you can have anF plus 1 replicas. So typically you can have 3 replicas, 5 replicas.
You can have an odd number of replicas.
And among these 2F plus 1 replicas,
one replica is elected as a leader
and the other replicas are followers.
So when the client wants to commit its request,
it directly talks to one proxy
and the proxy will do the request model cast
and later it will also do the quorum
check to decide whether this request is committed
and if the request is committed it will return the execution result
to the client and here as i mentioned with the DOM primitive we have a fast
pass if there is no message reordering then
this request should be committed in a fast pass
but if there is a message reordering, then this request should be committed in a fast path. But if there is a
message reordering, it can lead to the inconsistency among replicas. But that is not a big worry,
because proxy will detect the inconsistency. And then finally, the replicas can commit a request
in the slow path. Amazing stuff. So let's go through the fast path then. So walk us through
the fast path, the happy case where we've not got many messages out of order.
The DOM permit has worked perfectly.
Cool, yeah, take it away.
Yeah, yeah, yeah.
For the FastPath, it works in this way.
So the proxy receives this message
and receives the request from the client.
And so it uses a DOM to multicast the request
to all the replicas.
In the happy case, we do not have message reordering. So in that way, every replica willast the request to all the replicas. In the HP case, we do not have message reordering.
So in that way, every replica will receive the request
and hold the request until its deadline.
And then this request is released and appended
to a log list of this replica.
So here we have a smart design so that only the leader will
execute the request following the increasing order of their deadline.
But the followers do not execute the request.
So followers only need to maintain the log list.
And then every replica will send a reply to the proxy.
And we call this reply fast reply.
So inside the fast reply,
every replica will include a hash value.
And this hash value represents the log list on this replica.
So after receiving the false replies, the proxy can check the hash value inside the false replies.
So in that way, it can know whether these replicas have consistent states.
So if the proxy receives the false reply from the leader and also receives the fast reply from F plus half of followers,
then this request is considered committed in the fast path.
And because the leader's fast reply will include the execution result,
so the proxy will obtain the execution result and reply to the client.
So this is the fast path.
Nice, cool. So I guess now we've got the fast path, we is the fast path. Nice, cool.
So I guess now we've got the fast path,
we need the slow path, right?
So yeah, so what happens
when things are out of order
and we can't take this speedy route,
we can't take the shortcuts,
we've got to go the long way around.
How does that work?
Yeah, cool.
So as you have mentioned that,
this is the fast path
is just about the happy case.
So we cannot guarantee
that every request can arrive before its deadline.
Let's see if some requests arrive very late
and it arrives at some replicas after the deadline.
Then the replica receives this request.
It tries to append this request to the log list.
But suddenly, the replica notices,
oh, I have already appended a request with a larger deadline.
So I cannot continue to append this incoming one because this guy has a smaller deadline.
If I append that, that will violate the increasing order of deadline on my log list.
So how should we address that?
So for the leader replica and the follower replicas, they will take different actions.
So for the leader replica,
we gave them much power. So if the leader replica finds that, okay, this incoming request deadline
is too small, then the leader replica will modify the deadline of the incoming request to be a larger
value. And then this request becomes eligible to be appended to the log list of the leader.
But for the follower replicas, they do not have so much power.
So when the follower notices, okay, this guy
the incoming request deadline is
too small, we cannot append that, the
follower will just put the request aside.
It will not handle that
at this moment. So in that way, you can
see, if some request is
appended, but
if the request is appended to some replicasicas but not appended on the other replicas, then
these replicas will have inconsistency. So in that way, they will generate different hash values
in the first replies. Then the proxy will detect this inconsistency. So the proxy will not be able to commit this request
in a fast path.
So here, we make the leader replica take an active role.
So that means every time the leader
appends the request into a log list,
the leader will also broadcast its log indexes
into the other follower replicas.
So after the followers receive the log indexes from the leader,
the followers will listen to the leader.
It may notice that, okay, my log list is inconsistent with the leader,
so I must modify my log list to keep consistent with the leader.
So here I hope you still remember when I was describing the first pass,
I'm saying that only the leader will execute the request.
Followers only maintain the log list.
So this is quite useful in resolving the inconsistency between the leader and followers.
So imagine that if we also ask the followers to execute the request at the beginning,
then we are trying to resolve the inconsistency.
Maybe we will need to roll back at the follower.
So that can be quite expensive and even unsupported by the application. But in our case, the followers do not execute the
request. So to resolve the inconsistency, the followers just need to modify the log list,
and then it can rapidly fix the inconsistency and keep consistent log list as a leader.
So after the followers have completed the resolving of the inconsistency,
it will send another reply to the proxy,
which we call that slow reply.
And the proxy, after it receives the fast reply from the leader
and also the slow reply from as followers it will consider
this request committed in the slow path and again it can obtain the execution result from the leader
fast reply and the reply that to the client fantastic so i'm really interested to see how
how how kind of some before the performance that was behind the fast path versus the slow path
that's his other protocols out there so i think it's time to talk some numbers.
Can you tell us about your experiments and your setup
and what sort of benchmarks you ran and what you compared NetApp against?
Yeah.
Actually, we have compared the NetApp protocol with multiple baselines,
including, I remember it is about six to seven baselines,
including multi-paxels, fastpaxels, no-paxels, raft,
and e-paxels, domino, and so on.
So at the very beginning, we only conducted now evaluation in the local area network.
So we just employed the VMs in Google Cloud.
So we ran the comparative evaluation in a single data region.
And we have achieved much higher throughput and better latency. So the speedup can be about 1.3 to 20.9x compared with the baselines.
But later, when we submitted to the VLDB, we got some feedback from the reviewers, and
reviewers suggested that we should also do some evaluation in the wide area network with
multiple regions.
And then when we supplemented the evaluation in the wide area network with multiple regions. And then when we supplement the evaluation
in the wide area network,
actually the performance advantage of NERDA
becomes even more distinct
because we have known that in the wide area network,
perhaps the most well-known protocol is EPEXOS.
EPEXOS claims it can achieve optimal latency
in the wide area network.
But that assumption will become not true
when we separate the client and the replicas into different regions.
So now in the more general setting,
when we have different regions for the client and the replicas,
only Neja protocol can achieve the optimal latency,
which means we can commit the request
in just one RTT in the fast path.
For ePaxels and multi-Paxels,
they still need at least two or even 2.5 one RTTs
to commit the request.
So that makes our latency performance
much more distinct compared with the baselines.
Yeah, for sure.
I mean, like one round trip
versus 2.2.5 that's a big difference yeah cool and so i i guess kind of on this then so
this this works well when obviously we have when the dom print is working great and our messages
uh um uh aren't kind of being reordered did you do any experiments sort of in the extremes
of what happens when kind of, well, I don't know,
what is the performance sort of tipping point
when I'm seeing a lot more of my messages
being arriving sort of out of order
and I need to do some reordering?
So I end up on the slow path
more than I do on the fast path.
Yeah, actually, I think,
actually, we have included such experiment
in our conference version.
So, for DOM, the key parameter that we need to care about is how should we decide the estimated deadline.
So here we are using the sliding window to estimate the one-way delay and further we can decide the deadline.
So for the sliding window, we have maintained a hyperparameter which is called the percentile. So that means you collect
some OOD, some one-way delay samples, and then you will just choose maybe to the 50 percentile
to the 75 percentile as the estimated one-way delay. Of course, the larger percentile you use,
then you will suffer from longer holding delay,
but in that way, you can reduce more
message reordering. So we have
measured different, we have chosen
different percentiles used to
estimate the one-way delay and further
decide the deadline. And finally
we found that with the 50 percentile,
we can achieve the best
trade-off between the holding delay
and the faster commit ratio.
As we increase the deadline value,
then, of course, we can continue to increase the fast commit ratio,
but that increase is just marginal.
We are just suffering from more holding delays, which causes longer latency.
So in our future work,
we need to do more experiments
to maybe develop an automatic mechanism
to select the estimation method.
But in our current design,
we have used the sliding window estimation
with the 15 percentile as the estimated value.
That has been proved to
achieve the best trade-off yeah i guess that probably is going to be the best answer in most
cases right so for now it's good but it's interesting like as a future research direction
sort of kind of having this monitored at sort of like a runtime and varied depending on what's
happening in the underlying network or whatever and sort of changing it over time based on
on various factors
that's that's really it'd be really cool to see it to see to see that um um awesome yeah i guess
sort of like building on top of this then so are there any sort of i mean i always like to i always
kind of ask this question but are there any sort of scenarios in which and nether is sort of
suboptimal in terms of its performance like what are the areas where it's limitations, I guess?
Yeah, for the limitations, actually,
I should admit it has limitations.
Actually, I have uploaded the technical report of NerdJar to Archive.
So in the appendix, I have talked about the limitations.
I think the biggest limitation is still about the clock synchronization.
As I have mentioned, actually I should emphasize that NerdJazz protocol's correctness does not
depend on the clock synchronization. So we only rely on clock synchronization for performance
rather than correctness. But even if the clock synchronization fails, the protocol is still
correct, but the performance will suffer from some negative impact.
So, for example, if the clock synchronization suddenly fails,
and then the DOM primitive is likely to generate very large deadlines,
and when the replica receives a message with very large deadlines, and then this replica will just hold the message for a very long time.
And this long holding time will be counted into the commit latency of this request.
So that will lead to very bad latency performance.
So in the appendix of the technical report,
I have discussed this issue and also discussed
some mechanisms to address this problem. For example, when we have noticed that the holding
time is very long, we can force this request to be committed in a slow pass. So, in that way,
we do not need to suffer from the long holding time caused by the wrong deadline. And here I want to highlight one thing,
that is, for this kind of issue,
it can be theoretically a problem,
but in practical cases, it can hardly happen.
I said this because of two reasons.
The first reason is because of the Higgins algorithm
we used in NERDA.
So Higgins is a very robust clock synchronization algorithm, which means that it is using a robust
probing mesh. So even some clock synchronization agents fail. So long as there is no Biont node,
the remaining healthy nodes can quickly,
can still synchronize with each other and they can quickly converge
to the synchronized clocks.
So the second reason is
even if there is some Byzantine node
which keeps polluting
the clock synchronization,
in the recent literature,
I have noticed that
there are some mechanisms developed.
For example, you can use some watchdog mechanism,
and the watchdog mechanism can detect the Byzantine node.
And then if this node keeps polluting my clock synchronization,
we can just detect that and kick that node out of the synchronization scope.
So in that way, these mechanisms will prevent the long-lasting negative impact from Byzantine nodes.
Fascinating.
So yeah, it's interesting to see.
I mean, I kind of am a little bit sort of detached from the whole sort of Byzantine sort of protocols, BFT protocols, and sort of don't keep up to speed with them too much.
So it's really interesting to see that they're kind of in their way way, even working on solutions around that sort of, and like a watchdog or whatever,
and how that can be leveraged
for these sorts of protocols as well,
which don't have the same sort of underlying assumptions.
It's really cool.
I had another question as well about
how do you handle sort of changing the member set
of the protocol,
like rolling in new members
and rolling out old members.
How does NetEarth handle that?
The short answer is,
for now,
we just inherited
the view-based approach.
We will just use
the view-stamped replica approach.
But in the near future,
I think we can design
more efficient membership change.
For example,
in the recent protocol,
maybe you have noticed
which is called Matchmaker. I think it is also published in VLDB. So in the recent protocol, maybe you have noticed, which is called Matchmaker.
I think it is also published in VLDB. So for the Matchmaker protocol, it can achieve even
more efficient reconfiguration. And I believe such protocol can be adapted and be applied
to support the membership management in the future version of NEDA.
Fantastic stuff. So I guess kind of leading on from that then, where
do you go next with NERDA? What's the next
thing on the research agenda?
Yeah, good question. So
actually, you know, this is not the end
of the story of NERDA. So
currently
in my
research, there are still three
ongoing directions related to NERDA.
So the first direction
is to productize
the Neutra protocol.
So after the Neutra protocol is developed,
my supervisors and I
all believe that this
Neutra protocol can become a
promising alternative and
replacement of the legacy protocols
such as the Raft and Multiplexers.
So for now, currently I'm working
with some industry colleagues
and we are trying to refactor
the code base of a NERDA protocol.
And after we have done that,
we will continue to integrate NERDA
into some industry systems.
For example, Kubernetes.
You know, currently the Kubernetes
is using the ETCD as the storage backend.
And ETCD actually is a standard implementation of Raft.
And based on the feedback from some industry staff,
they told us that the Kubernetes is suffering from the performance bottleneck
due to the ETCD backend.
So if we can replace the etcd with another protocol
which offers higher performance,
then we should be able to bring performance
boost to the Kubernetes
system. And similar systems
such as Apache Parse,
which is now using the loopkeeper,
and Apache Parse
is also suffering from the bottleneck
due to the loopkeeper. And in
the future, maybe we can also replace the loopkeeper with Nerja
to bring more performance improvement for PaaS.
And another scenario is the financial system.
You know, for financial systems, they also require the fault tolerance.
So this is something related to my first project, which is called CloudX.
And in the future, I think I will also apply the NERJA protocol
into my first project to provide the fault tolerance for the financial system.
So this is the first direction in my schedule.
And the second direction related to NERJA is to extend the application scope
from the consensus protocol
to concurrency control protocol.
Oh, by the way,
I have listened to your podcast
published last month
when you were talking about
the DTOC project with Coim.
Yeah.
That is a very insightful podcast
and I enjoyed that a lot.
So, you know,
based on my understanding, you see,
the concurrency control protocol
actually is trying to establish
the ordering agreement
of multiple shots.
So that means you submit
a transaction, this transaction
will be executed across multiple
shots, but we need to satisfy
the strict serializability.
So, concurrency control protocol
is trying to establish ordering agreement across multiple shots, while the consensus protocol is
trying to establish the ordering agreement across multiple replicas. So these two share much
similarity. So if we can do high-performance consensus protocol with ClockSync, then we should also be able to do high-performance concurrency control with ClockSync.
So actually, currently, I'm already implementing such a new protocol
which can provide both consensus and concurrency control.
Nowadays, I have completed the design of the new protocol,
and I'm now involved in implementing and evaluating.
So hopefully, I think this new protocol will come out very soon in the near future.
And maybe when that day comes, we can have another talk about the new protocol.
And yeah, this is the second direction.
And for the third direction, as you have mentioned, I'm also looking at some literature on the
Byzantine fault- scenario. So at this moment, the NERJA is just a crash fault tolerant consensus protocol.
So compared with the CFT consensus, the BFT consensus is even more challenging
because in the BFT consensus, we should not handle the case when the node fails.
You should also handle the case that the node is dishonest. So,
how should we achieve high-performance
BFT consensus protocol?
In the past three years, we have seen that
many researchers are becoming
more interested in using
TEE to accelerate
BFT. So, TEE
stands for Trustworthy
Execution Environment.
For example, Intel SCX is a very good TEE technique.
So I'm thinking that by using the TEE technique
and combine the TEE technique with NERDA protocol,
we can make NERDA work.
We can also make NERDA work in the BFT scenarios.
So that is the three dimensions.
Fascinating.
I mean, you're going to be busy for a long time.
That sounds fascinating in that that directions i'm really
excited i mean if you can kind of my next question is going to be what impact do you
think your work can have but i mean if you manage to get nesla in using kubernetes and
using apache pulsa and then obviously the financial application that'd be some fantastic
industry impact right that'd be amazing exactly and yes you can definitely come on the show again and talk about you the um the extension to concurrence control
protocol that'd be a fascinating chat i'm sure yeah and then yeah the third direction as well
the bft then that's that seems fascinating as well so yeah i'm sure there's gonna be
loads of really interesting work coming out from yourself from your lab over the next sort of
coming years for sure so um so I mean, I guess my next question
was going to be what impact do you think your work can have?
But I don't know if you want to add anything more onto that,
like how do you think people can leverage your findings
in their day-to-day sort of work?
Yeah, as we have mentioned in the previous question,
I think that by replacing the Lexi protocols with NerdR,
we can bring the performance improvement
to many distributed systems.
Another thing which is also put in my
to-do list, maybe
in the far future, that is
for the DOM primitive,
we think that we can make the DOM primitive
as a general middleware.
So if we can
make the DOM as a general middleware, we can
plug into any distributed
systems which requires consistency. So in that way, it should be able to as a general middleware, we can plug into any distributed systems which requires consistency.
So in that way, it should be able to bring a general acceleration to the distributed systems.
So that can also be a promising direction that we need to do. But of course, by implementing this
general middleware, it may be more challenging. We need to consider many other issues.
Fair. I find that'd be really interesting. Yeah, for sure. I guess. Yeah. So of course,
I am my, my, um, kind of just changing direction a little bit here. We sort of working on, on,
on the, on the nether project. What's the most interesting thing that you've learned?
Uh, actually, yeah, I, I think probably interesting thing that I want to share
to, I think I want to share two lessons I learned. One good news and one bad news.
Okay.
Let's first talk about the bad news.
Yeah, sure.
Yeah, so first let's talk about the bad news.
So in the distribution,
while I was doing the research on the consent protocols,
I realized that this area is really crowded.
So for example, because there are too many,
because consent protocols have been
developed for more than 30 years,
so there are too many smart people
and they have spent too much effort in
developing different kinds of protocols.
So this is what I call history burden.
Because we are doing that
30 years after
the pioneers. So
when we develop a new protocol,
in order to make another protocol stand out,
we need to defeat many, many baselines.
This is what we have encountered
from the reviewer feedback.
So when we submitted the NERDA paper
to some conference tracks,
our reviewers were just arguing that
why don't you compare with baseline A?
Why don't you compare with baseline B?
So in order to satisfy the reviewers, we need to
add more and more evaluation in
order to distinguish novel protocol
compared with the existing works.
So for the young researchers
and students who want to enter
this area and make some contributions
in the consent
protocols, I suggest that
you must be prepared for
the laborious implementation and evaluation work that you must be prepared for the laborious implementation
and evaluation work that you need to do.
So this
is bad news.
But there are also some good news.
So the good news is
we should
be aware that compared with
the young researchers 30 years ago,
we are now enjoying much
better implementation conditions.
You can have many available powerful techniques that you can use to improve the protocols. So
if you want to make a contribution in the consensus protocol, I'm sure that you can do something.
Based on my understanding, there are two paths that you can consider.
So the first path is that you can consider
to use some special network support
like what NoPaxels and SpeculativePaxels have done.
You can also use RDMA, use ProgramMage, use ISDN.
You can use these new techniques
to enhance the performance of the consensus protocols
and build your own contribution.
And the second path is to use some software and deployable techniques, such as the clock
synchronization and such as TEE.
And by the way, TEE is also available in the public cloud like AWS and Microsoft.
So by using these new techniques, you can create the unique contribution, which you can do, but the early researchers cannot do.
So in that way, you can make your work stand out.
I think that's a fantastic answer to that question.
Yeah.
Great.
I love the bad news and I love the good news as well.
I say it's great stuff.
But it's so true, that kind of the history bed.
And it's true, obviously, I spend a lot of time in like the in the same sort of
field and like and like you know with database research it feels like everyone's already done
it all in the 60s and 70s and 80s right so you there's so much to sort of learn and kind of
get your head around before you can then make a some sort of like layer on top right and make
some incremental improvement in some in some area but yeah no i guess it's that's the the the way the
world is at the moment i guess but anyway cool um awesome so yeah i guess kind of on the on the
other side then were there were there some sort of things that you tried while you were working
on there that you that failed that maybe the listener might be interested in knowing about
yeah yeah i think yeah this question i think this question is quite relevant to the previous
question yeah i think yeah the problem is that in the research,
actually, this is a common issue for almost all the system-related research.
That is, behind the paper, you need to spend much, much more effort
trying to do the implementation and evaluation.
Actually, you know, Mildred protocol has been submitted for three times,
and it is accepted in the third time.
But for the first two times, we submitted to the other companies and it is rejected.
And the reason for the rejection is because they said that, okay, we do not compare with
the other baselines, or we do not do this experiment or that experiment.
So, you know, this actually makes me quite disappointed because we have designed
such protocol.
The only problem is
we need to make
our protocol defeat
so many rivals.
So these hard work
cannot be reflected
in a paper.
So that is quite disappointing.
So when I got rejection
for the first
and the second time,
my supervisors come to comfort me.
They said, oh, Jingkun, do not be so disappointed.
You know, for the rest of the paper,
they have been rejected for six times.
Really? Okay.
Well, yeah, that makes it easier now, right?
You know, that got rejected six times.
Look how influential that protocol is.
Yeah, yeah, yeah.
Oh, yeah.
Yeah, finally, so my supervisors always encouraged me.
They said, we believe that now work is good work,
so we just need to keep working on that.
And it is just a matter of time to get it published.
And finally, yes, we got published in VLDB.
And also the final review feedback is very positive.
That encourages us a lot.
That's great.
But it is kind of quite disheartening, right?
You've got to do
so much sort of work that doesn't ever get any really any real credit in the in the paper right
like and it's like i don't know maybe it's a problem it's a kind of a systemic problem with
the system at the moment i guess with the academic system and the way that things kind of work and i
think it's especially prevalent in systems um systems research but yeah um yeah um at least
he got accepted eventually
and i'm glad it did get accepted because it was a great read so um we've got a couple more
questions now and the this one is this can you maybe spend some time telling the listener about
your other research i'm sure you work on loads of interesting and interesting things i haven't
in the past so yeah tell us about some of your other stuff. Yeah, yeah, yeah. Actually, I want to share some lessons,
share some tips that I have learned
based on my past research experience.
Yeah, go for it.
Yeah, yeah.
I think for the PhD students,
this is a common challenge that we will encounter.
So how should you find the project
that you think is promising
and what you think you can do?
So I think for the PhD research,
it can be divided into two steps. Step one, you need to have an idea. You should have an idea
that you believe it is promising. The second, second step two, you need to implement your idea
and demonstrate the effectiveness of your idea. So comparing these two steps, I think step one
is harder because when you do not. Because this is a starting point.
If you do not have the idea, you will just be blocked.
You cannot make any progress.
So here I want to share the experience.
How should we generate some interesting ideas?
So there are two ways.
The first way is if you do not have the idea,
I think at least you will have a general direction.
For example, you can ask your supervisor to give you a general direction.
Of course, the supervisor will not tell you
what to do step by step.
That is your responsibility
to figure out what to do step by step.
So after we have got a general research direction,
then you will need to do a lot of literature review.
For example, when I first decided
to do something in the consensus protocol, I just need to do a lot of literature review. For example, when I first decided to do something
in the consensus protocol,
I just need to read about 20 papers
which are published in the top 10 conferences.
So after you have read 20 papers,
I believe that you should already get some sense.
You should already have some inspiration.
If you are saying that,
okay, after I read 20 papers,
I still have no idea,
then don't worry.
Keep going and continue to read another 20 papers and I still have no idea. Then, don't worry. Keep going and continue
to read another 20 papers. And keep
doing this. Finally, you will find something
interesting. This is the first option
that I suggested to other
PhDs. And the second option
is to make your hands dirty.
For example, if you do not know what to
do, you can find a
merger system. For example, if you want to do something
in the consensus protocol,
you do not know what to create.
Then you can start by
implementing the Paxos protocol
or by implementing the Vue
replication protocol by yourself.
So when you are doing the implementation,
you will learn some potential
bottlenecks, some potential issues
that the other guys cannot
know if they do not
execute, if they do not implement that. So based on that, perhaps at the beginning, you will have
a very small idea. And then you continue with your small idea. During the implementation,
you will make your idea bigger and bigger. And finally, probably you will have a good work.
So that is the second tip I want to share with the other PhD students.
That's fantastic. Honestly, I love that share with the other phd students that's fantastic
honestly i love that it's great i think that's fantastic advice is read read read read read
keep reading and it's a great way to generate ideas right like that was sort of the approach
that i sort of took initially and and was sort of like if i can read all of these distributed
transaction papers then surely maybe one day i can have an idea of my own and then obviously yeah get your hands dirty right and i think that's that that that kind of advice
is is is true in in life as well in general in all and all things right like you you kind of
you've got to say you've got to crack a few eggs to scrap to make an omelette right so like you've
got to get your hands dirty get in there and then ideas flow from there right and the first thing
you do might be terrible but
the everything is at first right the first podcast i recorded was terrible and hopefully by now
they've got a little bit better i don't know at least sticking with the judge of that i'm joking
i'm joking but um yeah but no i think that that that's that's brilliant um so i think that kind
of that answer that sort of i normally ask about people's creative process but you've kind of
answered it there as well so i think that that's a really, really cool sort of answer.
I really enjoyed that.
And I guess, yeah,
the time for the last question now is,
what's the one thing you want the listeners
to take away from this podcast today?
Yeah, yeah.
To summarize in one sentence,
I would say ClockSync is becoming
the new powerful weapon that we can use
to accelerate the distributed protocols.
That is what I want to share with the audience.
That's fantastic.
Great.
Let's end it there.
Thank you so much, Jinkeviko, on the show.
It's been a fascinating chat.
I've loved it.
If the listener wants to know more about Jinkeviko's work,
we'll put a link to everything in the show notes.
And again, if you do enjoy the show,
please do consider supporting us through Buy Me a Coffee.
It really helps us to keep making the show. and we'll see you all next time for some
more awesome computer science research