Disseminate: The Computer Science Research Podcast - Suyash Gupta | Chemistry behind Agreement | #24

Starting point is 00:00:00 Hello and welcome to Disseminate the Computer Science Research Podcast. I'm your host, Jack Wardby. Today we have another installment of our CIDR series and I'm delighted to say I'm joined by Suyash Gupta, who will be talking about his paper, Chemistry Behind Agreement. Suyash is a postdoc at UC Berkeley and he's currently working on designing efficient fault-tolerant systems with a key focus on handling malicious attacks. Suyash, thanks for joining us on the show. Yeah, thank you for having me. I mean, it's great to join here. This is an exciting platform, and I think it will help a lot of community. So yeah, as I'm doing a postdoc at UC Berkeley,

Starting point is 00:01:01 this was a recent paper which we presented at CIDR, so I'm excited to explain it and look forward to any questions. Awesome. Well, we're happy to have you as well. So before we get into the paper, can you tell us a little bit more about yourself and how you became interested in researching databases and distributed systems? Yes, I think so I have been starting to do database research, I think, from Purdue, from where I started my PhD, and then I moved to UC Davis, where I graduated. So I think we were taking a course which was looking into designing commit and conquer control protocols. And that sort of intrigued me in like, how are these protocols designed? Can we actually do something new? And that's where I started to take a deep dive into distributed databases. And interestingly, this work is sort of an extension to the early days when I started doing that. Because back then when I was looking at, I was looking at how to design two-phase commit protocol or three-phase commit protocol. And then

Starting point is 00:01:59 I soon moved on to doing consensus protocols, basically fault tolerant consensus protocols. So this work sort of takes care of all of those protocols and tries to bring a single framework in some sense. Amazing. Yeah. I mean, concurrency control, right? That's what got me hooked as well. So yeah, that's definitely the gateway drug for databases, maybe. I don't know. Yeah. Cool. So you touched on it a little bit there as sort of the general idea behind the paper but can you maybe give us the elevator pitch for this work yes so um if i have to say that i'll say that out in the wild there are many type of agrarian protocols which come in several flavors like commit protocols are there are there protocols which

Starting point is 00:02:39 can handle crashes there are protocols which can handle malicious attacks but despite this we believe there's no single framework which can express all of these protocols. As a result, what we feel is that there is a disparity in understanding and usability of these protocols. And such a disparity kills innovation and adoption. In fact, in just this cider, we were listening to some talks and we were listening to people discussing among themselves. And a common framework statement was like Pax's heart. So still there is what people feel not easy way to understand this protocol. And that is where our work come into picture.

Starting point is 00:03:15 Chemistry behind agreement. We tried to design a chemical framework. Per se, it's called chemical. But the idea is to design a framework which can express all of these protocols. And the hope is that using this frameworks framework which can express all of these protocols and the hope is that using this frameworks developers can come up with new protocols and they can actually adjust the system based on the requirements yeah i mean that's it's super useful work so i know from personal experience trying to understand all the different variants of paxos and the whole spectrum

Starting point is 00:03:40 of protocols there right it's it's so hard hard to compare and get your head around each one, sort of teasing out what the primitives are and things. So, yeah, definitely really, really sort of a helpful work on that front. So you've kind of I really like the chemistry angle to it as well. I think that's a really good way of thinking about these things. So you start off by in the paper by defining four atoms. So can you maybe run us through each one and why you decided on these four being the four atoms? Yes.

Starting point is 00:04:09 So the idea behind this paper was like, as you just said, we're trying to say what is chemistry? And if you're saying chemistry, chemistry has many elements, there's something atom, there are elements, and then there are compounds. So first thing we need to design is what is atoms.

Starting point is 00:04:22 And in chemical term, atom is basically the smallest indivisible unit of an element. So similarly, we were trying to think, what are the atoms of different agreement protocols? How will these protocols actually come into being? So I think a few things which you realize that all of these protocols have something which is called a failure model means what type of attacks or things they can handle like for example some protocols can simply handle crashes and they cannot do anything more and that is good enough while some protocols can actually handle uh byzantine attacks or malicious attacks and that they're there the replicas could actually collude they can lie or do anything else so first thing is we need to define what is the failure model. Next element according to us is quorum. Basically, if we are designing a protocol and in that protocol, there will be multiple parties. What is required from these multiple parties to come together and reach a common decision?

Starting point is 00:05:15 And that is what quorum basically tries to tell. For example, in some protocols, the idea will be that you need all the parties to come together. So in a system, if you have n parties, need all the parties to come together so in a system if you have n parties all n of them should come together while some protocols will say hey i do not need n of them i only need a subset of them and that subset could be f for example paxo style protocol what they say is that if you have n parties and they say that n is equal to 2f plus 1 then i only need f of the parties to come together reach reach a common decision as F plus one parties. And that is the quorum for them.

Starting point is 00:05:47 Similarly in the BFT world, but the idea here has been that a lot of protocols say that we need two third majorities, two F plus one to come together and actually reach a common decision. So that is quorum for us. Then another element we thought is that, okay, we have defined failure. We defined quorum. What can we do next? The next thing which we thought is that, okay, we have defined failure, we defined quorum, what can we do next? The next thing which we thought is will be topology.

Starting point is 00:06:08 Basically, where will these protocols run? How will these protocols run? So generally, if you say in the broad outlook, the protocols can run either as a centralized setup or in a decentralized setup. In a centralized setup, the idea will be there will be a leader which will actually initiate the protocol and everyone else will follow whatever the leader is saying while the decentralized setup everyone can work together and they can send messages to each other they can communicate with each other they do not need to wait for the leader to initiate stuff alternately we also there's there are some protocols which follow a ring design where the idea is that i will just communicate with two of my neighbors basically we assume a ring like structure everyone is sitting in a ring and they are communicating

Starting point is 00:06:48 so these are some of the sort of like topologies we see are common and finally the fourth element which we thought is like data distribution how is the data distributed among these parties because we have these parties and what are these parties actually holding on to? What are they actually executing? So one way could be that all the parties are holding exactly the same data. And that is a common replication model. While the another scenario could be where every party only has a distinct set of data items. And traditionally, this thing is called a sharding. Similarly, there are some models which use flavor of both,

Starting point is 00:07:23 where you have both sharding and both replication happening at the same point in time. So this is how we define generally the atoms in the system. Nice. How did these four things sort of fall out then? I mean, was there certain other atoms you thought, do we make this an atom or not? Or was it pretty obvious from when you started trying to decompose this and break this down that these are the four sort of founding like indivisible atoms right that that kind of fell out so uh that is again a very good question i think um why we thought about only these four atoms was the reason is because we saw a lot of different protocols and i think some of the protocols which we'll be discussing later in this podcast also will be that any protocol which we have will follow some of these structures they will first define how their party should have

Starting point is 00:08:10 the data they will of course define what is the type of attacks they can handle they will design how should they reach a decision and so basically these ended up being the fundamental constructs in the protocol of course we're not saying that these will be the only sort of four atoms for the start. We believe these are the four fundamental atoms and we believe in the future, more atoms can be maybe added if this possible that some protocols might require some different items which are indivisible and they need to be the founding was so yes, they can be added. But according to us, these are the four basic which will be part of almost any protocol nice that that really makes sense so now now that we've got our atoms we need that in chemistry we need we need compounds right as well so you give the six of these in the paper can we maybe run

Starting point is 00:08:55 through those what are they and why yes now once we got the atoms now we need to do next as i just said we need to make them sort of an element, combine these atoms, what can we do? And in some sense, what we view elements in this paper are basically phases of the protocol. So we have different agreement protocols, we could have two PC, three PC, which are commit protocol, we could have Paxos style protocol, which can handle crash failures. And then we can have PBFT style protocol, which can handle all the malicious attacks. So in all of these protocols, what is something which are common? And that is what we try to term as the elements and basically the common phases. For example, a very common phase in any protocol will be proposal. Someone has to, because if a protocol is being run, that means someone must have sent a proposal. There must be a leader. There must be some party who has actually said, hey, let's run this proposal. Let's complete this transaction.

Starting point is 00:09:51 Let's execute this transaction. And that is what we call as one of the key elements. And often this proposal is sent by the leader to all the other parties or in protocols like distributed protocol. Maybe they could multiple leaders. All of them could be sending proposal at the same point of time then once a leader has actually sent a proposal other parties need to take some decision generally the idea is that these parties try to reach an agreement what to do with the proposal so they try to vote on the proposal

Starting point is 00:10:22 and that is according to us the next element where the parties will decide to vote whether they want to actually agree to what the primary is saying or the leader is saying or they do not agree in some protocols the vote might actually even determine they want to even abort the transaction commit the transaction so the idea of the next element is to vote then once the the voting is done, now a common decision has to be reached. So the votes, all of these parties, they vote. They send these votes to the primary if there is a primary-backed protocol. And now what the primary will do? It will collect all of these votes.

Starting point is 00:10:59 And once the primary has actually collected all of these votes, it needs to gauge what is the result of this voting. Has everyone agreed or has everyone disagreed? So now based on those results, now the protocol needs to move forward. So if it is possible, some protocols say that just enough votes are once we get sufficient number of votes, we can simply prepare or we can simply commit. And that is what our next elements are. Based on the common decisions, protocols try to decide on the global decision. Based on this global decision,

Starting point is 00:11:31 either you can just prepare, move the protocol to the next stage, or if you want multiple steps, you can also commit. And similarly, we also have here another element which we call as decentralization. Because if the protocol is just being led by the primary then your prepare and commit are happening led by the prime because everyone is sending to the primary then primary sending response back while you can have a decentralized version where

Starting point is 00:11:55 you say everyone can communicate each other so again we have decentralized version of this protocol and in the paper if you'll have a look we try to show us a symbol to represent this decentralization so we are trying to show the difference between centralized prepare and commit and decentralized prepare and commit basically saying who is leading to this next decision making steps finally when all the decision is done you need to actually execute the transactions execution means you need to actually find what the result is and then you need to reply to the client. So that stage we are again representing through an element. And often when you have a lot of transactions and you have reached completed execution and you have passed through a lot of

Starting point is 00:12:36 history, you need to do some checkpointing and garbage collection. So we again make us element for that. And finally, it is possible that if the protocol is led by some leader, then that leader might fail. And if the leader fails, then what should you do next? You need to select a next leader. And once you select a next leader, you need to ensure that all the other parties know this is the next leader and they all have the common state. So we believe leader election could be another element so if you see in premise all of these elements are trying to show that different phases and protocols are sort of like common in every protocol and these are exactly the elements that's what we are trying to show nice so yeah just just to summarize

Starting point is 00:13:21 the six different elements we have here so So we have proposal, vote, prepare, commit, with obviously the decentralized variant as well. Execution, checkpoint, and leader election. Great. We've got our elements. Now let's cook up some protocols of these elements. Let's mix them together. So there's this really nice bit in your paper where now we've got our framework. You analyzed four of probably the most famous agreement protocols out there. So maybe we can

Starting point is 00:13:50 run through each one and just talk about how this framework can be used to describe these algorithms. Yes, I think that's a great thing. So now what we'll try to do is that we'll try to take the four atoms and the six elements, and we'll try to design four that we'll try to take the four atoms and the six elements and we'll try to design four of the protocols. Basically, we'll try to see how can we design two-phase commit, three-phase commit, Paxos, and the PBFT. And the reason we chose these four basic protocols is because two of them come from the commit world. One of them comes from the Paxos, a crash fault-tolerant world. Another one comes from the malicious fault tolerant protocols.

Starting point is 00:14:27 So to understand that, what we'll try to do, we'll start with 2PC protocol. And to see 2PC protocol, we know that in the 2PC protocol, the assumption here is that there will be multiple parties. And these parties, what they try to do is they try to reach an agreement where a

Starting point is 00:14:43 transaction could be committed or has to be aborted. And the whole point here is that generally 2PC runs in a partition system where the idea that all the parties are managing their own data. And that is why they need to take their own individual decision whether they can actually order the transaction or not so in some sense all of these parties are sending a vote of aborting or committing the transaction so the first phase would be for would be also our first element which is proposal so whichever party has access to the transaction it says let's propose and then now it sends this proposal to all other parties. And now all other parties, when they get this proposal from the leader, what they do, they decide to vote whether they should commit or abort the transaction. And the thing here is that all of these parties are

Starting point is 00:15:36 individually making a decision. They are trying to look in their own history and seeing whether they can individually abort a committed transaction or not and this is why we represent this type of voting sort of differently from the voting we do in pax and pivot which i'll discuss next and once the primary gets the decision from all the parties it decides okay if everyone is agreed to commit transaction let's. If one of the persons says, or several people say that let's abort the transaction, they'll abort. So the idea is to show that there's a next phase, which is

Starting point is 00:16:11 the commit phase, where you make a common decision what you need to do with the transaction. And based on this discussion, now the transactions are executed. If you need to actually execute the transaction or not, you need to reply to the client, you need to update in the history, write in the log, so that's what Bing has done, execution step. So that is a very simple two-phase commit protocol. But now once we have done the two-phase commit protocol, let's try to see what

Starting point is 00:16:35 do we need to do in the three-phase commit protocol. And this is where things become interesting. The extension between two-phase commit protocol to three-phase commit protocol is quite trivial if we try to see it from our framework's perspective. Basically the problem with two-phase commit protocol was that it could be blocked. So prior work showed that if the primary

Starting point is 00:16:56 fails and one of the replica fails at the same point in time, then it is very easy to create a scenario where all other replicas are stuck. They are not able to make any progress and system comes to halt. So to avoid that, three-phase commit protocol came. And the idea of three-phase commit protocol was to say that, hey, before you actually make any decision, ensure everyone is ready to move forward.

Starting point is 00:17:19 So that's why three-phase commit protocol had a prepare phase. And this is where another element comes. So when we said you can either commit the transaction or you can first prepare and commit a transaction, that is where three-phase commit protocol comes. So the proposal is the same. Leader sends a proposal. Everyone is voting. After everyone votes,

Starting point is 00:17:38 now the leader does not make the final global decision. It decides whether it can actually move everyone to prepare or it can ask someone to come. And that's what is happening. And once it knows that everyone's ready to move to commit, it again finally does the committing the transaction. So basically we have added two more phases in the whole, two more elements in the whole framework.

Starting point is 00:18:00 So if you've seen the two phase commit, we had four elements, proposal, vote, commit, execute. Now we have six elements, proposal, vote. Now you prepare, then you again vote, then you commit, and then you finally execute. So that's what three-phase commit protocol looks like. Okay. So now we did two-phase commit and three-phase commit protocols. So how can we build Paxos from this knowledge? And that is what we were trying to think. So what does Paxos do compared to two-phase and three-phase commit protocol? Well, if we see in two-phase and three-phase world, the idea of what was happening is that

Starting point is 00:18:34 all these parties were making the individual decision. And in Paxos, what is happening, there's a notion of a replicated system. So all the parties are working on exactly the same data. They have same sort of data items. Whatever the transaction comes, they're trying to order it exactly the same way. And to do that, we need to run through a consensus, which is backspace consensus. And what we'll try to see here is that the phases being used are exactly same as 2PC. For example, in the Paxos, let's assume a leader. There's already a selected leader

Starting point is 00:19:13 and that leader has to propose. So the leader proposes a transaction. And now the same thing was happening in 2PC. The leader was proposing. And now everyone is voting. But what they're voting here is that they're trying to see whether they can actually agree with the primary or not so they simply send their agreement to the primary if the primary received agreement from majority of people in

Starting point is 00:19:35 this case it will be f plus one as we said the quorum size is different in two pc the leader had to listen to everyone while in taxes the benefited leader only needs to listen to F plus one people. So once the leader listens from F plus one people, it knows that, okay, F plus one people have agreed to follow the decision. So he goes ahead and commits the transaction. And then finally, the leader asks everyone to execute transaction.

Starting point is 00:20:02 So exactly same four elements, which are being used in 2PC are being used in Paxos. But the only distinction is the way they are being used. And in 2PC, the voting is done. Everyone does a distinct sort of voting. They have their own states. They might be working on partition data. While in Paxos, everyone is working on exactly the same data.

Starting point is 00:20:25 While in 2PC, the leader needs response for everyone. While in Paxos, everyone is working on exactly the same data, while in 2PC, the leader needs response for everyone, while in Paxos, only F plus one. And similarly, this allowed us to create PBFT now, because seeing we already made Paxos, what do we need to do in PBFT? In PBFT, the difference here is that you can have malicious attacks. So because you can have malicious attacks. So because you can have malicious attacks,

Starting point is 00:20:47 like in 3PC, you had to wait before you reach your decision. In 3PC, we had to first prepare and then commit. Exactly same thing what we are doing in PBFT. We first have to prepare the replicas and then we commit the transaction. But the key difference between 3pc and pbft is same what is between 2pc and paxos and of course in pbft the quorum size is increased the quorum

Starting point is 00:21:12 size which was f plus one in paxos is now 2f plus one in pbft so the phases are exactly the same so this is how we were able to define the four different protocol, using just these six elements. Here you see that it puts it all into perspective so much more in a clearer way in my mind than me reading these textbooks over the years. It just seems it's a really nice framework and it fits all together really nice.

Starting point is 00:21:38 I mean, the thing that was kind of, I guess, not obvious to me initially was like, the way that basically you kind of, what's the word? I was initially thinking that the elements could only be used like once, right? You could only use the prepare once, right? But you can use them multiple times, right? That's, I guess, the secret sort of source to it in a way, I guess, is that you can mix and match the order in which you have these elements, right? Which is awesome. So I guess building off that, you hinted to it earlier on as well, is that this framework will allow you to maybe create or open the space up to thinking about

Starting point is 00:22:10 different combinations of these elements that haven't maybe been explored yet. And I know you have a section in your paper where you look at more complex protocols. Maybe we can touch on that, and then you can maybe say about the potential space you think is out there for what of what this framework opens up. Yes. So you're absolutely right.

Starting point is 00:22:29 I think that was one of the things also driving goal for us. We wanted to just not restrict ourselves to these four protocols. The idea was to say that this one framework could actually aggregate all different protocols and in fact uh uh while i was explaining different atoms and elements i also said that there could be a centralized and decentralized version so we i think we also show in the papers like the paxos which i was just describing was the leader base where the leader was actually driving the paxos leader was sending the proposal it was collecting all the responses and taking decisions. We also show in the paper, we can simply make a decentralized version where once the leader sent the proposal, now the leader is no longer in the middle.

Starting point is 00:23:16 All the replicas can go ahead and talk with them themselves. They can come on the common decision and they can finally execute. Similarly, we can have a decentralized version of PBFT where we can say that all the replicas can communicate with each other and they can come reach prepare and they can then finally commit in the paper then we go again with several other protocols i think i'll not discuss all the protocols but to give a quick glimpse we talk of protocols where there can be notion of speculative consensus for example some protocols say you don't need to wait for decision to be committed. Basically, you can take an eager decision. And that is, for example, they're having protocol like ZZ1 and BFD1.

Starting point is 00:23:52 What the idea of this protocol is to say that, hey, as soon as the primary sends me a request, a replica will go ahead and execute the request. And then in the future, they will decide whether that request can actually be committed or not. And in case it could not be committed, then they need to roll back this. So we were able to simply express this in our framework, because why we just needed two elements, we needed prepare proposal phase, and then we need an execution phase. Of course, if we need to show the failure, then we have to design a different model. And similarly, we were able to show several other protocols. For example, all of these protocols we were talking about assumed right now that there's only one leader and that one leader is sending the proposal. But what if there could be multiple leaders? I mean,

Starting point is 00:24:39 there are several protocols like there's a Menaces protocol, there's an RCC protocol. These protocols, what they try to say, you can have all the replicas maybe at the same point and they can act as the leader. That means in such a protocol, several requests are getting committed, prepared at the same point of time. So in our framework, what we try to do is that how can we actually leverage this? And if we're going to show that multiple concerns are happening at the same point in time, how can we actually leverage this and if we're going to show that multiple constants are happening at the same point in time how can we represent that and and for represent that was very simple for us

Starting point is 00:25:11 what we added was just sort of like a subscript showing that hey we sort of like uh said that these phases uh we found what are the some sort of uh phases which everyone will be doing we subscripted by i i means how many constants do you want to run in parallel. And then we said that we added a new element where we said that if all of these constants are happening in parallel, then you need to aggregate the response because these replicas are running multiple constants.

Starting point is 00:25:38 They're running constants because they have multiple leaders. But ultimately, they need to execute these requests and they have to come with a common order for all of these consensus so we need a phase where this common ordering can be generated so that is how we express parallel protocols and multi-primary protocols and similarly we went to other style of protocols we said that how about protocols or systems where there's both sharding and replication like spanner is one example of system so we try to express those systems where we show that if you want to do sharding how do you represent that and then you can even do replication so we just as what we saw that

Starting point is 00:26:15 was just some sort of change of notation we needed few more notations but the phases and elements still remained the same the only thing was that yes if you have multipleards, you might require these shards to communicate with each other. And for that, we added a new element where we said that maybe you need an inter-shard communication. And that inter-shard communication is a new element for us. And this element was interesting because this also helped us to define a different style of protocol, which is basically, say, the geo-replication protocol, where you have multiple clusters

Starting point is 00:26:43 and each cluster is running its own consensus. They're still working on the same data and everyone is still replicating the same data, but they now need to communicate. So that is what we are trying to do. Next question is, so far we've been talking about the happy path, right? So we haven't really touched on failures too much.

Starting point is 00:27:01 How did you go about incorporating failures into your framework? I think you asked a very good question. I think this is a concern for almost every protocol. I think that is what is challenging for every protocol is how can we actually handle failures? And can we actually design a common path for all the failures?

Starting point is 00:27:22 Can we actually see that even failures have same sort of elements and atoms like the happy path as he was talking about and actually that is quite true so because if if you see a very basic failure model the very basic for failure model talks to us that there has to be some leader who has failed. Because if a non-leader has failed, then basically that non-leader just needs to get up and it needs to be recovered by itself. It can ask others for help.

Starting point is 00:27:54 But if the leader fails, then the system has an issue because now the leader has failed, you need to select a new leader and you need to ensure that all the replicas are on the common state. And that is what we're trying to do. With the leader action, we believe there's a possibility of creating

Starting point is 00:28:08 a common set of elements here. And that's why we'll design four different elements. The first element was timeout element, where we said that, how do you know that the leader has failed? The idea is that every replica is actually waiting for some action to happen. And if that action does not happen for a prolonged period,

Starting point is 00:28:30 then that replica will sort of time out and they need to take the next step. In our case, what we'll say is that the replicas are waiting for leader to do something. And if the leader does not do a specific task, then they have to time out. And once they time out out they now need to do next and the next thing generally is that they announce if because if i'm a replica and i've timed out i need to know that how can i make progress and i cannot make the progress by myself so generally i need to announce to everyone and say hey i've timed out have you also timed and and based on that decision we can do the next.

Starting point is 00:29:06 And often the idea is if every replica is announcing with each other, now they will come to a common goal that they'll find out the leader has actually failed. So now they need to select a new leader. So the whole process of leader election. And leader election is a very famous problem. I think there has been a lot of papers and distributed systems and databases about how to do leader election. So this simply fits in here. So after you have done timeout,

Starting point is 00:29:29 there's announcement, another leader election comes. And once the leader election has happened, the idea is that everyone should reach a common state. So generally the idea is the leader will ensure everyone has the common state. In Paxos-style protocol, the leader might check everyone's log

Starting point is 00:29:44 and maybe try to set their logs into correct state. Or in BFT-style protocol, the idea is the leader will ensure it gives everyone the common state they can all reach in so that they can, again, start the consensus. So this is exactly what we believe are the four elements, and these four elements can help to define any failure model. And in the paper, we actually show how can these be used

Starting point is 00:30:08 to represent PBFT's failure model, because we believe if you're able to represent PBFT's failure model, we can also represent Paxos failure model, which is somewhat easier. Awesome. Yeah, I recommend the listener going and digging into that and looking into that some more for sure and so given the the framework here how did you then go about evaluating how effective the framework is and how how useful it was like and how did you experimentally explore this and i guess kind of what questions are you trying to trying to answer yes So I think that that is a good thing. So the key challenge for us while experimenting was to understand how these different protocols are faring with each other.

Starting point is 00:30:55 First of all, at least in this paper, we are not presenting a new protocol. So people already know how different protocols are supposed to behave in each other. But one of the key goals of our implementation and evaluation was try to give a highlight that now that we have a framework which can actually express different protocols and even their failure model, can we similarly have an experimental framework like that where we can actually pick and play different protocols. We can have like these elements I'm talking about, like this proposal, vote, prepare, commit and all. Can we have all of these elements as like modules? And if we have these elements as modules and then these atoms are basically the general characteristics of different protocols,

Starting point is 00:31:42 then we can simply plug and play these modules and we can create different protocols. And that is exactly what we tried to do in the experimental framework here. In this paper, we specifically use Bedrock as the framework. And why we use Bedrock? Because in the prior work, we have used Bedrock to actually test different types of BFT protocols.

Starting point is 00:32:04 So in this framework, what we did is that we actually leveraged the bedrock to create even 2PC, 3PC and Paxos. And what we were trying to show here is that just by having these modules, so we were easily able to not only create these protocols, but even able to achieve the same performance what is expected so in general what we expect is that you expect two pc to assuming there are no failures so assuming there are no failures the assumption is two pc will perform

Starting point is 00:32:35 good because it just have few number phases all the replicas will achieve high throughput and similarly the assumption here with the pbft would be the most expensive protocol because it has three phases all to all communication and the leader has to send message to all the replicas and there's somewhere in between there will be three three pc there will be paxos and that is exactly what is what we are able to yield the same results uh and we don't the whole aim is that now that we have created such a framework with these small elements and that framework is able to express these four different protocols we actually we also evaluate pou uh which basically reduces one phase from pbft and we try to show that po is actually performing better than pbft in the experimental

Starting point is 00:33:24 so the idea is that with these four or five protocols, which we are now able to create, maybe in the future, we can also add more elements, or maybe we can use these existing elements to create newer protocols. So this experimental evaluation is sort of like an initial step to allow developers to create a framework where they can easily mix and match different protocols and create newer designs awesome yeah i have a few questions that kind of follow up from the first is bedrock i mean can you tell us a little bit more about because i'm not familiar with that system at all yes so i at the bedrock is actually um work with one of the co-authors so

Starting point is 00:34:00 one of the co-authors on this paper is uh muhammad javed amiri was from jupen so it is his uh ongoing work and the idea of the bedrock is basically trying to understand how different bft protocols work in practice so i think muhammad has done a good job there he tried to basically implement almost 10 to 11 10 to 15 style different BFT protocols. And what it tries to show is that how in practice these protocols are performing, what is the throughput, what is their latency? It could be in the single cluster, it could be across a geo cluster, multiple regions and continents.

Starting point is 00:34:40 So I think that is what he's trying to experiment in Bedrock. And I think that is why we use Bedrock for this framework. Sure, yeah, that makes total sense. I mean, I really liked what you were saying as well about this having these plug-and-play modules for each element, right? I mean, I can imagine me just coming along and writing some config files.

Starting point is 00:34:56 Today, I want to try out this new protocol I think could be great. I want to evaluate the properties of it through latency, see how it performs. And I can just like plug-and-play all the, I want to do this round on this element, that is kind of stitched small together that's really cool so how easy was that to implement and like to have these sort of independent modules is that quite difficult or did you find things leaking into each other actually that's a good question so

Starting point is 00:35:17 so in this paper interestingly we use bedrock so this was something sort of an attempt which we also did at uh resilient db while we were doing this thing so and in both of these systems i think the idea what we did was we first started by creating a simple protocol let's say we just took pbft for example and once we designed pbft we saw there are very common phases like phases like the same thing which we are talking of elements here right like The prepare and commit, the same things have to be done in there. And these could be simply created as independent modules. Of course, as you said, they will have some common parameters and these parameters will

Starting point is 00:35:56 be accessing each other. So that is the same thing which is going to happen even here. But we could easily separate out these as functions like a function called this prepare function as a commit function or a vote function. And now for these four protocols, for example, we realized that these functions could be similarly used. But yes, of course, there were some nitty gritty details. For example, in 2PC, we realized that the vote is basically telling whether you want to actually commit

Starting point is 00:36:22 or abort the transaction. While in PB of 2Paxel, vote basically means I'm ready to agree with the primary. So in that case, we had to have some changes. We have to say a vote for commit protocols while a vote for any other specific startup protocol. Similarly, execution instance, I think it was quite easier for almost all the protocols because you're simply executing the transaction or even the lead election was quite simple. So some of these phases were very easy to separate out. Yes, but there were some other phases

Starting point is 00:36:52 which use common global parameters. And I think for there, you have to link to it. So there, I think the developers have to uh do some edit okay how how big was the implementation of the fallout how long did it take um okay so the the two things as i said because um part of the thing was we already had bedrock uh with us so we didn't have to do a lot of implementation. The only implementation we had to do was for 2PC, 3PC, and Paxos. But I think it did not take much time. It just took us a matter of, I believe,

Starting point is 00:37:32 Mohamed told me it just took him like a couple of days just to implement these few protocols. And I think that was the beauty about it. Because generally, if you're going to write a protocol, it is going to take you a lot of time. But if you have a plug-and-play system it just takes a couple of days to create multiple protocols i was thinking the answer is going to be in a magnitude of months rather than days wow that's yeah that's awesome uh cool um yeah i guess i guess kind of spoke about all like

Starting point is 00:38:01 the the like how effective the framework how useful the framework can be and you've spoken about the evaluation but are the any limitations to the framework like what are the the areas where it doesn't maybe fit perfectly actually that's a great question in fact you yourself uh while asking me the question pointed out a limitation interesting and there is sort of a limitation i think the limitation the framework is sort of like the system right now itself is not fully expressive we need to do much more i think to start with one of the points which you've said was uh there are protocols which talk about read and write sets where you need to specify read and write says earlier in time and like the deterministic protocols which say that you need to know the read and write set of the transaction

Starting point is 00:38:42 ahead of time and i think at this point in time, our framework does not provide a way to express such protocols, but we need to think about how can we do that. And similarly, right now, all the protocols that we have discussed, we make an assumption that they have sort of a partial

Starting point is 00:38:59 synchrony or synchrony assumption. The idea is that like the real election with the assumption here is that we're assuming the replicas will time out and when they time out they will announce they will announce and then they will go and select a leader and they will do for but what if the protocol is asynchronous and there are sort of many asynchronous protocol where there's no notion of timeouts so how do we express such protocol in the framework i think we still need to do on that and similarly our protocol our protocol right now, for example,

Starting point is 00:39:27 does not talk about things like reconfiguration. I mean, what if a node wants to join the system? What if a replica wants to leave the system? Can we express those type of things in our framework? That is something, again, we have to think about a challenge. So, and some more protocol like there are protocols which have a dag like right now we are assuming a single ordering and interesting most of these protocols assume there's a single but there have been recent protocol which assume

Starting point is 00:39:53 dag based orders dag based ordering there's no there's a partial ordering happening so again we need to think how can we include these type of designs in our frame the dag thing is interesting because i don't know whether this is related to what i'm going to ask is that how would you capture different consistency guarantees in the pro in this framework and is that something that you're thinking about exploring as well actually i'll say that that is actually one of the things which we are thinking about exploring i do not have a right answer right now. That is something which also hovered around because right now we are assuming

Starting point is 00:40:29 just one guarantee. We are assuming a single ordering across all the replicas, but I think, yes, you're right, we might have to think about eventual consistency or we might have to cause a consistency. Can we actually even talk about these things in that framework? Might be quite interesting i we

Starting point is 00:40:46 don't have a right answer right now but we are this is something we are actively exploring right fantastic yeah plenty of interesting things to be going at cool um yeah i guess my my next question is then has maybe as a like a software developer or someone working in databases or data management how can i leverage the findings from your research and things you find and what impact do you think they can have on someone's day-to-day work in life? Yes, I think, so for that, I think I'll go back to like a few years back and I'll try to say like,

Starting point is 00:41:15 why did we actually even come up with this idea? So, I mean, this was not something which we have been thinking for years, but I think something, so during my PhD, I think we have been working on like uh these different constant protocols and that is where we designed this framework we which we call this resilient db so ideas basically which expresses different bft protocols and uh similar to bedrock and the idea of our resilient db framework was basically trying to say that how can we express different protocols uh the idea was not exactly plug and play but just to give a framework which can allow

Starting point is 00:41:50 implementing different protocols and uh we found this challenging because there was speculative protocol there were geo-scale protocols there were sharded replication protocols we were trying to have these challenges and i think this is the same sort of challenge which even developers had because i think my one of the person i was uh working on such this framework uh he is now working at oracle and or other places and he also said that he met different developers and these developers still find it hard to design back source because it's not it's not it's very unclear to people like how can you even do simple sort of like replication or safe replication and i think the use of this framework would be like a lot of people easily know about two-phase commit three-phase commit protocol because these are the basic database protocols which are taught to anyone and if we can show

Starting point is 00:42:43 that other protocols are quite similar to these uh like protocol like packs or the pbft are quite similar to this this two-phase three-phase protocol then people can easily reason about them and uh this reasoning can actually help in expressing different protocols and i think might easily help developers to develop these protocols and create the fallback case because often i've seen developers saying that hey uh pbft is hard why it is hard because oh the failure case is very hard you have to think about a lot of sending a lot of messages or receiving or actually generating the common state we believe that may not exactly the case it might just be a simple way of expressing first and yeah i think it definitely gives a common language

Starting point is 00:43:26 for people to talk about these things because I think we all kind of have in our heads, otherwise, our own way of understanding these protocols, right? And then often my worldview is not compatible with somebody else's and we kind of get confused on terminology and whatnot. So yeah, definitely kind of giving people a common baseline to talk about these things on is a massive, massive win.

Starting point is 00:43:44 And I think as well, I mean, you touched on it a little bit there where you said that, people are coming baseline to talk about these things on is is a massive massive win um and i think as well i mean i you touched on a little bit there you said that i mean i'm i'm i was getting this as well like two first commit three years yeah this is these are pretty like okay i can understand these you go into pactos and pactos sort of has this like myth associated with it being difficult to understand right and it kind of almost it feels impenetrable almost sometimes but when you break it down like this it's you just saw oh no it's not just doing a few extra phases right like and that that that kind of almost demystifies it and sort of takes away sort of um the scariness of it almost um so yeah definitely yeah i agree with it agree with everything you're saying and cool yeah i guess

Starting point is 00:44:21 when you when you were working on this then what what was maybe the thing that you kind of learned across working, that kind of caught you off guard maybe and was you were like, huh, that's interesting. I didn't expect to learn that sort of thing. Yes, I think two things were quite interesting to me, especially while I was trying to understand this framework was specifically, we saw, as I was describing the protocols,

Starting point is 00:44:47 we saw there is some sort of similarities between pbft paxos 2pc and 3pc i think that was quite interesting especially similarities not just similarity in the saying that of course their agreement protocols and they have similar style phases what it was like the number of phases were exactly the same and like the only difference what we found was how well they were actually replicating the transactions or they were not replicating transactions where they were actually uh nodes were taking their own decisions and that was quite intriguing to us because we were like okay they're not exact they're not quite different protocols they have same style of phase and i think this is what exactly I think what the challenge has been in the company.

Starting point is 00:45:25 And we all have been looking at the protocol in a different way, but they're exactly simple extension. You just add more phases to Pax, it becomes PBFT. You added more phases, you increase the quorum size, now it's PBFT.

Starting point is 00:45:36 You just change the model from like, you increase, change the quorum size, you decided who will be making the decision. Now it's actually two pieces. So this is what was quite. And now how should I parallelize the protocol? You just allow replicas to run multiple concerns at the same point in time.

Starting point is 00:45:54 Now it's a parallel protocol and just need to collect and aggregate this state. So I think this was quite interesting to us. And this was something I think we didn realize it it is quite easy to understand even even after working for several years on these protocols itself we ourselves did not realize until we actually started making down this framework and realize yeah it is quite we are using exactly the same states and that is exactly easy to define nice yeah for sure i mean when you were working on it as well, I mean, when did the idea sort of happen for initially?

Starting point is 00:46:28 When did you decide, you know what, we're going to do this framework? How long back was that? So I think in some sense, the idea for this, okay, the idea for this specific paper, I'll say was just last year, but the idea in general to design such a

Starting point is 00:46:47 framework or way to express had been there with us for i think since 2018 2019 i'll say that since 2018 2019 we wanted a way to express and i think the problem the reason why we wanted to express is because as like everyone uh when i started i started with just understanding 2PC, 3PC. And then, of course, Paxos has been difficult for everyone, I think. And then I just switched to PBFT and then I tried. Okay, I was trying to understand. Once I understood PBFT, I think I went back to Paxos. Oh, it's not exactly difficult. It's quite similar to what PBFT is doing. It just reduces phases and you don't require that much communication sometimes.

Starting point is 00:47:26 So I think that is what was quite interesting for me. But at that point, even I'm still saying, if this is the case, why don't we just simply express all these protocols in one framework? And actually that helped us

Starting point is 00:47:39 to design a few other protocols. I think subsequently, while I was understanding these protocols, we had this GOBFT paper which was at vldb and we had another paper rcc which was icd so all of these papers and i talk about these protocols actually uh in in the paper all in this chemistry behind agreement paper also so the the protocol POE, RCC, and GOBFT, all three of the protocols, they came to us in a matter of like

Starting point is 00:48:11 one and a half to two months. And why they came to us? Because once we saw that how similar the phases were, we were easily able to split them. We saw that, oh, now you can simply parallelize and that will give you performance. And now you can actually replicate the concerns across clusters and that will give you another different design and which will help to scale more so i think since then we were trying

Starting point is 00:48:34 to think okay maybe we should come up with a common framework on my phone and finally i think it happened so i thought yeah finally we're gonna have to do we're gonna have to write this framework down now awesome is it well they need like dead energy hit finally, we're going to have to do it. We're going to have to write this framework down now. Awesome. Were there any dead ends you hit, though, when you finally came to the point of deciding, right, let's write this down? Were there some kind of things where you hit a bump in the road sort of thing with it? Yes, there were. Actually, there were a lot of bumps in the road. For example, when we started out, we started out with the four simple protocols.

Starting point is 00:49:01 And I think they were very easy to express. But I think as we move forward with further protocols, we realized that we had to come up with different notations. For example, in chemistry, if you're trying to create a chemical bond or something, do you have different types of arrows or do you have different types of bonding? For benzene, you have a hexagon structure do you have different type of bonding like for benzene you have a like a hexagon structure or you have different structures like similarly we we were quite perplexed how should we express express multi-primary protocols or how should we express sharding and replication in fact if you are going to express sharding represents separately it's fine but how should we express them in the same diagram so uh for a single protocol and that too in a concise manner we cannot just have like uh some comments written on top of protocol because

Starting point is 00:49:52 that will not be quite interesting and i think for that we had to iterate over many times because because we first i think we remember we were trying to make some sort of like uh some sort of brackets or we were trying to make some sort of figures or square box and they're like no that's not concise that's not enough i think that was quite uh interesting challenging for us to design how to come up with this compact notation and in fact we still believe like if future protocols as some of the things i was expressing uh telling that we yet have to discuss in our framework is like about different data ministry protocols maybe we still need to come up with new notations and expand the set of elements atoms and these own notations to include this protocol so i think that was quite challenging for us and yeah and that will still be challenging i believe yeah i think that's a nice segue into my

Starting point is 00:50:42 next question is like what's next on the research agenda then where do you go from here so i think uh all the open problems i think we are we have started looking at them i think uh right now for example we only looked at a very small protocol i think in this paper we only looked at like around 20 plus protocols or something 2025 protocols i think we want to move beyond that because in just, for example, in BFT world, for example, there are like 50 plus protocols and there are chaining protocols. There are protocols which assume the lead election is happening just part of the happy case every time it's happening. And then there, as I talked about, deterministic protocols are there.

Starting point is 00:51:24 And even in Paxos world, there are sort of different type protocol. I think right now, what we're trying to look next is we're trying to see how we can express more of these protocols. How do we have to define the annotations? Maybe we need to come up with a much more precise set, because I think right now this is just a start, we believe. And I think the largest, we want to expand from this 25 protocols to like 50 to 100 protocols we can have.

Starting point is 00:51:51 And then we can have, say, hey, now this is the whole suite of protocols and this one language can actually express. And maybe beyond that, we are also trying to think maybe we can have

Starting point is 00:52:01 sort of like automatic proof generation for this protocol. Like, for example, we're trying to think, can we actually use things like data law to actually write down these protocols and prove that whatever our designs are, or these elements we can simply plug and play there and we can show that we can easily design new protocols.

Starting point is 00:52:19 So I think those are some of the goals which we are trying to actively think about right now awesome yeah i am really looking forward to seeing where this goes my uh i'd be nice as well to talk i mean you mentioned earlier on about talking about resilient db and maybe it'd be a good time to talk about some of your other research as well that you've done i mean you mentioned like a few vldb papers and icd paper yeah tell us what else you've been up to in recent times. Yeah, I think, so ResilientDB was something which I started during my PhD. I think that is a system which was sort of created. And I think that also relates to sort of the key

Starting point is 00:52:56 focus of my research. So my research mainly focused about designing fault-tolerant protocols which can actually handle malicious attacks. And I think with ResilientDB, we wanted to do that. We wanted to basically create a framework which has the same client-server model, but now you can actually introduce different protocols which can handle malicious failures.

Starting point is 00:53:21 And we had several of our protocols there. We have RCC, we have PoE, we have GeoBFT, we have RingBFT. So these are different protocols. So I do these protocols in different environments. Some work in geoscan environments, some work in sharded replicate environments, some assume multi-primaries,

Starting point is 00:53:37 some simply reduces phases. So that is what we are doing. But right now, so that was something which I did as part of my PhD. But right now I'm doing something more in the sense, now we are trying to look ahead of just consensus. We are trying to look at what is beyond consensus. We are trying to look at, for example, serverless edge framework.

Starting point is 00:53:59 And we are trying to see if we have a lot of these serverless functions and we have these edge devices, how can these edge devices make use of these serverless functions and we have these edge devices, how can these edge devices make use of these serverless functions? But they don't have to actually interact with this serverless function. In the sense, they don't want to store anything on the serverless hardware or cloud.

Starting point is 00:54:17 And how can they do that in an efficient manner? And assuming some of the edge devices might be malicious, maybe the cloud provider might just fail in some case, such a scenario, how can we actually provide efficient response? That is something which is upcoming in our ICD 2023 paper. Then we are also working on something about trusted hardware, where the assumption here is that you have systems

Starting point is 00:54:44 which have trusted hardware, and we saw that a lot of these systems use protocols which claim to have more benefits than they actually have. And this is something, one of our work which is under process and the idea is that to show we define a new suite of protocols and which we call as flexi trust protocols and the idea is that these pseudo protocols are more efficient than the protocol which use trusted hardware or the protocols which do not use trust are both of the categories and our idea is that our protocols are more efficient they provide a higher throughput lower lower latency. They provide, they're more resilient to liveness attacks. And there's some work which we are doing,

Starting point is 00:55:29 which we are doing in parallel. And finally, I think one more style of research which we are trying to look for is like, assuming you have multiple clusters and these clusters running whatever protocol they want. And how can these clusters talk with each other? And why is this more interesting? Because these clusters could be anything. The clusters could be just organization like for example your top organizations

Starting point is 00:55:49 they want to communicate with each other now if they want to talk with you then they need some sort of a trusted person to relay their messages or some sort of a media which is exactly specific for their communication someone needs to handle and manage that. But if I don't want to have some trusted network, I want to do this thing distributed manner. I don't want to trust anyone. So how can we do that? And that is what we are trying to look at. Of course, there are simple way you can ask everyone in one system

Starting point is 00:56:17 to talk to everyone in another system, but then you're creating too many copies of the message. So we are trying to see if we can do this type of combination but in an efficient copy less manner so these are some of the direction which i'm trying to explore right fascinating it's like this is a whole host of really interesting um like directions that how do you decide which to pursue and how do you even come up with these in the first place what's your process for doing this i think that's a very interesting question in sense like i i i cannot just give credit to myself for coming up with most of this i think i think we have a lot

Starting point is 00:56:50 of like brainstorming session where i think a lot of us are just sitting around and trying to think over the ideas i think uh one of the things which we always do about and i think pretty sure everyone does is like uh we try to see what might be relevant to uh industry or to wider audience for example uh like in in the protocol which i was talking about communication protocol the idea was basically trying to see is that it's we believe it's a common problem for multiple organizations to talk with each other and there have been some solutions in that direction so we were trying to see what can we do in that space. We didn't want to do something with trusted parties. We wanted to do without trust and we wanted to also

Starting point is 00:57:29 reduce the commission cost. So that is where it comes from. And then similarly, we were trying to think, oh, there are a lot of hardware these days which have some sort of components like SGX is there, Intel's SGX is there, or the ARM's enclaves are there. Amazon also has a small sort of trusted computing is there. So these type of machines are coming forward and people are trying to use them for computation. Can we design concerns around these type of machines? Of course, people have done in the prior, but we first tried to find some limitations. We found that there are actually limitations in these designs. And then we tried to come up with efficient design so i think most of the ways when we most of the times when we try to look at problems we're trying to see uh if that problem actually

Starting point is 00:58:14 relevant uh can actually be implemented somewhere can someone actually use it um and of course you have to read what other people have done so i think that is how which gives an idea about what is something more interesting which we can perceive i guess if you put your uh put your prediction hat on now or where do you think it's things are going to look like in like sort of 10 years time like what do you think are the biggest challenges we're going to be facing now that we need to solve for the next sort of 10 years basically huh that's that's a hard question. It's an impossible question, I'm sorry.

Starting point is 00:58:47 Yes. I will not deny that. First of all, there's a lot of research in AI and people are doing excellent and beautiful work in that direction. I'm not even an expert to even comment that, but I believe AI in databases is very interesting. A lot of people are doing. But I'll just talk about from the perspective of a person

Starting point is 00:59:10 who's looking at distributed systems and databases. From my perspective, I believe, I think achieving that extra throughput and low latency would still be an interesting problem. But I think a lot of focus has recently moved towards designing uh energy efficient systems and in fact uh one of our works which is also under submissions we are actually trying to design so for example i because i work in a malicious attack space um there's a whole world which works on protocols which can handle like

Starting point is 00:59:47 blockchain protocols, basically. And we have been looking at those directions. But I think traditional protocols like search consume a lot of energy. So I think we have moved towards designing some protocol which can consume less energy. In fact, the product in this trusted space work, which I was which I was talking about, they're also one of the graphs which we are trying to show is that our designs try to use less energy. And I think energy would be quite key. I think because a recent presentation I was attending from Google engineers and researchers, they also were claiming that the companies are driving towards having designs which are consuming less energy and are more efficient. I think that is something which might be interesting.

Starting point is 01:00:28 Another direction I believe it might be more interesting is security and privacy. I think, interestingly, some of our works coincide in that direction. Right now, most of the database works, they don't cater to security in general. Their actual works are moving towards privacy. We have seen a lot of privacy, how to secure the data of your client,

Starting point is 01:00:56 how to ensure the data is not leaked. That is something very interesting. But I think still we are not looking at malicious attacks per se. There have been a lot of works in systems community, but I think database community still has very few works. But I think this will become more relevant because attacks on databases are quite common. Data breaches are happening very much.

Starting point is 01:01:15 And I think very recently we just saw like there was a FAA thing where the flight control got completely shut down in the system because of poor replication in some sense. So I think such type of attacks could have been caused by even hackers where they easily compromise the data.

Starting point is 01:01:35 So I think security might be quite interesting. So that's where these BFT protocols will come into use. Right now, nobody uses them saying that they're expensive, but I believe they will be much more in application in the future. H guess maybe as well as as they become if like efficiency becomes more important energy efficiency like those protocols improve and then they become

Starting point is 01:01:53 more usable right i guess or more appealing to use because i mean right economics right and i guess yeah it's time for the for the last word now so guess if you, what's the one takeaway you want the listeners to take away from this work in this paper? So I think one simple takeaway I would say is that consensus protocols or agreement protocols are not hard.

Starting point is 01:02:18 What is hard is the way they have been expressed and there's a huge disparity in their design. And this paper tries to amend those disparities it tries to promote a single framework which can help express these protocols and our hope is that this one framework can help in designing future protocols newer protocols can be expressed in this framework and maybe we can simply keep on extending this framework to express other protocols in a sense. So this framework can act as like a bridge between different communities, communities

Starting point is 01:02:54 which are using commit protocols or crash fault tolerant or malicious fault tolerant protocols. I think this framework can become a bridge. And this theorem provers which are used by one community can be used in another community to express the same protocols maybe the designs can become common practices can become common by using this framework that that is what i believe is useful for this framework great message to end it on and what yeah we'll end we'll end the podcast there thank you so much for coming on absolutely fascinating conversation and if the listeners interested in finding more about your work we'll put links to all of the relevant materials in the show notes. And yeah,

Starting point is 01:03:28 we'll see you all next time for some more awesome computer science research. Thank you.

Disseminate: The Computer Science Research Podcast - Suyash Gupta | Chemistry behind Agreement | #24

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.