Disseminate: The Computer Science Research Podcast - Tamer Eldeeb | Chablis: Fast and General Transactions in Geo-Distributed Systems | #46
Episode Date: February 12, 2024In this episode, Tamer Eldeeb sheds light on the challenges faced by geo-distributed database management systems (DBMSes) in supporting strictly-serializable transactions across multiple regions. He d...iscusses the compromises often made between low-latency regional writes and restricted programming models in existing DBMS solutions. Tamer introduces Chablis, a groundbreaking geo-distributed, multi-versioned transactional key-value store designed to overcome these limitations.Chablis offers a general interface accommodating range and point reads, along with writes within multi-step strictly-serializable ACID transactions. Leveraging advancements in low-latency datacenter networks and innovative DBMS designs, Chablis eliminates the need for compromises, ensuring fast read-write transactions with low latency within a single region, while enabling global strictly-serializable lock-free snapshot reads. Join us as we explore the transformative potential of Chablis in revolutionizing the landscape of geo-distributed DBMSes and facilitating seamless transactional operations across distributed environments.CIDR'24 Chablis PaperOSDI'23 Chardonnay paperTamer's Linkedin Hosted on Acast. See acast.com/privacy for more information.
Transcript
Discussion (0)
Hello and welcome to Disseminate the Computer Science Research Podcast. I'm your host, Jack Wardby.
This is the first show of 2024, so I guess a belated Happy New Year. I don't know if we're still doing that or not really.
It's probably what, we're into February now, so I don't think we can really say Happy New Year anymore.
But anyway, I hope you all had good New Years and Happy Holidays.
So yeah, the usual reminder that if you do enjoy the show, please
do consider supporting us through Buy Me A Coffee. It really helps us to keep making the show.
Now onto today's episode. I'm really glad to say that I'm going to be joined today by Tamir
El-Deeb, who will be telling us everything we need to know about Shably, fast and general
transactions in geo-distributed systems. Temer is a PhD student at Columbia University and
is also a software engineer at Jane Street. So he's very, very busy, I can imagine. Great,
great stuff. So Temer, thank you for coming on the show. Oh, thank you so much for having me. I'm
really excited. Well, let's jump straight in then. So I've given you a very brief introduction
there, but can you tell us a little bit more about yourself and how you became interested
in database management research? Sure, yes. So I grew up in Egypt. I studied computer science
and then I moved to the US to start my career in software engineering.
That was in 2011, 2012 timeframe.
And as fate would have it, I would start on the Azure storage team,
which is kind of a distributed storage database team.
And I found that I really enjoy that area.
That same year, like in 2012, two very influential papers came out.
So like during that time, I think the conventional wisdom was generally that like,
you know,
transactions are too slow.
You,
we don't need them.
We're gonna like build our,
like all the cool,
like no SQL stuff,
you know,
Cassandra and big table and dynamo.
And like,
there were both sorts of these things.
And then like in 2012,
two papers came out,
the Spanner paper from Google
and also the Calvin paper in SIGMOD 2012
on deterministic database systems.
And they both kind of showed that actually like,
you know, transactions A are still, like, very useful.
Like, even Google developers who are, like, you know, known for being high, like, highly skilled were struggling without transactions.
And, you know, here's a way we can actually like build this global scale system
calvin kind of showed a like a radically different way that you can do this and i just i found like
whole papers very exciting and it's like since since then have been just like keeping an eye on the on the area until i decided to start at phd myself uh in late 2019 early
2020 and i didn't like uh intend to work in that area specifically from the start but i i just like naturally found myself uh gravitating
to the same area so awesome stuff yeah so going back on the uh on the as you said you must have
been in there pretty early doors i guess then so like kind of how long how long have the team kind
of been going and that sort of project been going when you joined them originally because that's
kind of feels for like like kind of day one sort of stuff it wasn't really day one but
it was early so i think the entire azure thing started in 2006 the famous project thread dog
and i think like azure storage had been in production for maybe a year or two before i
joined or or something like that so it was you know it was clear that
like this is becoming a huge thing it was a like like already like lots of data being stored and so
on but it was also fairly early days uh so yeah yeah no that's really cool and also as well i'm
sure a lot of our listeners will be familiar with the the sort of the influential spanner paper and the calvin paper as well that's uh that's really cool
and we i just want to get us over because we spoke about this but off before we started recording i
think it's quite funny as to why you started and to why you decided to pursue a phd during during
during covid and or during the pandemic so i'll let you tell the listeners what was the main
motivation what you tried before and you're like, no, screw this.
I mean, I, I tried baking like most people did during the pandemic, but I didn't find it very, very interesting.
So basically decided to just continue working while also doing some research and it it it i like not not that i like necessarily
recommend it in normal times but it worked out so yeah it definitely did because then when we've got
this awesome paper as well to prove it so uh let's talk a little bit more about that then so
um let's let's start off i guess with some a little bit more background set the scene for the for the for the listener a little bit more so so can you
we're going to be talking about geo distributed databases today so can you tell us what they are
and kind of why we need them really sure so i think like the definition of a geo
just like so uh the distributed database is a database that runs on multiple nodes.
It's not like a single node kind of thing.
And geo-distributed here means that these nodes span geographical regions.
It's not like all in the same data center, for example, or the same cloud region.
And the reason we need them, I think there's two main reasons.
One is some or like many applications are like have users
that are geo-distributed.
Like think about, I don't know know your twitter or your facebook or something
like the users are all around the globe so you want to keep the data for the user near them so
that they can have uh low latency access to it latency is is very important we are all used to
like snappy uh app experience and like if the apps take too long to load, users just leave.
Companies lose money.
It's a pretty big deal.
The other reason this is also pretty important is just disaster recovery.
I'm sure your viewers are familiar with you know
aws region going down and the internet you know stopping to work for a bit or something like that
so for a lot of like mission critical apps you really want to store like a copy of your of your data in at least like one more region than like
where it's it's uh home is probably more um like hence like to protect from things like natural
disasters that can like happen and take a uh it's centered down and so forth um so these are i think the two main
reasons why you need geo uh distribution yeah it's such i mean we're so used to having this
it's instantaneous sort of any application we use these days right we just want it to be
we're conditioned for it to be like instantaneous right as soon as it's not like we're not gonna
use this anymore so yeah there's a big thought and obviously as well the disaster recovery sort of angle as well i mean um yeah
that kind of speaks speaks for itself right i mean you think it oh one reason will be fine but
yeah this whatever what can go wrong will go wrong right now if he's lost so yeah we need to cover
our backs there so that's cool um so yeah so as you alluded to in when you were like um answering
one of my earlier questions um you mentioned that distributed database research has been it's been quite a fertile ground.
There's been a lot of work over sort of the last 10, 10, sort of 15 years, especially as people have kind of refuted this notion that transactions, we don't need them.
And then I was like, hang on a minute. That's a really good thing. Let's have those.
And how can we do those performance at sort of large scale?
And so maybe,
you touched on the briefly mentioned Spanner and Calvin,
but can you kind of give us a rundown
of maybe some of the more state-of-the-art systems
in that space?
And some of the problems that they kind of have still.
Sure.
The way that I think about this,
and like it wasn't very obvious
when I started doing research,
but it turns out you can generally like categorize the state of the art
systems into two buckets.
Like there's fast and there is general,
but there isn't both at least I like until now.
Nice.
Yeah.
So in that general bucket,
I think like these are your systems that support,
you know,
traditional SQL.
They have this unrestricted API.
They give you very strong consistency semantics.
And so things like Spanner or cockroach DBb or yugabyte or like um like they're
all kind of spanner influenced systems and they are general in the sense that like i said they
like give you uh transactions with all the semantics that you want you know serious serious here like serializability
consistency you can just like run full sql on those with all the features like there's no
restrictions to the programming model but they tend to kind of uh be slow especially like if you have to do transactions that cross one partition. So,
you know, like if you can partition your data in such a way that you can like only touch one
partition per uh transactions you
run the well-known two-phase commit algorithm or a variant of it like it's not exactly like
like it's not always implemented in the textbook form but it's always kind of like one variant of it.
And you take a hit because, you know,
you have to do two RPCs and two rights to storage,
at least for every transaction.
And that takes a toll.
So that's the general systems. The fast systems kind of look at this
and say, okay,
two-phase commit is this
really slow bottleneck
and we really
don't want to do it. So we are
going to design the system
in such a way that we don't do it.
And in doing so, you
kind of lose
what I call generality. Like you lose a property of the system that is fairly important.
So one such property is like the easiest one is like,
let's just not have any cross partition transactions at all.
Like we will just have, like we will limit the transactions to a single shard.
Or we may let you do transactions that cross shards, but with weaker semantics.
Or the deterministic systems, which are kind of, like like structured very differently,
that they can be fast
and they do support cross-partition transactions,
but they restrict the programming model in various ways
that are kind of fairly restrictive.
Like you either have to,
like so the code has to be deterministic,
which usually rules out things like conversational
uh queries where you are like running a part of a query and then like look at the result and then
like issue the other part and and so on and like they do things to like try and mitigate that, but in general, you really cannot support a SQL interface on it.
You kind of have to come up with a different query language
to fit the system or something like that.
Nice, cool.
Yeah, so I liked your trade-off there.
Basically, you've got these two families of systems,
these two buckets of systems, the fast ones,
but they make some trade-offs, weaker semantics,
maybe for cross-partition transactions,
restrict the programming model
so we don't have the nice SQL
that probably developers are used to.
And then on the other side, we've got the general systems
and they take a performance hit for that.
When we're talking about a partition here,
are we talking about kind of,
is the physical location of the machine kind of is the is the physical
location of the machine kind of related to these pathogens are we talking about just partition
in that context usually just means a single machine it's not exclusive to like geo distribution like
even uh things within a single region have the same trade-offs right i'm just saying like once
you have to like run across multiple machines,
you usually have to either run two-phase commit
and suffer the penalty,
or do something else and sacrifice some property about the system.
Yeah, as soon as the network gets involved,
things are going to get a little bit slower.
That was the conventionalism now i i like
i should know that like this category is more about like systems that primarily store the data
on uh disk uh in memory systems like there's been another line of of research there like where you keep your data completely in RAM and you use things like RDMA and so on.
And there you could have fast and general,
but it's also pretty expensive because it's all in the RAM.
And really, I think the popularity of on-disk systems
just show that really developers really care about about that too so yeah yeah exactly yeah
especially when you when you start using something maybe this more kind of well fancy hardware or
fancy way it gets expensive like you said for one and also as well it's not as generally available
right like it's not as easy to use so there's definitely some um challenges there as well yeah
cool so we know all the problems now but we've we've solved this trade-off, this fast versus general trade-off
with Shably, right? So
can you give us the high-level elevator pitch
for the system first?
Okay, so I think
with Shably,
we are going after geo
distribution, right?
And we have
two goals. One
is we want transactions that are local to one region to do global externally consistent lock-free snapshots without slowing down our regional rights in the system.
So two goals. And they are in many ways like uh until at this point we're kind of
conflicting you either oh plus i think like so like that's the the fast part the general part
just speaks for itself you still want to be able to like you know run sql have an unrestricted api
and like like all the nice general things now geo distribution is an even harder
problem than normal uh distribution so like even systems that are like fast or general within the data center,
when you deploy them as geo,
they have even more trade-offs that they have to make.
So let's take Spanner, right?
Spanner was famous because of its TrueTime API that lets you do externally consistent block-free reads,
and that's a uh big deal like you can just like
like do a global snapshot lock-free fantastic right but a it uses a specialized uh clock which is not as widely available.
Although there are startups and cloud providers
that are trying to make it more widely available,
but still it's not as generally available as you would like.
And the other problem is that because of these TrueTime API,
like every single write in Spanner
has to wait out the clock uncertainty
before releasing locks.
So before committing,
you have to wait out the clock uncertainty to generate version timestamps
that are guaranteed to have certain properties.
Now, this clock uncertainty bound
is many milliseconds in the Spanner paper.
I think they have improved it recently.
Like maybe it's like one millisecond or something like that.
But it still means that basically you have to slow down every single write in the system
to achieve the lock-free snapshot reads. Some other systems basically require you
to run consensus across the globe
for every single write you make,
which also kind of like slows down
every single write in the system
to let you like read lock-free.
And finally, so, these are the
slow versions, right? And then
the fast versions are
based on
determinism,
which, again,
sacrifices the
generality of the programming model.
So on top of
the fast and general trade-offs
for the distributed transactions, when you go geo, you also have to sacrifice speed for generality or generality for speed even more.
There's more trade-off there.
And WatchHablis shows that you actually don't
have to anymore.
And I think
the pitch here is that you can have
local
writes have
hundreds of microsecond latencies
and you can have
snapshot free with
external consistency
lock-free in the same system.
So it's fast and general.
Now, of course, if you have to run a transaction
that spans the globe, that will be slow,
but that's just not affordable.
Yeah, there's some fundamental sort of limit
to how fast such a transaction can go, right?
But it sounds there.
Yeah, continue.
Sorry.
I'm just saying, but REITs, they can just go without impacting REITs at all.
So that's the thing that we showed possible with Shibley.
Nice.
Awesome.
So yeah, it really sounds like we can kind of have our cake and eat it here.
So that's really cool.
So I just want to just pull on one thread real quick.
And you've mentioned it a few times while you've been talking about this notion of external consistency.
So maybe you can kind of give a brief sort of rundown of what that actually means in practice.
Yes.
So I think it's also known as strict serializability, although some people slightly distinguish between the two.
I basically just treat them as the same thing.
So serializability, I think, is well understood.
It means that all the transactions you execute, the result is equivalent to some serial order.
Okay, great.
But it doesn't make any guarantees about which serial order.
And one example here is that like,
basically like if you have a read-only query,
you can always order it like as of the beginning of the database.
You can always return null.
And that would be still valid serializability because...
It's a compound, but yeah, it's technically still valid.
I didn't violate anything.
It's as if it happened before all the other rights that you have.
So that's obviously not very useful.
And that's where the
consistency semantics
come in. And like,
these kind of tell you
like roughly
how stale
things can be.
And the strongest
semantics that you can give is like this external
consistency and it basically means that like if you start to read you are guaranteed to observe
all the rights that committed before you started that read in real time like you know if somebody so basically like i can commit a transaction
and then like i don't know call you on the phone say hey you can read right now and you go and
execute it and the system has to know that like to show you the results of my of my right because
it committed before your read started. Nice.
So basically like,
even if the two transactions do not coordinate beforehand,
the system is not allowed to order one before the other.
If one committed before the other started.
It's got to respect the clock on the wall, right?
It's got to.
Yeah. Yeah physical time, basically.
Yes.
Now, transactions that are
that start
overlappingly together
may be
ordered
arbitrarily
by the system,
but if one committed
before the other one started it must be
that the order will
reflect that
real time relationship
and basically this is considered
the gold standard semantics
and
I think Spanner was the first
geo system that achieves it but Kelvin and its
successors also do that's the gold standard that with all these systems are aiming for
I just on a quick aside I remember reading somewhere also somebody somebody told me this
that the the obviously when serialized abilities defined there's no mention of actually like real time right of sort of the wall clock time but so most of that is like it was almost they
didn't have to think about it because when they defined those semantics like the notion of a
geo-distributed system didn't exist almost like everyone's on one box right so they they just got
that for free then but then obviously yeah now obviously well that's right that's right like
like when you think about like asset transactional semantics and so on, it was all conceived on a single box.
And so you got that for free because you are acquiring locks on things and that just orders things. things but once you have replicas and global distribution and stuff this becomes like this
suddenly becomes a major source of uh like surprises for yeah that's somewhere yeah
awesome cool right so let's dig into the details now of um of chevrolet but i guess before we do
that we need to give some air time to its predecessor chardonnay
and i'm getting a theme with the names as well so we need to touch on that at some point as well
yes but yeah so like tell us about chardonnay and how that then led to chablis and start let's let's
start filling in this the details here sure so um like i said like i have known about the spanner
paper and the calvin paper and you know i've've been thinking about the tradeoffs they both make.
And I like conclude like, OK, it stems from slow two phase commit.
And then at some point I was reading another paper in MSDI 2019, I think it's called the DRPC, where it says data center
RPCs can be general and fast.
And it shows that
in bare metal data centers with
modern networks and things like kernel bypass
and so on, you can really have RPC latencies that are
five microseconds in the data center. Like, okay, that's interesting. That removes kind of one
reason why two-phase commit is really slow. And then i came across a bunch of other like research on things like
nvme or xenand or like store like really low latency storage that are also single digit
microseconds and i was like wait a minute like it seemed like it sounds like we can actually have very low latency two-phase commit right now.
Like, back of the envelope estimates, it's like, if these numbers actually hold, there's no reason why you can't design a two-phase commit protocol that finishes in 50 microseconds or something but as but a typical like ksd right is oh i'm like sorry uh typical ssd access is
300 microseconds like you know uh commodity ssd like you know like not the fancy xenand or
obtain ones the one that that the cheap one that you want to use to store your data, it takes you like, I don't know, 300 microseconds to access.
So suddenly the latency of one IO is actually higher than two phase commit.
And I was like, Hmm, interesting.
So maybe two phase commit isn't the fundamental bottleneck that people think it is now.
And what would you do? Isn't the fundamental bottleneck that people think it is now?
And what would you do if that's the case?
What would the system look like?
Can you finally be fast and general?
And I mean, no shockers here.
The answer is yes.
That's the result of the Stratton paper. It was focused on single data center. The idea was let's
make the assumption that two-phase commit is fast
and design a system based on that.
And what we wanted to achieve is not just low latency,
but also the ability to handle high
contention workloads. So, you handle high contention workloads so you know contention is where like one
or a few records in the database become very popular that they get most of the access
and it's something that's very unpredictable like it's very easy to like have an app where like you like the load is very
like evenly balanced but then suddenly something becomes very popular and you couldn't know
before that and once you have heightened contention the slow systems basically
the performance just like really drops really bad like either very high abort rates
because of deadlocks or if you or like if you are using optimistic concurrency uh control things
just like you know uh assume that data isn't gonna change and then they try to commit and
and the data change and and they have to like restart. So short and A starts from the assumption
that transaction IOs are slow,
the network and the log are fast,
and it does a few things.
One is it uses the fast RPCs
to support the lock-free snapshot
mechanism.
And the way you do it is
in
Chardonnay, the key idea here is
we have a service
called the Epoch service.
It's a very, very,
very, very simple service.
All its job is
maintaining a single counter.
That counter is not incremented by transactions.
It spontaneously, like on a timer, just advances.
All right?
So it's not a sequencer in the traditional sense
where transactions would go and
get a number and that
determines the
ordering. No, it's just
kind of a clock.
It's very
coarse grained.
Because transactions are only
reading it, we can
actually make it
distributed and scalable. It's not a bottleneck. And that's a
key difference from sequencer-based designs. But the key point here is that transactions,
they run in very much the classic way transactions run in a shared nothing system.
This is the root of the name Chardonnay because A, it's a sharded system.
Sorry.
Nice.
Nice.
Yeah.
That's good.
And B, it's an architecture that's very classic that we think has a aged like fine wine.
Oh, it's vintage yeah nice so in in in chart rate
transactions run using two-phase locking you know like these are things that were designed in the
80s or something like by jim gray and phil bernstein and people uh like that very like very classic design shared nothing you read the like you
you figure out okay like i need to read uh this key then i'm gonna go to the partition server that
that has that key i'm gonna take the lock on that key in the end i'm gonna like run two-phase
commit to make sure that everything is atomically
committed.
But we use very fast
RPCs, so then the network
time is
minimal compared to the
I.O. time. And
during running
the transaction,
sorry, during running the
commit protocol, you just read the epoch.
You know, from the epoch service, you issue an RPC in parallel, you get the value, whatever
value that was, is the value that you use to, like, to version your records.
And the system just maintains the property
that a transaction that reads an epoch value 5
is ordered before a transaction
that reads an epoch value 6.
And that's very easy to guarantee.
We talk about the details in the paper,
but it's easy to just make sure that that happens
because transactions still acquire locks
and they are ordered.
So when a transaction finishes execution
and it reads a value of the epoch, say five, that means that it cannot have depended on a transaction that read a value four.
And why is that?
Because it has the locks.
If a transaction, I'm sorry, value six.
And because it finished execution before reading the epoch,
it knows that all its dependencies finished and read the epoch
and the value has been five or less.
It couldn't have been six.
Does that make sense?
Because the epoch is monotonically
increasing pilot system in real time if you read a value five that like that means that anything you
depended on had a value of five or less it couldn't have been six because you read the
latest value right now okay yeah so so that's nice because that means that the epoch
boundaries are consistent points in the order and you can read their lock free
ah okay so you you do your like your lock free snapshot reads at an epoch version basically
but i guess the one before the the one that's the active epoch
at the moment is that how that would work that's basically right that's basically exactly right
okay so it reminds me very similar of a scheme like epoch-based memory reclamation i can't say
the word correctly reclamation there we go that's the word and a similar sort of thing right about
like kind of you get so far in advance of this epoch that nothing can have a reference
to that epoch, something that in the previous bin,
basically, so you can garbage collect that.
So it's a similar sort of concept, I guess.
It is very similar concept.
And it was inspired by a paper called Silo.
It's also, it's a single node multicore database,
but we showed how to basically distribute it.
Awesome. Cool.
But, yeah, and the nice thing is
if you want the property of external consistency,
meaning like any transaction that started before I did,
sorry, that committed before I did, I want to observe it.
All you have to do is wait for the epoch to advance once. Then you read as of then, as of the,
like just before the new epoch, right? Because you know that any transaction that committed before you started
has an epoch, say, seven.
So if you read everything
that has an epoch seven,
then you're golden.
Nothing could have committed
with a lower epoch.
Very simple.
Now, there are some technicalities when you are reading that we talk about in the paper,
but that's basically it.
Use the epoch to coarse-grain version the transactions.
So transactions can have the same epoch, right?
And you would have to wait until after the epoch to read.
But the epoch boundaries are
the points where you do your lock-free snapshot reads.
Nice, cool.
We'll probably, I guess, touch on this.
Obviously, we've got to talk about Shably
and everything else,
but like kind of the,
how we actually choose the time between these
to increment these epochs.
Maybe we can touch on that when we get to the results later on.
Yeah, I think that's a really good question.
And in Chardonnay, it's not super sensitive.
You just want it to be large enough
compared to the transaction execution time,
but small enough to not have the users wait for too long
to get the external consistency.
So we set it to like 10 milliseconds or even five.
We found it works well enough.
Nice. Cool.
So I know kind of now talking about Shardly,
there were some challenges about taking this concept of Shardner
in sort of a single data center
and then kind of going to distribute this.
So can you tell us what these challenges were?
And then we're going to talk about how you overcame them.
Yes.
So basically, the problem here is that
every committing transaction has to read the epoch, which is fine if all the nodes are in the
same data center.
But once you go geo, the question is, where do you put the epoch service?
If you put it in one region, then all the transactions from the other regions have to take like a cross-region
rpc to read the epoch which is bad because that would slow down the transactions in the other
regions and if you try to like replicate it across regions you still have to like reading from the
epoch service like physically has to be like a consensus read
because you have to get the absolute latest value.
So then you're also running cross-region RPCs,
which also slows down all your transactions.
But remember, in Shabli,
we want transactions that are local to a region
to have Chardonnay regional latencies.
We are talking 100 microseconds.
Yeah.
So that's the conundrum.
Where do you put the Epoch service?
And it turns out the solution is fairly simple.
Because of the way the Epoch service works, it turns out you can basically break it into
two.
In every region, there is like an Epoch.
We call it the Epoch Publisher.
And this looks to the nodes in that region like the Epoch service looked before.
It is the thing that they talk to to read the Epoch.
And then there is one central thing that can live anywhere that's actually
responsible for advancing the epoch. So the way this works is we have this global epoch service.
It would go and say, okay, now the epoch is five. It goes and pushes that value to all the publishers that are local to every region.
And it does not advance the epoch again until all the publishers have gotten the new value.
Now, the publishers are designed to be like replicated, highly available services.
So it's going to ask kind of what happens if we lose one of these publishers, right?
I know going down, that's fine
because it's a replicated thing.
Okay, cool.
But what this implies is that the value
at every publisher can either be equal
to the true epoch or one less, right?
Yeah.
But it turns out it's really easy to fix the algorithms
to take account for that fact.
And I don't want to go into too much detail,
but the trick here is that when in doubt,
you just need for the epoch to advance,
and there will be no doubt
okay so we just wait to get one more head
and then we know for sure that we're all good
we can't because I guess
everyone can only be at the latest one
or the one before so if you then get
to the latest one nobody can be
you can't be that one before so then it's all good
absolutely
I see I see nice
and once you do that suddenly you can read
lock free in a geo distributed fashion all all the while read write uh transactions they just
read the epoch from the from the local thing using fast rpcs they don't block they don't wait on the epoch to advance nothing it's just they keep the exact
same latencies as as as before you didn't need any fancy clock hardware you didn't need like
gps clocks or anything like that you didn't make any assumptions about the maximum clock skew
none of that just need one one number. Just one number.
And it turns out it's easy to scale and maintain that one number.
Like I said, you have to make sure
that that one number is replicated.
Managed correctly, yeah.
It's durable and all that,
but it is not a scalability bottleneck
because you can have as many of these publishers
as you want basically
and uh it's not a latency bottleneck because you read it using like kernel bypass rpcs
you know in like like even on public cloud like not uh bare metal you get today 20 microsecond latencies or so.
Yeah.
So this is something that's even like on cloud,
you can have today, like nothing fancy,
no clock, you know, hardware, nothing like that.
So fast and general, which was the goal.
Yeah, mission complete. So let's talk some numbers then i guess yeah um so can you tell us about your experiments with shabbly and kind of
what the results were yeah like we ran a very simple experiment like we like ran the ycsb benchmark to like measure regional latencies and uh i mean memory doesn't exactly
serve me right but i think like a single write would take like below 100 microseconds
compared to cloud spanner it would take many milliseconds and then we run the snapshot read the lock free snapshot read
with external consistency and so on and so uh to clarify like this is a deployment
across the u.s so like we have a region in central in east and in west US. Yeah. And you can take a snapshot across those,
I think, with external consistency and everything.
And if memory serves right,
I think like 80 milliseconds,
like something like that,
compared to Spanner, which is like 60 milliseconds.
So it's a bit slower than Spanner
because you have to wait for the epoch to uh advance
but it's comparable while our uh right latency is like an order of magnitude uh faster nice so yeah
i'd count that as a win i i i hope people uh will agree with that. Yeah. Awesome stuff.
So I always like to ask this question as well.
So this is like kind of,
are there any sort of situations?
Obviously we've kind of taken these two boxes of fast in general.
And are there any cases though,
when the performance of Shabli or any use cases where the performance may be
sort of suboptimal?
I guess I'm asking here kind of what are the limitations of the system?
Sure. I do think like compared to something like spanner i don't think so to be honest okay but i think compared to something like calvin or its geo uh distributed
successor called slog they're able to handle cross-regional
rights more
efficiently. If the system
has lots of cross-regional
rights,
the performance
of Chablis will be
fairly worse
than Slog.
It can be sensitive to that
aspect of workloads i guess so
yeah yeah so basically like we make kind of strongish assumption that you don't do cross
region rights that often but you care about cross-region uh creeds if that's not really true
i think like compared to something likelog, the system would perform worse.
Okay.
But obviously, these are the different trade-offs with Slog, right?
As you said, alluded to earlier on, it falls into that fast category, but then we're probably giving up some flexibility.
You will lose, like, you know, the...
Generality of...
Programmability.
Yeah.
Yeah.
Nice.
Cool.
So, I guess, yes.
So where do you go next with Chablis then?
So what is the next step?
Is there a next step?
There is.
There is a next step.
And I think like so far, like in the paper, we kind of assume that we have like the geo-partitioning of the data is static and known,
but we want the user to be able to run a transaction
that includes move this piece of data from Europe to the US.
Because say you are a user of Facebook
and you were living in Europe and then you moved to the US, the app will
want to move the data with you, right?
So working on like geo partitioning and like making that a first class part of the system
while supporting fast and general transactions, I think that's where we are going next.
Nice.
This kind of concept of ownership
i guess of like having the data move with with um within it reminds me of just on a tangent a
system called zeus that did something similar maybe i don't know if it's on your radar but
i and maybe it was maybe i'm getting it confused because it was something we're uh kind of like
now like reading all the systems that yeah yeah so yeah it's just like like there
are like problems on like okay like how do you figure out where the data item is without
running a geo distributed query because you yeah like yeah so like there are whole sorts of like fun uh problems here that we are thinking about um so that's
certainly like one area that we are uh going for next with surely yeah that's i'd be really
interested to see because there's a whole interesting space that i imagine of of things
to tackle so that's cool and yeah so my next sort of set of questions are sort of more sort of higher level and sort of general.
So yeah, the first question is sort of
kind of what impact do you think your work on Shably can have?
And as sort of a software engineer, developer,
data engineer, et cetera,
how do you think I can kind of leverage
those sort of findings of your work?
So I think there are two things here here one is the epoch versioning and like having the
service and so on i think it can be useful in a lot more contexts um like you know ordering
events and stuff like that and just like it's something that comes up in a lot of contexts so yeah uh i
think like people can like look at this and realize this is exactly the like the level of
ordering that i need i just i don't have to order like every single like uh transaction or event in the system with regards to
each other, but if I can
group them after
the fact
opportunistically using epochs
then use the epoch boundaries
to
read that
I think this is an idea that
probably have more
broader applicability.
So that's one aspect.
The other is just I really do think that databases that we have right now, they were designed in some way for a different platform platform a different hardware a different era and like
there is a potential to just like have something that's just straight up better and uh i'm hoping
that like that kind of research you know encourages people to revisit the assumptions you know
something like oh two-face commit is this like this horrible bottleneck that we can't solve, or
oh, we need these fancy
clocks to do
geo
lock-free reads, like that
kind of thing.
I'm hoping that people
can look at this and realize, oh,
these assumptions no
longer hold, and we can build
like next generation systems that are better and you know to facilitate that i think we are
working on releasing shably as an open source thing so there's still some work there to move
it from like research quality code to actual like thing production grade yeah people would
want to use but it's something that's that will hopefully happen soon so awesome that'd be great
to see how that progresses well good luck with that i hope it's a successful endeavor
cool and awesome yeah so i guess you've worked on this i mean how long was you working on this when did the chardonnay project start uh i think chardonnay started um in late 2021
or like early 2022 something like that um it got rejected once before it got accepted so you know
that always it's a rite of passage i mean someone told me once i'm
sure it's the the raft consensus paper which obviously is extremely i've been extremely
influential i'm sure that got rejected a couple of times before it was finally accepted so yeah
i mean it's a lottery sometimes right there so yeah new system but yeah yeah
thankfully was well received from the start but certainly yeah yeah yeah so the question i'm
leading up to is sort of across this sort of journey awakening on these systems what's the
sort of like the most interesting lesson that you've you've kind of learned um i mean i really
just think that like a lot of the like there is a lot of potential for just like like looking at the conventional wisdom and like
not accepting it as true now it's not useful to just be contrarian i think that's just not a good
thing i think it's important to understand why the conventional wisdom is the conventional wisdom
like one thing that really
drove me mad for for a bit was just like people were saying oh transactions they don't scale
you know okay but why like and like really they didn't really have how have you arrived at that
conclusion like what's your thought process yeah like sure you tried running i don't know
like my sequel and it didn't scale but that really just doesn't
necessarily mean that and like so it's important like what i would say to people who are like you
know thinking about starting their own like research journey it's really important to understand
why the decisions were made in in certain, because even though they may have been the best decision at the time, if you don't understand what went into them, you can miss out on opportunities where it's no longer the case.
Like, you know, two-phase commit was genuinely a problem. The people who tried to avoid it weren't misinformed or stupid.
It was really, really a problem, right?
So it just so happens that I happened to stumble upon the fact that it's no longer an issue
and I can build a system
based on that uh the other lesson that's kind of like more practical i think or like more concrete
is that when you are dealing with on-disk systems multi-core designs and and single data center distributed designs can look surprisingly similar
and look for ideas in one to apply to the other and vice versa so this is kind of like
the whole like epoch thing was in in many ways an extension or an application of the multi-core silo e-book.
And it's very counterintuitive.
Like you would think, okay, like this can only like work in a multi-core single node thing.
But no, because the latency of disk is much higher than the network.
Suddenly, a lot of the things that made sense in multi-core can make sense in distributed systems and vice versa i
suspect like you know yeah so that's just a thing that i i that wasn't at all obvious to me like
when i started but i it's just something that's like always in in my mind right now like whenever i'm designing something now i'm
like let me see what the multi-core people did about that and maybe i can steal an idea or two
yeah i like that oh that's an interesting observation i'm definitely going to keep that
in mind as well and i liked what you were saying about um kind of when you need this conventional
wisdom and like challenging the assumptions or reviewing
those assumptions that were made to kind of get to that conventional wisdom because you know what
the world changed right things change so maybe those assumptions are no longer valid assumptions
anymore the the ball games changed but yeah no i really like that i'm going to jump over the next
two questions just for the just for the interest of time to get to my favorite question, which is about the creative process.
And yes, I kind of want to know here, Tamer,
how you go about generating ideas.
And then once you've sort of generated a whole bunch of ideas,
like selecting things to dedicate two to three years
or however long to working on.
I don't know.
I'm very chaotic here. I't i wouldn't like this it's
not but i think to me the way i try i try to think about problems it's just like really try to
decompose the problems and figuring out why they are probably like you know you it's it's a problem
because we do it that way okay fine but if we try to change this very small thing
how can i take that and does it even like make sense and so on so i think like for example i was
like trying to figure out how to do block free reads in chardonnay and i kept like banging my
head against the wall for a few weeks trying to like you know change the
right algorithm to make it work like in such a way to felt like it just like nothing really worked
but but then i remembered the silo paper and i read it and i was like you know what this
epoch thing is interesting why can't it work in a distributed environment like and then i kept
like and then i couldn't like there were two challenges that we solved but there was no
fundamental like i i was just like trying to kind of prove by contradiction that it cannot work in
a distributed environment like and then i couldn't prove right you know so at this point i was like
maybe it can be solved maybe it's just and then you sleep on it for a while and like one day you
get the epiphany so that's kind of like how i like to do things really think about the problem
decompose it read papers like i think just like in some ways there are very little new ideas under
the sun it's really just like yeah being able to recommend like recognize something from a
different context and apply it in a new context can be very valuable so that's really part of my creative process is just like you know
see if there's if someone smarter than me figured something out and see if i can just like you know
cleverly apply it and uh make it work in a way that it didn't work before so definitely standing
on the shoulder of the giants is like that's the exact phrase i
had in my head when you were talking then as well yeah that's funny uh like we like trying to like
reinvent the wheel which is i mean to be fair research is in many ways rewarded on very strong novelty so that's kind of like yeah the the the incentives are structured in that way
but to me i really like solving an important problem a lot more than coming up with a cool
technique that may or may not be uh that useful that's just uh yeah no it's a fantastic answer to that question and it's
another one for my collection and so thank you for that it's great i love to see how everyone
everyone has a different answer to that question it's it's it's it's um it's always interesting
to see what people people say and kind of get an insight into how how you work how your brain
ticks and how what makes you tick so yeah awesome stuff cool so it's then the last
question now so what's the one takeaway you want a listener to get from this podcast episode today
uh transactions they can't scale they can be fast they can be general don't like you should
expect more from your dbms because it can do more for you i love that what a great message
to end on thank you so much it's been a fantastic chat thank you so much for having me brilliant
where can we find you on on socials are you on any of the platforms where the listener can go and
go and connect with you or anything like i'm not really i think linkedin is is the best way
okay cool we'll drop it we'll drop that in the show notes and we'll put links to all of the with you or anything like that? I'm not really. I think LinkedIn is the best way.
Okay, cool.
We'll drop that in the show notes.
We'll put links to all of the work,
and I think we spoke about it today in the show notes as well,
so the listeners can go and find that.
But yeah, thanks again.
And just a quick reminder to the listeners,
if you do enjoy the show,
please consider supporting us
through Buy Me A Coffee.
It helps us to keep making the show.
And yeah, we'll see you all next time
for some more awesome
computer science research.