Disseminate: The Computer Science Research Podcast - Yang Wang | Rethinking Concurrency Control in Databases | #25
Episode Date: March 6, 2023Summary: Many database applications execute transactions under a weaker isolation level, such as READ COMMITTED. This often leads to concurrency bugs that look like race conditions in multi-threaded p...rograms. While this problem is well known, philosophies of how to address this problem vary a lot, ranging from making a SERIALIZABLE database faster to living with weaker isolation and the consequence of concurrency bugs. In this episode, Yang talks about the consequences of these bugs, the root causes, and how developers have fixed 93 real-world concurrency bugs in database applications. Who's responsibility is it to prevent these bugs from happening? The database or the developer? Listen to find out more!Links:PaperHomepage Hosted on Acast. See acast.com/privacy for more information.
Transcript
Discussion (0)
Hello and welcome to Disseminate the Computer Science Research Podcast. I'm your host, Jack Wardby.
Today we've got another installment of our CIDR series and I'm really happy to say that I'm joined today by Yang Wang,
who will be talking about his paper, Developers' Responsibility or Databases' Responsibility, Rethinking Concurrent Control in Databases.
Yang is an assistant professor at The Ohio State University, and he's currently on
sabbatical at Meta. His research interests are distributed systems, in particular fault tolerance,
scalability, and performance optimizations. And obviously, he is interested in databases as well,
hence the paper we are going to be talking about today. So Yang, thank you very much for joining
us on the show. Thank you for organizing this. I'm very happy to have this opportunity and have the chance to
share my research with others.
Fantastic. It's our pleasure to have you. So let's jump right in then. So can you maybe tell
us a little bit about your background, about yourself and how you became interested in
researching distributed systems and databases?
Okay. So I got my bachelor's degree in computer science
from Tsinghua University in China
and also master's degrees from there.
Then I joined the University of Texas at Austin
for my PhD degree
under the advising of Dr. Mike Darlene and Lorenzo Alvisi.
Then after I graduated,
I joined the Ohio State University as an assistant professor.
Regarding why I'm interested
in distributed systems
and databases,
I think usually that's a hard question
because sometimes you just
cannot explain your interest.
So if you're interested in something,
then yeah, you're interested.
I would like to say that I like those things because,
so first of all, they involve challenging problems.
If you look at the concurrency control, if you look at distributed protocols,
some of them are probably among the most intellectual challenging problems,
in my opinion.
And also databases and distributed systems serve as the, you know,
foundations of today's, you know, IT infrastructure in both, you know,
big IT companies and even those university infrastructure and something like that.
Okay. So they are challenging and important.
I think those are the major reasons why I chose to, you know,
pursue research in these directions.
Fantastic. Yeah. I just echo what Sen was saying. They're the sort of things that attracted me in these directions. Fantastic. Yeah, I just echo that sentiment.
They're the sort of things that attracted me to these areas as well.
I mean, like you said, there's so many interesting problems in this space.
And there's enough interesting stuff there to keep you occupied for a lifetime, right?
So yeah, definitely agree with you on that one.
Cool. So let's talk about the paper then.
So developer's responsibility or the database's responsibility,
rethinking, concurrency, control. So maybe you can just give us the elevator pitch for this work and maybe
highlight the questions you were you research questions and what you're trying to answer with
this work yeah okay okay so i can start from the from the background especially you know if the
audience are not very familiar i know you are very familiar with this field but let's assume some
audience probably are not very familiar with this so So I think a CS student probably starts to, you know,
get touched to a SQL database
probably in undergraduate courses, okay?
So when you get into that course,
you will learn that you can write a transaction
which involves multiple SQL statements, okay?
And if you go a little deeper into that,
you know that databases can execute
multiple transactions in parallel
to get a better performance.
Now, of course, the question is
if different concurrent transactions
interleave, what is going to happen?
Especially if you have learned
multisatellite programming and locking before,
you know that multiple concurrent things
executed in parallel is a painful problem.
But during your undergraduate course, your instructor probably will tell you that
database provides this very nice property called serializable. It basically says that if you execute
multiple concurrent transactions, their behaviors will be equivalent to some serial execution of them.
So in other words, you don't need to worry about concurrent transactions at all.
It will be equivalent to some serial execution.
This is called serializable. That's a very nice property.
And you know that when you are writing one transaction,
you're not going to worry about what other transactions are doing in the meantime.
Very nice property. I'm very glad to hear that.
At that moment,
I believe that addressed all my problems.
And then even during my PhD,
I work on topics about how to
improve the performance, how to improve
various things of
executing
serializable transactions.
Even to today, that's still a pretty
hot research topic.
But then,
once the multiple papers,
some kind of studies,
they kind of tell me that,
you know, in reality,
many people are not using
serializable transactions.
Okay?
So they are using something weaker
called,
the typical examples include
read committee and snapshot isolation.
Those are the most, the two examples include read committee and snapshot isolation. Those are the
most two, the two most common, you know, we call
it weaker isolation levels provided by popular
databases. Okay. I don't want to go into the
details about what the difference between them and
so on. So in short, when you are using weaker
isolation levels, then you are going back to the
old days that you have to worry about interleavings.
Some kind of interleavings will give you unexpected results.
We probably will go through some examples.
Then you need to go back to all the painful problems of risk conditions,
whether I need to add a locking somewhere,
to put something in a critical section or something like that.
Then all the problems come back.
Then the question to me is that, why?
We have this really nice world with serializable transactions.
You don't need to worry about anything,
but there's a lot of research on that.
But why are people not using that?
And if that's a fundamental reason,
then maybe that has an impact on our research, right?
So if people don't like serializable transactions,
then maybe that's something we can address.
Or if it's really not addressable,
then maybe we need to research something else, right?
So that's, I think, the big motivation for this problem.
Okay, so basically I want to understand that, is it really
possible to let a database handle everything
about concurrency?
Or is it like the real world?
Is it like what happens in the real world,
that developers still have to worry about a concurrency and
Then what is the separation between the database and the workers? Okay, I think that's the big questions
We want to answer in this work. Okay
Fascinating stuff. Yeah, and really interesting questions you trying to answer. So I mean I
You did sort of answer this question and explore these questions, you did a study, right?
So can you maybe tell us a little bit about the study and the way you designed the study?
Okay, okay.
So in this work, the first part of this paper is to study what really happens in real-world open open source database applications. Basically,
what kind of
isolation levels are people using?
And if they are using weaker
isolation levels, or some of them don't even use
transactions, they use something called
ad hoc transactions.
If they are using those
weaker or ad hoc transactions,
are they meeting any
correctness issues caused by the Wicker isolation level?
And if they meet those problems, how do they fix that?
And even more interestingly, why are they reluctant to use a serializable database?
So those are the questions we listed as a detailed question.
So what we really do is that we go through the issue of bug reports
of open source repositories.
Okay.
We search for issues related to comparison control in databases.
Well, that's actually a painful procedure.
We can go to more details if you want.
Then we...
We can maybe talk about that later on, yeah,
when we speak about some of the difficult aspects of this work.
But yeah, for sure.
Yes, yes.
That's always the painful part of, you know, study work.
Okay.
So after we found them,
we manually read them to
understand what are
the root causes of the problem,
how people fix that problem,
and if they leave any comments about
why they prefer
weak isolation or ad hoc
synchronization, ad hoc transactions, and so on.
Okay? Yeah,
that's basically the methodology we use
in this work. Cool, yeah, basically the methodology we use in this work.
Cool.
Yeah.
So, I mean,
are there any limitations
to this approach,
I mean,
was the alternative ways
you considered
when designing this
or was this sort of
the obvious
and best path,
you think?
As I said,
it's a painful process.
Okay.
So,
I would say
there are two limitations
of the work.
Okay.
So, first,
obviously, we cannot, you know, read everything.
Okay.
There are so many bug reports about open source applications.
So we cannot read everything.
So we try to do some search based on keywords.
Okay.
But it turns out that it's really hard to find the right keyword because, you know, people sometimes just call the problem risk condition.
Okay. So which is indistinguishable from a multi-threaded risk condition. And a multi-threaded
risk condition happens much more frequently than database issue. It's really hard to find
the right keyword to search. So our final strategy is that we search for issue report,
which includes both, which includes keywords from two fields.
One field is regarding concurrency, like a risk condition, you know, locking, concurrency, and something like that.
Another field includes keywords from databases, like a database, SQL, transaction, something like that.
So we finally basically only look at bug records,
which includes keywords from
at least one keyword from each group.
Through a heuristic solution, we may miss
something.
So that's one limitation of our work.
The other limitation,
and that's something we really like to do,
is to see what happens in
industry. Like what happens
in, for example, Oracle
or Microsoft SQL Server, okay?
We tried, but that's really a hard job
because there are always privacy concerns, right?
So, you know, Oracle is, you know,
there are users using it.
We cannot do a study about users transactions
that are privacy concerns, okay?
That's a much painful job.
Even during some informal conversations with Microsoft people, we find the quick answer is
nobody knows the exact answer. Nobody else knows the exact answer. That's something we really like
to do. If anyone in the industry has the resource or has interest to do that's something we really like to do. Okay, if anyone in the industry has the resource
or has interest to do that,
we really would like to see a similar study.
Okay, that's a second limitation.
Or you could see the future direction.
Cool, cool.
That's interesting.
If anyone's listening and has access
to concurrency bug reports from their systems,
please do reach out, right?
Great stuff.
Yeah, it's interesting to tell me,
so these online applications that you surveyed,
like what sort of,
like how many different databases
did these applications use?
Like what's the sort of the,
and also the domains
of these open source applications?
Was it very wide reaching?
Most, I would say a majority of those applications
are what we call e-commerce applications, right?
Online shopping, accounting, or something like that.
But there are a few exceptions.
I remember one of them is about a gaming server, okay?
And there are some about social networks or something like a Wiki.
They use database as back-ends as well. But I think the majority of them are like e-commerce.
Okay, cool. So let's talk about the consequences and the things you found from this study. So
can you maybe first tell us about some of the data inconsistency issues you found?
Yes, data inconsistency is the most common issue we find and that's not surprising.
So maybe let's go back to the WCAG solution to see why these things happen.
I think you can have a better understanding about what they really cause.
So let's give a very simple example.
Let's say you have a transaction to first read some count.
Let's say how much money you have.
Or let's say you want to buy something online,
and your transaction will first check whether there's still sufficient count in stock.
You probably will do it with a select and then do an if.
And then if the amount is sufficient,
usually you deduct some number from the account
and that's it.
That's a very simplified purchase transaction.
So when you execute them,
let's say you have multiple people
trying to buy the same thing at the same time.
So if you have several liable transactions,
then there's no problem.
Some of them will get what they want. Some of them probably will not, but they will get an honest answer.
But let's say when you use a move to another very popular isolation level called read committed.
Read committed will allow your select or your check to execute in parallel first.
So that leads to a classic phenomenon like risk condition.
Basically, everyone checks at the same time.
They all find there are sufficient amount of items
in the stock.
Then when they do the deduction, you
will find probably that's not enough.
Then the final count will become negative.
That's a very classic risk condition problem.
But now if you use read committed, it could also happen in your databases.
So let's say if I'm a company like Amazon, I sell things and I use a read committed database,
what could happen is that I may oversell my items. I may have 10 items in stock,
but if everyone checks in parallel, I may sell to 100 customers.
Okay, so let's call it overselling.
Okay, or let's say if you, you know, save some money in the account, it could also cause a problem called double spending.
Basically, I can try to spend my money at the same time to buy two different things, and the system will let me pass.
Okay, and leaving me with a negative account number.
Okay, this is called double spending.
Or if your system has something like coupons,
wall trusts,
and if the customer is smart enough,
they can use it more than once, okay?
By issuing concurrent transactions at the same time.
So this is called double spending.
You know, overselling.
Those are classic problems
which are caused by the data inconsistency issues.
Another very classic is called ID problems.
So most of these companies want to ensure
that when you have an order,
you have an order ID.
And the order ID should be unique
so that we can track what happens
with each order.
But again, if you use a really committed database, it is possible that two orders will get the same ID.
And then everything will mess up and you may get things from another order or your item may go to another order.
So it's hard to predict what could happen.
So those are the, I think uh most common you know data
inconsistency issues we have observed nice so so out of i think you say in the paper that you found
like 93 books i think it is if i'm remembering correctly so of the of those how many were
related to data inconsistency uh 78 okay so that's sort of the lion's share almost of issues around inconsistencies.
Because you also define another category of issues that you've identified as unavailability.
They usually cause the process or the database to crash, but doesn't cause your data to be
wrong.
So common issues like if you have
two deletes transactions
deleting the same row,
one of them will throw exception.
And if you don't catch that,
then your process will crash.
And sometimes deadlock
is another possible reason.
Those I would say
they are not specific
to weak isolation levels.
Even if it was a serializable database,
some of those issues could
happen as well. That just says
even with a serializable
database, a small amount of these problems
will still happen.
Developers still need to be careful about
some aspects of concurrency control.
But some of them are rated.
I guess the question that follows from this is then,
these issues that you identified, how severe were they?
How bad were they?
Okay, that's actually the most interesting question we find.
Okay.
So maybe still, let's go back to the history a little bit.
So, I mean, the problem with weaker isolation levels
are pretty well known.
People know that they could cause data inconsistency, right?
So, you know, it's a risk condition.
It could happen.
Okay, so far.
But many people are still willing to use that.
So, one question there is that,
do those issues really happen, okay?
Because maybe, you know maybe most open source users
probably don't really have a very high throughput.
So you probably get a few orders every day
if you are just a startup.
So maybe those things,
risk conditions don't happen at all.
So that's a guess.
But at least from our study,
we find, at least for some people, those are not true.
We see that common thing.
One guy said this problem happened five times over the past few months,
in a total of 10,000 orders.
So if you have 10,000 orders in five months, that's really not a lot.
So your system is idle most of the time.
But those issues still happen.
Our guess is that you may have some kind of a burst of incoming orders at the same moment.
You're short of a million top time.
So the short answer to your question is that it is more severe than we expected. So at least I would expect that
for a person with tens of orders every day, this should not be an issue. But it seems that
it could still be an issue sometimes. But if you ask me what are the real financial costs to them,
none of them provide that data okay so i i
cannot answer that directly sorry since they complain to the developers that must be an issue
for them okay yeah that's true yeah that's interesting that even with sort of what we'd
started thinking of being a uh quite low uh well not low but like not like you can't really imagine 10 000 orders for example
generating like the sufficient level of contention or stuff i mean like i said if the workload is
quite bursty then that kind of i guess could could be an explanation for that but yeah and that's
that's that's a really good point so there's kind of the you know in your in your paper you don't
sort of like break down the causes of these books. You categorized them into six different categories.
Maybe we could run through these categories and you could tell us about what they are.
Right.
Okay.
The first one is similar to the example I just described.
Actually, that's the most common issue.
Among all the 93 issues, 52 of them are caused by that. Basically, that's the most common issue. Among all the 93 issues, 52 of them are caused by that.
So basically, that's a very common pattern.
As I said, you first have to select a statement to check something,
and if the check passed, you do update.
Okay, so we call the read followed by write.
Okay, and if you do those kind of transactions on a read-committed database,
you are almost guaranteed to have some issues if you have a high level of concurrency.
So it can be further broke down into
two types, but I think I will not go to details here.
Okay, if you are interested, you can always read the paper.
Okay, but that's the most common pattern to cause issues, especially when using read-committed.
So the second category is inappropriate error handling.
That's mostly related to the unvalidability issues we have discussed.
So if concurrent transaction could throw an exception and you don't handle that, that's
an issue.
The third issue is lock timeout.
That's usually caused by deadlocks or too many concurrent transactions.
Again, that's not something specific to concurrency control, but the most interesting
thing we find for this problem is that sometimes people argue that to reduce the timeout or to
reduce the locking time, we should use a weaker isolation level. Even weaker.
One common example is saying the default isolation level is a repeatable read.
Then the developers argue that we should move it to read committed to reduce
the locking time to remove this lock timeout problem. That's pretty
interesting. The other issues I think are pretty minor, so I maybe not go through them here because they only happen to two or three lines.
But the last category is a big one because there are some issues, 22 issues we have not been able to identify the exact root cause.
Because some reports are still open. And sometimes people fix problems in an ad hoc way.
It's just to say, I disable that, disable that.
I change a little bit here, then problem disappears.
Okay, fine, good.
But then you don't know the exact root cause.
So, yeah, some issues fall into this category.
So we don't know the exact answer yet.
So in short, read followed by write
is still the most common pattern to
cause problems.
Cool, yeah, so those are the
majority of, that's the majority cause
for mostly. So that's interesting.
You alluded to it a little bit there
when you discussed
the other category, the unresolved
issues you found.
But what sort of, I guess,
how do developers go about fixing
these bugs that they found? How did you observe
the approaches that they used to fix these
bugs? Other than just like telling
something off.
That's actually
I think how
to fix the issue, the answer
to that from our study is the most
surprising part of our study.
The short answer to that from our study is the most surprising part of our study. The short answer to that question is that
people don't have a
unified solution to that. Different people use different solutions.
Many of them are very ad hoc and
some of them even contradict with each other. They follow different philosophers.
Maybe again, let's
start from history a little bit. So as we said, it's not a new problem. People are aware that
there could be consistent issues caused by request-relation levels.
In the early days, some people's argument is that it's fine. We can let them happen.
Then we fix them manually.
If you think about the overselling issues,
it could happen to Amazon.
It could happen to an airline.
An airline actually does that deliberately.
The oversell issues.
But then when problems really happen,
what do they do?
Amazon will send you an apology email saying,
sorry, it runs out of stock. They may give you some compensation, like a voucher or something.
The airline will say, okay, so if you're willing to give up your seat, we will give you a couple of hundred dollars. So those problems can be fixed manually. And it doesn't have to
have a technical solution. So those are the arguments from early days.
I mean, that makes sense to me.
That makes sense to me.
But in our study, we find the answer is very different.
For the 93 issues, only one issue discussed this what we call business solution.
And it's not clear whether they finally adopted it.
So in the open source field,
people are,
at least from our study,
we conclude that people are not happy
with those business solution.
Okay.
We don't know the exact answer yet,
but our guess or from what we look,
okay,
the open source community is quite different from, you know, those closed industries.
If you think about Amazon, Amazon's database team and whatever storage team and their product selling team,
they all fund, they are all under the same organization.
They are all under Amazon.
Okay, so they can coordinate their responsibility. Okay, you could say,
who is responsible for fixing the correctness of issues caused by request solution?
They can coordinate that.
But in open source community, it's very different, right?
So the database provider
and the database application
are two separate organizations, okay?
If you use MySQL, right?
So MySQL now belongs to Oracle,
but it used to be open source.
Postgres, right?
And for those people who use them,
they are in different organizations
from the creator of the database.
Okay?
So now the question is that,
let's say if I'm a user of MySQL,
let's say I write a program or I use some online shopping software on it,
and I find my software oversells.
So am I going to fix my problem in my way,
or am I going to blame MySQL or blame those who provided the shopping software.
So that's why the story in the open source community might be very different.
But that's our guess.
But that's from our observation.
So that's why we're particularly interested to see what happens in industry,
like in Microsoft and Oracle, or in Amazon.
The equation there might be very different.
Okay, but let's go back to the
question, how do we fix it? So the first answer is that in the open source community, people are not
very happy with business solutions. They want a technical solution to fix the issues. Again,
it's not a new problem. So the database community have, I would say, are aware of these issues,
so they provide some help for that. So one specific semantic is called a select for update.
Basically, as we said, if you do select on update and you do it on the read committed,
then the database may execute the select in parallel.
Now, select for update is designed to prevent this case.
If you say I declare a select
with additional keywords for update,
then the database knows that,
okay, so I'm going to do an update after the select.
So I should not execute two of those selects in parallel.
I should just treat the select as some kind of update.
So if you imagine this,
this is an exact solution designed
to address the problem we have just discussed.
And actually, in our study, we have searched for this keyword, full update keyword, specifically.
So our study is a little bit biased to full update. But despite such bias, among the 93
issues, only 22 of them use this solution. So a small percentage of them use this solution.
And if we look at all the other solutions, all of them are very diverse. So for example,
some of them decide to just add a lock, saying, if I do select, I could lock before select and
could unlock after doing update.
But in databases, this is not very simple because a database is distributed and your web server might also be distributed. So you cannot just use a PSWR log or Java synchronized.
So some of them actually use an external locking service like Redis. So you need to grab a lock in
the service, then you need to send a locking request to Redis
before executing the transaction
and release the lock afterwards.
Okay, so this is called additional locking paper.
Now, some people even do it in a more crazy way.
They implement,
if you are familiar with database comparison control,
you probably have heard about
optimistic comparison control called OCC.
So what they do is that they don't use lock.
They add a version to each row.
Then at the end of the transaction, they check whether your version matches.
I don't want to go into the detail,
but that's another way to do concurrency control in database.
But traditionally, this is implemented by the database itself.
But now we find at least eight issues.
In eight issues, the developers implement them
by themselves in the application layer.
So basically, for their table,
they add a column called version.
And when they do the select,
they read the version.
And when they do the update, they put the version in that column.
And at the end, they check the versions are matched.
So basically, they re-implement OCC by themselves.
Okay, so the question is, what are we doing?
Those are supposed to be the job of a database,
not supposed to be the job of developers.
And there are many other answers.
For example, some of them say,
let's give up concurrency. I just execute those transactions in serial.
For conflicting transactions, I just execute them in serial.
One thing in particular is that, actually, one original question we want to ask is that why are people not willing to use the serializable?
We find that only one of the cases decided to use serializable ultimately.
So they said, okay, let's upgrade to serializable. Six other issues decided to
upgrade isolation level to some extent, but altogether only seven issues, in seven issues, people decide that
using a stronger isolation level
is the solution. In all the other cases,
they move to other solutions.
So in short, the solutions
or the fixes are very
ad hoc and very diversified.
And some of them are very
complicated.
So that's why we raised our
original question.
So do we really, is database really
able to handle everything by itself or do we want to push the responsibility to the
developer? And if we push, if we really want to do the push, then can we make their job
easier? Instead of re-implementing OCC just at the application level. There are a few
other categories. So if you are are interested still encourage you to read the
paper okay sure i found that absolutely wild that well there's two things that jumped out there first
of all like the people basically re-implement rcc is is crazy it feels like such a wild solution to
the problem and then that they use an additional like locking service as well like it's crazy that
they're that some people like reach for that um like that sort of
solution that a database should provide for you in my opinion anyway and it's really interesting
to see that and it kind of the additional lock-in feels like massive overkill almost
um but anyway that's really interesting that's what the kind of solutions that people
reach for with these sort of problems um that's cool so you touched on it a little bit there about
like how how these findings helped
answer your your question you asked prior to the study so maybe you could elaborate on that a little
bit and sort of like look back on the questions you you asked and how now that you can answer
them but now you have this new knowledge okay okay so so here i just list a few questions people have
asked in the past so we're not the first one. So first question we have discussed, okay, so do weekly isolation levels really cause any issues? Okay, because
you know people's guess is that maybe you don't have a high contention level, so maybe that's fine.
So our quick answer is yes, they still cause issues. Our guess is that maybe there's a burst
of traffic to some extent, okay, that, the application does not seem to
have a sustained high throughput. The second question is how much effort does it require
to handle those crackness issues caused by required solution? Our quick answer is it's a lot. Okay. So our general finding is that it's not hard to diagnose or reproduce the problem.
So this is quite different from traditional bug detection.
Usually it's hard to diagnose a bug, but after you know the reason, it's easy to fix that.
Okay.
But here it's the opposite.
Okay.
It's pretty easy to diagnose or even reproduce them.
Because as we said, the pattern is pretty straightforward.
Most of the issues are caused by read followed by write.
But fixing them is a lot harder.
That's why, again, we resolve our question.
So the problem there is that databases
may still work under the
assumption that
we know everything about concurrency control.
But in reality, it is not.
But when it is not, it also doesn't
provide additional support
for the developers
to customize
the concurrency control. That's why they have to
rely on additional locking,
implement OCC by themselves. So to me, it seems that LISC-GAP basically,
database wants to do everything, but actually it does not do everything. Basically, LISC-GAP makes
concurrency control even harder than just giving everything to the developer. Let's say I don't do
anything, just you do login for everything.
Now things become even harder than that,
to some extent.
That's why,
so the short answer to how much effort does it require
is that it requires a lot of effort to fix issues.
Some of them even need to rely on external systems
and some of them even need to rely on multiple external services.
So the last question, actually that's the most interesting question we are interested in,
why people are not using serializable? Why are they not even using select for update? So select for update, if you think about it, it's somewhat similar to the idea of serializable. Basically you
hold a log for a longer time, for sure. So I think in the past, people's belief, or at least my feeling, is that serializable
databases are not going to provide enough throughput compared to weak isolation levels,
okay? Which is true, but the problem for the open source community is that, as we said, most of them
don't require a sustained high throughput, okay? So throughput probably is not the major concern here.
So we can't really go through discussions to see if people are discussing those things.
Most people don't discuss those.
But from a few discussions, we find there's a general concern about the locking.
So basically, if you use a serializable database,
the database will hold
the lock, especially the hold
the write lock for longer time,
compared to really committed step-shutter isolation
or even using no transactions.
So this leads to a longer
locking time.
This also leads to a higher
chance of deadlocks.
Okay?
Now, in today's database,
I think to detect deadlocks,
most of them use a simple strategy,
like you set a timeout.
Okay, saying if the transaction does not return in a minute,
okay, let's kill it and retry.
Okay?
In one discussion,
we find that people are really struggling
to set the right timeout interval for data loss detection.
So if you set it too short, then a heavy transaction will always be killed.
If you set it to be one second, if a transaction takes 10 seconds, it will always get killed.
If you set it to be long, like an hour, then it will slow down your recovery.
If a transaction really gets into a deadlock, then you will have to wait for one hour for it to finish.
I'm quoting some of the comments here.
That's a quote from one developer, which I totally agree. So it is a poor approach to architect software
based on guessing deadlock timeouts.
Actually, personally, doing my research,
I've met the same problem in many other fields.
I've seen many other problems.
So basically, our answer to why serializable or why not
select for update is that they may introduce longer locking,
they may introduce a higher chance of deadlocks, and if we just use timeout to detect deadlocks,
and people find it's a pain to set up the right timeout interval. As a result, some of them are
even willing to go to all those complicated and ad hoc solutions. But again, this is just from limited,
probably four or five comments.
So those, I mean, if you have any answers,
if the audience have any answers,
we are very happy to follow the discussion.
Cool, great stuff.
It's mad that just because it's difficult to determine
what the optimal lock timeout should be,
that it drives people to use these sort of, what seem to be sort of extreme solutions almost it's really
interesting to see that's what people people do so i guess kind of building off of this like
what can we do then so what can we do as as i guess as researchers and also what can software
developers do to to make this problem better, to alleviate
the problems associated with
concurrency bugs?
That's a question
we discussed
in the paper, but honestly, we don't know
the absolute right answer.
That's why we are writing this.
A set of papers usually motivates
more discussions.
Here I'm listing
our opinions.
In the short term, let's say if you
don't want to do anything crazy that needs
three to five years,
I think first
among
between the two most popular
isolation levels, snapshot isolation and
reader-committed, I think snapshot isolation
provides much better correctness guarantees, which
again, is probably not surprising. People know
it's stronger.
Actually, snapshot isolation will
remove
a majority of the issues
during our study.
Another
major help
is from those databases which
provide some kind of, they call it
unique IDs.
Basically, you can declare
one row as unique, and the
database will help you to generate some unique IDs
or incremental IDs.
So this will remove,
this feature will remove most of the
same ID
problem we have discussed.
I think by combining snapshot isolation
and these unique IDs features from databases,
we can address most of the issues, okay,
so during our study, okay.
But whether snapshot isolation database
will cause more issues, okay, we don't know yet, okay.
But at least that's the result from our study, okay.
From the research side,
so we're discussing two directions.
So first, could we push all the responsibility to the database?
Is it possible to improve the database design
so that it can be serializable,
it doesn't have any of the issues
reported by different developers?
As I said, we don't know the exact answer.
We discussed a few possible ways to improve it.
But at least if you ask my personal feeling,
my personal feeling is that
it's not going to be able to handle everything.
Because there are always knowledge
that only a human
being knows. For example, a typical
example is that some
developer may say, okay, one of my
transactions doesn't need a serializable result.
It says, I read
some statistics. I'm fine with some errors.
Those kind of
things are really hard to be figured out by the
database by itself.
So my opinion is that some kind of developer's help is going to be useful, okay?
Although AI is popular today, you know, we cannot expect, you know,
what AI can do maybe in 10 years.
Maybe if we have a very smart AI, then maybe we can do the job fully by the machine.
But so far, that's my opinion. So then the last question is,
if we need users' help, how can we do that? We don't want the user to implement OCC by himself,
so that sounds ridiculous. Is there a better way so that the database can expose something to the
developer? For example, if we use locking, it may expose something to the developer.
For example, if we use locking, it may expose locking to the developer to some extent.
I know some databases can already do that.
If we are using a versioning database, can we provide the versioning to the user so that it can have some control by itself?
So that's the last question we have discussed.
Again, we don't have the concrete answer, but I personally think that's a direction to explore.
Nice.
Plenty of interesting directions for us to explore then.
That really sets the scene for your future research.
So maybe we can touch on, I guess, where do you go next
and what's next on your research agenda with this work?
So I think we are going into both directions.
So we are seeing whether we can develop
any automatic analysis method
to help people figure out what could go wrong
if you're using a weak isolation level
and maybe automatically fix some of them.
And also we are thinking about how to,
so what is the right abstraction
between the database comparison control and the user?
Should we expose anything?
What do we expose?
But the challenging part is that if you expose something,
maybe it will break other things, right?
Databases also have this nice rollback support,
atomicity support.
If you expose locking to
users, those properties may
easily be broken.
How to expose something with a new
app structure without breaking existence?
I think that's a challenging question
we're thinking of.
Nice, cool. So I get that.
Yeah, really interesting direction there.
It'd be really nice to see your future work on that.. I guess I think you alluded to this a little bit earlier on in the chat as well. But I would love to like kind of get your thoughts on what was maybe the most unexpected thing that you sort I have been working on a serializable database
for my undergraduate until the end of my PhD.
But then I realized that most people are not using it.
So that's the most I expected.
That's wild.
Yeah.
You think what the heck I've been doing for all these years, right?
Yeah.
That's cool.
I'd like to get a little bit more about the origin story about this paper as well.
And so, I mean, like from your initial sort of conception of the idea, I guess, first of all, can you say how did this idea come about?
And then also tell me a little bit more about the journey from sort of the idea to now and to the world to the side of publication.
Right. That just goes back to my own. So I mean, even my colleagues,
we collaborate on serializable databases for a while.
But then we see those studies.
After we see the study,
we look at the real applications by ourselves.
We look at our source code.
Many of them are not using serializable databases.
So why?
Okay.
So if there's a reason we can address,
that could be a nice research. Okay. If there's's a reason we can address, that could be a nice research.
If there's some fundamental reason we cannot
address, then it could motivate
the research in the other direction.
Yeah, that motivates
this work and potentially all the
future works follow up.
Amazing, that's really cool.
I mean, well through there.
We've got you here. Can you tell us
maybe a little bit more about your other research?
I mean, I don't know whether you're able to elaborate on what you're working on, obviously, at Meta at the moment,
but maybe in your wider sort of research, what are you looking into at the moment?
So, as I said, I'm a sabbatical at Meta.
Okay, so here, instead of doing real research, I think I'm in the learning phase now.
I'm trying to learn, you to learn what industry really cares about.
So in the last half a year, I mainly work on software deployment.
Basically, they have a new version.
Let's say they have a new version almost every day.
So their old version is already deployed to millions of machines.
How do you deploy the new version?
Especially, you don't want to stop all your service
and then to upgrade, right?
So that's a topic I worked on last year.
This year, I'm working on something related
to scheduling AI training jobs, okay?
Which is pretty hot and new topic
listed. Well, I wouldn't say it's new, but it's
a hot topic.
In my universities, as I said,
we still have some follow-up works for transaction
processing.
And in the meantime, I'm also interested
to see how to
apply some of the new AI techniques to help
the design of the system.
Amazing, amazing stuff. Do you find that sort of having the
experience of being in industry versus being in
an academic sort of environment, that
the things that academics think are important
differ or diverge from what industry thinks is important
in terms of what needs researching?
Or do you feel like they're quite aligned?
That's a very big question.
So first, that's a big difference for sure.
That's a big difference.
So they're different in many different ways. So first you could say research is usually
ahead of industry by
maybe 10 years.
So
basically you would say
many problems have been discussed in research
papers several years ago
but they have not been applied in industry.
It's not because
they are not good ideas. I mean I talk to some people
people think they are good ideas,
but they just don't have the time to do that yet or explore that yet.
Okay.
So research is ahead of us.
Okay.
So that's one difference.
The other difference is that there are many more constraints in industry.
Okay.
So whether your research can be applied. For example,
like in our research, we sometimes do
program language analysis, right?
So we analyze your code to see whether
there's a bug or whatever, okay?
In meta, for example, okay, so
not everyone is using the same program language.
You see Python, Java,
PHP, whatever. You could have like
10 very different languages.
Okay, so if you want to do programming language
analysis, then we need to
implement it for 10 very different languages.
That's a job that's
very hard to do.
There are a lot
of additional constraints there.
Also, industry
is much more careful about overhead.
If you run a research program,
I would say 5% overhead.
That's pretty good.
Okay.
No here.
If it's over 5%,
it will be rejected directly.
Okay.
So there are many differences.
I think, yeah,
probably don't have time to go through all of them today,
but those are the majors in my mind.
Okay.
That's really cool.
That's really cool.
I mean, I guess sort of along the similar sort of line is,
I mean, how do you approach sort of determining what is worth working on
and like sort of generating ideas, I guess, right?
Like how do you determine this is the thing that needs working on?
Like what's your process for that?
Okay.
So first, to be honest, I was often wrong.
So there are multiple topics I pursued in the past,
which turns out to be not so good.
I don't think there's a unified answer to that.
So what we learn is that we usually learn
by reading other people's publications
and discussing with other people, both industry and academia, so that we can see what are the important problems, okay?
Then we will choose based on my interest, okay?
So, of course, I cannot work on everything, so I will choose those topics based on my interest and expertise i think that's how i usually work on a fun topic so whatever okay uh but again as
i said i'm often wrong so no that's cool that's a great answer that's a great answer um i guess
yes like we we like we've also we've mentioned mentioned this at various points through the interview,
but what do you think is the biggest challenge in database research right now,
or in discrete systems even, maybe, or just in our field in general?
Oh, okay.
Yeah, that's a question probably too big for me to answer, but I will try my best.
Yeah, speculate.
Yeah, that's a question probably too big for me to answer, but I will try my best. Yeah, speculate. Yeah, yeah.
Okay, so regarding what problems I think are interesting or important for database.
So one thing I find is new applications, in short.
So again, if we go back to the history a little bit,
so in the old days, there was only databases.
Everyone was using databases.
Then starting maybe from Amazon Dynamo,
there was a trend called NoSQL.
Basically, it says you have a very simple application
like a key value store.
So then let's just build a key value store
with a key value store interface.
We don't need a full transactional database for that.
It gives you better performance,
but of course you lose other things.
There's some controversial about which way to go.
But I think in these few days,
we have seen many more applications,
like a graph database, analytics database,
and time series database, and also array database.
I have worked on some of them.
So I think the big question here is that,
do we still want to put everything under the hood of database
and analyze everything with some kind of SQL transactions?
Or do we want to build some specialized systems for them?
Okay.
So a specialized system usually gives you better performance,
lower cost.
But, you know,
still many people still prefer
to store everything in database
because it's mature,
it gives you the query interface
and everyone is familiar with SQL.
Okay.
I mean, I don't know the answer.
I see research in both fields.
So I think the question I'm interested in is which way is the right way to go.
Should we still put everything on the SQL database, or should we build a specific system for each application?
That's one question I found interesting.
The other one, which I'm not very familiar with, but I see a lot of work on that, is
how can database handle this, what we call, disaggregated mode.
So in the old days, your CPUs, your memories, your storage on one machine, so everything
is close to each other.
But now industry, there's a trend saying, let's say you put all your memory on the remote
machines, and you also put your storage
on another set of remote machines. This gives
you a lot of flexibility for resource
provision. Of course, you have a longer
latency. So can you
design your database efficiently
on such a mode?
So that's another question
I find interesting, although
personally, I haven't worked on this field yet.
Cool.
We should
check back in
in like five
years, right,
and see if
these predictions
and questions
have come to
fruition.
That's cool.
So it's time
for the last
word now.
So what's
the one thing
you want the
listener to
take away
from this
podcast today
and from
your research?
I would say
if you're a database user
and you have not heard about weak isolation
levels, you should definitely try.
I actually did some
informal survey in the department.
So just in the class,
ask a student, how many people have heard about isolation levels?
Most of the answers
no.
And imagine those are the people
that are going to go to industry
and use
databases.
That's a pretty
risky scenario in my opinion.
I'm trying to
incorporate those contents in the
courses, but so far it's only
done at the graduated course level,
graduated operating system course
level.
But yeah,
I just think it's a problem.
Since, you know,
weak isolation levels are so popular,
but most people who are using that
are probably not even aware of that.
That's a warning sign,
in my opinion. Yeah, agreed. Especially, I guess,
as a lot of the most popular systems
ship with the default isolation of them set
to a week one, right? I mean, most systems
ship with recommitted, right, I think.
Yeah, no, that's a really
good point. And hopefully the listeners
take a note and now more people know about week
isolation. So, fantastic, Yang.
Well, thank you very much for coming on the show. It was
a great, great conversation and we'll drop links to all the relevant materials we've discussed throughout the
episode in the show notes so the listener can go and go and check those out and yeah thanks very
much and we'll see you all again for some more awesome computer science research Thank you.