Disseminate: The Computer Science Research Podcast - Jian Zhang | VIPER: A Fast Snapshot Isolation Checker | #34
Episode Date: June 9, 2023Summary:Snapshot isolation is supported by most commercial databases and is widely used by applications. However, checking, if given a set of transactions, a database ensures Snapshot Isolation is eit...her slow or gives up soundness. In this episode, Jian Zhang tells us about VIPER, an SI checker that is sound, complete, and fast. Tune in to learn more!! Links:PaperGitHub repoJian's homepage Hosted on Acast. See acast.com/privacy for more information.
Transcript
Discussion (0)
Hello and welcome to Disseminate the Computer Science Research Podcast.
I'm your host, Jack Wardby.
A reminder that if you enjoy the show, you can please consider supporting us through Buy Me a Coffee
so we can keep making this show you all love.
Today I'm joined by Jian Zhang, who will be telling us everything we need to know about Viper,
a fast snapshot isolation checker.
Jian is a PhD student at
Northeastern University. Jian, welcome to the show. Thank you, Jack. Thanks for the invitation.
I'm really happy to share my research. Fantastic. So obviously, I've given you a very short
introduction there. But maybe you could tell the listener a little bit more about yourself and how you became interested in systems and database research okay uh i'm currently a second
year phd student at northeastern university my research interests mainly lie in verifying
the correctness of computer systems and building efficient distributed systems. My most recent project, like Viper,
focused on verifying the isolation level of databases.
Before joining Northeastern,
I got my bachelor's degree from Nanjing University, China,
and a master's degree from the Chinese University of Hong Kong.
How I became interested in system research? Actually,
originally, in my undergraduate, I didn't decide to do system research at that time. I just
want to be an expert in computer science. At that time, I didn't decide which area to focus on, like algorithms or systems.
But I'm curious.
I was always curious about the world,
about the infrastructures in the area of computer science,
especially how they work.
I want to figure it out,
how they were built,
how they can work to solve the problems they target,
and how to combine the techniques
to solve the new problem we encounter.
Out of the curiosity,
after my undergraduate,
I joined a group at CHK.
That group has both algorithm and the system research
so that I can switch across different projects
to find out where my interests really lie in.
At CHK, I tried some graph neural network research,
some distributed graph analytics system research there.
And finally, I find personally, I was more interested in building systems,
especially some elegant and efficient systems.
And I found that it would be better if it can have some theoretical guarantees.
So that's why I chose my advisor at Northeastern, since we share some similar research interests.
As you see, VAPR is a combination of both theory and system.
Amazing. That's a really good backstory there. That's how you became interested in that topic.
Cool. So today we're going to be talking about snapshot isolation. But before we kind of dig
into that a little bit more, can you maybe explain to the listener what concurrency control is and maybe why it's actually important in databases?
Sure. Maybe I should start with transactions.
In databases, a transaction is a group of operations that all the operations should be executed or none of the operations should be executed. In SQL, a transaction usually starts with a begin statement
and ends with a commit or abort statement.
Database usually manages data that are shared by many users
so that the database should provide concurrent access to the shared data.
It should guarantee the users from different places
can correct access to the shared data.
So the database must provide some mechanisms
that regulate the interaction between transactions,
which means only a few valid transaction interleavings are
allowed.
It also regulates what values to read from.
For example, assume a transaction T1. It consists of two same-point queries,
like something like SELECTVALUE from table where id equals to 1.
The first query gets the value 1. Then before the second query
is executed, there can be another transaction t2 concurrently updates the value to 2.
You may not wish the second query in t1 to get the latest value 2, because if that's the case, the same queries in the single transaction T1
get different values, which may not be expected for some applications above the database.
So to summarize, concurrency control is a kind of procedure of managing the execution of transactions while ensuring some properties
like atomicity, consistency, isolation, and durability.
Atomicity means, as we just mentioned,
all the operations are executed together or none.
None has taken effect.
Consistency means a transaction, transition, a database state from one consistent
database state to another consistent database state. By consistent, I mean all the invariants
hold. For application, usually it has some invariants. For example, for application of bank transfer, the invariant can be like the total balance of all accounts doesn't change.
And isolation regulates what kind of effects of a transaction are visible to another transaction.
What kind of interleavings or scheduling of operations are allowed.
Durability, I mean, once the transaction has committed,
its effects will never disappear.
They will always remain in the database.
So without concurrency control,
the atomicity, consistency,
isolation, durability
cannot be ensured.
For example, if a risk condition occurs,
some invariance may be broken
and consistency may be violated.
Awesome. Yes. So that emphasizes how important transactions are and how important concurrency control is in a database.
So you touched on that one of the four components of the ACID acronym there, isolation. And can you maybe tell us why?
Because obviously a lot of your work here has been about checking
and verifying isolation levels of a database.
Can you elaborate on why we need to check these
and maybe talk a little bit more about the different levels?
Okay, sure.
Maybe we can imagine a scenario consider a company it has developed
several web applications running within the trusted domain of their own company
while the data of applications are remotely stored on some like cloud databases in untrusted domain. So there is a contract between the web applications
and the cloud databases.
The contract is like what kind of guarantee
the database should provide.
As a nation levels are a kind of contracts.
For example, the contract can be the execution
of transactions must comply with some specific
isolation levels,
serializability or snapshot isolation. Maybe I should explain what serializability is.
Serializability is the strictest isolation level in the transactional theory. If we say execution of transactions is serializable, it means this execution is equivalent to another serial scheduling of transactions, which means it appears to execute the transactions one by one. Then those web applications developers, namely those database clients, may want to know whether the database they are using indeed provides the correct isolation level.
Since isolation level or concurrency control are really important to them, they build their application based on the isolation level.
And even a minor violation of isolation can lead to severe consequences or financial losses.
By far, clients can never know the exact answer.
They can only use some unit testing to test, but testing is never done.
They have to assume the answer to the question whether they provide correct isolation level. They have to assume the answer to the question whether they provide correct isolation
level. They have to assume the answer is
yes. Testing is never guaranteed
to cover all the anomalies in
the execution. This is
one motivation of R.
And the second scenario
is that
database developers
they keep
adding new features to the current version of databases.
So in the iteration of databases, developers keep adding new features and improving the performance.
Before release, the new version, they may want to test the current version of database to guarantee the correctness. So it may be useful
for database developers to be able to check the isolation level before they release the new version.
And actually, based on our knowledge, I think these concerns are reasonable since it's really hard to implement isolation level efficiently and correctly.
So it's reasonable for clients not to trust the database completely
and have that kind of concern.
For example, for some databases, I don't mention their name here, so some databases they were ever found to
violate SI for three times, starting from the very initial version v0 to the version 20.
And some other databases, it supported snapshot isolation and serializability in the very beginning.
Later, they admitted their snapshot isolation implementation was buggy
and just gave up that.
Now, serializability is the only isolation level it supports.
Wow.
That's, yeah, it's quite surprising.
Cool.
So, yeah, so plenty of good reasons there as to why we it's important to check the isolation level of the database so you mentioned it again
there so like snapshot isolation maybe you can tell the listener how it differs from serializability and what potential anomalies could happen under snapshot isolation.
Okay.
Recall that serializability is the strictest isolation level.
Snapshot isolation is a weaker isolation level
than serializability.
Intuitively, but not very precisely,
snapshot isolation can be regarded as in the beginning of the execution of each transaction,
the transaction tries to capture a snapshot of the current database state and then executes operations based on that snapshot.
Finally, after the transaction executes all the operations and it tries to commit, before
it commits, it needs to detect conflicts.
If without conflicts, the snapshot can be merged back into the current database state.
Otherwise, it should abort.
And for example, assume we have a simple table consisting of only two rows, x and y, with values 1 and 2, respectively.
When transaction T1 starts, it captures a snapshot of the database state, x equals 1, y equals 2.
And then T1 updates x to 3. Before it commits, another transaction T2 also starts,
captures the same snapshot.
Since T1 hasn't committed,
so T2 captures the same snapshot and updates y to 7.
Each transaction has an illusion,
like it's the early process running in the database
without being affected by each other.
After T2 made the modification, next T1 tries to commit and detect conflicts since there is no
other transactions in its lifecycle. Namely, after its capture of snapshot and before its commit,
there is no other transaction writing the same key, x.
So there is no conflict.
So it's a snapshot, x equals to 3 and y equals to 2 is merged back into the initial database state.
So x is updated successfully.
Similarly, T2 tries to commit and finds that the concurrent transaction T1 didn't update Y,
so there is no conflict in writes.
So it also commits successfully, and Y is written back to the database state.
Finally, both transactions commit, and the two snapshots are merged to get a new current database state,
X equals to 3 and Y equals to 7.
This kind of scheduling is not allowed in serializability.
In serializability, you cannot modify the database concurrently.
To be formal, REST skew is the only kind of anomaly
that is allowed in SI but not allowed in serializability.
RDR gave a formal definition, graph definition of SI. It defines a start-order serialization graph,
we call SSG for short. RDR uses one node for each transaction, one node to represent each transaction, and one edge to represent a dependency between
two transactions. Actually, RDA defined four types of different dependencies. Read dependency,
write dependency, start dependency, and end dependency. Here, for simplicity, we will just
combine these four types of dependencies into two types of dependencies,
anti-dependency and non-anti-dependency. Earlier defined SIS, all the cycles in this graph,
SSG, should have at least two adjacent anti-dependency edges, or we call CAD for short. If a cycle has at least two consecutive anti-dependency ages,
we say this is the CAD cycle and is allowed by SI.
Cool. Yeah, okay, just to recap that,
we have this formalization of isolation levels from Adyar,
graph-based, and that you can characterize the occurrence of a right skew anomaly
or anomaly under snapshot isolation by the fact that it has two adjacent anti-dependency edges
emanating from or into a node right cool that's great stuff so there's obviously been some work
previously on verifying isolation levels in
databases and it's looked at maybe serializability more than snapshot isolation but what are the
challenges that come with trying to verify and check for snapshot isolation versus serializability
what are the additional challenges okay good Actually, both verifying serializability and snapshot isolation have been proven to be NP-complete problems.
So both of these problems are computational challenges.
But the additional challenge of checking SI is that serializability is wildly agreed. Almost everybody knows the definition
of serializability. They agree on the same definition, but for snapshot isolation, they
are different people have different understandings. There are multiple SI variants like Like in the paper, A Critique of Ancid Acceleration Levels,
there is the original definition of SI and Strong Station SI, General SI, there are many different
SI variants. So how to propose a unified approach that can be applied to different SI variants is one of our main challenges.
Since serializability, naturally from the definition, the definition is like whether
you can find a serial execution of transactions, right?
That's equivalent to the concurrent scheduling. So it's more natural that you model the history with a graph
where each node is a transaction and each edge is a dependency.
If you can find a topological sort of the graph,
then you have already found a serial execution.
That's equivalent to the original concurrent scheduling. But for SI, by far,
we don't know. We don't know whether we can model it
as a graph where the secrecy, each topological
sort corresponds to one valid scheduling.
We don't know that. Great stuff.
With all the challenges in mind that come with checking snapshot isolation,
tell me how Viper works and specifically I'd like to know more about this concept of a BC polygraph.
Okay. This is going to be a long story.
Okay. I'll step myself in. Go for it.
Okay, firstly, maybe I should start with the architecture of Viper.
I just briefly introduced the architecture.
Viper has two main components.
One is history collector and the other is the checker. The history collector, if without, originally,
clients keep ensuring transactions to databases and receive responses from databases, the history
collector is the library that logs these transactions and the returned responses.
When we use Viper, we put the history collector in between the clients and the database.
The history collectors work as a seam layer between the clients and the database and are transparent to applications because they provide the same key value semantics.
After running the history collector for a while, we can get a series of transactions and their execution results, which is called a history.
Then, Viper passes the history to a tracker. The tracker is responsible for answering this question.
Is the history snapshot isolation?
Okay, next we switch to the concept of BC graph first.
BC graph targets the white box setting,
while BC polygraph targets the black box setting.
Actually, with RDA's formulation of SI,
there is already a straightforward approach to checking SI.
You just need to search for all the cycles in SSG
and check the properties, CAD properties for each cycle.
You just find all the cycles and for each cycle you check whether the cycle has at least two adjacent anti-dependency edges.
However, since this approach runtime relies on the input SSG. So if the SSG is very dense,
which means given a fixed number of nodes,
there can be a large number of ages
and potentially there can be a large number of cycles.
So it's not very efficient to enumerate all the cycles
and check the CAD properties for each cycle.
So our thought is,
can we avoid,
can we check SI without enumerating all the cycles
and without checking the CAD properties?
Our idea is to design a new graph data structure
called a BC graph,
which has a one-to-one correspondence
to original SSG.
Crucially, a BC graph has cycles
if and only if there are non-cad cycles in the corresponding SSG,
which means the non-cad cycles,
which will be rejected in SSG,
remain in the BC graph,
while the other cycles disappear in the BC graph.
Then, we only need to check the asynclicity without caring about the cycle types.
Now we introduce how to construct a BC graph given an SSG.
You can imagine an SSG on the left part for any transaction Ti,
just for any node in the SSG.
It is split into two nodes, BI and CI, representing the begin
and commit of transaction TI respectively. Then we add an internal edge from BI to CI
to enforce the requirement that the begin commit should happen before the commit for each transaction.
For any non-anti-dependency edge from TI to TJ in the left SSG,
we add an edge from CI to BJ from the commit of TI to the begin of TJ in the BC graph,
indicating that a transaction like TJ is only allowed to observe those transactions that logically happened previously. For any anti-dependency edge from TI to TJ, we add an edge
from BI to CJ, which is the reverse direction from non-anti-dependency edge. So this is to require
that the beginning of TI happens before the commit of TJ, so that TI didn't see the write by TJ.
After that, we get a complete BCGraph.
In the original SSG, the edges have types, but in the BCGraph, the edges don't have types.
All the edges are equivalent.
By far, we have illustrated how to construct a BC Graph.
Then if the BC Graph is acyclic, it's SI.
Otherwise it's not SI.
But by far, we assume that all the dependencies are already available.
Then we can build a BC Graph.
However, this assumption may not always hold.
Since clients may use black box databases like cloud databases.
Then clients have no access to the internal scheduling of transactions.
The only available information is what transactions they issue to the database
and what kind of responses they get from the database. Then how to check K sign? For example, imagine two transactions, T1 and T2.
T1 updates x to 1 and T2 updates x to 2. But they are concurrent. We don't know which one happens first.
So the dependencies between them are unclear. It can be there is a right dependency from T1 to T2. It can be
there is a right dependency from T2 to T1. So how to check a site without some of the dependencies?
A naive approach is to try all the possible schedules of rights and build a basic graph
for each possibility. For example, here we can assume the dependencies between T1 and T2.
For one, if T2 overrides T1, we can build a BC graph.
For the other, if T1 overrides T2, we can build another BC graph.
Then we can answer the question whether there exists a BC graph that is acyclic.
If yes, we have already found a valid schedule and the history is acyclic. If yes, we have already find a valid schedule
and the history is acy.
Otherwise, there is no valid schedule
and the history isn't acy.
However, due to a factorial number of a retorite order,
the number of derived BC graphs can also be huge
and hard to check in a reasonable time.
A simple idea is to use BC Graph to use
another graph data structure called BC Polygraph to model all the possible BC Graphs in a single
data structure. One way to consider BC Polygraph is that it's a superposition of many valid BC
Graphs that share the same history. I will briefly introduce how to
construct a BC polygraph. First, BC polygraph shares the same vertex set with BC graph.
And second, for those edges that exist in all the BC graphs, they are still included in the BC polygraph. But what about those edges that only exist in some of the BC graphs?
Like an edge only exists in...
Recall that we have two BC graphs for the example,
for the concurrent T1 and T2.
So maybe one edge only exists in one possible BC graph
and doesn't exist in another BC graph. Recall that each BC graph
corresponds to one possible schedule of writes. Different writes orders are not compatible with
each other so that the edges from different BC graphs cannot coexist in the BC polygraph.
So if I'm just going to repeat uh repeat this back to you i clarify my
own understanding so from an uh an ssg we essentially we almost uh we we almost if we
assume that an ssg has all the information of a given execution that a bc graph is essentially
like a a compressed version of the of an ssg right is that the correct way
to think about it in the sense that you are removing some edges you are so you're removing
the type from edges and you're like certain edges have been removed from the graph essentially the
ones that you don't necessarily care about exactly and then a polygraph on top of that
factors in the fact that in a black box database scenario we don't actually have control
over the internal mechanism so that in this case we have to kind of measure we have to account for
the possible different states that could have happened within the database is that the right
way of thinking about it i think your answer is is mostly correct actually okay cool great stuff so
viper also has a has another optimization in it you use called heuristic
pruning. Can you maybe explain or give the listener an intuition to what
this is and how it maybe works? Okay, this is a very
simple optimization. Since we observe that
even we know that snapshot isolation allows the
transaction to fetch a really old snapshot,
to read from a really old snapshot.
But that's not a common practice in production databases.
This is our observation.
A snapshot isolation database implementation rarely delays the write for a long time. So our idea is to heuristically assign orders
to those transactions that are far from each other
because databases are less likely to reorder them.
So by adding those heuristic edges,
Viper can prune some constraints that violate those edges,
hence can reduce the number of
constraints decisions to make and accelerate the SI checking.
For example, if we have two concurrent transactions, we don't know which one happens first, which
means there is a constraint from C1 to B2 or from C2 to B1.
A constraint is similar to the semantics of exclusive OR.
If we add a heuristic age from C1 to B2,
then the constraint is reduced to a known age
without any uncertainties.
After adding heuristic ages,
if it returns accept, then it's already done
because we have already found a valid schedule.
Otherwise, if it returns reject, it doesn't mean indeed rejection. The recycle may be introduced by
the heuristic edges we added. So we need to relax the definition of far and repeat the procedure.
Specifically, we use the constant k to measure
how far two transactions are from each other and double the measure k each time. Finally,
when k equals to n minus 1, where n is the number of transactions, if the still rejects, then
we reject, Viper rejects the history. The history is not SI.
So this kind of optimization only accelerates the positive histories,
which means those snapshot isolation histories.
But if the histories that are not snapshot isolation,
this kind of optimization may bring some negative effects.
Cool.
So with all this in mind, let's talk about the results.
Can you tell me about the experiments you ran?
I mean, how do you even go about evaluating a tool like Viper?
And then, yeah, let's start off with that.
Then we can talk some numbers afterwards.
Tell me about the experiments you ran first.
Okay.
Our evaluation plan mainly answers three questions. The first question is,
what's the Viper's performance? How do these compare to existing checkers and the natural
baseline? The second question is, how do Viper's components contribute on the different workloads
and different setups? The third question, can Viper detect non-real-world SI violations?
For question one, to compare with natural baselines that we built,
we evaluated Viper on five benchmarks,
which is a combination of macro and macro benchmark.
Some of the benchmarks are some random transactions that consist of read-only transaction and
write-only transaction.
And some benchmarks also have range query operations.
And some other benchmarks are simulations of real-world workload like CRubis.
Rubis is a bidding system like eBay.
And the C-Twitter is a simulation of a tiny Twitter.
We use a configuration of 1,000 users.
And to generate histories,
we run our benchmarks on three databases,
TI DB, SQL Server, and Yuga by DB.
All the databases are configured to run on the SI.
We use 24 concurrent clients for all experiments,
and we just record the total runtime
for the performance comparison.
And to compare with another existing SI checker, AO,
if we run the workload of Brian Wright,
AO can run fast, but can meet some data anomalies.
So it's not solving the exactly same problem as Viper.
So to make the comparison fair,
we need to let AO run in the sound mode,
just as Viper.
So we configure both Viper and AO
to run the append benchmark
from Jepson, which
is a distributed
database testing tool.
In the append benchmark, clients
issue atomic append operations
to some keyed lists.
And by reading the
lists,
each list is a list of integers.
So it contains all the appended values.
Clients know the write orders of appends.
Given the version order of all the writes,
AL is also summed.
So we compare Viper with AL in this mode.
For the second question, how do Viper's components contribute?
We run Viper on five benchmarks with 5,000 transactions and measure the wall clock running time
and break the runtime into different phases like parsing, graph construction, encoding, and solving.
For the third question, we downloaded,
can Viper detect non real-world SI violations?
We download real-world public SI violations reports
and test whether Viper can reject them.
And also we manually inject three types of anomalies
into histories to see how Viper performs
on different types of anomalies. Cool, so I
guess, what's the
headline here? How much faster
is Viper over the state of
the art? Firstly,
the performance comparison
mainly consists
of, we have two figures about that.
Firstly, we compare Viper with
natural baselines that we built.
Why we built natural baselines that we built. Why
we built natural baselines? Since the other trackers, there are usually no sound, so we
built some natural baselines, but the core SMT encodings of baselines are borrowed from
existing work and we implement them by ourselves. All trackers are run on blindW histories, this benchmark,
of various sizes. Our evaluation results show that Viper can outperform all the baselines
consistently and can handle a 15 times larger history. This is the comparison with natural baselines. We also compared Viper with Aeo
in the sound mode of Aeo. Our evaluation results show that both Viper and Aeo can
the runtime scale linearly for checking the append benchmark, which means there is no fundamental performance gap.
This is because the right order has been revealed, so the checking SI is linear for both checkers.
In particular, by having right order, the BC polygraph has no constraints.
The difference in that figure, different slopes of the lines are just due to the difference of technical stacks.
Viper is implemented in person, while I will use Clojure language, which is a JVM-based language.
So my next question is, Viper is an amazing tool. How do you think software developers,
database administrators can leverage the findings in your research?
My question here is more, what impact do you think your work can have?
Or maybe already has.
As we mentioned before, the first application scenario is if your database
supports ASI as a a database developer, engineer,
you can integrate into your existing testing pipeline
to test the correctness.
Actually, I know some production database,
they have such testing pipeline,
but such like Jepson-like testing pipeline but yeah they
I think our checker can be integrated
into that kind of testing
pipeline to
help improve the reliability
and while we
develop such method I think
I think
our method has some
advantages over existing
method like other method existing method like formal verification I think our method has some advantages over existing methods.
Like other methods, existing methods like formal verification,
you know, as we know, the cost is too high.
And testing tool is the common approach we use to test the correctness.
But test is not sound.
So that's why we developed this method.
It has an acceptable runtime cost and you don't need
to make any modification across different system version, different database versions.
It's implementation independent and yeah, it has a higher runtime cost than testing, but it can actually downgrade to testing if the time is limited.
You just need to exclude all the constraints
and only consider the known ages, just like Jeppesen does.
Nice, cool.
Yeah, I think it's really impactful work.
And I know that, like you say, it's good to know that.
I mean, I guess on this, this story just for a quick side are you aware of any commercial systems that
are thinking of integrating viper into their current workflow at the moment i see that you're
one of your co-authors worked at cockroach i'm thinking maybe cockroach db might be using it internally yeah actually my employee from corporate db they contacted us
before and yeah contacted us before to seek for potential opportunities for collaboration yeah
they have a such testing pipeline in their company i know i. I heard of that. Cool, cool.
Great stuff. I guess
whilst you were working on Viper Engine,
what was maybe the
most interesting thing that you learned
whilst
being on this project?
The most unexpected
part for
Viper
is perhaps the
new characterization of
snapshot isolation.
In the very beginning, we didn't expect
it to be as simple as
a cyclicity of a graph.
I think this is the
kind of bonus. This is the
most interesting part.
Before, only serializability can be
characterized as a dependency graph
that has no cycles, so that you can
find a serial scheduling of transactions
that's equivalent to the
concurrent scheduling.
It's very easy to understand
serializability. In the past,
if you ask somebody what SI
is, it can be really
hard for them to tell
what SI is, since they may forget about the
read, write, and commit rules of SI if he's not working on the database implementation.
However, if you ask them what serializability is, it's highly possible for them to tell
something, like equivalent to a serious execution but so i think bonus is that it is just from the
perspective of clients is that has a great parallel to serializability this is what we
learned and yeah i think this is the most interesting part i have learned yeah fascinating
i agree that's really really interesting finding for sure.
Cool.
So it's time for the last question now, Jian.
The last word. So what's the one thing you want the listener to take away from this podcast today
and from your work with Viper and verifying isolation levels?
Okay. I think
we
developed
a new graph data structure
to characterize SI
which can help people to
understand snapshot
isolation from the perspective
of clients.
Based on that,
we developed Viper, which is the first sound and complete
si checker well yeah that's a wrap then so thank you so much jan for for coming on the show and
talking to us about viper to viper today if the listeners interested to learn more about
viper and all of jan's really cool work we'll put links to everything in the show notes you can you
can go and find that.
And another reminder again,
that if you enjoy the show,
please consider supporting us through buying me a coffee.
And yeah, we'll see you next time
for some more awesome computer science research. you