Disseminate: The Computer Science Research Podcast - Audrey Cheng | TAOBench: An End-to-End Benchmark for Social Network Workloads | #15
Episode Date: December 12, 2022Summary: This episode features Audrey Cheng talking about TAOBench, a new benchmark that captures the social graph workload at Meta. Audrey tells us about the features of workload, how it compares wit...h other benchmarks, and how it fills a gap in the existing space of benchmark. Also, we hear all about the fantastic real-world impact the benchmark has already had across a range of companies. Links:PaperPersonal websiteMeta blog postGitHub repo Hosted on Acast. See acast.com/privacy for more information.
Transcript
Discussion (0)
Hello and welcome to Disseminate, the computer science research podcast. I'm your host, Jack Wardby.
I'm delighted to say I'm joined today by Audrey Cheng, who will be talking about her VLDB22 paper, TauBench, an end-to-end benchmark for social network workloads.
Audrey is a PhD student in the Sky Computing Lab at UC Berkeley, and her research focuses primarily on transaction
processing for database systems. Audrey, thanks for joining us on the show.
Thank you so much for having me, Jack.
It's a pleasure. Let's dive right in. So how did you end up researching transaction processing?
What was your journey to this point?
Yeah, so now I am a third year PhD student at Berkeley, and I ended up here mainly because of my undergrad advisor,
Professor White Lloyd at Princeton. My junior year of college, he has done a lot of work on
transaction processing and I was interested in doing research. So I reached out to him
and I ended up working on a project that extended Peter Bayliss's Bull on Causal Consistency paper. So in that paper,
Peter looked at how you can go from eventual to causal consistency. And I sort of extended that
hierarchy. So how we can go from like causal to sequential, and then from sequential to
linearizability, and so forth. And I felt that was a lot of fun, just like thinking of protocols
and making, like coming up with new techniques to ensure
stronger guarantees. So that's how I sort of fell into thinking about strong guarantees. And then
that naturally led to transaction processing. So let's talk about Tilebench then. Can you
start off by telling us exactly what Tilebench is? Yeah, of course. Happy to. So Talvent is an open source benchmark based on Meta's production workloads. Specifically, we look to capture the sort of request patterns, including transactional ones that you'll find on the social graph. is the first time that a social networking company of this scale has open sourced their workloads
especially transactional workloads so we hope it can definitely be a very helpful resource for the
database community and we've already seen impact both within meta and outside of meta for other
databases. You've touched on it a little bit there but what is the the motivation for developing a benchmark specifically for social networks? So I've been interning at Meta for a
couple of summers. One of my first projects was looking at adding stronger guarantees to
the to their like social graph data store. This was a ramp tab paper, which is in BLDB 2021, for anyone interested in that. But as
I was doing that, one of the things I was trying to look at was how these transactional workloads
operated at large scale and thinking about the needs of applications using transactions. And
that's what helped me decide that I wanted to take a more holistic view of sort of what the social graph workload looks like. And because I had access to data, I thought it was
really exciting. So I spent a bunch of time looking at different sorts of skew, and also how data is
stored on different shards, because explicit collocation is actually quite important, we found,
for the social network workload. And as I was looking at these things, I realized that none of this was captured
in other existing benchmarks, especially social network benchmarks. There are a couple, the
prominent ones I'd say are LinkBench, which was released by Meta in 2014, but that was only
focusing on the workloads of a single MySQL instance. So you lose a lot of interesting
patterns you see with geo-dist distribution and also at the cache level,
which serves the majority of the workload.
And also there's LBBC, which is a very important workload,
but that targets mainly graph databases.
So the focus here was more complex graph processing queries
rather than the many serving requests we saw on TAO.
So most, the majority of requests we have are like point reads and some point writes,
but then you also do get some very complex queries and longer transactions,
and especially figuring out the interaction between these and also on shared data
made me really excited to, one, study this workload
and then try to open source it in a benchmark.
You mentioned the other benchmarks in this space.
Are there any more than LinkBench and LDBG?
How does it compare to something like TPCC or YCSB or something like that, right?
Yeah, great question.
So in our paper, we identify five properties that are important for social network benchmarks.
And I think TPCCCC and YCC are
definitely important, like OLTP workloads, but they don't necessarily capture some of the properties
we've seen on social graphs, including being very read heavy, like TPCC is notably update heavy.
And we also have extreme skew. We have very, very severe hotspots in some cases,
like let's say Lady Gaga posts something,
then a ton of people are trying to look at that post and like that post.
So, and also the correlation between requests,
and we also have different applications interacting in the same data
versus most benchmarks, I think,
if you'll try to capture the properties of one application.
For a lot of different reasons,
we felt there was a need for a new benchmark.
It was not that TPC0, YSP are not important.
They're very important points in sort of the workload space,
but we felt within the social graph workload space,
there wasn't a representative benchmark
and that's what we sought to capture with TabBench.
Awesome, it's funny.
I always, when I'm ever talking about like supernodes
in a graph setting, I always use Lady Gaga for my example as well. with tab edge awesome it's funny i always when i'm ever talking about like super nodes in in
in a graph saying i always use lady gaga for my example as well i don't know what it is about lady
gaga but i always use that as well cool yeah so you mentioned there's five properties there like
that you think should be captured by a social network benchmark can you maybe dig into these
a little bit more walk us through them what they are and say specifically why they are important?
So these properties were ones we came up with after talking to a bunch of different engineers, industry practitioners and also looking at the workloads and trying to identify why those are important for performance or other aspects of the system. So the first one is that we want to make sure we're accurately emulating
social network requests. As I mentioned, like things like TPCC and YCSB, they're designed for
a specific purpose, but they're not designed to capture the social graph. And because we have
things like extreme fear, high correlation, redominance, it's important to design them
specifically for the social network, which is a very important application domain and has had
enduring popularity for maybe several decades now, we can even say. So things like link bench
and LDBC are indeed designed, but not all benchmarks are satisfied. Property one.
And the second property is that we want to make sure we capture transactional requirements,
and these can vary. So LDBC, for example, has full interactive rewrite
transactions, while on TAO, the system has chosen to only support sort of one-shot read-only and
write-only transactions, because we found these were able to perform, have improved performance
and scalability, given the challenging workload that at the moment is like tens of billions of
requests per second. So serializability, like full coordination is not yet feasible at this scale. And the third property is about
co-location preferences and constraints. And we found that no listing benchmark actually satisfies
this property. So what we mean here is like information about sharding and how data is stored.
And we found within a real-world system
that data placement isn't actually
just an implementation detail,
but it can reflect user intent, privacy constraints,
and or regulatory compliance,
especially with the growing concerns
for data privacy and stuff.
Sometimes it has to be stored in certain regions
and that can definitely impact how the workload is served
and also have concurrency control and performance consequences.
So we thought this was, to the best of our knowledge, no social network benchmark or most of the popular like LTP workloads capture this data and something we wanted to make sure that we supported.
And the fourth property is about how we model these requests. And I think it also
illustrates a trade-off in representing workloads. So most standard benchmarks take the approach of
they want to represent specific behavior. So for example, they want to capture, I think one of the
queries in LDC is like you want the friends of friends with maybe some property and like a very complex query.
And you want to sort of capture the action of, OK, this is what would happen.
So you have like these distinct query types. LDBG has 29, TBCC, for example, has five.
We took an approach on the other side of the spectrum where we didn't attempt to capture any of these types.
That's because on TAL,
there's like tens of thousands of distinct query types running per day.
So to replicate all of those
and replicate the code would be impeacable.
So instead we decided to model these workloads
using probability distributions
without these prescriptive query types,
which allows us to capture a much broader workload.
But there's a trade-off of with these distributions,
if the workload changes,
we're able to very easily capture that
without needing to modify specific queries.
But the trade-off here is understandability.
We're not able to say,
okay, these requests came from this particular application,
doing this.
But that actually also helps us hide
some of those privacy concerns
so we're not sort of highlighting exact user actions.
So it definitely is a trade-off,
but I think an interesting one.
And for the most part, I think YCSB
and to some degree, LinkBench
sort of also takes an approach
of like using request distributions
rather than query types,
but not the most common, I'd say, approach that most benchmarks have taken.
Yeah, I know YCSB kind of has that skew factor parameter, right?
Which you can vary to kind of, I guess, simulate that to some extent.
But yeah, anyway, sorry, continue.
Property five.
Yeah, and the final property is we wanted to make sure that the benchmark workload exhibited
in multi-tenant behavior on shared data.
This is important because when you think about the social graph, you might think, oh, it's
like the Facebook application or something like that.
But actually, there's a lot of different services and products running on the same
data.
So there's like Messenger.
There's like WhatsApp. There's also like ads, for example. There's Marketplace. and products running on the same data. So there's Messenger, there's WhatsApp,
there's also ads, for example, there's Marketplace,
there's also now the Metaverse.
And these are all based on the same users
accessing these different applications.
So thinking about the complex interactions
between how one application can affect the behavior of another
is really important.
And I think the LBBC benchmark paper acknowledges this,
that you do need to capture this,
but their approach was just to run mixed workloads.
And this doesn't necessarily reflect
how different applications actually interact
in the real world, in real world cases.
So it was exciting that we're able,
we had this data at hand,
we're able to study these
and then also look specifically at like product groups,
which are like
sets of applications that have that share the same data use the same infrastructure and we do deep
dives into how this behavior can vary both within product groups and across product groups to then
satisfy these these properties you then profiled and tau which is obviously facebook's graph system so can you maybe tell the listener more about
this system and i guess why you chose it but you chose it right because that is their system right
and you were you were inside in there so that kind of answers itself but yeah can you just
explain to us a little bit more about the system what its architecture is and some of the background
on that yeah happy to and i think this is important
as well as why like this the workload this particular system was of interest so tau as i
mentioned is the social graph data store at meta so it stores and serves the vast majority
requests the social graph there are certain other stores um some like blob storage, for example, that will store like very large videos or things like that.
For the most part, whenever you're accessing one of Facebook applications, it's pulling the vast majority of requests and also doing like serving those rights and stuff within TAO.
So TAO has been around for around 10 years now.
I want to say there's an ATC paper on the system published in 2013.
And since then, it's grown a lot bigger.
So currently, it serves over 10 billion reads and tens of millions of writes per second
on a changing dataset of many petted bites.
So it's operating at a very, very large scale.
It's distributed and runs in many different regions across the globe.
And as such,
I think it offers unique insights into the modern social network workload
because we sort of have the majority of the social network requests like on
this one system.
And we're able to like look at the system and understand these request
patterns in terms of how it actually
models data so it has a graph data model but not a graph data api so it has objects and which are
called nodes and associations which are edges in the social graph um and it was originally designed
to sort of operate at large scale so it it didn't have any consistency guarantees originally. And it was sort
of the designers focused on like high availability and low latency. And as such, they also imposed a
very simple API. So it's similar to a key value store, I'd say. But it does have range queries. So it has like point, get, range, and count queries,
and then insert, update, and delete for writes.
And then it's since added write transactions,
also prototype read transactions,
more detail can be found in the Ramp paper
about how those were added.
But it definitely has, I'd say, some quirks as being such a large-scale system,
but I think because it has continued to run and continue to serve the growing workload for over 10 years,
it sort of speaks to this approach towards serving the social graph was successful,
and it has a sharded tiered architecture so tau itself is actually like
two graph aware cache layers and underneath tau there's a statically sharded mysql database and
above tau there is a client-side cache also there's actually many layers of caches but the
main one that we looked at was a client-side cache because that can a lot of like request the hotkeys end up
being served from that cash so it actually I think it captures we say in the paper like 14%
of the workload so a significant portion. What do TOW's request patterns look like? How do these
actually manifest themselves? I know you said there's a large variety in both what was the sort of general patterns i would say at a high level it
is very the requests are very read dominant so i believe we say yeah that so we looked for the
purposes of the benchmark we looked at traces over three days and in trace, we found like 99.7% are reads. So for the most part,
people are sort of fetching data rather than updating data. And this is not surprising,
because like, let's say you're like loading posts in your feed, you once every post you load has a
lot of different components, you have to like read in all those components. And you don't like,
like, at least most people don't like every single post they're viewing so you end up sort of looking at a lot of things before you send updates to the system so that's
mostly why the it's so like read heavy and also one thing we've noted is like the extreme skew
and this isn't unexpected because there are certain users in the social graph that sort of are much more popular
than others. But even in the SKU, we found that there's a lot of diversity that hasn't necessarily
been captured in previous benchmarks. So one example of this is looking at hotspots. So we
specifically studied read and write hotspots for both like normal rights and transactional rights.
And what we found is that there are some hotspots that are like read to and
written to frequently. So something like,
let's say they got actually posted something that would probably get a lot of
engagement, but there are also other hotspots that are like write only hotspots.
So there's like a ton of rights going to these keys, but very few reads.
And this could result from,
there's often like asynchronous jobs running in the background,
like maybe, or like migration tasks,
or maybe like an AI or ML model like needs training data.
So they're like getting this data and updating it in some sense,
but not reading from it very frequently.
These are reasons why we see these right hot spots
and they suggest that um there's potential to further optimize sort of providing guarantees
there and also like is this is there a temporal aspect of things as well because i know
because obviously i've some experience the ldbc stuff we kind of have this notion of them
like spiking trends where you have like
uh i don't know someone comes in and posts an outrageous post on facebook and then they get
cancelled and then there's a massive drive of activity around people either uh liking that or
commenting on it or unfriending them or whatever does that ever manifest itself in the request
patterns or i guess it's only over three days so maybe you didn't maybe capture that that's a great question and something we definitely looked at so at meta at least the most prominent trend is
there's like a daily spike in the workload that's like very noticeable um when you look at these
and it's sort of around i can't remember what time around it is maybe around noon but basically a lot
of people are accessing um like social
network applications at that time so we see that spike and we debated like whether to capture that
or not and how to think about like periodicity and the temporal part um so we found that while
there are sort of those daily spikes and also we measured different periods so specifically we
looked at like july 4th versus a normal week, like three days around July 4th and normal week.
We found while like the volume changes, what we cared about is those probability distributions.
And those distributions, including the tails, don't actually change that much across those different periods.
So that's why we decided to sort of aggregate it across three days,
because we wanted to focus of aggregate it across three days um because we
wanted to focus on capturing those probability distributions but definitely one of the feature
works i want to look at is like how can we think about looking at the temporal aspects and how that
matters for like performance what are the the parameters i have available to me to me when i'm uh i want to
go and run tile bench what parameters can i tune to to change things yeah so we have actually a
small set of parameters and we found like this small set um was sufficient to fully characterize
the social network workload so it's nice because there's not too many knobs you have to tune if
it's some, if you want to run the benchmark. So I would break down the parameters into
two broad categories. One is like, they're pretty general and they apply most systems. And the
second is there, the way that the second group of parameters are captured is unique to tau,
but the properties they're capturing are represented, I think, in most social graphs.
So I'll explain those.
But the first group are very general, like transaction size.
So we have distributions for both the read and write-only transactions.
And those can be tuned.
We have different workload configuration files.
So they've already been tuned for you if you want to run a particular workload.
But if you wanted to try something else, you could change those. We also have
sharding. So this determines what shards the key should go on for both objects and edges in the
social graph. And this is sort of information that's available, but not always used by different
systems.
So some of the systems, some of the databases you ran against have their own charting strategies.
And there it's interesting to see how like a very skewed workload can be balanced or maybe not balanced across a different chart.
We also have operation types.
So for the benchmark, we basically have like reads, writes, reads, writes,
consists of like inserts, updates, and deletes,
and then read transactions and write transactions
and the proportions for those.
We also have request sizes.
So how much data we're reading or writing.
This is also distribution.
And so that's like the first category.
And then the second category,
the first is association or edge types.
These basically capture particular integrity constraints
on the social graph. So there are some cases where you always want like paired edges between two
nodes. You don't want just one edge. So that's one property we capture. Another one is uniqueness.
So between two nodes, you should only have one edge of a particular type. So these are important for things like,
let's say you have a Facebook account, Instagram account, and you want to link those. You don't
want to be linking like one Facebook account to like multiple Instagram accounts or something
like that. And then there's also something called preconditions. These are essentially
correctness constraints on operations. So while Tau doesn't support currently like fully
interactive read write transactions, you can still do something similar to like a compare and swap.
If you first send your read and let's say, read, you read x and the value of x is one,
and then you can condition your write on the value of x. So you can say, only allow my write
to succeed if x is still 1. So it's,
in fact, basically, an interactive read write transaction, but they're not both within a
transactional API, because sort of if you do that, then your read and then subsequent write will
block everyone else. And it's really important, especially for reads that those have high
performance, which is why we currently have this format where the condition only applies on writes, but not the reads. So we're not holding up resources for too
long. And then the final parameter is the read tier. So as I mentioned, TAO sort of has three
tiers. Probably it has the client-side cache, TATSOF, and the database. And depending on which tier you're getting the request from this can have very high
impact on latency like at least in order of magnitude is what we showed on the paper i think
um and this is this is information that also like depending on the underlying database it might
already have its own caching layer in this case it like wouldn't apply but we needed this parameter to capture it on tau
to basically make sure that our workload was representative what's the um the schema of the
of the social graph in facebook is it kind of very similar to uh like a traditional labeled
property graph where these the edges have properties on them as well or is the or is that stored elsewhere basically as i mentioned
tau is served by mysql under the hood like mysql is the third the bottom bottom most layer and
since mysql is a relational database basically what they did is they converted both objects and
edges in the social graph into rows so let's say like an edge has a particular type that's just
stored in a column so you basically condition on that column like only read that edge if it has a
certain type yeah so it's much closer to like a key value or like a relational all than the graph
database yeah my next question is uh other benchmarks like yCSB, for example, have standardized workloads. People can say I ran
YCSB workload A. Have you
standardized workloads for TauBench? Currently, we have
three open source workloads with plans to release more.
Our workloads are TAO, spelling out Tau.
The first one, T is the transactional workload. So our workload, the TAO spelling out TAO. So the first one, T is the transactional
workload. This basically captures the current transactional workload on TAO and sort of all
requests that are under the transaction API. And then the A workload is the application workload.
This represents the speculative transactional workload. So right now, we have a lot of
applications that are sort of sending
simultaneous reads and writes together. And they basically sending these requests with
transactional intent. But because the rollout on the projection system to using these APIs
takes years, we're still in the process of sort of making sure everything is rolled out. So this
workload is interesting because it has a longer tail than the T workload. So there's some like very large read and write
applications that has heavier skew in some cases. And we've actually found this workload can be
quite challenging from database for some systems. So it's been able to identify some interesting
like performance opportunities and finally workload o
is the overall tau workload so here we're focusing also on like the single point reads and writes
and this workload is notably read heavy as expected yeah i think these with these three
we sort of capture a range of interesting patterns on Tau, and we're hoping we can also release more.
Fantastic. So what metrics does the system compete on then?
Is it just throughput and latency, or is there some sort of price performance metric that you've formulated?
Yeah, at the moment, we are just focusing on throughput latency.
Okay, cool. insane okay yeah cool as a as a user then what is the the benchmarking workflow what do i need
to implement implement to to get a valid run of of tau bench so we definitely tried to simplify
tau bench as much as possible and we made it very similar to ycsb in a lot of aspects these ycsb
have a lot of different drivers at this point and we actually already currently has a lot of different drivers at this point. And we actually already currently support a lot of different systems.
So if you want to run either like Postgres or MySQL compatible systems,
there's already drivers for those.
Or if it happens to be like Spanner, CloudSpanish, CockroachDB,
and TidyB or UgoByteDB or PlanetScale, those are already all supported.
If you want to support a new system basically what you have to do is you need to convert our requests so like reads writes
read transactions to the equivalent request on your system so it's like a different syntax like
spanner has its own sort of query language. You sort of have to do that conversion because that's not present in our benchmark framework,
but that process is pretty straightforward.
So we have a website, taobenz.org,
and there's documentation there on how to do that conversion.
It's usually pretty straightforward because our API is simple.
This can usually be done in like an afternoon or two.
It's not like a heavy process
just basically you want to make sure your system where you're using like the right syntax for your
system and like converting our queries into yeah your yeah awesome i will drop a link to the to
the website in the show notes so the listeners can go and find that how does it work with data
generation and do you have a data generator that produces all the data necessary for a benchmark run? For us, the benchmark runs two phases.
There's a workload generation phase, which we're basically preparing the database, and then
the actual request generation. And we only do measurements during the request generation,
though we've actually helped various databases identify things in the workload generation phase
because we can run at large scale
and we sort of found some weaknesses in their systems
during the workload generation phase,
even though that wasn't intentional.
What the workload generation does
is we generate a baseline social graph
on which subsequent requests can operate on.
And here we basically send,
create basically insert requests.
And if it's like a SQL database,
insert requests into the system
and we create all nodes
and we pre-allocate all the edges between the nodes,
but we don't write all those edges because
sometimes you need to test uniqueness and other constraints. So if you like pre-write those edges,
then you'll never satisfy uniqueness because the edge like already exists. And you can also warm
up the cache during this phase because we basically like send out create requests and then we batch
read back into the benchmark all the all
the objects and edges we want to operate on so that's in memory and once that's done then you
can sort of run as you only need to do this once for all the experiments you want to run on that
particular system for a particular workload so once you've done the generation then you can run
multiple experiments on that same data cool
so how is taubens used internally within meta before taubens the only way engineers could
test different things was they had a very limited stress testing tool that wasn't realistic it was
like they could produce very basic patterns basically to validate something was like able to serve traffic.
There's no guarantee on that type of traffic or they could do shadow run.
So they basically like try something out and then run it on existing production data.
So they're really limited, especially in being able to test new workloads, because there is no way you could test a new workload unless you change something
in production but once you did that you could cause issues in production so they're like very
very cautious about what they were releasing to production this slowed a lot of things down so
we talk about several different use cases in the paper before we discuss sort of like looking at
new international use cases looking at contention under longer lock hold times, if there's like network delays and things like
that, evaluating new APIs and also quantifying the performance of high fan out transactions.
But basically the general just to do, TauBench has now enabled us to test new features, optimizations
and reliability, which they might, these might these might sound like standard things but at
scale it's actually quite complicated and without affecting production workloads you still want to
test like realistic things so the the most powerful thing i think top bench enabled
meta engineers to do is now we can sort of try out new workloads very quickly just by like changing the probability distributions and um
at large scale as well and then from that you can predict errors without having to wait to like very
late stage in like pre-production or um even encountering encountering these errors in
production so one example is we have this like new transactional use case that was, I think, creating an object and adding
an association, an edge in the graph. And we basically replicated the workload with TauBench
and ran it and we found there were some contention errors. I think it had a higher object contention
than edge contention.
It was still like relatively low,
but we were able to sort of measure these errors.
And then when they actually rolled out
this use case in production,
they found the same breakdown of errors.
So that was really exciting
because it was able to validate both
that TopBench could capture these workloads
in a representative manner,
but also we were able to do this testing much earlier
on and sort of foresee errors before it's like months down the line and then you have to sort
of figure out oh how do we fix that how do we know we fix that. Do they have is there any sort of
continuous tracking of the the type of workloads like well I guess what I'm asking is has Meta's
workload changed over the last way I'm
guessing has changed a lot over the last 10 years but are they noticing changes is not changes now
that you can then I guess model with Taubens and say okay well look we're seeing things heading
this direction if this trend continues this is kind of things we need to fix or this is a potential
optimization is there anything like that you can do with it yeah
definitely so one of the recent things i am supposed to do is so we kept when the workloads
we have right now were captured last year when you're sort of working on the paper and on the
benchmark and since then there's been a bunch of rollout of different transactional use cases
so i'm hoping to like capture that the workload now and probably have
very different properties than the previous one a year ago so also like capturing those snapshots
and seeing how performance changes and if there are certain issues we should address now like
given the new workload um it's definitely something we can do with tail bench and it's exciting
so you said earlier on as well that TauBench has been, has referenced
implementations for
a number of cloud
native systems. I guess
the question here is, what is the
impact of TauBench being outside
of meta and
what did, how did
you go about sort of
evaluating TauBench's suitability
to these systems and to help those systems study
performance trade-offs and identify potential uh optimization opportunities as i mentioned we
currently support jarvis for five different databases um quad spanner cracker cb plant scale
tidy b and you might db um and we when we were like working on the paper we basically wanted to show that our benchmark could
have interesting results on systems outside of meta so we worked with engineers at each of these
companies to run the benchmarks and we were running on a cloud server so hosted servers
that they could monitor so they sort of had full visibility into
the results and also helped us in some cases, like tune their system to get the maximum performance
we could see. And the, the, I'd say there was like two main trends. We were in, in ways we were able
to help the system. One is that some systems did not benchmark
as much as they should just in general.
So running a benchmark,
especially a challenging benchmark on them,
was able to quickly reveal a lot of performance bugs
or opportunities that they were able to plug pretty easily.
But also I think the unique aspect of our workload
is the skew and the asymmetric skew that we capture that is not necessarily represent even something like YCSB, where if you skew it, you basically for like the default case, you just increase the reads and the writes like proportionally. For us, we have hotkeys that are hot for certain operations and not for others.
And because it's very extremely skewed, we're able to help systems identify interesting trends.
So I can talk about some examples of those.
So for CockroachDB, they found one.
So we're running our benchmark on their Kubernetes cloud cluster.
But then when they ran it themselves against their bare metal machines,
they found like a 30% performance improvement.
That was because something about Kubernetes was misconfigured.
Oh, wow.
50% Jesus.
As well as they should have.
So that's something they're fixing now.
Something else I noticed is that because it's so skewed,
they do have like an auto sharding strategy
to try to balance out that skew.
But because the default of that,
like at which point do you actually auto shard
is quite high.
And it looks like at the overall
workload rather than looking at particular hotkeys because in our workload and what we've noticed in
the social graph workload is like most keys can be pretty cold but there are a couple of ones that are
like several orders of magnitude sometimes even hotter than other keys they weren't looking at
individual keys so they ended up having even if it's just a couple of hotkeys on one shard,
that shard got a lot more traffic.
Those hotkeys should have been spread out more evenly.
So now they're exploring sort of better auto-sharding strategies
that are more sensitive to the particular keys,
the workloads running on.
For TidyB, we have now been integrated
into their daily benchmarking sort of suite so they run this
benchmark every day and they're using tablents to help help them capture there's any performance
regressions as their system's still under active development uh for planet scale i think they
they're still building up their system so they had some like limitations in certain queries.
There's like insert and select query wasn't supported,
but we need this to enforce uniqueness constraints.
So they're like prioritizing that.
And I think it's now available actually
because of our benchmark,
they added that kind of schedule.
For YugoByteDB, we found a couple of issues. We helped them identify
a couple of bottlenecks. One of them is when we started running our benchmark on the system,
their performance was unexpectedly slow. So for all the systems except Spanner, we tried to run
with the same number of cores to sort of get an estimate of like relative performance. Spanner doesn't really publicly save
the number of cores we have. So it's sort of that comparison is less, less equal. But for we found
the performance was really slow for you to write DB and we worked them and actually there was a
performance bottleneck in their system they hadn't been aware about. So basically, they use Postgres
on the other hood
and they've been using like a monitoring extension
that was taking exclusive blocks.
So limited like overall performance of their system.
So they were able to get rid of that.
And then they also were doing something weird with scans
where we have like, because we were batch reading,
we had a filter, we had a limit on sort of the number
of rows we were scanning reading we had a filter we had a limit on sort of the number of rows we
were scanning but they were actually pulling all rows in a table into memory on every scan so that
quickly um fell over and led to out of memory error so they um they are now i think they fixed
that already that improved their scan performance by a lot nice Nice. So I guess, are there any situations then, though,
when using Tilebench might be, like, the wrong decision
for a user or for a database system?
Yeah, I would say there isn't necessarily, like,
when it's wrong, but it depends on sort of what...
Yeah, I should have phrased that as suboptimal, not wrong.
I'd say it really depends on how you want to be stressing your system
because benchmarks that are tools that we are using to better understand the performance of
our system and i think tau bench can look at those aspects in a very interesting way but
example we are much more read heavy than write heavy so there are certain workloads
um ones that tpcc is very good at for, that are very update heavy. You have a lot of like read, modify, write,
and those can have significant performance opportunities.
It's not to say that TabBench cannot capture those workloads.
Actually, if you like modify those distributions yourself,
and let's say like focus very heavily on like writes and write transactions,
you can capture them, but they're not like the default use case,
I would say, of TabBench.
And that's the reason I feel like we have so many different benchmarks. It's because people found,
oh, like this benchmark captures one aspect of certain workloads well, but it doesn't capture
other aspects. So let me like go and create a new one. With Taobench, we tried to be as general as
possible so that hopefully if we can get interest from like other social network companies in the
future as well, they can like add their workloads to our framework because it's very general
but we are sort of targeting i'd say the social graph workload space cool yeah and i was gonna
ask what are the limitations of tile bench but i guess maybe it's been captured by that
yeah i think most answer the one thing is we currently do not have
interactive read write transactions
because those are not,
those requests are not present on TAO's workload,
but those would be very easy to support on the benchmark
if like those are added in the future
or those are something that other users wanted.
But yeah, that's not currently available because that's not captured in Meta's workloads.
And our starting goal was to be representative
of Meta's production workloads.
What is the most interesting
and perhaps unexpected lesson that you have learned
while working on TowerBench?
Yeah, I would say that a couple of interesting things.
So one is definitely the diversity of request patterns and SKU and things like that, that
I observed, like looking at these road work clubs, people can use, I would say even like
abuse the system in very not standard ways.
Because as an academic, usually we're taught like these are the best practices and this is how you should like use babies and that's very not standard ways because as an academic usually we're taught like these are the best
practices and this is how you should like use babies and that's very not true yeah so like one
one of the things we observed for example we looked at like contention specifically like
write write and read write read contention because there's no blocking reads on tau
currently so there's no like read write contention but we found like i think it was like 97 more than
97 percent of write write contention is intentional basically the application knows what it's doing
is going to cause a conflict but continues to choose it continues to do so because it knows
the system will like let's say it like sends a bunch of concurrent requests and it knows only
one of them will succeed um and it can rely on the system to do that so it will sort of write this as a expensive code for the system because it knows that it's safe to do
so so i think that was very surprising and i i there's been a point when i started like
researching termic and processing and being an APT which I felt
this is like such a well-studied field
in the 1980s
but I think
now looking at these different workloads
and these extreme cases
there's a lot
a lot of different problems
that people haven't considered
simply because they didn't have access
to like the data
and the workload pattern
so now I'm really excited to
leverage Tau tau bench to sort
of hopefully inspire new research problems so yeah i guess from the progress in research is very
non-linear right there's lots of ups and downs so from the initial um conception of the idea for
tau bench to the final publication and were the things that you tried in that process and kind of dead ends
that you went down so things that you failed that maybe the listener might find interesting
I think one thing is like figuring out that set of parameters took a lot of trial and error we sort
of was starting out we had two approaches that we could tackle from one is um we start sort of with a list we start simple we start with
things that we know would probably make sense to include and then build it up from there and go
travel air or the other thing was like we come up with an exhaustive list of features and like
use an ml model and train it and see if it will fit um we decided to go with the first one because it was like more understandable and explainable and um yeah i think even in some cases the ml model doesn't always capture especially like
extreme skew and things like that it might not always be accurate which is why we but the first
approach required us to like manually sort of try out new features build it out and see what
worked so we did actually look at like temporal patterns
and some things we discussed earlier,
and we thought that would make a difference,
but it didn't actually impact the access distribution,
only the volume.
So that's something we decided to put off for the paper.
And then, yeah, I think that was definitely
a lot of trial and error.
And then also when we tried to get taubens running on other
systems for them at that point the code was already pretty mature because we'd run it internally but
there were still some kinks we had to iron out and as we like worked with the database engineers from
the various companies they also gave us feedback they're like oh this isn't as clear as it could
be so we also ended up making a lot of they they're
more like um not modification modifications to the core logic but sort of the to make the code easier
to use there are a lot of changes there and also running the benchmark on each of those systems was
definitely a process because even though the code development didn't take that long it took many
iterations for almost all the systems to figure out
if we were actually getting representative performance,
because the first run was like never good.
So then you have to keep iterating and seeing,
and then you get to a place where you feel it's pretty reasonable,
but we wanted to make sure we captured as representative results as possible.
So there's a lot of back and forth, which with each of the five databases. So it was definitely a lot of back and forth which with each of the five so it was definitely but as
you'll see on the paper there's a lot of authors I had fortunately a lot of undergraduates at
Berkeley who are very passionate about research and they helped a lot with that sort of development
process and so what's next for Talbots and they've spoke about it briefly in across the across the
course of the interview but what are the highlight things where what's the initial next steps and the big picture
goal for it yeah i think we're open source now which is exciting we like jump through all the
legal hurdles and stuff um we're definitely trying to do more pr for it and get more people to use it
it's exciting that like the databases we've worked with have already like
seen value in the benchmark and are starting to integrate it into their
systems and make it like,
like actively use it as a part of their development process.
And we're hoping to like build out the community there.
I think another aspect is like what workloads are available.
So as I mentioned, the workloads we have currently are from like 2021 and we're hoping to like do a refresh sort of on those release more workloads
and as a part of this um because i've seen different patterns both in terms of like the
skew and also like the long tail sometimes you have like very large transaction use cases i feel
there's a lot of interesting research questions relating to
concurrency control anyway i'm excited to pursue those going forward can you briefly tell the
listeners about your other research about things like the ramp ramp tower yeah yeah yeah happy to
so i'd say yeah broadly my research is focused on like transaction processing with with some focus i would say like large scale
systems so the of the previous project to this top bench that i worked on at meta was called
ramped out this was ensuring a read atomicity for the large scale system so basically how we can add
a stronger guarantee without impacting current applications and with
like low performance and storage overheads. This, I have a paper and a talk available on this on my
website. I think they're both linked on my website. And then I also have another project that we're
in the process of submitting right now on looking at how transactions and caching interact as well.
And that project actually uses TauBench workloads.
But yeah, we're hopeful that with TauBench, we can hopefully inspire some of my own research,
but also help other both people in academia and industry highlight sort of new problems and ways to address them but I think the goal of the benchmark at the
end of the day is to sort of help you understand your system and where what its limitations are
so that you can improve it if you could kind of say um something about how you approach
idea generation and selecting new projects and what your process is for doing that how do you choose what to work on um i'd say as a current phd student i'm definitely still figuring that out
yeah but i would say what's worked really well for me and what um what that's what i prefer is
i think being inspired by real world data and problems is great because you have sort of the grounding that this is a problem to at least like some real users.
Not necessarily to say that should always be the case or you should only look at the problems that industry often sees,
but sort of looking at that to understand the general trend of why current systems cannot perform to that degree.
And then using that to my goals, like to try to understand
that problem. And then once I have a sense of that, to think about different ways to solve it.
And I think being an academic gives me the freedom to sort of think outside of the box of, oh, we
need to like implement this right away. But this may be would be like hard, harder to implement a short amount of time but it's a good like long-term solution so um
yeah i think for me like looking at the real world aspect and just inspire you but not necessarily
limit you has worked well so what do you think is the biggest challenge in your recent in
transaction processing now what do you think is the biggest challenge facing facing us yeah i think that um
while there's been a lot of academic work on sort of how we can deal with certain like high
contention cases or things like that a lot of those you find are not applied in practice
like on tau for example they use two-phase locking and two-phase commit which has been around for decades exactly like there have been so there were hundreds and thousands of papers on like
things that perform better theoretically than two-phase locking and two-phase commit and i think
one of the reasons they're not applied is because 2pl and 2pc are like so simple and understandable. So I think there's a need for better performance,
but also keeping in mind the simplicity
and what's actually feasible to implement in practice.
So last question now,
what is the one key thing you want listeners to take away
from this episode and from your research?
I'd say the key thing is that with this new benchmark
cow bench um which has had which captures meta's production and social graph workloads we've been
able to have impact both within meta and outside of meta and it provides sort of a stepping stone
simply a lot more exciting research on transactions and concurrency control and it provides sort of a stepping stone simply a lot more exciting
research on transactions and concurrency control and i would encourage you to check out adventure
brilliant well let's end it there thank you so much audrey if the listeners are interested to
know more about audrey's work then we'll put all the links to everything in the show notes and we
will see you next time for some more awesome computer science research