Disseminate: The Computer Science Research Podcast - Audrey Cheng | TAOBench: An End-to-End Benchmark for Social Network Workloads | #15

Starting point is 00:00:00 Hello and welcome to Disseminate, the computer science research podcast. I'm your host, Jack Wardby. I'm delighted to say I'm joined today by Audrey Cheng, who will be talking about her VLDB22 paper, TauBench, an end-to-end benchmark for social network workloads. Audrey is a PhD student in the Sky Computing Lab at UC Berkeley, and her research focuses primarily on transaction processing for database systems. Audrey, thanks for joining us on the show. Thank you so much for having me, Jack. It's a pleasure. Let's dive right in. So how did you end up researching transaction processing? What was your journey to this point? Yeah, so now I am a third year PhD student at Berkeley, and I ended up here mainly because of my undergrad advisor,

Starting point is 00:01:07 Professor White Lloyd at Princeton. My junior year of college, he has done a lot of work on transaction processing and I was interested in doing research. So I reached out to him and I ended up working on a project that extended Peter Bayliss's Bull on Causal Consistency paper. So in that paper, Peter looked at how you can go from eventual to causal consistency. And I sort of extended that hierarchy. So how we can go from like causal to sequential, and then from sequential to linearizability, and so forth. And I felt that was a lot of fun, just like thinking of protocols and making, like coming up with new techniques to ensure stronger guarantees. So that's how I sort of fell into thinking about strong guarantees. And then

Starting point is 00:01:51 that naturally led to transaction processing. So let's talk about Tilebench then. Can you start off by telling us exactly what Tilebench is? Yeah, of course. Happy to. So Talvent is an open source benchmark based on Meta's production workloads. Specifically, we look to capture the sort of request patterns, including transactional ones that you'll find on the social graph. is the first time that a social networking company of this scale has open sourced their workloads especially transactional workloads so we hope it can definitely be a very helpful resource for the database community and we've already seen impact both within meta and outside of meta for other databases. You've touched on it a little bit there but what is the the motivation for developing a benchmark specifically for social networks? So I've been interning at Meta for a couple of summers. One of my first projects was looking at adding stronger guarantees to the to their like social graph data store. This was a ramp tab paper, which is in BLDB 2021, for anyone interested in that. But as I was doing that, one of the things I was trying to look at was how these transactional workloads

Starting point is 00:03:13 operated at large scale and thinking about the needs of applications using transactions. And that's what helped me decide that I wanted to take a more holistic view of sort of what the social graph workload looks like. And because I had access to data, I thought it was really exciting. So I spent a bunch of time looking at different sorts of skew, and also how data is stored on different shards, because explicit collocation is actually quite important, we found, for the social network workload. And as I was looking at these things, I realized that none of this was captured in other existing benchmarks, especially social network benchmarks. There are a couple, the prominent ones I'd say are LinkBench, which was released by Meta in 2014, but that was only focusing on the workloads of a single MySQL instance. So you lose a lot of interesting

Starting point is 00:04:03 patterns you see with geo-dist distribution and also at the cache level, which serves the majority of the workload. And also there's LBBC, which is a very important workload, but that targets mainly graph databases. So the focus here was more complex graph processing queries rather than the many serving requests we saw on TAO. So most, the majority of requests we have are like point reads and some point writes, but then you also do get some very complex queries and longer transactions,

Starting point is 00:04:33 and especially figuring out the interaction between these and also on shared data made me really excited to, one, study this workload and then try to open source it in a benchmark. You mentioned the other benchmarks in this space. Are there any more than LinkBench and LDBG? How does it compare to something like TPCC or YCSB or something like that, right? Yeah, great question. So in our paper, we identify five properties that are important for social network benchmarks.

Starting point is 00:05:04 And I think TPCCCC and YCC are definitely important, like OLTP workloads, but they don't necessarily capture some of the properties we've seen on social graphs, including being very read heavy, like TPCC is notably update heavy. And we also have extreme skew. We have very, very severe hotspots in some cases, like let's say Lady Gaga posts something, then a ton of people are trying to look at that post and like that post. So, and also the correlation between requests, and we also have different applications interacting in the same data

Starting point is 00:05:38 versus most benchmarks, I think, if you'll try to capture the properties of one application. For a lot of different reasons, we felt there was a need for a new benchmark. It was not that TPC0, YSP are not important. They're very important points in sort of the workload space, but we felt within the social graph workload space, there wasn't a representative benchmark

Starting point is 00:05:58 and that's what we sought to capture with TabBench. Awesome, it's funny. I always, when I'm ever talking about like supernodes in a graph setting, I always use Lady Gaga for my example as well. with tab edge awesome it's funny i always when i'm ever talking about like super nodes in in in a graph saying i always use lady gaga for my example as well i don't know what it is about lady gaga but i always use that as well cool yeah so you mentioned there's five properties there like that you think should be captured by a social network benchmark can you maybe dig into these a little bit more walk us through them what they are and say specifically why they are important?

Starting point is 00:06:26 So these properties were ones we came up with after talking to a bunch of different engineers, industry practitioners and also looking at the workloads and trying to identify why those are important for performance or other aspects of the system. So the first one is that we want to make sure we're accurately emulating social network requests. As I mentioned, like things like TPCC and YCSB, they're designed for a specific purpose, but they're not designed to capture the social graph. And because we have things like extreme fear, high correlation, redominance, it's important to design them specifically for the social network, which is a very important application domain and has had enduring popularity for maybe several decades now, we can even say. So things like link bench and LDBC are indeed designed, but not all benchmarks are satisfied. Property one. And the second property is that we want to make sure we capture transactional requirements,

Starting point is 00:07:20 and these can vary. So LDBC, for example, has full interactive rewrite transactions, while on TAO, the system has chosen to only support sort of one-shot read-only and write-only transactions, because we found these were able to perform, have improved performance and scalability, given the challenging workload that at the moment is like tens of billions of requests per second. So serializability, like full coordination is not yet feasible at this scale. And the third property is about co-location preferences and constraints. And we found that no listing benchmark actually satisfies this property. So what we mean here is like information about sharding and how data is stored. And we found within a real-world system

Starting point is 00:08:05 that data placement isn't actually just an implementation detail, but it can reflect user intent, privacy constraints, and or regulatory compliance, especially with the growing concerns for data privacy and stuff. Sometimes it has to be stored in certain regions and that can definitely impact how the workload is served

Starting point is 00:08:24 and also have concurrency control and performance consequences. So we thought this was, to the best of our knowledge, no social network benchmark or most of the popular like LTP workloads capture this data and something we wanted to make sure that we supported. And the fourth property is about how we model these requests. And I think it also illustrates a trade-off in representing workloads. So most standard benchmarks take the approach of they want to represent specific behavior. So for example, they want to capture, I think one of the queries in LDC is like you want the friends of friends with maybe some property and like a very complex query. And you want to sort of capture the action of, OK, this is what would happen. So you have like these distinct query types. LDBG has 29, TBCC, for example, has five.

Starting point is 00:09:17 We took an approach on the other side of the spectrum where we didn't attempt to capture any of these types. That's because on TAL, there's like tens of thousands of distinct query types running per day. So to replicate all of those and replicate the code would be impeacable. So instead we decided to model these workloads using probability distributions without these prescriptive query types,

Starting point is 00:09:40 which allows us to capture a much broader workload. But there's a trade-off of with these distributions, if the workload changes, we're able to very easily capture that without needing to modify specific queries. But the trade-off here is understandability. We're not able to say, okay, these requests came from this particular application,

Starting point is 00:10:02 doing this. But that actually also helps us hide some of those privacy concerns so we're not sort of highlighting exact user actions. So it definitely is a trade-off, but I think an interesting one. And for the most part, I think YCSB and to some degree, LinkBench

Starting point is 00:10:19 sort of also takes an approach of like using request distributions rather than query types, but not the most common, I'd say, approach that most benchmarks have taken. Yeah, I know YCSB kind of has that skew factor parameter, right? Which you can vary to kind of, I guess, simulate that to some extent. But yeah, anyway, sorry, continue. Property five.

Starting point is 00:10:41 Yeah, and the final property is we wanted to make sure that the benchmark workload exhibited in multi-tenant behavior on shared data. This is important because when you think about the social graph, you might think, oh, it's like the Facebook application or something like that. But actually, there's a lot of different services and products running on the same data. So there's like Messenger. There's like WhatsApp. There's also like ads, for example. There's Marketplace. and products running on the same data. So there's Messenger, there's WhatsApp,

Starting point is 00:11:05 there's also ads, for example, there's Marketplace, there's also now the Metaverse. And these are all based on the same users accessing these different applications. So thinking about the complex interactions between how one application can affect the behavior of another is really important. And I think the LBBC benchmark paper acknowledges this,

Starting point is 00:11:26 that you do need to capture this, but their approach was just to run mixed workloads. And this doesn't necessarily reflect how different applications actually interact in the real world, in real world cases. So it was exciting that we're able, we had this data at hand, we're able to study these

Starting point is 00:11:41 and then also look specifically at like product groups, which are like sets of applications that have that share the same data use the same infrastructure and we do deep dives into how this behavior can vary both within product groups and across product groups to then satisfy these these properties you then profiled and tau which is obviously facebook's graph system so can you maybe tell the listener more about this system and i guess why you chose it but you chose it right because that is their system right and you were you were inside in there so that kind of answers itself but yeah can you just explain to us a little bit more about the system what its architecture is and some of the background

Starting point is 00:12:21 on that yeah happy to and i think this is important as well as why like this the workload this particular system was of interest so tau as i mentioned is the social graph data store at meta so it stores and serves the vast majority requests the social graph there are certain other stores um some like blob storage, for example, that will store like very large videos or things like that. For the most part, whenever you're accessing one of Facebook applications, it's pulling the vast majority of requests and also doing like serving those rights and stuff within TAO. So TAO has been around for around 10 years now. I want to say there's an ATC paper on the system published in 2013. And since then, it's grown a lot bigger.

Starting point is 00:13:12 So currently, it serves over 10 billion reads and tens of millions of writes per second on a changing dataset of many petted bites. So it's operating at a very, very large scale. It's distributed and runs in many different regions across the globe. And as such, I think it offers unique insights into the modern social network workload because we sort of have the majority of the social network requests like on this one system.

Starting point is 00:13:39 And we're able to like look at the system and understand these request patterns in terms of how it actually models data so it has a graph data model but not a graph data api so it has objects and which are called nodes and associations which are edges in the social graph um and it was originally designed to sort of operate at large scale so it it didn't have any consistency guarantees originally. And it was sort of the designers focused on like high availability and low latency. And as such, they also imposed a very simple API. So it's similar to a key value store, I'd say. But it does have range queries. So it has like point, get, range, and count queries, and then insert, update, and delete for writes.

Starting point is 00:14:30 And then it's since added write transactions, also prototype read transactions, more detail can be found in the Ramp paper about how those were added. But it definitely has, I'd say, some quirks as being such a large-scale system, but I think because it has continued to run and continue to serve the growing workload for over 10 years, it sort of speaks to this approach towards serving the social graph was successful, and it has a sharded tiered architecture so tau itself is actually like

Starting point is 00:15:06 two graph aware cache layers and underneath tau there's a statically sharded mysql database and above tau there is a client-side cache also there's actually many layers of caches but the main one that we looked at was a client-side cache because that can a lot of like request the hotkeys end up being served from that cash so it actually I think it captures we say in the paper like 14% of the workload so a significant portion. What do TOW's request patterns look like? How do these actually manifest themselves? I know you said there's a large variety in both what was the sort of general patterns i would say at a high level it is very the requests are very read dominant so i believe we say yeah that so we looked for the purposes of the benchmark we looked at traces over three days and in trace, we found like 99.7% are reads. So for the most part,

Starting point is 00:16:08 people are sort of fetching data rather than updating data. And this is not surprising, because like, let's say you're like loading posts in your feed, you once every post you load has a lot of different components, you have to like read in all those components. And you don't like, like, at least most people don't like every single post they're viewing so you end up sort of looking at a lot of things before you send updates to the system so that's mostly why the it's so like read heavy and also one thing we've noted is like the extreme skew and this isn't unexpected because there are certain users in the social graph that sort of are much more popular than others. But even in the SKU, we found that there's a lot of diversity that hasn't necessarily been captured in previous benchmarks. So one example of this is looking at hotspots. So we

Starting point is 00:16:58 specifically studied read and write hotspots for both like normal rights and transactional rights. And what we found is that there are some hotspots that are like read to and written to frequently. So something like, let's say they got actually posted something that would probably get a lot of engagement, but there are also other hotspots that are like write only hotspots. So there's like a ton of rights going to these keys, but very few reads. And this could result from, there's often like asynchronous jobs running in the background,

Starting point is 00:17:28 like maybe, or like migration tasks, or maybe like an AI or ML model like needs training data. So they're like getting this data and updating it in some sense, but not reading from it very frequently. These are reasons why we see these right hot spots and they suggest that um there's potential to further optimize sort of providing guarantees there and also like is this is there a temporal aspect of things as well because i know because obviously i've some experience the ldbc stuff we kind of have this notion of them

Starting point is 00:18:02 like spiking trends where you have like uh i don't know someone comes in and posts an outrageous post on facebook and then they get cancelled and then there's a massive drive of activity around people either uh liking that or commenting on it or unfriending them or whatever does that ever manifest itself in the request patterns or i guess it's only over three days so maybe you didn't maybe capture that that's a great question and something we definitely looked at so at meta at least the most prominent trend is there's like a daily spike in the workload that's like very noticeable um when you look at these and it's sort of around i can't remember what time around it is maybe around noon but basically a lot of people are accessing um like social

Starting point is 00:18:45 network applications at that time so we see that spike and we debated like whether to capture that or not and how to think about like periodicity and the temporal part um so we found that while there are sort of those daily spikes and also we measured different periods so specifically we looked at like july 4th versus a normal week, like three days around July 4th and normal week. We found while like the volume changes, what we cared about is those probability distributions. And those distributions, including the tails, don't actually change that much across those different periods. So that's why we decided to sort of aggregate it across three days, because we wanted to focus of aggregate it across three days um because we

Starting point is 00:19:26 wanted to focus on capturing those probability distributions but definitely one of the feature works i want to look at is like how can we think about looking at the temporal aspects and how that matters for like performance what are the the parameters i have available to me to me when i'm uh i want to go and run tile bench what parameters can i tune to to change things yeah so we have actually a small set of parameters and we found like this small set um was sufficient to fully characterize the social network workload so it's nice because there's not too many knobs you have to tune if it's some, if you want to run the benchmark. So I would break down the parameters into two broad categories. One is like, they're pretty general and they apply most systems. And the

Starting point is 00:20:16 second is there, the way that the second group of parameters are captured is unique to tau, but the properties they're capturing are represented, I think, in most social graphs. So I'll explain those. But the first group are very general, like transaction size. So we have distributions for both the read and write-only transactions. And those can be tuned. We have different workload configuration files. So they've already been tuned for you if you want to run a particular workload.

Starting point is 00:20:44 But if you wanted to try something else, you could change those. We also have sharding. So this determines what shards the key should go on for both objects and edges in the social graph. And this is sort of information that's available, but not always used by different systems. So some of the systems, some of the databases you ran against have their own charting strategies. And there it's interesting to see how like a very skewed workload can be balanced or maybe not balanced across a different chart. We also have operation types. So for the benchmark, we basically have like reads, writes, reads, writes,

Starting point is 00:21:23 consists of like inserts, updates, and deletes, and then read transactions and write transactions and the proportions for those. We also have request sizes. So how much data we're reading or writing. This is also distribution. And so that's like the first category. And then the second category,

Starting point is 00:21:39 the first is association or edge types. These basically capture particular integrity constraints on the social graph. So there are some cases where you always want like paired edges between two nodes. You don't want just one edge. So that's one property we capture. Another one is uniqueness. So between two nodes, you should only have one edge of a particular type. So these are important for things like, let's say you have a Facebook account, Instagram account, and you want to link those. You don't want to be linking like one Facebook account to like multiple Instagram accounts or something like that. And then there's also something called preconditions. These are essentially

Starting point is 00:22:20 correctness constraints on operations. So while Tau doesn't support currently like fully interactive read write transactions, you can still do something similar to like a compare and swap. If you first send your read and let's say, read, you read x and the value of x is one, and then you can condition your write on the value of x. So you can say, only allow my write to succeed if x is still 1. So it's, in fact, basically, an interactive read write transaction, but they're not both within a transactional API, because sort of if you do that, then your read and then subsequent write will block everyone else. And it's really important, especially for reads that those have high

Starting point is 00:23:01 performance, which is why we currently have this format where the condition only applies on writes, but not the reads. So we're not holding up resources for too long. And then the final parameter is the read tier. So as I mentioned, TAO sort of has three tiers. Probably it has the client-side cache, TATSOF, and the database. And depending on which tier you're getting the request from this can have very high impact on latency like at least in order of magnitude is what we showed on the paper i think um and this is this is information that also like depending on the underlying database it might already have its own caching layer in this case it like wouldn't apply but we needed this parameter to capture it on tau to basically make sure that our workload was representative what's the um the schema of the of the social graph in facebook is it kind of very similar to uh like a traditional labeled

Starting point is 00:23:57 property graph where these the edges have properties on them as well or is the or is that stored elsewhere basically as i mentioned tau is served by mysql under the hood like mysql is the third the bottom bottom most layer and since mysql is a relational database basically what they did is they converted both objects and edges in the social graph into rows so let's say like an edge has a particular type that's just stored in a column so you basically condition on that column like only read that edge if it has a certain type yeah so it's much closer to like a key value or like a relational all than the graph database yeah my next question is uh other benchmarks like yCSB, for example, have standardized workloads. People can say I ran YCSB workload A. Have you

Starting point is 00:24:51 standardized workloads for TauBench? Currently, we have three open source workloads with plans to release more. Our workloads are TAO, spelling out Tau. The first one, T is the transactional workload. So our workload, the TAO spelling out TAO. So the first one, T is the transactional workload. This basically captures the current transactional workload on TAO and sort of all requests that are under the transaction API. And then the A workload is the application workload. This represents the speculative transactional workload. So right now, we have a lot of applications that are sort of sending

Starting point is 00:25:25 simultaneous reads and writes together. And they basically sending these requests with transactional intent. But because the rollout on the projection system to using these APIs takes years, we're still in the process of sort of making sure everything is rolled out. So this workload is interesting because it has a longer tail than the T workload. So there's some like very large read and write applications that has heavier skew in some cases. And we've actually found this workload can be quite challenging from database for some systems. So it's been able to identify some interesting like performance opportunities and finally workload o is the overall tau workload so here we're focusing also on like the single point reads and writes

Starting point is 00:26:14 and this workload is notably read heavy as expected yeah i think these with these three we sort of capture a range of interesting patterns on Tau, and we're hoping we can also release more. Fantastic. So what metrics does the system compete on then? Is it just throughput and latency, or is there some sort of price performance metric that you've formulated? Yeah, at the moment, we are just focusing on throughput latency. Okay, cool. insane okay yeah cool as a as a user then what is the the benchmarking workflow what do i need to implement implement to to get a valid run of of tau bench so we definitely tried to simplify tau bench as much as possible and we made it very similar to ycsb in a lot of aspects these ycsb

Starting point is 00:27:00 have a lot of different drivers at this point and we actually already currently has a lot of different drivers at this point. And we actually already currently support a lot of different systems. So if you want to run either like Postgres or MySQL compatible systems, there's already drivers for those. Or if it happens to be like Spanner, CloudSpanish, CockroachDB, and TidyB or UgoByteDB or PlanetScale, those are already all supported. If you want to support a new system basically what you have to do is you need to convert our requests so like reads writes read transactions to the equivalent request on your system so it's like a different syntax like spanner has its own sort of query language. You sort of have to do that conversion because that's not present in our benchmark framework,

Starting point is 00:27:49 but that process is pretty straightforward. So we have a website, taobenz.org, and there's documentation there on how to do that conversion. It's usually pretty straightforward because our API is simple. This can usually be done in like an afternoon or two. It's not like a heavy process just basically you want to make sure your system where you're using like the right syntax for your system and like converting our queries into yeah your yeah awesome i will drop a link to the to

Starting point is 00:28:17 the website in the show notes so the listeners can go and find that how does it work with data generation and do you have a data generator that produces all the data necessary for a benchmark run? For us, the benchmark runs two phases. There's a workload generation phase, which we're basically preparing the database, and then the actual request generation. And we only do measurements during the request generation, though we've actually helped various databases identify things in the workload generation phase because we can run at large scale and we sort of found some weaknesses in their systems during the workload generation phase,

Starting point is 00:28:55 even though that wasn't intentional. What the workload generation does is we generate a baseline social graph on which subsequent requests can operate on. And here we basically send, create basically insert requests. And if it's like a SQL database, insert requests into the system

Starting point is 00:29:18 and we create all nodes and we pre-allocate all the edges between the nodes, but we don't write all those edges because sometimes you need to test uniqueness and other constraints. So if you like pre-write those edges, then you'll never satisfy uniqueness because the edge like already exists. And you can also warm up the cache during this phase because we basically like send out create requests and then we batch read back into the benchmark all the all the objects and edges we want to operate on so that's in memory and once that's done then you

Starting point is 00:29:52 can sort of run as you only need to do this once for all the experiments you want to run on that particular system for a particular workload so once you've done the generation then you can run multiple experiments on that same data cool so how is taubens used internally within meta before taubens the only way engineers could test different things was they had a very limited stress testing tool that wasn't realistic it was like they could produce very basic patterns basically to validate something was like able to serve traffic. There's no guarantee on that type of traffic or they could do shadow run. So they basically like try something out and then run it on existing production data.

Starting point is 00:30:37 So they're really limited, especially in being able to test new workloads, because there is no way you could test a new workload unless you change something in production but once you did that you could cause issues in production so they're like very very cautious about what they were releasing to production this slowed a lot of things down so we talk about several different use cases in the paper before we discuss sort of like looking at new international use cases looking at contention under longer lock hold times, if there's like network delays and things like that, evaluating new APIs and also quantifying the performance of high fan out transactions. But basically the general just to do, TauBench has now enabled us to test new features, optimizations and reliability, which they might, these might these might sound like standard things but at

Starting point is 00:31:25 scale it's actually quite complicated and without affecting production workloads you still want to test like realistic things so the the most powerful thing i think top bench enabled meta engineers to do is now we can sort of try out new workloads very quickly just by like changing the probability distributions and um at large scale as well and then from that you can predict errors without having to wait to like very late stage in like pre-production or um even encountering encountering these errors in production so one example is we have this like new transactional use case that was, I think, creating an object and adding an association, an edge in the graph. And we basically replicated the workload with TauBench and ran it and we found there were some contention errors. I think it had a higher object contention

Starting point is 00:32:23 than edge contention. It was still like relatively low, but we were able to sort of measure these errors. And then when they actually rolled out this use case in production, they found the same breakdown of errors. So that was really exciting because it was able to validate both

Starting point is 00:32:37 that TopBench could capture these workloads in a representative manner, but also we were able to do this testing much earlier on and sort of foresee errors before it's like months down the line and then you have to sort of figure out oh how do we fix that how do we know we fix that. Do they have is there any sort of continuous tracking of the the type of workloads like well I guess what I'm asking is has Meta's workload changed over the last way I'm guessing has changed a lot over the last 10 years but are they noticing changes is not changes now

Starting point is 00:33:12 that you can then I guess model with Taubens and say okay well look we're seeing things heading this direction if this trend continues this is kind of things we need to fix or this is a potential optimization is there anything like that you can do with it yeah definitely so one of the recent things i am supposed to do is so we kept when the workloads we have right now were captured last year when you're sort of working on the paper and on the benchmark and since then there's been a bunch of rollout of different transactional use cases so i'm hoping to like capture that the workload now and probably have very different properties than the previous one a year ago so also like capturing those snapshots

Starting point is 00:33:51 and seeing how performance changes and if there are certain issues we should address now like given the new workload um it's definitely something we can do with tail bench and it's exciting so you said earlier on as well that TauBench has been, has referenced implementations for a number of cloud native systems. I guess the question here is, what is the impact of TauBench being outside

Starting point is 00:34:16 of meta and what did, how did you go about sort of evaluating TauBench's suitability to these systems and to help those systems study performance trade-offs and identify potential uh optimization opportunities as i mentioned we currently support jarvis for five different databases um quad spanner cracker cb plant scale tidy b and you might db um and we when we were like working on the paper we basically wanted to show that our benchmark could

Starting point is 00:34:48 have interesting results on systems outside of meta so we worked with engineers at each of these companies to run the benchmarks and we were running on a cloud server so hosted servers that they could monitor so they sort of had full visibility into the results and also helped us in some cases, like tune their system to get the maximum performance we could see. And the, the, I'd say there was like two main trends. We were in, in ways we were able to help the system. One is that some systems did not benchmark as much as they should just in general. So running a benchmark,

Starting point is 00:35:30 especially a challenging benchmark on them, was able to quickly reveal a lot of performance bugs or opportunities that they were able to plug pretty easily. But also I think the unique aspect of our workload is the skew and the asymmetric skew that we capture that is not necessarily represent even something like YCSB, where if you skew it, you basically for like the default case, you just increase the reads and the writes like proportionally. For us, we have hotkeys that are hot for certain operations and not for others. And because it's very extremely skewed, we're able to help systems identify interesting trends. So I can talk about some examples of those. So for CockroachDB, they found one.

Starting point is 00:36:20 So we're running our benchmark on their Kubernetes cloud cluster. But then when they ran it themselves against their bare metal machines, they found like a 30% performance improvement. That was because something about Kubernetes was misconfigured. Oh, wow. 50% Jesus. As well as they should have. So that's something they're fixing now.

Starting point is 00:36:46 Something else I noticed is that because it's so skewed, they do have like an auto sharding strategy to try to balance out that skew. But because the default of that, like at which point do you actually auto shard is quite high. And it looks like at the overall workload rather than looking at particular hotkeys because in our workload and what we've noticed in

Starting point is 00:37:11 the social graph workload is like most keys can be pretty cold but there are a couple of ones that are like several orders of magnitude sometimes even hotter than other keys they weren't looking at individual keys so they ended up having even if it's just a couple of hotkeys on one shard, that shard got a lot more traffic. Those hotkeys should have been spread out more evenly. So now they're exploring sort of better auto-sharding strategies that are more sensitive to the particular keys, the workloads running on.

Starting point is 00:37:37 For TidyB, we have now been integrated into their daily benchmarking sort of suite so they run this benchmark every day and they're using tablents to help help them capture there's any performance regressions as their system's still under active development uh for planet scale i think they they're still building up their system so they had some like limitations in certain queries. There's like insert and select query wasn't supported, but we need this to enforce uniqueness constraints. So they're like prioritizing that.

Starting point is 00:38:17 And I think it's now available actually because of our benchmark, they added that kind of schedule. For YugoByteDB, we found a couple of issues. We helped them identify a couple of bottlenecks. One of them is when we started running our benchmark on the system, their performance was unexpectedly slow. So for all the systems except Spanner, we tried to run with the same number of cores to sort of get an estimate of like relative performance. Spanner doesn't really publicly save the number of cores we have. So it's sort of that comparison is less, less equal. But for we found

Starting point is 00:38:52 the performance was really slow for you to write DB and we worked them and actually there was a performance bottleneck in their system they hadn't been aware about. So basically, they use Postgres on the other hood and they've been using like a monitoring extension that was taking exclusive blocks. So limited like overall performance of their system. So they were able to get rid of that. And then they also were doing something weird with scans

Starting point is 00:39:19 where we have like, because we were batch reading, we had a filter, we had a limit on sort of the number of rows we were scanning reading we had a filter we had a limit on sort of the number of rows we were scanning but they were actually pulling all rows in a table into memory on every scan so that quickly um fell over and led to out of memory error so they um they are now i think they fixed that already that improved their scan performance by a lot nice Nice. So I guess, are there any situations then, though, when using Tilebench might be, like, the wrong decision for a user or for a database system?

Starting point is 00:39:51 Yeah, I would say there isn't necessarily, like, when it's wrong, but it depends on sort of what... Yeah, I should have phrased that as suboptimal, not wrong. I'd say it really depends on how you want to be stressing your system because benchmarks that are tools that we are using to better understand the performance of our system and i think tau bench can look at those aspects in a very interesting way but example we are much more read heavy than write heavy so there are certain workloads um ones that tpcc is very good at for, that are very update heavy. You have a lot of like read, modify, write,

Starting point is 00:40:27 and those can have significant performance opportunities. It's not to say that TabBench cannot capture those workloads. Actually, if you like modify those distributions yourself, and let's say like focus very heavily on like writes and write transactions, you can capture them, but they're not like the default use case, I would say, of TabBench. And that's the reason I feel like we have so many different benchmarks. It's because people found, oh, like this benchmark captures one aspect of certain workloads well, but it doesn't capture

Starting point is 00:40:54 other aspects. So let me like go and create a new one. With Taobench, we tried to be as general as possible so that hopefully if we can get interest from like other social network companies in the future as well, they can like add their workloads to our framework because it's very general but we are sort of targeting i'd say the social graph workload space cool yeah and i was gonna ask what are the limitations of tile bench but i guess maybe it's been captured by that yeah i think most answer the one thing is we currently do not have interactive read write transactions because those are not,

Starting point is 00:41:30 those requests are not present on TAO's workload, but those would be very easy to support on the benchmark if like those are added in the future or those are something that other users wanted. But yeah, that's not currently available because that's not captured in Meta's workloads. And our starting goal was to be representative of Meta's production workloads. What is the most interesting

Starting point is 00:41:56 and perhaps unexpected lesson that you have learned while working on TowerBench? Yeah, I would say that a couple of interesting things. So one is definitely the diversity of request patterns and SKU and things like that, that I observed, like looking at these road work clubs, people can use, I would say even like abuse the system in very not standard ways. Because as an academic, usually we're taught like these are the best practices and this is how you should like use babies and that's very not standard ways because as an academic usually we're taught like these are the best practices and this is how you should like use babies and that's very not true yeah so like one

Starting point is 00:42:30 one of the things we observed for example we looked at like contention specifically like write write and read write read contention because there's no blocking reads on tau currently so there's no like read write contention but we found like i think it was like 97 more than 97 percent of write write contention is intentional basically the application knows what it's doing is going to cause a conflict but continues to choose it continues to do so because it knows the system will like let's say it like sends a bunch of concurrent requests and it knows only one of them will succeed um and it can rely on the system to do that so it will sort of write this as a expensive code for the system because it knows that it's safe to do so so i think that was very surprising and i i there's been a point when i started like

Starting point is 00:43:21 researching termic and processing and being an APT which I felt this is like such a well-studied field in the 1980s but I think now looking at these different workloads and these extreme cases there's a lot a lot of different problems

Starting point is 00:43:37 that people haven't considered simply because they didn't have access to like the data and the workload pattern so now I'm really excited to leverage Tau tau bench to sort of hopefully inspire new research problems so yeah i guess from the progress in research is very non-linear right there's lots of ups and downs so from the initial um conception of the idea for

Starting point is 00:43:59 tau bench to the final publication and were the things that you tried in that process and kind of dead ends that you went down so things that you failed that maybe the listener might find interesting I think one thing is like figuring out that set of parameters took a lot of trial and error we sort of was starting out we had two approaches that we could tackle from one is um we start sort of with a list we start simple we start with things that we know would probably make sense to include and then build it up from there and go travel air or the other thing was like we come up with an exhaustive list of features and like use an ml model and train it and see if it will fit um we decided to go with the first one because it was like more understandable and explainable and um yeah i think even in some cases the ml model doesn't always capture especially like extreme skew and things like that it might not always be accurate which is why we but the first

Starting point is 00:44:57 approach required us to like manually sort of try out new features build it out and see what worked so we did actually look at like temporal patterns and some things we discussed earlier, and we thought that would make a difference, but it didn't actually impact the access distribution, only the volume. So that's something we decided to put off for the paper. And then, yeah, I think that was definitely

Starting point is 00:45:20 a lot of trial and error. And then also when we tried to get taubens running on other systems for them at that point the code was already pretty mature because we'd run it internally but there were still some kinks we had to iron out and as we like worked with the database engineers from the various companies they also gave us feedback they're like oh this isn't as clear as it could be so we also ended up making a lot of they they're more like um not modification modifications to the core logic but sort of the to make the code easier to use there are a lot of changes there and also running the benchmark on each of those systems was

Starting point is 00:45:58 definitely a process because even though the code development didn't take that long it took many iterations for almost all the systems to figure out if we were actually getting representative performance, because the first run was like never good. So then you have to keep iterating and seeing, and then you get to a place where you feel it's pretty reasonable, but we wanted to make sure we captured as representative results as possible. So there's a lot of back and forth, which with each of the five databases. So it was definitely a lot of back and forth which with each of the five so it was definitely but as

Starting point is 00:46:28 you'll see on the paper there's a lot of authors I had fortunately a lot of undergraduates at Berkeley who are very passionate about research and they helped a lot with that sort of development process and so what's next for Talbots and they've spoke about it briefly in across the across the course of the interview but what are the highlight things where what's the initial next steps and the big picture goal for it yeah i think we're open source now which is exciting we like jump through all the legal hurdles and stuff um we're definitely trying to do more pr for it and get more people to use it it's exciting that like the databases we've worked with have already like seen value in the benchmark and are starting to integrate it into their

Starting point is 00:47:09 systems and make it like, like actively use it as a part of their development process. And we're hoping to like build out the community there. I think another aspect is like what workloads are available. So as I mentioned, the workloads we have currently are from like 2021 and we're hoping to like do a refresh sort of on those release more workloads and as a part of this um because i've seen different patterns both in terms of like the skew and also like the long tail sometimes you have like very large transaction use cases i feel there's a lot of interesting research questions relating to

Starting point is 00:47:45 concurrency control anyway i'm excited to pursue those going forward can you briefly tell the listeners about your other research about things like the ramp ramp tower yeah yeah yeah happy to so i'd say yeah broadly my research is focused on like transaction processing with with some focus i would say like large scale systems so the of the previous project to this top bench that i worked on at meta was called ramped out this was ensuring a read atomicity for the large scale system so basically how we can add a stronger guarantee without impacting current applications and with like low performance and storage overheads. This, I have a paper and a talk available on this on my website. I think they're both linked on my website. And then I also have another project that we're

Starting point is 00:48:40 in the process of submitting right now on looking at how transactions and caching interact as well. And that project actually uses TauBench workloads. But yeah, we're hopeful that with TauBench, we can hopefully inspire some of my own research, but also help other both people in academia and industry highlight sort of new problems and ways to address them but I think the goal of the benchmark at the end of the day is to sort of help you understand your system and where what its limitations are so that you can improve it if you could kind of say um something about how you approach idea generation and selecting new projects and what your process is for doing that how do you choose what to work on um i'd say as a current phd student i'm definitely still figuring that out yeah but i would say what's worked really well for me and what um what that's what i prefer is

Starting point is 00:49:39 i think being inspired by real world data and problems is great because you have sort of the grounding that this is a problem to at least like some real users. Not necessarily to say that should always be the case or you should only look at the problems that industry often sees, but sort of looking at that to understand the general trend of why current systems cannot perform to that degree. And then using that to my goals, like to try to understand that problem. And then once I have a sense of that, to think about different ways to solve it. And I think being an academic gives me the freedom to sort of think outside of the box of, oh, we need to like implement this right away. But this may be would be like hard, harder to implement a short amount of time but it's a good like long-term solution so um yeah i think for me like looking at the real world aspect and just inspire you but not necessarily

Starting point is 00:50:34 limit you has worked well so what do you think is the biggest challenge in your recent in transaction processing now what do you think is the biggest challenge facing facing us yeah i think that um while there's been a lot of academic work on sort of how we can deal with certain like high contention cases or things like that a lot of those you find are not applied in practice like on tau for example they use two-phase locking and two-phase commit which has been around for decades exactly like there have been so there were hundreds and thousands of papers on like things that perform better theoretically than two-phase locking and two-phase commit and i think one of the reasons they're not applied is because 2pl and 2pc are like so simple and understandable. So I think there's a need for better performance, but also keeping in mind the simplicity

Starting point is 00:51:29 and what's actually feasible to implement in practice. So last question now, what is the one key thing you want listeners to take away from this episode and from your research? I'd say the key thing is that with this new benchmark cow bench um which has had which captures meta's production and social graph workloads we've been able to have impact both within meta and outside of meta and it provides sort of a stepping stone simply a lot more exciting research on transactions and concurrency control and it provides sort of a stepping stone simply a lot more exciting

Starting point is 00:52:05 research on transactions and concurrency control and i would encourage you to check out adventure brilliant well let's end it there thank you so much audrey if the listeners are interested to know more about audrey's work then we'll put all the links to everything in the show notes and we will see you next time for some more awesome computer science research

Your Ad Here

Disseminate: The Computer Science Research Podcast - Audrey Cheng | TAOBench: An End-to-End Benchmark for Social Network Workloads | #15

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.