Disseminate: The Computer Science Research Podcast - Subhadeep Sarkar | Log-structured Merge Trees | #32

Starting point is 00:00:00 Welcome to Disseminate, the computer science research Podcast. I'm your host, Jack Wardby. Today, I'm joined by Suvadeep Sarkar, who will be telling us everything we need to know about log-structured merge trees. Suvadeep is going to be, I say, in July, an Assistant Professor of Computer Science at Brandeis University. So, Suvadeep, welcome to the show

Starting point is 00:00:41 and congratulations on your new appointment. So, thank you and thank you. I love to be here. It's great for you to invite me. And I'm very excited to start my new life at Brandeis. And this is essentially the first podcast that I'm doing as an assistant professor. So I'm super excited. Brilliant stuff.

Starting point is 00:01:04 Brilliant stuff. Cool. So obviously, brilliant stuff, cool. So obviously I've given you a very brief introduction there but maybe you can tell the listener a little bit more about yourself and how you became interested in databases, data management and all the cool things that we all love. Of course, so I did my PhD on something that is completely different, it was more like I did theoretical research and computer science and cloud computing but as I was doing so I did theoretical research and computer science and cloud computing. But as I was doing so, I realized that I wanted to do more system stuff, both research-wise and development-wise. So after my PhD, I moved to INRIA in France, and it is there that I was

Starting point is 00:01:39 exposed to large-scale systems and the multitude of data management challenges that you face when you operate on data systems and operate on large scales of data. So instantaneously, I was hooked. So specifically, my journey started with the problem on how we can delete data efficiently from large scale data stores. So if I remember correctly, GDPR was about to be enforced at that time, and it was really in the interests of many companies that they want to delete user data

Starting point is 00:02:14 in a scalable manner. And it was fascinating for me that how a seemingly simple task of data deletion can become so tricky when we operate at scale. So this is what made me really interested into data management research and then after my time at India I moved to Boston University, spent about four and a half years working on the same problem and some more other problems as well and now I am truly in love with this new domain. Amazing. A great origin story and how you became hooked on this really cool research area. So yeah, so today we're going to be talking about log-structured merge trees.

Starting point is 00:02:55 So maybe for the year initiated, you can kind of start off by explaining what the hell this really cool data structure is. Well, truth be told, I did not even hear about Ellison-Trees five years back. But the truth of the story is, today, Ellison-Trees are sitting at the heart of many data systems, relational and non-relational alike. So if I have to take a few names, we have all heard about ROXDB at Meta, Bigtable, LevelDB at Google, XEngine at Alibaba level db at google x engine at alibaba cassandra hb at apache all these big databases which are primarily used as key value data stores have innocent trees as the underlying data structure so what is an ellison tree essentially it is get another data structure it is a highly right optimized data structure it is a kind of a data structure. It is a highly write-optimized data structure.

Starting point is 00:03:45 It is a data structure that was built to support data when the data does not fit in memory and data is stored in residence. So the exponential data growth that we have seen over the past couple of decades, where data really does not fit in memory anymore, we have seen more and more systems adopt the LSN paradigm. So several systems have either moved away from B plus trees to have LSN based indexes, or they have extended support to LSNs while retaining their B plus tree implementation. So LSN trees in a single sentence, it is a right optimized storage based data structure that can also help us index data. Amazing. So you touched on it a little bit there about the history of LSM trees, but maybe you can dive into this a little bit more. When was this data structure first proposed?

Starting point is 00:04:35 And you mentioned a few of the systems there that are using it. So maybe we can talk a little bit more about why they chose to use that data structure as well. Awesome. This is an interesting question. So Elephant Trees were first proposed by Pat O'Neill and his colleagues from here in Boston, out of Boston, in 1996. And for the next 10 years, there were literally only a couple of papers that talked about this data structure. No one knew about this data structure. No one knew about this data structure.

Starting point is 00:05:06 No one really cared about them. There was not a single database that used LSMs underneath. So in 2006, when we had the big data booms just starting up, Google wanted to build a new data system that would be based on the key value paradigm. And they wanted to use commodity hardware to build these large-scale data systems. So they kind of moved away from the area of disk paradigm that we have seen in the 80s. And when they wanted to move away to commodity hardware, the lesson trees were really a data structure that gained their attention.

Starting point is 00:05:47 So it was, as I mentioned, storage-based. So essentially, it does not really care whether the data fits in memory or not. Rather, it is super suitable when the data is large in size. And second, it is highly write-optimized. So in an era of data boom, where you have a lot of data and you want to ingest the data to your data store really fast, Ellison trees were really the data structure of choice. And in the following year, starting up from 2007, several companies understood the utility of Ellison trees as a storage layer data structure. And as I mentioned, a bunch of NoSQL data stores adopted Ellison trees to be their data structure of interest. Not only that, there are several relational databases like MyRocks at Meta, SQLite, which

Starting point is 00:06:35 have Ellison trees running underneath. There are a few time series databases such as InfluxDB, QuasarDB, which are also based on Ellison trees. influx db, quasar db, which are also based on elephantry. So this is how elephantry has become from not a data structure of interest to a data structure of super interest in less than three decades. Yeah, that's amazing. I mean, it's such a rarity that something like because the mid-90s is really recent in database history. I mean,'s always that

Starting point is 00:07:05 we always have this joke with a lot of my guests that everything was already done in system art and then the kind of all the cool ideas are taken so it's really interesting to see that kind of something that's kind of quite late on the scene and it's kind of quite revolutionary as well and that it's been used by so many data systems today so yeah that's fantastic let's let's let's let's dive into these into a little bit more depth then so you've i know you've done several tutorials on lsm trees um over the last couple of years at sigmod and icde so and you always kind of start off by talking about the key operating principles of lsn based storage engines so could you maybe tell us what these key operating principles are?

Starting point is 00:07:50 Absolutely, absolutely. That's a great question. And another way to think about this question is that how are Ellison trees fundamentally different from classical index structures, say B plus trees? So essentially, if we can compare and contrast these two data structures side by side, we might be able to understand the key operating principles, as you mentioned, of LSN trees. The first thing that comes to my mind is LSN trees is an ingestion-friendly data structure, and it has an in-memory component that we often refer to as the LSN buffer. So as incoming data comes, so we do not really move them to the storage aggressively.

Starting point is 00:08:28 So if you think about a B plus tree, if you have an insert operation, you will have to traverse through the root, move to the leaf node, access all the intermediate pages in between, and then write your data to the leaf node. In Ellison trees, we try to amortize the cost of ingestion. So instead of paying a logarithmic cost,

Starting point is 00:08:49 we buffer a bunch of inserts in memory, with no storage access whatsoever. And then when the in-memory buffer is full, we lazily flush it to the storage. So this is how we really achieve a superior ingestion performance in Ellison trees. The second thing that comes into mind is the out-of-place update and delete paradigm that Ellison follows.

Starting point is 00:09:11 So again, going back to B plus trees, if you want to update a data or delete a data, it is in place. What do I mean by in place? You will literally have to start from the root, go to the last leaf where the data is contained, then you have to rewrite the data or delete the data and do all the necessary actions required. So you pay a very high cost here as well because you are doing a lookup to find the key

Starting point is 00:09:39 and then modifying the key in place. So, LSI queries do not follow this paradigm at all. So, if you have to update a data, you simply insert the new version of the data. So, you do not even look for or touch the existing entry. The only thing that you have to make sure that whenever you have a query, the query accesses the recent version first before it accesses the older version. And once you access the recent version, your query terminates. So there is no way you will be returning an invalid entry. Deletes happen the same way. So you do not go in place and delete an entry.

Starting point is 00:10:13 Rather, you insert something that we call a tombstone, which logically invalidates the entry. So essentially, these are the two fundamental ways how Ellison trees are different from B plus trees. And one other thing that comes to my mind, if you think of a B plus tree as it supports in place updates, the leaf nodes are almost always not completely full. So in a stable situation, they are probably 67% full.

Starting point is 00:10:41 Ellison trees on the other hand, operate on the notion of immutable files. So what does that mean? It means that once you move something from the memory to the storage, you can never go and make in-place edits in the file. This allows Ellison to achieve a superior space efficiency. So it really can use that device space, the storage space, in a very efficient way. So fundamentally, these are the three axes that come to my mind how LSNs are different from B plus trees. There are, of course, several similarities. So both of them facilitate index structures.

Starting point is 00:11:18 In a B plus tree, the data is probably in the last level and all the intermediate levels are used for indexing. Innocent trees in that sense is probably more closer to B trees or B epsilon trees because there is indexing structure that we have but the data and the key are always together even in the intermediate levels of the tree. Another similarity is that as you know, what we call a B plus two fan out essentially becomes the size ratio in an analysis tree. So every level has an exponentially larger capacity.

Starting point is 00:11:55 This way we can really limit the number of levels that we end up having in an analysis tree. Finally, like we have node splits and mergers in B plus tree, which I refer to as data reorganization, we have something called compactions in Ellison trees. So every time a level size goes beyond the threshold, we move some part of the data from that level and merge it or compact it with the overlapping part of the next level. So, yeah, essentially it is, once again, it's still an index structure, but it has certain similarities and certain differences when it comes to classical indexes. I love the positioning against a B-tree because it kind of having that as sort of a yardstick to measure against and see

Starting point is 00:12:42 how it differs is really, really nice. It's a really nice way to kind of compare these two data structures. So that's great. So let's talk about this some more then. So you break down the operations we can then do on log-structured merge trees and into internal operations like flushing and compaction and some of the external operations like updating and reading and stuff like that. So can you maybe, first of all, work us through how the

Starting point is 00:13:09 internal operations like flushing and compaction actually work in practice? Of course, of course. Yeah, so internal operations are operations that are really not triggered from the user side. I mean, they are probably triggered, but not explicitly, right? So internal operations is that when L lsn buffer gets full we need to move all the data from the buffer which is in memory to the storage right so this is the operation that we often refer to as flush in nelson terminology right so a flush is nothing but after the buffer is full we sort all the entries in the buffer based on the key. So we have a sorted component. And then we write all of them to the slower storage as an immutable run.

Starting point is 00:13:50 So flush is essentially moving data from the memory to the storage in the form of immutable sorted runs. And the second thing that you mentioned is compaction. I briefly touched upon this before. The compaction is like any index data structure, as you have more and more data, you have to reorganize your index structure. In Dplus trees, we do this by splitting the nodes or merging the nodes based on the work that we have. In LSM3s, we do the same. We just call them compactions. So a comp is every level as you can imagine has a capacity which is exponentially growing as we are moving

Starting point is 00:14:28 down in the tree but once the size of a level goes beyond the capacity we can do two things. One, we can either move all the data from that level and sort merge it with the data from the next level. This sort merge operation is essentially the compaction. Or two, what we can do is that we can just move some part of the data from the level that is full and merge it with the overlapping part in terms of the key ranges from the next level. So these are classically the two ways

Starting point is 00:15:01 how we implement compactions. But as you can imagine, both flush and compactions are something that LSM engines have to do in order to make sure that they retain the tree structure. So this is why we call them internal operations. I see. I see. Cool. So let's then flip this kind of corner, look at the other side from the user level and the external operations, the gets and the puts and the scans. So how do these things work over and how do they then trigger these internal operations? Absolutely.

Starting point is 00:15:29 So as you can, as a main suggest, external operations are really what we see in the workload, right? So LSN trees, as I mentioned before, is a key value and it's stored in NG, right? So every entry that you can see, it has a key based on which we organize the entries in the tree and all the other attributes, whatever you have, they are clubbed together and they constitute the value fields. So we have this key value entries that come into an Ellison tree. And we support the different operations through what we call a put-get API. So what is a put?

Starting point is 00:16:02 Put is nothing but an insert. So we call it put because in RSM trees, we do not really differentiate updates from inserts. So we do not really care whether the entry already exists in the database and the new entry is an update, or this is a completely new entry. We simply use the put API to make sure that we are writing the new entry to this memory buffer.

Starting point is 00:16:27 And when the memory buffer is full, the entry will be sorted and flushed to the storage eventually. But the PutGet API is very simple. We have to make sure that whenever we have a put, we never access older data before we access the newer data. This is how this invariant that we call it in Ellison's really allows us to make sure that put operation, which is essentially a query operation, a point query, will never return something that is invalid.

Starting point is 00:16:58 This is how the PutGet API works. Apart from this, Ellisons also support range scans. Range scans are essentially range queries where you want to fetch all entries between two given keys, let's say K1 and K2. In a leveled LSM3, I'll talk more about what's leveling and what's trading in my talk later on.

Starting point is 00:17:21 But the idea is that you have to really figure out all the sorted components that that qualify for a range scan operation a range query and then because there can be overlaps of the keys there can be the same keys that duplicate different versions across several runs you have to merge the content of all these qualifying sorted runs. While you do the merge, you make sure that you are only retaining or returning the latest entry for every key. This is how LSM trees through the process of

Starting point is 00:17:58 sort merging responds to range query. Finally, we talked a little bit about deletes. Deletes, as I mentioned, is also a put operation, but instead of pushing or putting in a completely new entry, we do put in a special entry that we refer to as tombstone. Tombstone is nothing but it's still a key-value pair. It has the key for the entry that

Starting point is 00:18:25 you want to delete but the value field is typically one byte long which just says that this is a tombstone so that's it so the tombstone flag is what we call it and this is what helps us differentiate the tombstone from a general key value entry so the whole put get delete range The whole Put, Get, Delete, Range, Scan API in LSM trees are very simple in fact. Is there a way to have additional, I mean, I don't know what these would be. I mean, it's like one of the selling points of this data structure that is very simple, right, is a simple API. But are there any other possible types of operations you can kind of define on an LSM tree that would be useful?

Starting point is 00:19:05 Absolutely. So as I talk about this PutGate API, it's simple. It works. But whenever it comes to complex operations like range queries, LSM do not perform as well as B plus 3s do. Because if you think about a B plus 3, everything is sorted in the leaf. You can literally do a scan and be done with it. But in Ellison 3, you will have to do a merge operation.

Starting point is 00:19:29 It's quite expensive. It's a two way merge that will happen in general. If you want to delete, for example, a range of values, which we call range deletes, is super tricky in Ellison because now you have to think about how you are going to delete all the values because you do not really know how many valid entries are there within a given range. So there are some work on range beliefs in LSMs. We are also working on it in one of our projects. The PutGet API is very good when you have a very simple workload that has a bunch of point lookups and a bunch of inserts or updates.

Starting point is 00:20:05 But when you have more complex workloads, the API does not really allow us to be more flexible. Rather, we have to write our own API, own routines that can facilitate that. Cool. I mean, I guess it's all about trade-offs, right? There's a trade-off you've got to make somewhere, right? So I guess one of the benefits is having this really nice, simple data structure that's really good at ingestion and things like this. So, I mean, I guess this is part of the trade-off, right? But anyway, cool.

Starting point is 00:20:32 So let's talk more about rights then, because obviously this is kind of one of the big selling points of this data structure and probably one of the main determinants why it's become really popular. Like you said earlier on, the advent of big data, et cetera, et cetera. We have these large volumes of data to deal with. So can you maybe give the listeners an overview of the recent research on how we've tried to improve ingestion performance in log structure-made-based

Starting point is 00:21:00 storage engines over the past sort of, I don't know, five to 10 years? Absolutely, absolutely. And you correctly point out that LSM's are designed to facilitate ingestion in a better way. But that does not mean that we cannot make it better. So we have been trying putting in our hours to make LSM's even better for ingestion. So the two primary things that come to my mind is the implementation of the memory buffer, right?

Starting point is 00:21:29 So I can give you an example and it might help you understand why the implementation of memory buffer can even become a performance bottleneck. So if I have a bunch of inserts coming, I will simply append them to my buffer. And this is super fast. This is like literally

Starting point is 00:21:45 bigger than one time you take to perform this operation. But now you have a point query coming in. And as I mentioned, you will have to search the newer data first. So you have to look into the buffer if the entry is there. And because you simply appended, now your simple point query, in order to see whether the entry is there in the buffer or not, you have to scan it. So you end up paying a big of n cost, where n is the number of entries in the buffer. And your whole workload suddenly slows down.

Starting point is 00:22:15 What was constant now becomes linear. But if you know that your workload does not have any operations altogether, I would rather have a vector implementation where I will simply keep on appending my entries in the buffer. It will be super for ingestion. But now if I have a mixed workload where I know that there will be a bunch of inserts

Starting point is 00:22:36 which are interleaved with a bunch of point queries, I have to rethink my design of the buffer. A skip list based buffer, for example, might be a better choice. You can even have a hash map based buffer. So the implementation of the buffer, we have seen in some of our experiments, depending on the workload, there can be a difference of performance of about two orders

Starting point is 00:22:59 of magnitude. The buffer size also really matters. So if you have a very small buffer size, it will be full sooner, you will be doing a lot of flush operations. So your workloads will see smaller stalls, but frequent stalls. A larger buffer size will hold a lot of entries,

Starting point is 00:23:17 which is good because some of the entries can be, some point lookups may be served from the buffer itself. But at the same time, when you are moving all the entries in the buffer to the But at the same time, when you are moving all the entries in the buffer to the storage, there will be a latency spike. So there is all these trade-offs that we talked about. Another thing that is huge in LSNs is how we are reorganizing or compacting

Starting point is 00:23:39 the data on the storage. So the buffer is really the smallest component of an Ellison tree. Most of the data, like typically 99% of the data in an Ellison tree is storage resident. And it really really matters how we are storing the data or writing the data on the device itself. So how frequently we are compacting data, whether we are compacting data eagerly or lazily, these design choices really affect the LSM performance. So classically, I briefly touched upon this before, there are two LSM designs. One is what we call a leveled LSM design, and the other

Starting point is 00:24:20 is a tiered LSM design. In a level-relational design, every time you flush a buffer, you merge eagerly all the entries that you just flushed with all the entries that were there in the first level of the tree. And this way, you make sure that in every level of the tree, you always have only one sorted component. Having one sorted component per level is actually superb when it comes to point query performance.

Starting point is 00:24:51 I will talk about it later on a bit as well. But because you are merging every time you are performing a flush, the ingestion performance is not super good. It is good, but it is not super good. On the other side of the extreme, you have the tiered lesson design. So every time you flush the buffer, you do not do any short merge. You just write the newly inserted buffer next to whatever was existing in the first level,

Starting point is 00:25:21 and you are done. So this way, your ingestion performance does not require either a short merge so you have a very good ingestion performance. But when you are performing a lookup you will have to check all these sorted components within a level. So field is great when it comes to ingestion but when it comes to query performance, leveling is more optimized. And Elezen is a very flexible data structure. You can have a bunch of intermediate hybrid designs between the leveling and tiering extremes. You can have the first few levels implemented as tiered, the larger levels below the tree are leveled. So it really, you can play around with the best design that you might want

Starting point is 00:26:06 to have based on the workload that you have and the performance target that you are set yeah you preempted my question there i'll just say is any way we can sort of mix and match these two different designs and kind of explore the space between the two but yeah that's that's you can do that on that um kind of I guess going a level further with that is, rather than making that decision of being, I'm going to kind of have this hybrid structure this way, is there what been researched on that? Maybe we might touch on this later on, I might be getting ahead of myself,

Starting point is 00:26:34 about adaptively changing the data structure and the balance between these two extremes of leveled and tiered based on the workload at runtime. Is that possible or does it have to be a static decision up front? Great, great, great question. So this is what we have been trying to do for the last five, 10 years, I guess, right? So the idea is that exactly what you mentioned.

Starting point is 00:26:56 So if I know something about my workload, if my workload changes on the fly, can I switch from, let's say, in a very simple case, from a level delusion design to a tier delusion design? Because suddenly I do not have any point queries in my workload anymore, so I want really just look at ingestion faster. So the answer is yes and no. So we have been doing a lot of modeling and a lot of analysis about what design would be optimized or optimal for a given workload.

Starting point is 00:27:27 So we kind of have an idea about if I have certain knowledge about the workload, what would be my near optimal, let's say, LSM redesign. So this part is good and this part is the yes. The no part part comes from the fact that there is really no LSM design in practice that can on the fly change the data layout, that is on the fly change a leveled LSM to a tiered LSM. And this is because the implementation is tricky. So the transformation of one LSM design to another really depends on how you are compacting data.

Starting point is 00:28:09 And compaction, until one of our works from two years back, was literally a black box in LSM design space. So people knew that compactions go on. There are a bunch of compaction routines that LSM trees have underneath. But no one really knew what are the different design choices that are there so it was increasingly hard for people to understand how they can really change the compaction routine based on the workload and the kind of performance that they are looking at so this is essentially one of the natural next steps for us in the community. I'm sure in the next few years, we'll have a few endeavors that will attempt to do this. But to date, we still do not have an LSM engine that can transform on the fly.

Starting point is 00:28:58 Cool. So we've been speaking about the right half of things so far. So let's switch back over to reads so let's talk about the techniques and and and approaches we have for optimizing things like point queries range lookups analysis entries because it feels like we've so far we've kind of we've done everything to make the right path a lot faster how can we then go about and sort of kind of pulling that like that back a little bit and getting ourselves some good read performance as well. Yeah, absolutely. Well, as you know, there is no freelance, right? So if you have a super right optimized data structure, you will take a hit when it comes to reads.

Starting point is 00:29:36 And Ellison's are no exception. So if we do not have any auxiliary data structures that can help us for queries, reads are quite expensive in LSM reads. So if you think about the point query, I know that it might be in one of the LSM levels. So if I have one sorted run per level, which is the best design for reads, I still

Starting point is 00:30:01 have to do a binary search to see whether the entry is there in level one. If it is not in level one, it can be in level two. So once again, I do another binary search. So I end up doing L times big of N IOs, right? And IOs, as we know, are super expensive. So L is essentially the number of levels that I'm talking about, and N is the data size that is of the storage so it will be super super bad so and

Starting point is 00:30:30 this is the most read optimized LSM variant we are talking about for tiered LSM depending on how many tiers we have in every level it can be even more expensive so what how do state-of-the-art LSM engines work? Well, they trade more memory to improve reads. So what they do is that, because memory is cheap nowadays, they have what we call offense pointers in memory. So offense pointers retain the maximum and minimum key for every storage page. So you really know whether or not you should be doing an IEO if you have the PINs pointers in memory. So this way you can really

Starting point is 00:31:16 limit the number of IEOs that you do for every level to one because you do not have to do an expensive binary search on the data on this disk anymore. You do the binary search on the PINs pointers which is in memory and the PINs pointer tells you which file may contain the entry and then you go there. So yes it is a huge win. We come down from L times log N to L. But again if I have to do LIOs for one point lookup, it is still super expensive. What we do is that we take more memory to improve the performance further. So I use this memory to have bloom filters for all

Starting point is 00:31:54 the entries in the Allison tree. And now, before I probe a particular level, I cancel the bloom filter, which is, of course, in memory. And it tells me with a very high probability if the entry is there or not. So this is how we really cut down on the cost of point blockups. But again, nothing is free here. The cost that we pay is in terms of memory. Cool.

Starting point is 00:32:18 So something that jumped out to me there whilst you were talking about that, and we've not really touched on um hardware so far other than the fact that we talked about kind of we can kind of improve things by giving ourselves some more memory and stuff but has there been any research on um exploiting sort of modern hardware with respect to lsm uh trees and lsm and based storage engines like are there certain types of hardware that are more suitable for the characteristics of lsm trees great question again so there have been several ongoing efforts that are trying to answer this exact particular question so the lsm operating principle which is out of place as i often refer to as which is you do not really update things in place or delete things in place. It is very

Starting point is 00:33:06 similar to how modern SSDs work. So if you think about SSD operating principles, you have small blocks called EAS blocks in the device, and you could write to these EAS blocks, but if you want to update something, you cannot really go in place and update it. So you have this garbage collection routine that has to kick off. But before that, you write the new entry, the updated entry in a different erase block. And erase entries are very much similar. So you have to update an entry. You do not go in place and change it.

Starting point is 00:33:42 You just insert a new entry, which is somewhere up top in the tree. So these two operating principles are very similar and one of the key line of research that many groups and even we are focusing on at this point is that how we can take advantage of these devices principles like garbage collection and writing in an out of place manner and make use of that in LSM itself so that we can reduce the overall write amplification of the system that is the LSM running on SSD because write amplification is caused at both layers. Every time you do not make an in-place change, you are inducing write amplification. The device is doing it, the elephant is doing it.

Starting point is 00:34:26 But can we combine these two and make something better? So one of the key lines of research is how we are trying to exploit the device's capabilities and make elephant-based storage engines even better. Amazing. Some more interesting research that's going to happen there for Charlotte Ford to catch up and read when that comes out. Brilliant stuff. So, cool. So let's now talk about the performance trade-off between reads and writes. We've been treating them somewhat in isolation so far.

Starting point is 00:34:53 So how do I go about, I've got some workload and I don't know, it's got some proportion of reads and proportion of writes. How do I go about picking the optimal design for my LSM tree, my LSM-based storage engine? What do I, how do I go about doing this? So like any data structure, LSM trees are bound by the RAM conjecture. So what is a RAM conjecture? R-U-M stands for read, update, update and memory, respectively. So the idea is very simple.

Starting point is 00:35:26 You cannot have it all. So you can try to improve two of the three axis, performance wise, and then you must take a hit on the third axis and LSM trees are no exception. LSM trees are great on the U axis, which is update, which is write. But if you want to make them better for reads, that is the R-axis, you must take a hit along the M-axis, which is memory.

Starting point is 00:35:52 And this is why I keep on mentioning, we do make reads better, but we pay a cost in terms of memory. But the good thing is that the whole Ellison design is quite flexible. And between read-optimized level design and write-optimized tier design, LSM trees offer a wide range of hybrid designs, which I talked about. So in my work, Compactionary, I identified four key design questions that we always end up making within the compaction routine. And how we, depending on how we answer these design questions, we might end up with thousands or even ten thousands different LSM designs. So this goes on to show the rich design space that LSM trees offer.

Starting point is 00:36:42 But at the same time, it also shows that it's a vast design space. And it is very tricky to navigate when the design space is so vast. So the next step in my work, what we did is that we took an approach that was based on modeling and extensive experimentation. So we ended up running more than 2,000 experiments by varying the workload composition, the workload

Starting point is 00:37:05 distribution, the elephant tuning. And we tried to understand the impact of each of these four primitives, the four compaction primitives, in isolation and when they are put in together. So this really helped us understand which design questions are more important when I'm looking at certain performance metrics. So this work we published in VLDB 2021, and it's a great experimental work that I'm very

Starting point is 00:37:37 proud of. And this really goes on to give the researchers and practitioners the idea about how we can navigate this rich yet complex LSM design space. So the goal is not always to come up with the best compaction algorithm or the best design, so to say, but rather with some knowledge about the workload, we can definitely avoid the worst designs. So we can definitely choose something that is in the top 10 percentile performance-wise. And this is what we did in this work. Amazing.

Starting point is 00:38:17 Yeah, you just kind of said, these are things you definitely don't want to do. And this stuff is probably going to be good enough for you like 99% of the time, right? Like, yeah, just avoid these terrible things and you're going to be good right yeah yeah that that's that's awesome and we can link that work in the show notes as well so the interested listener can go and can go and check check that out so we've took we've touched on various different like research possible research directions for ls ls entries over the over the course of the um

Starting point is 00:38:44 course of the podcast. So maybe you could just summarize now kind of what's next on your research agenda for LSMs. Sure, sure. So LSMs are still a green field. So we have been doing several lines of research here. But even if I talk about deletes, which was one of the starting points for me in Ellison research, how we can efficiently delete data

Starting point is 00:39:12 from Ellison trees, we have done so many work on that, but still, we have a very long way to go before we can build an Ellison-based order engine that is completely deletion compliant. So it's still significantly green. One big problem that I kind of touched upon before is the right amplification in LSN3 of any out of place data structures, so to speak. So if you are not going and making this in place when you have updates or deletes,

Starting point is 00:39:44 you are writing the same data over and over again with this expectation that eventually you will merge the different versions of the data and retain only the latest one. But in this process, you are actually rewriting invalid data that was not supposed to be in the database in the first place. But because you did not delete them you keep on writing them over and over again and like any out of place data structures ellison trees suffer from right amplification so some some of the research they show that erosion trees end up exhibiting 30 x rightray amplification in practice. For a particular design, if I'm ingesting one gigabyte of data in my databases, because of the database's internal operations, namely compactions,

Starting point is 00:40:33 it will move 30 gigabytes worth of data, which is huge. This is one of the key problems that we want to solve as a community when it comes to LSM-based storage engines. And of course, another thing that we briefly touched upon is the hardware-software co-design front. So because the out-of-place data structure is common in the LSM design paradigm, as well as how the SSDs operate, how can we avoid the multiplicative right amplification from the application that is the LSM and the device and piggyback both these garbage collection routines together to bring down the multiplicative cost factor to

Starting point is 00:41:18 something additive and this is this is this will be super important in the future if you if you want to make LSM scalable for the large amount of data that we are seeing today. Fascinating. Loads of interesting directions there. And I guess there's enough work there to keep you occupied for a lifetime. Cool. So as kind of a software developer or as a data engineer, how can I leverage your findings about LSM trees in my day-to-day life? Absolutely. It's a great question.

Starting point is 00:41:50 And I believe that impactful research is what helps the developers, the engineers, and the community as a whole build better and more efficient systems. So understanding the LSM design space is the first step if I want as an engineer or as a developer to make my system run faster or perform better, whatever the metrics that I have at hand. And as I mentioned that we, in our work, we really break down all the black boxes that we have around the whole LSM design space. And breaking the black boxes allow us to think more fundamentally. It allows us, it exposes the fundamental design questions in front of us.

Starting point is 00:42:39 And if we know how each of these fundamental questions, the answer to these questions, affect the whole performance space, I am really one step closer to my goal of improving performance. So as I mentioned, we are still not there yet where I can give you the optimal compaction strategy or optimal LSM design, so to say, given a workload or performance, but we can definitely identify the top K best designs given a workload and target performance and this is I believe the first major step that we have made along this line of lines of research and the whole idea is simply if the develop if

Starting point is 00:43:18 we can help the developers or engineers avoid the worst designs we are already somewhat there. And everything else is the last 10% that we have to improve on. Awesome stuff. Awesome stuff. So whilst you've been working on LSMs, I mean, how long did you say, when was your first sort of interaction with LSMs?

Starting point is 00:43:41 You said it was around 10 years ago, did you say? Or have you been on this like five years? Less than five years ago. Less than five years. Okay, cool. So over that five-year period, what has been the most sort of like unexpected lesson that you've learned while working with them?

Starting point is 00:43:57 Absolutely. I think the key thing that I observed is that when you are working with large-scale data, simple tasks can become extremely tricky. So when I first heard about the problem of deleting data from the storage engine, first of all, I did not know much about LSNs. But my initial reaction was, how can deleting data be? You just go in there and delete data. Why should I even care about

Starting point is 00:44:26 making it efficient or whatever? And then you start to understand more about how modern systems operate, more about the whole out of place, update, delete paradigm, more about the elephant paradigm. And you start to think that, yes, it is not at all a difficult task if you do not have any performance benchmark to meet. You can literally find your data, go there, delete it, everything is fine, but this will be super, super slow. So the simple operation like deleting data

Starting point is 00:44:59 in an efficient way when your database is super large can be super tricky. And once you know that you you smile internally but you know that this is going to be quite a journey yeah nice that's that that's that's interesting speaking speaking about um what kind of journeys how i from this of course it's like five year period as well like have there been numerous things that you've kind of tried that have failed? Like I guess I'd like to know more about the war stories of like,

Starting point is 00:45:28 oh, man, we went down this dead end for ages and this thing didn't work. Maybe the listener might be interested to kind of know about. Absolutely, absolutely. So I started working on this efficient data deletion project when I was in France. And that was more on the privacy policy level. We tried to understand what the regulatory requirements were,

Starting point is 00:45:50 what we need as system designers to do in order to facilitate all these requirements. And then after I moved on to the US in Boston University, we really worked on the same project but now instead of thinking of at a high level we really went down into the systems and we try to understand how the systems operate, how we can make the leads efficient. And in initial days I remember traveling called SIGMOD 2019 and before that we had this idea about building deletion compliant data

Starting point is 00:46:25 systems we send our drafts to so many people from academia for their feedback and after we got the feedback we were very close really really close to giving up on the project because people did not really care about the leads whatsoever so people care about ingestion people cared about about queries, deletes, not much. And then again, we traveled to Sigmund, we were on the verge of giving up on the project, and we spoke with a bunch of people from different companies who are from the industry. And for them, this was a huge, huge problem that they were facing at that point. So companies, because for the regulatory requirement, companies were really struggling to delete data in an efficient manner. And in doing these discussions, we

Starting point is 00:47:12 really found quite a few people who were very interested in the project. We did understand ourselves that this is indeed a problem that is worth pursuing and the fast forward four or five years we have done so much work on deletes some of our work are actually a version of it is running on production rocks tv and this is something that makes me super proud but the work has gained significant visibility people do care about deletes now and it all started five years back but it was not a very smooth journey it was rather a bumpy one yeah so like sigma 2019 was like a sliding doors moment then where it's kind of we've actually realized i guess it's a prime example of like the disconnect sometimes that can happen between academia and industry and how i mean okay now this this thing deletes that interesting but then

Starting point is 00:48:00 they're in a real we're in the real, this is a really big problem. So when you're working on LSMs, what other research topics are yours? Are LSMs the primary vehicle for your research at the moment? Are you the thing to dabble your toes in as well? It has been one of the primary lines of research that I have been doing. I mean, in my previous life, I worked on wireless networks, cloud computing and edge computing and all this stuff, but I won't bore you with that. So, apart from relations in the data systems community, I have been working... One of the main projects that I have been working on is how we can index near-sorted data. So to give you a very high-level idea about what this project is about, so if the data

Starting point is 00:48:52 that I have is completely sorted, so I do not have to pay a lot of cost when I construct the index because the data is sorted. I might simply not build an index and do a binary search which will be logarithmic of n base 2 cost wise or I can build a B plus tree on the sorted data and this will be still logarithmic of n but with the finite of f which is the finite of the B plus tree so it will be significantly cheaper than the binary search. And because my data was sorted to begin with, I did not have to pay any cost in sorting the data. On the other side of the extreme,

Starting point is 00:49:34 if the data is completely scrambled and I have to insert it on top of my D plus tree index, I will have to pay a remarkably high cost, big of N again, logarithmic of N, because I have to pay a remarkably high cost, big of n, again logarithmic of n, because I have to insert every entry to my index data structure, which is fine, because this is the read-write trade-off that we always play around with. If my data is sorted, then I do not

Starting point is 00:49:58 pay any cost, which is understandable. But what happens if my data is nearly sorted but not completely sorted? If I forget about how modern systems work and if I am in an ideal world, if my data is almost sorted, I should pay almost no cost in sorting the data. The indexed construction cost should be a function of how ordered or how sorted my data is. But if you look at any index data structure, B plus trees, B LFM trees, any index data structures for that example,

Starting point is 00:50:35 the cost of building the index is never a function of sortedness. There is a binary knob, sure, if the data is completely sorted, no cost in terms of constructing the index. But anything that is not completely sorted, no matter whether it is nearly sorted or completely scrambled, the cost is always the same. In this work, we try to build an index data structure, an indexing paradigm, so to say, that can exploit the sortedness as a resource, especially when we try to constitute or construct indexes. So if my data is sorted or nearly sorted, I will end up paying proportionally low cost

Starting point is 00:51:16 in constructing the indexes. So we published this work in, I think this is where I met you first, in CPCTC in Sydney last year. And the full version of this work was published in ICD this year. And the whole idea is that using shortedness as a resource and pay end up paying less cost when the data has some ordered organization inherent. Amazing. We can maybe do a podcast on that one though because that's another fascinating topic and now we can exploit that additional information to improve and improve and index creation and things that's that's fascinating um that's really cool uh so how so this kind of leads quite nicely into my into my next question really in that how do you go

Starting point is 00:52:00 about sort of generating ideas now what's your creative process and then how do you go about generating ideas? What's your creative process? And then how do you decide this is something that we want to work on for the next five years or however long? So if you are jumping domains from networks to databases like I did, the place to start is to talk to people. Talk to people around you. Talk to people from industry. Talk to people from academia. Understand the practical problems that they are facing and try to work on them. So this is a great starting point.

Starting point is 00:52:30 It has been a great starting point for me. And then with experience, you get to see newer problems as you work on the existing problems. So this is how you branch off from one of your main projects that you are working on. And I think selecting projects is always a tricky thing to do. But in my head, I do believe that there is no failure in research. So there are several projects that we end up investing some amount of time, maybe a lot of time in some cases, and we did not see the expected results that we began keeping and having them in our minds. But essentially, there is no failure in research. So if you fail to achieve a target, essentially, you find different ways about how you should not approach a particular problem. And this is learning.

Starting point is 00:53:25 And if you learn, there is really no failure. So if it is definitive, then it is not research, in other words. So there is always this risk, but with the experience, with your knowledge, you can always make wiser decisions. But again, if I'm selecting a project that i'm really passionate about and it does not really work out it's fine it's just still learning i absolutely love that perspective on it that is a great way to think about it because i think so much a lot of time it's so easy to get hung up on i need to publish publish publish and it has to be the new cool

Starting point is 00:54:01 sort of idea and that's not all those kind of external pressures but viewing it that way for me is that the the way like science should be conducted right that's the very it's very fundamental right if it fails you've learned something right yeah so now that's that's a really nice answer to that question cool um so i've just just two more questions now so um the the promote the penultimate one is um quite a picture. It's like what do you think is the biggest challenge in database slash data management research today? Sure, so in my head we are time traveling back to the 80s right so we had data but more, we had very small devices. It was very important for us to think about ways how we can manage the small amount of data that we had, but with even smaller amount of compute or storage resources that we had

Starting point is 00:54:57 available. So now fast forward 45 years, we have remarkably efficient storage and complete units, but the data has grown exponentially. We have still the exact same problems. We have a lot of data and we have to think about every passing day how we can manage efficiently this huge, huge amount of data that we have, how we can process data at large. And along those lines, it has become significantly important to understand which data is useful. So we have a glut of data, but that does not mean that every piece of data object is equally

Starting point is 00:55:37 valuable. So we have to really find out ways how we can shed off data that are not really useful for our ambulances. This way we can really reduce the load of managing and processing data. Another problem that we have is protecting the privacy of data and this is something that I have been working on for about five to seven years now and the idea is that there is no way for an end user today to really be able to understand where the stream of his or her personal data has flown.

Starting point is 00:56:11 So who has their personal data and contact tracing, data tracing are some of the lines that people have been working on to find this out. But again, overall, as I mentioned, simply deleting the user data on the user request in an efficient manner is a huge challenge. So because I do believe fundamentally, there is a trade-off between privacy protection and performance. So if you are really willing to build a system that can offer all the privacy guarantees that you are looking at, then you have

Starting point is 00:56:43 to take a hit on performance. So how we navigate this performance privacy trade-off is one of the biggest challenges that we have today in the community. Finally, I think what also comes to my mind is moving toward greener compute framework. So in the era of chat GPT and deep learning, the amount of data that we are processing every second is remarkable. And with so much energy, so much resources being burnt every single second, surely as the data grows, we have to think about greener compute frameworks. So I really want to invest a significant amount of my time in this particular line of research, because this is, this is not sustainable.

Starting point is 00:57:33 So we have to really move to something that is more sustainable. Yeah, I totally agree. All those dimensions there for sure. The privacy aspect of it, the energy efficiency aspect of it as well. Cool. So yeah, it's time for the last word now.

Starting point is 00:57:50 So what's the one takeaway you would like the listener to take away from this podcast today? Sure. I'll keep this one short. I think the key takeaway would be at the end of the day, the performance of all these large data systems that process petabytes of data, it really boils down to the data structures and access methods that are used in the storage layer.

Starting point is 00:58:16 And the key to improving performance and building efficient data systems is to really understand how the underlying data structures operate. And once you have a solid knowledge of that, you can really build a storage engine that serves your workload the best. So going back to the fundamentals, getting the basics correct, as we say in cricket. Yeah, that's a great way to end it so fantastic thank you so much for coming on it's

Starting point is 00:58:50 been a fantastic to talk to you if the listeners interested in knowing more about CVD's work they'll put a link to I think in the

Starting point is 00:58:56 show notes and if you enjoyed the show please consider supporting the podcast through buy me a coffee it really helps us cover all of our costs and

Starting point is 00:59:04 we'll see you all next time for some more awesome computer science research

Disseminate: The Computer Science Research Podcast - Subhadeep Sarkar | Log-structured Merge Trees | #32

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.