Disseminate: The Computer Science Research Podcast - Subhadeep Sarkar | Log-structured Merge Trees | #32
Episode Date: May 11, 2023Summary:Log-structured merge (LSM) trees have emerged as one of the most commonly used storage-based data structures in modern data systems as they offer high throughput for writes and good utilizatio...n of storage space. In this episode, Subhadeep Sarkar presents the fundamental principles of the LSM paradigm. He tells us about recent research on improving write performance and the various optimization techniques and hybrid designs adopted by LSM engines to accelerate reads. Tune in to find out more! Links:Personal websiteICDE'23 tutorialLinkedIn Hosted on Acast. See acast.com/privacy for more information.
Transcript
Discussion (0)
Welcome to Disseminate, the computer science research Podcast. I'm your host, Jack Wardby.
Today, I'm joined by Suvadeep Sarkar,
who will be telling us everything we need to know
about log-structured merge trees.
Suvadeep is going to be, I say, in July,
an Assistant Professor of Computer Science
at Brandeis University.
So, Suvadeep, welcome to the show
and congratulations on your new appointment.
So, thank you and thank you.
I love to be here.
It's great for you to invite me.
And I'm very excited to start my new life at Brandeis.
And this is essentially the first podcast that I'm doing as an assistant professor.
So I'm super excited.
Brilliant stuff.
Brilliant stuff.
Cool. So obviously, brilliant stuff, cool. So
obviously I've given you a very brief introduction there but maybe you can tell the listener
a little bit more about yourself and how you became interested in databases, data management
and all the cool things that we all love. Of course, so I did my PhD on something that is
completely different, it was more like I did theoretical research and computer science and
cloud computing but as I was doing so I did theoretical research and computer science and cloud computing.
But as I was doing so, I realized that I wanted to do more system stuff, both research-wise and development-wise. So after my PhD, I moved to INRIA in France, and it is there that I was
exposed to large-scale systems and the multitude of data management challenges that you face when you
operate on data systems and operate on large scales of data.
So instantaneously, I was hooked.
So specifically, my journey started with the problem on how we can delete data efficiently
from large scale data stores.
So if I remember correctly, GDPR was about to be enforced
at that time, and it was really in the interests
of many companies that they want to delete user data
in a scalable manner.
And it was fascinating for me that how a seemingly simple
task of data deletion can become so tricky
when we operate at scale. So this is what made me
really interested into data management research and then after my time at India I moved to
Boston University, spent about four and a half years working on the same problem and some more
other problems as well and now I am truly in love with this new domain. Amazing. A great origin story and how you became hooked on this really cool research area.
So yeah, so today we're going to be talking about log-structured merge trees.
So maybe for the year initiated, you can kind of start off by explaining what the hell this really cool data structure is.
Well, truth be told, I did not even hear about Ellison-Trees five years back.
But the truth of the story is, today, Ellison-Trees are sitting at the heart of many data systems,
relational and non-relational alike. So if I have to take a few names, we have all heard about
ROXDB at Meta, Bigtable, LevelDB at Google, XEngine at Alibaba level db at google x engine at alibaba cassandra hb at apache
all these big databases which are primarily used as key value data stores have innocent trees as
the underlying data structure so what is an ellison tree essentially it is get another data
structure it is a highly right optimized data structure it is a kind of a data structure. It is a highly write-optimized data structure.
It is a data structure that was built to support data when the data does not fit in memory and data is stored in residence.
So the exponential data growth that we have seen over the past couple of decades, where data really does not fit in memory anymore, we have seen more and more systems adopt the LSN paradigm. So several systems
have either moved away from B plus trees to have LSN based indexes, or they have extended
support to LSNs while retaining their B plus tree implementation. So LSN trees in a single
sentence, it is a right optimized storage based data structure that can also help us index data.
Amazing. So you touched on it a little bit there about the history of LSM trees,
but maybe you can dive into this a little bit more.
When was this data structure first proposed?
And you mentioned a few of the systems there that are using it.
So maybe we can talk a little bit more about why they chose to use that data structure as well.
Awesome.
This is an interesting question. So Elephant Trees were first proposed by Pat O'Neill and his colleagues
from here in Boston, out of Boston, in 1996.
And for the next 10 years, there were literally only a couple of papers
that talked about this data structure.
No one knew about this data structure. No one knew about this data structure.
No one really cared about them.
There was not a single database that used LSMs underneath.
So in 2006, when we had the big data booms just starting up,
Google wanted to build a new data system that would be based on the key value paradigm.
And they wanted to use commodity hardware
to build these large-scale data systems. So they kind of moved away from the area of disk
paradigm that we have seen in the 80s. And when they wanted to move away to commodity hardware,
the lesson trees were really a data structure that gained their attention.
So it was, as I mentioned, storage-based.
So essentially, it does not really care whether the data fits in memory or not.
Rather, it is super suitable when the data is large in size.
And second, it is highly write-optimized.
So in an era of data boom, where you have a lot of data and you want to ingest the data to your data store really fast, Ellison trees were really the data structure of choice.
And in the following year, starting up from 2007, several companies understood the utility of Ellison trees as a storage layer data structure. And as I mentioned, a bunch of NoSQL data stores adopted Ellison trees to
be their data structure of interest.
Not only that, there are several relational databases like MyRocks at Meta, SQLite, which
have Ellison trees running underneath.
There are a few time series databases such as InfluxDB, QuasarDB, which are also based
on Ellison trees. influx db, quasar db, which are also based on elephantry. So this is how elephantry has become
from not a data structure of interest to a data structure of
super interest in less than three decades.
Yeah, that's amazing. I mean, it's such a rarity that something like
because the mid-90s is really recent in database history.
I mean,'s always that
we always have this joke with a lot of my guests that everything was already done in system art
and then the kind of all the cool ideas are taken so it's really interesting to see that
kind of something that's kind of quite late on the scene and it's kind of quite revolutionary
as well and that it's been used by so many data systems today so yeah that's fantastic
let's let's let's let's dive into these into a little bit more depth
then so you've i know you've done several tutorials on lsm trees um over the last couple of years at
sigmod and icde so and you always kind of start off by talking about the key operating principles
of lsn based storage engines so could you maybe tell us what these key operating principles are?
Absolutely, absolutely. That's a great question. And another way to think about this question is that how are Ellison trees fundamentally different from classical index structures,
say B plus trees? So essentially, if we can compare and contrast these two data structures
side by side, we might be able to understand
the key operating principles, as you mentioned, of LSN trees.
The first thing that comes to my mind is LSN trees is an ingestion-friendly data structure,
and it has an in-memory component that we often refer to as the LSN buffer.
So as incoming data comes,
so we do not really move them to the storage aggressively.
So if you think about a B plus tree,
if you have an insert operation,
you will have to traverse through the root,
move to the leaf node,
access all the intermediate pages in between,
and then write your data to the leaf node.
In Ellison trees, we try to amortize the cost of ingestion.
So instead of paying a logarithmic cost,
we buffer a bunch of inserts in memory,
with no storage access whatsoever.
And then when the in-memory buffer is full,
we lazily flush it to the storage.
So this is how we really achieve a superior ingestion
performance in Ellison trees.
The second thing that comes into mind is
the out-of-place update and delete paradigm that Ellison follows.
So again, going back to B plus trees,
if you want to update a data or delete a data,
it is in place.
What do I mean by in place?
You will literally have to start from the root,
go to the last leaf where the data is contained,
then you have to rewrite the data or delete the data and do all the necessary actions required.
So you pay a very high cost here as well because you are doing a lookup to find the key
and then modifying the key in place. So, LSI queries do not follow this paradigm at all. So, if you have to update
a data, you simply insert the new version of the data. So, you do not even look for or touch the
existing entry. The only thing that you have to make sure that whenever you have a query,
the query accesses the recent version first before it accesses the older version. And once you access
the recent version, your query terminates.
So there is no way you will be returning an invalid entry.
Deletes happen the same way.
So you do not go in place and delete an entry.
Rather, you insert something that we call a tombstone,
which logically invalidates the entry.
So essentially, these are the two fundamental ways
how Ellison trees are different from
B plus trees.
And one other thing that comes to my mind, if you think of a B plus tree as it supports
in place updates, the leaf nodes are almost always not completely full.
So in a stable situation, they are probably 67% full.
Ellison trees on the other hand, operate on the notion of immutable files. So what
does that mean? It means that once you move something from the memory to the storage,
you can never go and make in-place edits in the file. This allows Ellison to achieve a superior
space efficiency. So it really can use that device space, the storage space, in a very efficient way.
So fundamentally, these are the three axes that come to my mind how LSNs are different
from B plus trees.
There are, of course, several similarities.
So both of them facilitate index structures.
In a B plus tree, the data is probably in the last level and all the intermediate levels
are used for indexing. Innocent trees in that sense is probably more closer to B trees or B
epsilon trees because there is indexing structure that we have but the
data and the key are always together even in the intermediate levels of the
tree. Another similarity is that as you know,
what we call a B plus two fan out
essentially becomes the size ratio in an analysis tree.
So every level has an exponentially larger capacity.
This way we can really limit the number of
levels that we end up having in an analysis tree.
Finally, like we have node splits and mergers in B plus tree, which I refer to as data reorganization, we have something called compactions in Ellison trees.
So every time a level size goes beyond the threshold, we move some part of the data from that level and merge it or compact it with the overlapping part of the next level. So, yeah, essentially it is, once again,
it's still an index structure, but it has certain similarities and certain
differences when it comes to classical indexes.
I love the positioning against a B-tree because it kind of
having that as sort of a yardstick to measure against and see
how it differs is really, really nice. It's a really nice way to kind of compare these two data structures.
So that's great.
So let's talk about this some more then.
So you break down the operations we can then do on log-structured merge trees
and into internal operations like flushing and compaction
and some of the external operations like updating and reading
and stuff like that.
So can you maybe, first of all, work us through how the
internal operations like flushing and compaction actually work in practice? Of course, of course. Yeah, so internal operations are operations that are really
not triggered from the user side. I mean, they are probably triggered, but not explicitly,
right? So internal operations is that when L lsn buffer gets full we need to move all
the data from the buffer which is in memory to the storage right so this is the operation that we
often refer to as flush in nelson terminology right so a flush is nothing but after the buffer
is full we sort all the entries in the buffer based on the key. So we have a sorted component.
And then we write all of them to
the slower storage as an immutable run.
So flush is essentially moving data from the memory
to the storage in the form of immutable sorted runs.
And the second thing that you mentioned is compaction.
I briefly touched upon this before.
The compaction is like any index data structure,
as you have more and more data, you have to reorganize your index structure. In Dplus trees,
we do this by splitting the nodes or merging the nodes based on the work that we have. In LSM3s,
we do the same. We just call them compactions. So a comp is every level as you can imagine has a capacity which is exponentially growing as we are moving
down in the tree but once the size of a level goes beyond the capacity we can do
two things. One, we can either move all the data from that level and sort merge
it with the data from the next level. This sort merge operation is essentially the compaction.
Or two, what we can do is that we can just move
some part of the data from the level that is full
and merge it with the overlapping part
in terms of the key ranges from the next level.
So these are classically the two ways
how we implement compactions.
But as you can imagine, both flush and compactions are something that LSM engines have to do in order to make sure that they retain the tree structure.
So this is why we call them internal operations.
I see. I see. Cool.
So let's then flip this kind of corner, look at the other side from the user level and the external operations, the gets and the puts and the scans.
So how do these things work over and how do they then trigger these internal
operations?
Absolutely.
So as you can, as a main suggest, external operations are really what we see in
the workload, right?
So LSN trees, as I mentioned before, is a key value and it's stored in NG,
right?
So every entry that you can see, it has a key based on which we organize the entries
in the tree and all the other attributes, whatever you have, they are clubbed together and they
constitute the value fields. So we have this key value entries that come into an Ellison tree.
And we support the different operations through what we call a put-get API. So what is a put?
Put is nothing but an insert. So we call it put because in RSM trees,
we do not really differentiate updates from inserts.
So we do not really care whether the entry already
exists in the database and the new entry is an update,
or this is a completely new entry.
We simply use the put API to make sure
that we are writing the new entry to this memory
buffer.
And when the memory buffer is full, the entry will be sorted and flushed to the storage
eventually.
But the PutGet API is very simple.
We have to make sure that whenever we have a put, we never access older data before we
access the newer data. This is how this invariant that we call it in
Ellison's really allows us to make sure that put operation,
which is essentially a query operation,
a point query, will never return something that is invalid.
This is how the PutGet API works.
Apart from this, Ellisons also support range scans.
Range scans are essentially range queries where you want
to fetch all entries between two given keys,
let's say K1 and K2.
In a leveled LSM3,
I'll talk more about what's leveling and what's
trading in my talk later on.
But the idea is that you have to really figure out all the sorted components that
that qualify for a range scan operation a range query and then because there can be overlaps of
the keys there can be the same keys that duplicate different versions across several runs you have to
merge the content of all these qualifying sorted runs.
While you do the merge,
you make sure that you are only retaining or
returning the latest entry for every key.
This is how LSM trees through the process of
sort merging responds to range query.
Finally, we talked a little bit about deletes.
Deletes, as I mentioned,
is also a put operation,
but instead of pushing or putting in a completely new entry,
we do put in a special entry that we refer to as tombstone.
Tombstone is nothing but it's still a key-value pair.
It has the key for the entry that
you want to delete but the value field is typically one byte long which just says that this is a
tombstone so that's it so the tombstone flag is what we call it and this is what helps us
differentiate the tombstone from a general key value entry so the whole put get delete range The whole Put, Get, Delete, Range, Scan API in LSM trees are very simple in fact.
Is there a way to have additional, I mean, I don't know what these would be.
I mean, it's like one of the selling points of this data structure
that is very simple, right, is a simple API.
But are there any other possible types of operations you can kind of define
on an LSM tree that would be useful?
Absolutely.
So as I talk about this PutGate API, it's simple.
It works.
But whenever it comes to complex operations like range queries,
LSM do not perform as well as B plus 3s do.
Because if you think about a B plus 3, everything is sorted in the leaf.
You can literally do a scan and be done with it.
But in Ellison 3, you will have to do a merge operation.
It's quite expensive. It's a two way merge that will happen in general.
If you want to delete, for example, a range of values,
which we call range deletes, is super tricky in Ellison
because now you have to think about how you are going to delete all the values
because you do not really know how many valid entries are there within a given
range. So there are some work on range beliefs in LSMs. We are also working on
it in one of our projects. The PutGet API is very good when you have a very
simple workload that has a bunch of point lookups and a bunch of inserts or updates.
But when you have more complex workloads, the API does not really allow us to be more flexible.
Rather, we have to write our own API, own routines that can facilitate that.
Cool. I mean, I guess it's all about trade-offs, right?
There's a trade-off you've got to make somewhere, right?
So I guess one of the benefits is having this really nice,
simple data structure that's really good at ingestion and things like this.
So, I mean, I guess this is part of the trade-off, right?
But anyway, cool.
So let's talk more about rights then,
because obviously this is kind of one of the big selling points
of this data structure and probably one of the main determinants
why it's become really popular.
Like you said earlier on, the advent of big data, et cetera, et cetera.
We have these large volumes of data to deal with.
So can you maybe give the listeners an overview of the recent research on how
we've tried to improve ingestion performance in log structure-made-based
storage engines over the past sort of, I don't know, five to 10 years?
Absolutely, absolutely. And you correctly point out that
LSM's are designed to facilitate ingestion in a better way.
But that does not mean that we cannot make it better. So we have been trying
putting in our hours to make LSM's
even better for ingestion. So the two primary
things that come to my mind
is the implementation of the memory buffer, right?
So I can give you an example
and it might help you understand
why the implementation of memory buffer
can even become a performance bottleneck.
So if I have a bunch of inserts coming,
I will simply append them to my buffer.
And this is super fast.
This is like literally
bigger than one time you take to perform this operation. But now you have a point query coming
in. And as I mentioned, you will have to search the newer data first. So you have to look into
the buffer if the entry is there. And because you simply appended, now your simple point query,
in order to see whether the entry is there in the buffer or not,
you have to scan it.
So you end up paying a big of n cost, where n
is the number of entries in the buffer.
And your whole workload suddenly slows down.
What was constant now becomes linear.
But if you know that your workload does not
have any operations altogether,
I would rather
have a vector implementation where I will simply keep on appending my entries in the
buffer.
It will be super for ingestion.
But now if I have a mixed workload where I know that there will be a bunch of inserts
which are interleaved with a bunch of point queries, I have to rethink my design of the
buffer.
A skip list based buffer, for example, might be a better choice.
You can even have a hash map based buffer.
So the implementation of the buffer,
we have seen in some of our experiments,
depending on the workload, there can
be a difference of performance of about two orders
of magnitude.
The buffer size also really matters.
So if you have a very small buffer size,
it will be full sooner,
you will be doing a lot of flush operations.
So your workloads will see smaller stalls,
but frequent stalls.
A larger buffer size will hold a lot of entries,
which is good because some of the entries can be,
some point lookups may be served from the buffer itself.
But at the same time, when you are moving
all the entries in the buffer to the But at the same time, when you are moving all the entries
in the buffer to the storage, there will be a latency spike.
So there is all these trade-offs that we talked about.
Another thing that is huge in LSNs
is how we are reorganizing or compacting
the data on the storage.
So the buffer is really the smallest component of an
Ellison tree. Most of the data, like typically 99% of the data in an Ellison
tree is storage resident. And it really really matters how we are storing the
data or writing the data on the device itself. So how frequently we are
compacting data, whether we are compacting data eagerly or lazily,
these design choices really affect the LSM performance. So classically, I briefly touched
upon this before, there are two LSM designs. One is what we call a leveled LSM design, and the other
is a tiered LSM design. In a level-relational design, every time you flush a buffer,
you merge eagerly all the entries that you just flushed
with all the entries that were there
in the first level of the tree.
And this way, you make sure that in every level of the tree,
you always have only one sorted component.
Having one sorted component per level
is actually superb when it comes to point query performance.
I will talk about it later on a bit as well.
But because you are merging every time you
are performing a flush, the ingestion performance
is not super good.
It is good, but it is not super good.
On the other side of the extreme, you have the tiered lesson design.
So every time you flush the buffer, you do not do any short merge.
You just write the newly inserted buffer next to whatever was existing in the first level,
and you are done.
So this way, your ingestion performance does not require either a short merge so you have a very good ingestion performance. But when you
are performing a lookup you will have to check all these sorted components within
a level. So field is great when it comes to ingestion but when it
comes to query performance, leveling is more optimized. And Elezen is a very flexible data structure. You
can have a bunch of intermediate hybrid designs between the leveling and tiering extremes. You
can have the first few levels implemented as tiered, the larger levels below the tree are
leveled. So it really, you can play around with the best design that you might want
to have based on the workload that you have and the performance target that you are set
yeah you preempted my question there i'll just say is any way we can sort of mix and match these
two different designs and kind of explore the space between the two but yeah that's that's
you can do that on that um kind of I guess going a level further with that is,
rather than making that decision of being,
I'm going to kind of have this hybrid structure this way,
is there what been researched on that?
Maybe we might touch on this later on, I might be getting ahead of myself,
about adaptively changing the data structure
and the balance between these two extremes of leveled and tiered
based on the workload at runtime.
Is that possible or does it have to be a static decision up front?
Great, great, great question.
So this is what we have been trying to do for the last five,
10 years, I guess, right?
So the idea is that exactly what you mentioned.
So if I know something about my workload,
if my workload changes on the fly, can I switch from, let's say,
in a very simple case, from a level delusion
design to a tier delusion design?
Because suddenly I do not have any point queries in my workload anymore, so I want really just
look at ingestion faster.
So the answer is yes and no.
So we have been doing a lot of modeling and a lot of analysis about what design would be optimized or optimal for a given workload.
So we kind of have an idea about if I have certain knowledge about the workload, what would be my
near optimal, let's say, LSM redesign. So this part is good and this part is the yes.
The no part part comes from the fact that there is really no LSM design in practice that can
on the fly change the data layout, that is on the fly
change a leveled LSM to a tiered LSM. And this is
because the implementation is tricky.
So the transformation of one LSM design
to another really depends on how you are compacting data.
And compaction, until one of our works from two years back, was literally a black box in LSM design space.
So people knew that compactions go on. There are a bunch of compaction routines that LSM trees have underneath.
But no one really knew what are the
different design choices that are there so it was increasingly hard for people to understand
how they can really change the compaction routine based on the workload and the kind of performance
that they are looking at so this is essentially one of the natural next steps for us in the community.
I'm sure in the next few years, we'll have a few endeavors that will attempt to do this.
But to date, we still do not have an LSM engine that can transform on the fly.
Cool. So we've been speaking about the right half of things so far.
So let's switch back over to reads
so let's talk about the techniques and and and approaches we have for optimizing things like
point queries range lookups analysis entries because it feels like we've so far we've kind
of we've done everything to make the right path a lot faster how can we then go about and sort of
kind of pulling that like that back a little bit and getting ourselves some good read performance as well. Yeah, absolutely.
Well, as you know, there is no freelance, right? So if you have a super right
optimized data structure, you will take a hit when it comes to reads.
And Ellison's are no exception. So if we do not have
any auxiliary data structures that can help us for
queries,
reads are quite expensive in LSM reads.
So if you think about the point query,
I know that it might be in one of the LSM levels.
So if I have one sorted run per level, which
is the best design for reads, I still
have to do a binary search to see whether the entry is there
in level one.
If it is not in level one, it can be in level two.
So once again, I do another binary search.
So I end up doing L times big of N IOs, right?
And IOs, as we know, are super expensive.
So L is essentially the number of levels
that I'm talking about, and N is the data size that is of the storage so it will be super super bad so and
this is the most read optimized LSM variant we are talking about for tiered
LSM depending on how many tiers we have in every level it can be even more
expensive so what how do state-of-the-art LSM engines work?
Well, they trade more memory to improve reads.
So what they do is that, because memory is cheap nowadays,
they have what we call offense pointers in memory.
So offense pointers retain the maximum and minimum key for every storage page. So you really know whether or not
you should be doing an IEO if you have the PINs pointers in memory. So this way you can really
limit the number of IEOs that you do for every level to one because you do not have to do an
expensive binary search on the data on this disk anymore. You do the binary search on the PINs pointers which is in
memory and the PINs pointer tells you which file may contain the entry and
then you go there. So yes it is a huge win. We come down from L times log N to L.
But again if I have to do LIOs for one point lookup, it is still super expensive.
What we do is that we take more memory
to improve the performance further.
So I use this memory to have bloom filters for all
the entries in the Allison tree.
And now, before I probe a particular level,
I cancel the bloom filter, which is, of course, in memory.
And it tells me with a very high probability if the entry is there or not.
So this is how we really cut down on the cost of point blockups.
But again, nothing is free here.
The cost that we pay is in terms of memory.
Cool.
So something that jumped out to me there whilst you were talking about that,
and we've not really touched on um hardware
so far other than the fact that we talked about kind of we can kind of improve things by giving
ourselves some more memory and stuff but has there been any research on um exploiting sort of modern
hardware with respect to lsm uh trees and lsm and based storage engines like are there certain
types of hardware that are more suitable for the characteristics of lsm trees great question again so there have been several ongoing efforts that
are trying to answer this exact particular question so the lsm operating principle which is
out of place as i often refer to as which is you do not really update things in place or delete things in place. It is very
similar to how modern SSDs work. So if you think about SSD operating principles, you have small
blocks called EAS blocks in the device, and you could write to these EAS blocks, but if you want
to update something, you cannot really go in place and update it.
So you have this garbage collection routine that has to kick off.
But before that, you write the new entry, the updated entry in a different erase block.
And erase entries are very much similar.
So you have to update an entry.
You do not go in place and change it.
You just insert a new entry, which is somewhere up top in the tree. So these two operating principles are very
similar and one of the key line of research that many groups and even we
are focusing on at this point is that how we can take advantage of these
devices principles like garbage collection and writing in an out of place manner and make use of
that in LSM itself so that we can reduce the overall write amplification of the system
that is the LSM running on SSD because write amplification is caused at both layers.
Every time you do not make an in-place change, you are inducing write amplification.
The device is doing it, the elephant is doing it.
But can we combine these two and make something better?
So one of the key lines of research is how we are trying to exploit the device's capabilities
and make elephant-based storage engines even better.
Amazing. Some more interesting research that's going to happen there for Charlotte Ford
to catch up and read when that comes out.
Brilliant stuff.
So, cool. So let's now talk about the performance trade-off between reads and writes.
We've been treating them somewhat in isolation so far.
So how do I go about, I've got some workload and I don't know,
it's got some proportion of reads and proportion of writes.
How do I go about picking the optimal design for my LSM tree, my LSM-based storage engine?
What do I, how do I go about doing this?
So like any data structure, LSM trees are bound by the RAM conjecture.
So what is a RAM conjecture?
R-U-M stands for read, update, update and memory, respectively.
So the idea is very simple.
You cannot have it all.
So you can try to improve two of the three axis,
performance wise, and then you must take a hit
on the third axis and LSM trees are no exception.
LSM trees are great on the U axis,
which is update, which is write. But if you want to make them better for reads,
that is the R-axis, you must take a hit along the M-axis,
which is memory.
And this is why I keep on mentioning,
we do make reads better, but we pay a cost in terms of memory.
But the good thing is that the whole Ellison design
is quite flexible.
And between read-optimized level design and write-optimized tier design, LSM trees offer a wide range of hybrid designs, which I talked about.
So in my work, Compactionary, I identified four key design questions that we always end up making within the compaction routine.
And how we, depending on how we answer these design questions, we might end up with thousands or even ten thousands different LSM designs.
So this goes on to show the rich design space that LSM trees offer.
But at the same time, it also shows that it's a vast design space.
And it is very tricky to navigate
when the design space is so vast.
So the next step in my work, what we did
is that we took an approach that was based on modeling
and extensive experimentation.
So we ended up running more than 2,000 experiments
by varying the workload composition, the workload
distribution, the elephant tuning.
And we tried to understand the impact of each
of these four primitives, the four compaction primitives,
in isolation and when they are put in together.
So this really helped us understand
which design questions are more important when I'm looking at certain
performance metrics.
So this work we published in VLDB 2021, and it's a great experimental work that I'm very
proud of.
And this really goes on to give the researchers and practitioners the idea about how we can navigate this rich yet complex LSM design space.
So the goal is not always to come up with the best compaction algorithm or the best design, so to say,
but rather with some knowledge about the workload, we can definitely avoid the worst designs.
So we can definitely choose something
that is in the top 10 percentile performance-wise.
And this is what we did in this work.
Amazing.
Yeah, you just kind of said,
these are things you definitely don't want to do.
And this stuff is probably going to be good enough for you
like 99% of the time, right?
Like, yeah, just avoid these terrible things and you're going to be good right yeah yeah that that's
that's awesome and we can link that work in the show notes as well so the interested listener can
go and can go and check check that out so we've took we've touched on various different like
research possible research directions for ls ls entries over the over the course of the um
course of the podcast.
So maybe you could just summarize now kind of what's next on your research
agenda for LSMs.
Sure, sure.
So LSMs are still a green field.
So we have been doing several lines of research here.
But even if I talk about deletes, which
was one of the starting points for me in Ellison research, how we can efficiently delete data
from Ellison trees, we have done so many work on that, but still, we have a very long way
to go before we can build an Ellison-based order engine that is completely deletion compliant.
So it's still significantly green.
One big problem that I kind of touched upon before
is the right amplification in LSN3
of any out of place data structures, so to speak.
So if you are not going and making this in place
when you have updates or deletes,
you are writing the same data over and over again with this expectation that eventually you will merge the different versions of the data and retain only the latest one.
But in this process, you are actually rewriting invalid data that was not supposed to be in the database in the first place.
But because you did
not delete them you keep on writing them over and over again and like any out of place data
structures ellison trees suffer from right amplification so some some of the research
they show that erosion trees end up exhibiting 30 x rightray amplification in practice. For a particular design,
if I'm ingesting one gigabyte of data in my databases,
because of the database's internal operations, namely compactions,
it will move 30 gigabytes worth of data, which is huge.
This is one of the key problems that we want to
solve as a community when it comes to LSM-based storage engines.
And of course, another thing that we briefly touched upon is the hardware-software co-design front.
So because the out-of-place data structure is common in the LSM design paradigm,
as well as how the SSDs operate, how can we avoid the multiplicative right amplification from
the application that is the LSM and the device and piggyback both these garbage
collection routines together to bring down the multiplicative cost factor to
something additive and this is this is this will be super important in the
future if you if you want to make LSM scalable for the large amount of data that we are seeing today.
Fascinating. Loads of interesting directions there.
And I guess there's enough work there to keep you occupied for a lifetime.
Cool. So as kind of a software developer or as a data engineer, how can I leverage your findings about LSM
trees in my day-to-day life?
Absolutely.
It's a great question.
And I believe that impactful research is what helps the developers, the engineers, and the
community as a whole build better and more efficient systems.
So understanding the LSM design space is the first step if I want as an engineer
or as a developer to make my system run faster or perform better, whatever the metrics that
I have at hand. And as I mentioned that we, in our work, we really break down all the
black boxes that we have around the whole LSM design space.
And breaking the black boxes allow us to think more fundamentally.
It allows us, it exposes the fundamental design questions in front of us.
And if we know how each of these fundamental questions, the answer to these questions,
affect the whole performance space, I am really one step closer to my goal of improving performance. So as I mentioned, we are still not there yet
where I can give you the optimal compaction strategy
or optimal LSM design, so to say,
given a workload or performance,
but we can definitely identify the top K best designs
given a workload and target performance and this is I believe the first major step that we have made along
this line of lines of research and the whole idea is simply if the develop if
we can help the developers or engineers avoid the worst designs we are already
somewhat there.
And everything else is the last 10% that we have to improve on.
Awesome stuff.
Awesome stuff.
So whilst you've been working on LSMs,
I mean, how long did you say,
when was your first sort of interaction with LSMs?
You said it was around 10 years ago, did you say?
Or have you been on this like five years?
Less than five years ago.
Less than five years.
Okay, cool.
So over that five-year period,
what has been the most sort of like unexpected lesson
that you've learned while working with them?
Absolutely.
I think the key thing that I observed is that
when you are working with large-scale data,
simple tasks can become extremely tricky.
So when I first heard about the problem of deleting data from the storage engine, first of all, I did not know much about LSNs.
But my initial reaction was, how can deleting data be?
You just go in there and delete data.
Why should I even care about
making it efficient or whatever? And then you start to understand more about how modern systems
operate, more about the whole out of place, update, delete paradigm, more about the elephant
paradigm. And you start to think that, yes, it is not at all a difficult task
if you do not have any performance benchmark to meet.
You can literally find your data, go there,
delete it, everything is fine,
but this will be super, super slow.
So the simple operation like deleting data
in an efficient way when your database is super large
can be super tricky.
And once you know that you you smile
internally but you know that this is going to be quite a journey yeah nice that's that that's
that's interesting speaking speaking about um what kind of journeys how i from this of course
it's like five year period as well like have there been numerous things that you've kind of tried
that have failed?
Like I guess I'd like to know more about the war stories of like,
oh, man, we went down this dead end for ages
and this thing didn't work.
Maybe the listener might be interested to kind of know about.
Absolutely, absolutely.
So I started working on this efficient data deletion project
when I was in France.
And that was more on the privacy policy level.
We tried to understand what the regulatory requirements were,
what we need as system designers to do
in order to facilitate all these requirements.
And then after I moved on to the US in Boston University,
we really worked on the same project but now instead of
thinking of at a high level we really went down into the systems and we try to
understand how the systems operate, how we can make the leads efficient. And in
initial days I remember traveling called SIGMOD 2019 and before that we had this
idea about building deletion compliant data
systems we send our drafts to so many people from academia for their feedback
and after we got the feedback we were very close really really close to giving
up on the project because people did not really care about the leads whatsoever
so people care about ingestion people cared about about queries, deletes, not much.
And then again, we traveled to Sigmund, we were on the verge of giving up on the project,
and we spoke with a bunch of people from different companies who are from the industry.
And for them, this was a huge, huge problem that they were facing at that point.
So companies, because for the regulatory requirement, companies were really struggling to delete data in an efficient manner. And in doing these discussions, we
really found quite a few people who were very interested in the project. We did understand
ourselves that this is indeed a problem that is worth pursuing and the fast forward four or five years we have done so
much work on deletes some of our work are actually a version of it is running on production rocks tv
and this is something that makes me super proud but the work has gained significant visibility
people do care about deletes now and it all started five years back but it was not a very smooth journey it was rather a bumpy one
yeah so like sigma 2019 was like a sliding doors moment then where it's kind of we've actually
realized i guess it's a prime example of like the disconnect sometimes that can happen between
academia and industry and how i mean okay now this this thing deletes that interesting but then
they're in a real we're in the real, this is a really big problem. So when you're working on LSMs, what other research topics are yours?
Are LSMs the primary vehicle for your research at the moment?
Are you the thing to dabble your toes in as well?
It has been one of the primary lines of research that I have been doing.
I mean, in my previous life, I worked on wireless networks, cloud computing and edge computing
and all this stuff, but I won't bore you with that. So, apart from relations in the data
systems community, I have been working... One of the main projects that I have been working
on is how we can index near-sorted data. So to give you a very high-level idea about what this project is about, so if the data
that I have is completely sorted, so I do not have to pay a lot of cost when I construct
the index because the data is sorted. I might simply not build an index and do a binary
search which will be logarithmic of n base 2 cost wise or I can build a B plus tree on the sorted
data and this will be still logarithmic of n but with the finite of f which is the finite of the
B plus tree so it will be significantly cheaper than the binary search.
And because my data was sorted to begin with,
I did not have to pay any cost in sorting the data.
On the other side of the extreme,
if the data is completely scrambled
and I have to insert it on top of my D plus tree index,
I will have to pay a remarkably high cost,
big of N again, logarithmic of N, because I have to pay a remarkably high cost, big of n, again logarithmic of n, because I
have to insert every entry to my index data structure, which
is fine, because this is the read-write trade-off
that we always play around with.
If my data is sorted, then I do not
pay any cost, which is understandable.
But what happens if my data is nearly sorted but not completely sorted?
If I forget about how modern systems work and if I am in an ideal
world, if my data is almost sorted, I should pay
almost no cost in sorting the data.
The indexed construction cost should be a function of
how ordered or how sorted my data is.
But if you look at any index data structure, B plus trees, B LFM trees, any index data structures for that example,
the cost of building the index is never a function of sortedness.
There is a binary knob, sure, if the data is completely sorted, no cost in terms of constructing the index.
But anything that is not completely sorted, no matter whether it is nearly sorted or completely
scrambled, the cost is always the same.
In this work, we try to build an index data structure, an indexing paradigm, so to say,
that can exploit the sortedness as a resource, especially when we try to constitute or construct indexes.
So if my data is sorted or nearly sorted,
I will end up paying proportionally low cost
in constructing the indexes.
So we published this work in,
I think this is where I met you first, in CPCTC in Sydney last year.
And the full version of this work was published in ICD this year.
And the whole idea is that using shortedness as a resource and pay end up paying less cost when the data has some ordered organization inherent.
Amazing. We can maybe do a podcast on that one though because that's another fascinating topic and now we can exploit that additional information to improve and improve
and index creation and things that's that's fascinating um that's really cool uh so how
so this kind of leads quite nicely into my into my next question really in that how do you go
about sort of generating ideas now what's your creative process and then how do you go about generating ideas? What's your creative process? And then how do you decide this is something that we want to work on
for the next five years or however long?
So if you are jumping domains from networks to databases like I did,
the place to start is to talk to people.
Talk to people around you.
Talk to people from industry.
Talk to people from academia.
Understand the practical problems that they are facing and try to work on them. So this is a great starting point.
It has been a great starting point for me. And then with experience, you get to see newer problems
as you work on the existing problems. So this is how you branch off from one of your main projects that you are working on.
And I think selecting projects is always a tricky thing to do.
But in my head, I do believe that there is no failure in research.
So there are several projects that we end up investing some amount of time, maybe a lot of time in some cases,
and we did not see the expected results that we began keeping and having them in our minds.
But essentially, there is no failure in research. So if you fail to achieve a target, essentially,
you find different ways about how you should not approach a particular problem. And this is learning.
And if you learn, there is really no failure.
So if it is definitive, then it is not research, in other words.
So there is always this risk, but with the experience, with your knowledge,
you can always make wiser decisions.
But again, if I'm selecting a project that i'm really passionate
about and it does not really work out it's fine it's just still learning i absolutely love that
perspective on it that is a great way to think about it because i think so much a lot of time
it's so easy to get hung up on i need to publish publish publish and it has to be the new cool
sort of idea and that's not all those kind of external pressures but
viewing it that way for me is that the the way like science should be conducted right that's
the very it's very fundamental right if it fails you've learned something right yeah so now that's
that's a really nice answer to that question cool um so i've just just two more questions now so um
the the promote the penultimate one is um quite a picture. It's like what do you think is the
biggest challenge in database slash data management research today? Sure, so in my head we are time
traveling back to the 80s right so we had data but more, we had very small devices. It was very important for us to think about ways how we can manage the small amount of
data that we had, but with even smaller amount of compute or storage resources that we had
available.
So now fast forward 45 years, we have remarkably efficient storage and complete units, but
the data has grown exponentially.
We have still the exact same problems.
We have a lot of data and we have to think about every passing day how we can manage
efficiently this huge, huge amount of data that we have, how we can process data at large.
And along those lines, it has become significantly important to understand which data is useful.
So we have a glut of data, but that does not mean that every piece of data object is equally
valuable.
So we have to really find out ways how we can shed off data that are not really useful for our
ambulances. This way we can really reduce the load of managing and processing data.
Another problem that we have is protecting the privacy of data and this
is something that I have been working on for about five to seven years now and the
idea is that there is no way for an end user today
to really be able to understand where the stream
of his or her personal data has flown.
So who has their personal data and contact tracing,
data tracing are some of the lines
that people have been working on to find this out.
But again, overall, as I mentioned,
simply deleting the user data on the user request
in an efficient manner is a huge challenge. So because I do believe fundamentally, there is a
trade-off between privacy protection and performance. So if you are really willing to
build a system that can offer all the privacy guarantees that you are looking at, then you have
to take a hit on performance.
So how we navigate this performance privacy trade-off is one of the biggest challenges
that we have today in the community.
Finally, I think what also comes to my mind is moving toward greener compute framework.
So in the era of chat GPT and deep learning, the amount of data that we are processing every second is remarkable.
And with so much energy, so much resources being burnt every single second, surely as the data grows, we have to think about greener compute frameworks. So I really want to invest a significant amount of my time
in this particular line of research,
because this is, this is not sustainable.
So we have to really move to something
that is more sustainable.
Yeah, I totally agree.
All those dimensions there for sure.
The privacy aspect of it,
the energy efficiency aspect of it as well.
Cool.
So yeah, it's time for the last word now.
So what's the one takeaway
you would like the listener
to take away from this podcast today?
Sure.
I'll keep this one short.
I think the key takeaway would be
at the end of the day,
the performance of all these large data systems that process petabytes of data, it really boils down to the data structures and access methods that are used in the storage layer.
And the key to improving performance and building efficient data systems is to really understand how the underlying data structures operate.
And once you have a solid knowledge of that, you can really build a storage engine that
serves your workload the best.
So going back to the fundamentals, getting the basics correct, as we say in cricket.
Yeah, that's a great way to end
it so fantastic
thank you so much
for coming on it's
been a fantastic
to talk to you
if the listeners
interested in knowing
more about
CVD's work
they'll put a link
to I think in the
show notes and
if you enjoyed the
show please consider
supporting the podcast
through buy me a
coffee it really
helps us cover all
of our costs and
we'll see you all
next time for some more awesome computer science research