Disseminate: The Computer Science Research Podcast - Raunak Shah | R2D2: Reducing Redundancy and Duplication in Data Lakes | #59
Episode Date: October 28, 2024In this episode, Raunak Shah joins us to discuss the critical issue of data redundancy in enterprise data lakes, which can lead to soaring storage and maintenance costs. Raunak highlights how large-sc...ale data environments, ranging from terabytes to petabytes, often contain duplicate and redundant datasets that are difficult to manage. He introduces the concept of "dataset containment" and explains its significance in identifying and reducing redundancy at the table level in these massive data lakes—an area where there has been little prior work.Raunak then dives into the details of R2D2, a novel three-step hierarchical pipeline designed to efficiently tackle dataset containment. By utilizing schema containment graphs, statistical min-max pruning, and content-level pruning, R2D2 progressively reduces the search space to pinpoint redundant data. Raunak also discusses how the system, implemented on platforms like Azure Databricks and AWS, offers significant improvements over existing methods, processing TB-scale data lakes in just a few hours with high accuracy. He concludes with a discussion on how R2D2 optimally balances storage savings and performance by identifying datasets that can be deleted and reconstructed on demand, providing valuable insights for enterprises aiming to streamline their data management strategies.Materials:SIGMOD'24 Paper - R2D2: Reducing Redundancy and Duplication in Data LakesICDE'24 - Towards Optimizing Storage Costs in the Cloud Hosted on Acast. See acast.com/privacy for more information.
Transcript
Discussion (0)
Hello and welcome to Disseminate, the computer science research podcast.
As usual, Jack here.
Today is another episode of our ongoing Cooking Ed series.
And today I'll be talking to Ranak Shah about his paper from Sigma recently,
R2-D2, Reducing Redundancy and Duplication in Data Lakes.
Ranak is currently doing his master's at the University of Illinois, AirBanner.
Champagne.
Cool. Anyway, so let's dive straight in then. So it's always customary on the podcast for the
guests to tell us a little bit more about themselves in their own words and how they
got interested in research and yes, specifically database research. So yeah, what's your story,
Renuk? Yeah. So I mean, my first time getting into research was during my undergrad. So then I, at that time, I was more into machine learning research and perception, that sort of thing.
And later on, like, I worked at Adobe for about two years.
And then I wanted to try something more deterministic.
So then I moved sort of into data and systems.
And since then, I've sort of been more
in that area even now at like at uac and and my current internship at lance db nice cool what's
the master plan in it are we gonna is it phd on the horizon or is it you're going to go back into
industry go work at lance db maybe we'll see yeah i think i think so it's a
thesis masters but i'll probably go back into the industry that's the idea at the moment that sounds
cool so let's talk about r2d2 then in the paper you recently published at sigmod but to do that
let's set a little bit of context for the chat for the listeners. So data lakes are really important for this paper, obviously.
So can you maybe start us off by telling kind of what the heck even is a data lake?
Yeah, so I mean, basically, you know, very large data stores.
Nowadays, you have huge companies processing huge amounts of data.
And specifically in data lakes, you can store both structured and unstructured data. And maybe for the context of
like what we're talking about, like the work that I presented at Sigmud. So most of the data was
stored in like the enterprise Azure data lake, and it was in parquet format. So Adobe gets like
hundreds of customers, all of this data, petabyte scale. And typically the access points were like through
cloud. So Apache Spark, that sort of thing. And there's huge scope for optimization in such a
landscape. Yeah. So kind of going off that then, I guess you said there's plenty of room for
optimization. So kind of what are those areas for optimization? I guess kind of the question I'm
getting at is what are the problem or the problems that kind of come with data of optimization? I guess kind of the question I'm getting at is what are the problems
that kind of come with data lakes?
And I guess that kind of would then lead naturally
into what the goal of your work was with R2D2.
Yeah, so of course there's a lot of room for improvement.
In fact, we had presented another paper at ICDE
previous year that was focused on optimizing storage costs.
This is more focused on like reducing
the redundancy and duplication. But basically, like, I can maybe talk about two problems that
this work focuses on more in more detail. So one is, you know, because you have such large amounts
of data, then the rate data regulation becomes very important. So you have things like GDPR policies and all. So if you
don't manage your data properly, then you get fined. And that can be like millions or billions
of dollars. And because of this, companies try to ensure compliance. And because of that, you can
end up with higher data maintenance costs. For example, you might have to do a full table scan
to delete a row from a table or to find some user's data in the entire table.
And full table scans are very expensive, especially at such a large scale.
Another thing is you can have a lot of redundant and untracked data.
In many cases, which is similar to what we were dealing with,
you can have a lack of data lineage information.
So you don't know how... You might have a of data lineage information. So you don't know how,
you might have a lot of duplicates,
but you don't know where those duplicates came from
or what the root data set was.
And that makes it even harder to, you know,
maintain compliance with regulations
and hard for users to figure out
what data should be deleted after a certain point.
Nice, yeah, so there's two of them.
So there's the regulation aspect
and yeah,
adhering to GDPR
because those fines,
like you said,
are quite large, right?
And they're willing
to give those fines out as well,
which is good,
which is something final
that's not a paper tiger, right?
So, yeah.
And obviously,
there's this possible
untracked data,
lack of lineage.
Yeah, and where the heck
did the data come from?
So R2D2 is going to help us
solve these problems,
right?
So walk us through, I guess, what R2D2 is then.
And at some point, we're going to need to get into the name
and how it ended up being called R2D2 as well.
But yeah, walk us through the high level and the key components
of what you've devised to sort of help this problem of redundancy
and duplication in data lakes?
Yeah, so we focused on, we tried to, there's many ways you could approach the problem. So what we,
one thing we did is sort of, you can model redundancy as like tables that are completely
contained within another. So that means both, so that means like a subset of the schema and a subset of the rows.
So if you have a table which is a subset in both these categories,
then it's completely contained in a bigger table.
And that's a clear case of redundancy.
Of course, you can also have overlapping tables,
and we also address that case in the paper.
But most of the paper focuses on this case of
complete containment and the main like naive solution you would think of when you think of
like I want to identify duplicates is you would just hash you would think of like hashing the
rows or doing something like that and then comparing the hashed rows of multiple tables
right but the problem with that is it gets very expensive,
especially when you have tables at such large scales.
So we tried to come up with several,
we brainstormed about several approaches
that we could use to do this in a more scalable way.
And we wanted this to work at an enterprise level.
We wanted to work in the real world.
So that did limit like some of the things that we could do. But finally, we came up with an
approach that was like, focus more on identifying which tables are not duplicates, and sort of
pruning the space of possibilities in that way. So at a high level, we divided our pipeline into
four stages. So first, we identified schema level containment. So now you know, you have a graph of
tables, an edge from table A to table B means that table B is contained, the schema of table B is contained within the schema of table A.
But of course, there's a lot of false cases in this.
You need to eliminate many cases to identify the real duplicates.
So then we used statistical pruning to eliminate more edges from this graph.
And finally, we looked at the actual contents in content-level pruning,
which is the third step.
And that really reduced the number of edges in the graph.
And the final step is,
like you recommend data sets for deletion
from the graph that's left over.
And that's the optimization step.
So yeah, that's at a high level.
Nice, cool.
Yeah, so to run through those real quick.
So the first one is sort of the schema level uh you're building building this graph of like yeah you're
on the edge between the two tables is that like how much of that is sort of a manual process of
you being how much you're able to automate that because i don't know they might be sort of like
table i may be contained in table b for example but how easy is it to sort of deduce that off the
fact that another the column might be a different name for example but how easy is it to sort of deduce that off the fact that another column
might be a different name for example but it might be the same thing there's this sort of a
semantic sort of mapping you need to make there as well right yeah so we that's this is more going
into the case of approximate containment so if you know that two tables are talking about the
same content if you have that domain knowledge then you can make that kind of mapping but it's
important that the tables are talking about the same thing so if you have the domain knowledge
then you can do that but the way we do the schema step is like we have it's not enough to just like
you could look at each pair one by one like each pair of tables but that could also take time so we use like a clustering type of algorithm
to create an approximate graph um like by processing multiple tables sequentially
yeah for that said then you get this statistical pruning number two and then the content number
is the last step and then so i'm really into i'm going to go into the evaluation here and see how
i'm really interested to see how how many edges you were able to prove it proven at each sort of these steps so
before we sort of do go in that then set the scene for us in terms of how you went actually
evaluating uh r2d2 and yeah like what was the setup and tell us what data you used and yeah
the experiments you ran to sort of evaluate performance of this this nice pipeline you developed so we had a bunch of like customer data at like scale
of terabytes and we also had used some synthetic data which so we got a good balance of data across
like megabytes to gigabytes to terabytes and we wanted wanted to like, you know, evaluate our pipeline on each of these
cases. So one thing you could easily use for evaluation is figure out the number of edges in
the containment graph after each step. So like, and see how the number of edges is reducing and
how much you're able to prune the graph. So that's one metric, you can look at the number of edges is reducing and how much you're able to prune the graph. So that's one metric.
You can look at the number of incorrect edges.
And the way we've designed the pipeline is such that you'll always capture all of the
correct edges so you won't miss any cases of containment.
So we're more focused on pruning the wrong edges than figuring out the right ones because
all of them are right. And another thing is very important is like the time the pipeline takes at each step
and the number of row level operations that happen at each step.
So we also compare on those metrics.
And finally, we also run scale tests.
And finally, for the optimization, we check the number of GDPR row scans that we can save
and the like amount of dollar cost savings approximately
and how that varies with the percentage of containment.
So yeah, of course, not exact numbers, but just projected.
Yeah, yeah, cool.
So yeah, tell us what the results were then.
How good did R2D2 perform then?
Run through these metrics you tracked here.
I think in terms of the schema step is the first step.
So of course you have a lot of edges left after,
but that sets the stage for the rest of it.
And then I think the statistical pruning
is extremely effective
because you don't have to process
any rows.
You can just read the metadata and use that to eliminate more than half of the edges in
the graph.
And finally, when you do look at the content, even that is relatively cheap because all
you're doing is sampling.
And if you have an index database, then even the sampling is a lot faster.
So in this way, like, and after that,
you're left with very few edges. Like if, for example, in a terabyte dataset, you might have
a hundred incorrect edges. And after that, like that's, after that, when you'd like recommend
datasets for deletion, the user can look at, like, it's a visual thing, right? So they can look at
the data that's recommended for deletion. If they don't want to delete like, it's a visual thing, right? So they can look at the data that's recommended for deletion.
If they don't want to delete it,
it's feasible enough for them to, you know, decide not to.
And there are more ways you can optimize this.
Like, we've discussed some of that in the paper.
We didn't implement, like, a completely optimized version.
But, yeah.
Yeah, you've always got, like, human in the loop at the end as well
who's actually making the decision, final decision,
to be like, okay, yeah, I want to delete this or not, right? So there's always got like human in the loop at the end as well. He's actually making the decision, final decision to be like,
okay, yeah, I want to delete this or not, right?
So there's always got that safe.
It's only a recommendation, I guess, in that sense.
Like it's not sort of making decisions and deleting data without sort of giving the human sign off, right?
Yeah, I mean, that could be dangerous, right?
Deleting customer data without their approval.
Of course, I mean, at max, you can go and archive the data
and move it to a lower tier of
storage but even that you you need permission so yeah so yeah so the the the sort of out of these
sort of the different steps of the pipeline the statistical pruning was the one that sort of had
more but most bang for the book then it's sort of that was like 50 of the edges you said that's
what kind of could get removed on a terabyte. I think statistical and then the content level,
but everything depends on everything.
So I mean, you can't have, like, for example,
if you didn't do the content level,
then the optimization wouldn't make any sense.
Yeah, so you hinted out there that you've identified
this isn't like a complete solution.
It's not the final sort of package.
There's a few areas for future research
so kind of taking that question then where do you go next with r2d2 what are those directions that
you think would be worthwhile exploring so um one is the approximate containment case so you have
tables which are overlapping great and. And in that case,
like statistical pruning will not work in that case. But you can still apply content level
pruning because you can still use sampling. But you need to design solutions that still keep the
complexity low. And it's an interesting problem, but it's hard. So that's one potential direction
for sure. Yeah. Otherwise, I think this is less of a
research direction, more like engineering optimizations in terms of how you do the
sampling. So I talked about this actually in our talk at SIGMOD that, I mean, you could choose to
do sampling from both tables in intelligent ways instead of just sampling from one table.
So that would really really and if you have
an index database which is the case for most enterprises then really reduces the overhead
because you're basically you might have billions of rows but you're only comparing like hundreds
of rows with each other and you can easily use that to eliminate an edge yeah for sure leverage
that information and that's yeah definitely a promising direction to go in.
I guess let's talk about impact
for a second,
because you said there
that a lot of the work going forward
would be sort of getting
this production grade
and getting it kind of to a level
where it could be deployed
in a production level system.
And so obviously,
this is kind of work
you've done with colleagues at Adobe.
Has this work already found some impact at Adobe internally?
Has it been used in any sort of production system yet?
Like what's the impact been like?
I'm not sure if it's been like fully deployed in production.
Like that does take a few years and it can be a long process.
But so I had implemented this like in scala and spark so the whole pipeline
can be run as a spark job so i mean in terms of actually getting it at like production ready i
don't think that's difficult i think more of the difficulty is in getting you know buy-in from users
and customers to get them to agree to process their data in this way and then for them to agree
to you know this recommendation for deletion i mean it can be hard to you know pitch that to
customers in a bigger company but but yeah i'm sure that this definitely does have a lot of
scope across the industry and i'm sure there are people who would be interested. Yeah, for sure. How easy do you think it would be now
to sort of like,
as a, I don't know,
someone who's got some data installed in,
I don't know, Azure, whatever, and I want to
basically run your tool, how easy
would it be to take what's there now? I mean, I presume
it's kind of open source, available on GitHub.
Could I take that, what you,
the code you've wrote, and use it? Would that
be pretty easy to do, or would it be some, a bit bit painful maybe how easy would it be to use it so I mean
the access point for data was azure databricks and we use spark so like I don't think it's very
difficult you just load the data in and uh after that you just run the Spark job on it. I mean, there's no other dependency, really.
Maybe the data format, depending on that,
you might have to change a few fields in the schema and all.
But yeah, unfortunately, though, the code is not open source
because it was written while I was working at Adobe,
so that becomes proprietary.
But all of the algorithms that I described,
they're all described in the paper in a lot of detail.
And it's like nothing is left out.
So it's easy to implement.
Yeah, that's cool.
That's really nice that we can recreate if we need to, right?
How long was the implementation effort then?
Was it quite a challenging thing to implement?
I think getting the scalability facts stuff in was a bit of a challenge because you need to think about what kind of operations you're doing and optimize the performance at each step.
So a lot of it was an engineering challenge and a lot of it was iterative.
So it wasn't like we came up with the whole pipeline in the first place.
You had to come up with a solution that's simple that's scalable and it was a product of several
months of to a year of of course going on parallelly with other projects but of effort
it's nice cool so yeah so it's i always like to sort of ask kind of where the idea for this
this project where projects or a paper actually came from originally so yeah what's the story behind this
and who had the idea to be like yeah this is something that we should do and we should work
on what was the story there yeah so like i said right we had this paper in icde and there we
realized the storage and compute costs for these data lakes are really high so there we focus more
on multi-tiering and compression and partitioning to optimize those aspects.
But naturally, once that was done, we thought, what other problems can we address?
And duplication, we had discussions with a lot of engineers, product managers, and duplication was one thing customers had brought up.
And data regulation, for sure, was a real issue. So we sort of put these two use cases together
and thought, like, what if we could...
And lineage was not, like, there at the time.
Nowadays, you have...
Many times you have, like, automatic lineage detection as well.
But at the time, like, Databricks wasn't offering that either.
So, yeah, I guess that was the root of the idea.
Nice, that's really cool.
You also as well mentioned earlier on when we were chatting about sort of you didn't arrive at this sort of this three-step process pipeline like immediately.
So there was some iteration there.
So maybe you can kind of tell us a little bit about some of the things you tried that were sort of unsuccessful.
And you're like, oh, OK, that's not the best idea.
I'm going to do it a different way. Yeah. yeah tell us more about those the explored parts of the journey
because the journey is never linear right there's always ups and downs so yeah tell us about some of
the downs i mean okay i have to remember but i think like you have one of the first things we
did was trying to find duplicates and we very quickly realized that that's probably not gonna probably
not gonna work that well and then we moved to a pruning approach and even there i think the first
schema algorithm we had was more inefficient and the content level pruning as well like we were
unable to figure out how you can do this without processing multiple like a large number of rows so a lot
of the initial approaches were not that scalable many times it's it's like is this even possible
because you don't know if it is and but that's research and in the end like we have we built
something that's like i think it can be used in like in a real world system as well but like that's the thing
right like as software engineers like you might not be willing to spend that much time
exploring the area because there's no guaranteed return but as a researcher like like i was part
of the research team at the time so it was easier for us to you know keep going at it and eventually we got something
which was good yeah for sure i mean that's you're right there when you when you're sort of not in
like a when you're in a research sort of domain in a research context you've got a lot more
scopes try things and keep failing until you maybe you want to get something that works eventually
but when you're kind of maybe a software engineer working on a production system there the incentives
and the objectives are a little bit different right so you don't have as much
freedom to okay if it didn't work the first time forget it we're going to do something else right
whereas it's nice in the research world that you can iterate and keep up there's definitely
something here but it might not work but it feels promising and you can sort of keep it
and get there and get there in the end yeah I guess sort of whilst we're talking about that journey from sort of start to finish,
how long, we touched on the implementation effort,
it's been like kind of a couple of months to sort of do.
Between the conception of the original idea,
how long did it take to arrive at the sort of the SIGMOD paper?
I think about a year, yeah.
And of course, it goes through like a round of reviews
and you have to rebut it.
And that also helps improve.
But in the duration, like between the review and the rebuttal,
there is another two months.
So, yeah.
Okay.
So it's like kind of like a 12-month time line.
That's pretty cool.
So, yeah, I mean, I always like to find out as well,
like if there are any sort of surprises that you've sort of found on this journey of working on this project was
there something that sort of jumped out and you're like huh did not expect to learn that whilst
working on this or this was something really cool i didn't really think was going to be a thing and
that was a thing but yeah i guess like like you do have this idea sometimes that things that researchers work on is like you know
hard to implement in actual production and a lot of it like many times it's a trade-off between
like one thing I realized like just because the solution is simple it doesn't mean that it's not
good and there is a trade-off between like triviality and practicality in some sense.
It's difficult to find something that's practical and also novel.
But I think we were able to do that.
And I think that's, I felt good about that.
And I think I like that nowadays, like we can see more research, especially in data
and systems that is like this.
I hope that continues. Yeah yeah i completely agree with you you know it's that's kind of that sort of that's
sort of a common sort of theme that a lot of people who i've been chatting to recently have
sort of had that as well been like their their solution whatever they've come up with maybe
hasn't been it's been sort of not trivial but like it's been it's been simple right it's been a simple
well-formed idea that solves the problem clearly.
And those ideas are so much more easy to sort of,
well, to sell to industries,
well, and get people to actually go and implement
because they can reason about them, right?
They're very explainable, they're very understandable.
It's like, okay, I can reason about this in my brain.
Okay, I've got some confidence in this.
Whereas a big complex sort of fancy solution,
maybe, yeah, people sometimes conflate complexity
with quality.
And those two things don't correlate in my,
they shouldn't do anyway.
But so I completely agree with you there.
And so, yeah, that's a really, really good point.
And cool, yeah.
So speaking of ideas and simple ones and complex ones do you have how you go about sort of like
generating ideas you have like a approach for it and then like yeah once you've got those how do
you go okay that's the one that i'm gonna bang my head against the wall for for the next 12 months
to get to work so yeah it's about the creative process when i can watch what's your approach to
it i mean i think it's first about finding a pain
point where there's a pain point people are you where there's enough of a pain point usually people
are willing to invest resources and resources are important whenever you work on a problem because
you need resources both in terms of people as well as well as you know people willing to invest money and time. And so once you have that,
then you can think about
what scope of the problem you want to tackle.
And I usually, it's hard to tackle the whole problem at once.
So if you can figure out a certain niche
that's feasible enough
and seems practical at the same time,
given your timelines and everything,
then you can get into the actual technicalities,
like how approaching it in different angles,
figuring out, you know, the problem-solving process.
And once you've, I think after a bit of exploration,
you get the idea of how much scope this has,
how much time it's worth investing into,
like you want to put more into it or not
and after that i suppose yeah you can like choose to invest more resources spend more time and
eventually like it either works or it doesn't but but that's the process yeah that's that's
a nice that's a nice approach to that to process. It always says a lot.
So getting incentives aligned, I guess, would be a way of phrasing it
and kind of getting that buy-in, solving a pain point,
and then that normally resources follow
and you need resources to solve a problem, right?
So yeah, it's a practical way of viewing that, sure.
Cool, yeah.
I also kind of want to ask you about what you're working on at the moment.
So obviously R2-D2 is done and dusted the paper's published best paper award maybe in 10
years we don't know right maybe fingers crossed but yeah what what's the next thing is it have
we got another star wars themed project on the go well not star wars but um so I guess R2-D2 was higher up in the stack.
I was working more like Python, Spark, PySpark, Scalo. And I guess now in UIUC, I'm trying, like, understanding more about different parts of databases, distributed systems and all.
And also exploring lower in the stack.
So at UIUC, I was working more on like file formats and that
also aligns with my current internship at lance db so over the summer like trying to develop better
and implement new encodings so that's more work and and basically speeding up compression and
random access so the work is more around like rust the python that sort of
thing and i i think it's very interesting also exploring a different part of the same space but
learning a lot and it's very exciting because this is it's all like the future and nice yeah so i'm
not really familiar at all with lance db so maybe you can kind of just give it give me the rundown
on lance db and yeah like what what what's the problem it's trying to solve i mean basically you have
aiml workloads right so you can have it's sort of like a database and there's a file format
underneath which is better designed to support like these multi multi-modal ai workloads in
terms of and now you have so much data processing that you have to do
with these LLMs and all. So in terms of vector storage, in terms of speeding up retrieval of
the data as well as processing, this becomes very important. So they're designing like a better data
format and like trying to beat Parquet as well on many metrics already beating parquet on many metrics
actually but yeah it's a work in progress but it's it's pretty exciting that's awesome yeah
that's really cool yeah before we do sort of move on to the the last word and sort of wrap up the
wrap up the chat the name r2d2 who decided on the name and why? We need to get that explained.
R2D2, I think, so we generally, we like to have,
I think my senior came up with it,
but we like to come up with names for our models,
which are memorable.
And so that's why we try to choose the words in a way that you can abbreviate it in a cool way.
So reducing redundancy
duplication derelicts you know natural extension yeah no that's awesome honestly naming things is
a hard thing but like i like it when papers have like a cool like name as the title it makes it
miles more memorable right uh so yeah that's awesome and yeah so i guess kind of with that
then time for the last word what's
the one takeaway you want the listener to get from this chat today i think just that right that um
it's it's important to focus on a general idea and simplicity is often something which is very
underrated but it can be very useful to solve difficult problems as well
which are often easy to implement in the real world and i think this is something like to note
both for researchers as well as for people in the industry because in in both sides there's a
different perspective as to like what's easy to go into the product or whether it's worth
investing into research to making it to make it you know good enough to go in the product
but but i think it's like yeah basically that's the the message i would say yeah yeah i mean that's
a great message runnock so yeah i guess thank you very much for chatting with me today it's been a
great chat.
I'm sure the listener will have enjoyed it as well.
And where can we find you on social media, anywhere like that?
Where can people reach out to you?
What's the best way?
I'm on LinkedIn.
Cool.
Well, drop your LinkedIn in the show notes,
and we'll put all the papers we've talked about today and everything,
all the materials in the show notes as well.
And, yeah, the listener can reach out to you through that.
And, yeah, I guess we'll see you all next time for some more awesome computer science research