Disseminate: The Computer Science Research Podcast - Raunak Shah | R2D2: Reducing Redundancy and Duplication in Data Lakes | #59

Episode Date: October 28, 2024

In this episode, Raunak Shah joins us to discuss the critical issue of data redundancy in enterprise data lakes, which can lead to soaring storage and maintenance costs. Raunak highlights how large-sc...ale data environments, ranging from terabytes to petabytes, often contain duplicate and redundant datasets that are difficult to manage. He introduces the concept of "dataset containment" and explains its significance in identifying and reducing redundancy at the table level in these massive data lakes—an area where there has been little prior work.Raunak then dives into the details of R2D2, a novel three-step hierarchical pipeline designed to efficiently tackle dataset containment. By utilizing schema containment graphs, statistical min-max pruning, and content-level pruning, R2D2 progressively reduces the search space to pinpoint redundant data. Raunak also discusses how the system, implemented on platforms like Azure Databricks and AWS, offers significant improvements over existing methods, processing TB-scale data lakes in just a few hours with high accuracy. He concludes with a discussion on how R2D2 optimally balances storage savings and performance by identifying datasets that can be deleted and reconstructed on demand, providing valuable insights for enterprises aiming to streamline their data management strategies.Materials:SIGMOD'24 Paper - R2D2: Reducing Redundancy and Duplication in Data LakesICDE'24 - Towards Optimizing Storage Costs in the Cloud Hosted on Acast. See acast.com/privacy for more information.

Transcript
Discussion (0)
Starting point is 00:00:00 Hello and welcome to Disseminate, the computer science research podcast. As usual, Jack here. Today is another episode of our ongoing Cooking Ed series. And today I'll be talking to Ranak Shah about his paper from Sigma recently, R2-D2, Reducing Redundancy and Duplication in Data Lakes. Ranak is currently doing his master's at the University of Illinois, AirBanner. Champagne. Cool. Anyway, so let's dive straight in then. So it's always customary on the podcast for the
Starting point is 00:00:50 guests to tell us a little bit more about themselves in their own words and how they got interested in research and yes, specifically database research. So yeah, what's your story, Renuk? Yeah. So I mean, my first time getting into research was during my undergrad. So then I, at that time, I was more into machine learning research and perception, that sort of thing. And later on, like, I worked at Adobe for about two years. And then I wanted to try something more deterministic. So then I moved sort of into data and systems. And since then, I've sort of been more in that area even now at like at uac and and my current internship at lance db nice cool what's
Starting point is 00:01:36 the master plan in it are we gonna is it phd on the horizon or is it you're going to go back into industry go work at lance db maybe we'll see yeah i think i think so it's a thesis masters but i'll probably go back into the industry that's the idea at the moment that sounds cool so let's talk about r2d2 then in the paper you recently published at sigmod but to do that let's set a little bit of context for the chat for the listeners. So data lakes are really important for this paper, obviously. So can you maybe start us off by telling kind of what the heck even is a data lake? Yeah, so I mean, basically, you know, very large data stores. Nowadays, you have huge companies processing huge amounts of data.
Starting point is 00:02:19 And specifically in data lakes, you can store both structured and unstructured data. And maybe for the context of like what we're talking about, like the work that I presented at Sigmud. So most of the data was stored in like the enterprise Azure data lake, and it was in parquet format. So Adobe gets like hundreds of customers, all of this data, petabyte scale. And typically the access points were like through cloud. So Apache Spark, that sort of thing. And there's huge scope for optimization in such a landscape. Yeah. So kind of going off that then, I guess you said there's plenty of room for optimization. So kind of what are those areas for optimization? I guess kind of the question I'm getting at is what are the problem or the problems that kind of come with data of optimization? I guess kind of the question I'm getting at is what are the problems
Starting point is 00:03:05 that kind of come with data lakes? And I guess that kind of would then lead naturally into what the goal of your work was with R2D2. Yeah, so of course there's a lot of room for improvement. In fact, we had presented another paper at ICDE previous year that was focused on optimizing storage costs. This is more focused on like reducing the redundancy and duplication. But basically, like, I can maybe talk about two problems that
Starting point is 00:03:33 this work focuses on more in more detail. So one is, you know, because you have such large amounts of data, then the rate data regulation becomes very important. So you have things like GDPR policies and all. So if you don't manage your data properly, then you get fined. And that can be like millions or billions of dollars. And because of this, companies try to ensure compliance. And because of that, you can end up with higher data maintenance costs. For example, you might have to do a full table scan to delete a row from a table or to find some user's data in the entire table. And full table scans are very expensive, especially at such a large scale. Another thing is you can have a lot of redundant and untracked data.
Starting point is 00:04:17 In many cases, which is similar to what we were dealing with, you can have a lack of data lineage information. So you don't know how... You might have a of data lineage information. So you don't know how, you might have a lot of duplicates, but you don't know where those duplicates came from or what the root data set was. And that makes it even harder to, you know, maintain compliance with regulations
Starting point is 00:04:36 and hard for users to figure out what data should be deleted after a certain point. Nice, yeah, so there's two of them. So there's the regulation aspect and yeah, adhering to GDPR because those fines, like you said,
Starting point is 00:04:47 are quite large, right? And they're willing to give those fines out as well, which is good, which is something final that's not a paper tiger, right? So, yeah. And obviously,
Starting point is 00:04:56 there's this possible untracked data, lack of lineage. Yeah, and where the heck did the data come from? So R2D2 is going to help us solve these problems, right?
Starting point is 00:05:04 So walk us through, I guess, what R2D2 is then. And at some point, we're going to need to get into the name and how it ended up being called R2D2 as well. But yeah, walk us through the high level and the key components of what you've devised to sort of help this problem of redundancy and duplication in data lakes? Yeah, so we focused on, we tried to, there's many ways you could approach the problem. So what we, one thing we did is sort of, you can model redundancy as like tables that are completely
Starting point is 00:05:37 contained within another. So that means both, so that means like a subset of the schema and a subset of the rows. So if you have a table which is a subset in both these categories, then it's completely contained in a bigger table. And that's a clear case of redundancy. Of course, you can also have overlapping tables, and we also address that case in the paper. But most of the paper focuses on this case of complete containment and the main like naive solution you would think of when you think of
Starting point is 00:06:11 like I want to identify duplicates is you would just hash you would think of like hashing the rows or doing something like that and then comparing the hashed rows of multiple tables right but the problem with that is it gets very expensive, especially when you have tables at such large scales. So we tried to come up with several, we brainstormed about several approaches that we could use to do this in a more scalable way. And we wanted this to work at an enterprise level.
Starting point is 00:06:43 We wanted to work in the real world. So that did limit like some of the things that we could do. But finally, we came up with an approach that was like, focus more on identifying which tables are not duplicates, and sort of pruning the space of possibilities in that way. So at a high level, we divided our pipeline into four stages. So first, we identified schema level containment. So now you know, you have a graph of tables, an edge from table A to table B means that table B is contained, the schema of table B is contained within the schema of table A. But of course, there's a lot of false cases in this. You need to eliminate many cases to identify the real duplicates.
Starting point is 00:07:34 So then we used statistical pruning to eliminate more edges from this graph. And finally, we looked at the actual contents in content-level pruning, which is the third step. And that really reduced the number of edges in the graph. And the final step is, like you recommend data sets for deletion from the graph that's left over. And that's the optimization step.
Starting point is 00:07:59 So yeah, that's at a high level. Nice, cool. Yeah, so to run through those real quick. So the first one is sort of the schema level uh you're building building this graph of like yeah you're on the edge between the two tables is that like how much of that is sort of a manual process of you being how much you're able to automate that because i don't know they might be sort of like table i may be contained in table b for example but how easy is it to sort of deduce that off the fact that another the column might be a different name for example but how easy is it to sort of deduce that off the fact that another column
Starting point is 00:08:25 might be a different name for example but it might be the same thing there's this sort of a semantic sort of mapping you need to make there as well right yeah so we that's this is more going into the case of approximate containment so if you know that two tables are talking about the same content if you have that domain knowledge then you can make that kind of mapping but it's important that the tables are talking about the same thing so if you have the domain knowledge then you can do that but the way we do the schema step is like we have it's not enough to just like you could look at each pair one by one like each pair of tables but that could also take time so we use like a clustering type of algorithm to create an approximate graph um like by processing multiple tables sequentially
Starting point is 00:09:13 yeah for that said then you get this statistical pruning number two and then the content number is the last step and then so i'm really into i'm going to go into the evaluation here and see how i'm really interested to see how how many edges you were able to prove it proven at each sort of these steps so before we sort of do go in that then set the scene for us in terms of how you went actually evaluating uh r2d2 and yeah like what was the setup and tell us what data you used and yeah the experiments you ran to sort of evaluate performance of this this nice pipeline you developed so we had a bunch of like customer data at like scale of terabytes and we also had used some synthetic data which so we got a good balance of data across like megabytes to gigabytes to terabytes and we wanted wanted to like, you know, evaluate our pipeline on each of these
Starting point is 00:10:07 cases. So one thing you could easily use for evaluation is figure out the number of edges in the containment graph after each step. So like, and see how the number of edges is reducing and how much you're able to prune the graph. So that's one metric, you can look at the number of edges is reducing and how much you're able to prune the graph. So that's one metric. You can look at the number of incorrect edges. And the way we've designed the pipeline is such that you'll always capture all of the correct edges so you won't miss any cases of containment. So we're more focused on pruning the wrong edges than figuring out the right ones because all of them are right. And another thing is very important is like the time the pipeline takes at each step
Starting point is 00:10:49 and the number of row level operations that happen at each step. So we also compare on those metrics. And finally, we also run scale tests. And finally, for the optimization, we check the number of GDPR row scans that we can save and the like amount of dollar cost savings approximately and how that varies with the percentage of containment. So yeah, of course, not exact numbers, but just projected. Yeah, yeah, cool.
Starting point is 00:11:21 So yeah, tell us what the results were then. How good did R2D2 perform then? Run through these metrics you tracked here. I think in terms of the schema step is the first step. So of course you have a lot of edges left after, but that sets the stage for the rest of it. And then I think the statistical pruning is extremely effective
Starting point is 00:11:43 because you don't have to process any rows. You can just read the metadata and use that to eliminate more than half of the edges in the graph. And finally, when you do look at the content, even that is relatively cheap because all you're doing is sampling. And if you have an index database, then even the sampling is a lot faster. So in this way, like, and after that,
Starting point is 00:12:06 you're left with very few edges. Like if, for example, in a terabyte dataset, you might have a hundred incorrect edges. And after that, like that's, after that, when you'd like recommend datasets for deletion, the user can look at, like, it's a visual thing, right? So they can look at the data that's recommended for deletion. If they don't want to delete like, it's a visual thing, right? So they can look at the data that's recommended for deletion. If they don't want to delete it, it's feasible enough for them to, you know, decide not to. And there are more ways you can optimize this. Like, we've discussed some of that in the paper.
Starting point is 00:12:35 We didn't implement, like, a completely optimized version. But, yeah. Yeah, you've always got, like, human in the loop at the end as well who's actually making the decision, final decision, to be like, okay, yeah, I want to delete this or not, right? So there's always got like human in the loop at the end as well. He's actually making the decision, final decision to be like, okay, yeah, I want to delete this or not, right? So there's always got that safe. It's only a recommendation, I guess, in that sense.
Starting point is 00:12:53 Like it's not sort of making decisions and deleting data without sort of giving the human sign off, right? Yeah, I mean, that could be dangerous, right? Deleting customer data without their approval. Of course, I mean, at max, you can go and archive the data and move it to a lower tier of storage but even that you you need permission so yeah so yeah so the the the sort of out of these sort of the different steps of the pipeline the statistical pruning was the one that sort of had more but most bang for the book then it's sort of that was like 50 of the edges you said that's
Starting point is 00:13:20 what kind of could get removed on a terabyte. I think statistical and then the content level, but everything depends on everything. So I mean, you can't have, like, for example, if you didn't do the content level, then the optimization wouldn't make any sense. Yeah, so you hinted out there that you've identified this isn't like a complete solution. It's not the final sort of package.
Starting point is 00:13:44 There's a few areas for future research so kind of taking that question then where do you go next with r2d2 what are those directions that you think would be worthwhile exploring so um one is the approximate containment case so you have tables which are overlapping great and. And in that case, like statistical pruning will not work in that case. But you can still apply content level pruning because you can still use sampling. But you need to design solutions that still keep the complexity low. And it's an interesting problem, but it's hard. So that's one potential direction for sure. Yeah. Otherwise, I think this is less of a
Starting point is 00:14:26 research direction, more like engineering optimizations in terms of how you do the sampling. So I talked about this actually in our talk at SIGMOD that, I mean, you could choose to do sampling from both tables in intelligent ways instead of just sampling from one table. So that would really really and if you have an index database which is the case for most enterprises then really reduces the overhead because you're basically you might have billions of rows but you're only comparing like hundreds of rows with each other and you can easily use that to eliminate an edge yeah for sure leverage that information and that's yeah definitely a promising direction to go in.
Starting point is 00:15:06 I guess let's talk about impact for a second, because you said there that a lot of the work going forward would be sort of getting this production grade and getting it kind of to a level where it could be deployed
Starting point is 00:15:17 in a production level system. And so obviously, this is kind of work you've done with colleagues at Adobe. Has this work already found some impact at Adobe internally? Has it been used in any sort of production system yet? Like what's the impact been like? I'm not sure if it's been like fully deployed in production.
Starting point is 00:15:37 Like that does take a few years and it can be a long process. But so I had implemented this like in scala and spark so the whole pipeline can be run as a spark job so i mean in terms of actually getting it at like production ready i don't think that's difficult i think more of the difficulty is in getting you know buy-in from users and customers to get them to agree to process their data in this way and then for them to agree to you know this recommendation for deletion i mean it can be hard to you know pitch that to customers in a bigger company but but yeah i'm sure that this definitely does have a lot of scope across the industry and i'm sure there are people who would be interested. Yeah, for sure. How easy do you think it would be now
Starting point is 00:16:26 to sort of like, as a, I don't know, someone who's got some data installed in, I don't know, Azure, whatever, and I want to basically run your tool, how easy would it be to take what's there now? I mean, I presume it's kind of open source, available on GitHub. Could I take that, what you,
Starting point is 00:16:42 the code you've wrote, and use it? Would that be pretty easy to do, or would it be some, a bit bit painful maybe how easy would it be to use it so I mean the access point for data was azure databricks and we use spark so like I don't think it's very difficult you just load the data in and uh after that you just run the Spark job on it. I mean, there's no other dependency, really. Maybe the data format, depending on that, you might have to change a few fields in the schema and all. But yeah, unfortunately, though, the code is not open source because it was written while I was working at Adobe,
Starting point is 00:17:18 so that becomes proprietary. But all of the algorithms that I described, they're all described in the paper in a lot of detail. And it's like nothing is left out. So it's easy to implement. Yeah, that's cool. That's really nice that we can recreate if we need to, right? How long was the implementation effort then?
Starting point is 00:17:38 Was it quite a challenging thing to implement? I think getting the scalability facts stuff in was a bit of a challenge because you need to think about what kind of operations you're doing and optimize the performance at each step. So a lot of it was an engineering challenge and a lot of it was iterative. So it wasn't like we came up with the whole pipeline in the first place. You had to come up with a solution that's simple that's scalable and it was a product of several months of to a year of of course going on parallelly with other projects but of effort it's nice cool so yeah so it's i always like to sort of ask kind of where the idea for this this project where projects or a paper actually came from originally so yeah what's the story behind this
Starting point is 00:18:25 and who had the idea to be like yeah this is something that we should do and we should work on what was the story there yeah so like i said right we had this paper in icde and there we realized the storage and compute costs for these data lakes are really high so there we focus more on multi-tiering and compression and partitioning to optimize those aspects. But naturally, once that was done, we thought, what other problems can we address? And duplication, we had discussions with a lot of engineers, product managers, and duplication was one thing customers had brought up. And data regulation, for sure, was a real issue. So we sort of put these two use cases together and thought, like, what if we could...
Starting point is 00:19:09 And lineage was not, like, there at the time. Nowadays, you have... Many times you have, like, automatic lineage detection as well. But at the time, like, Databricks wasn't offering that either. So, yeah, I guess that was the root of the idea. Nice, that's really cool. You also as well mentioned earlier on when we were chatting about sort of you didn't arrive at this sort of this three-step process pipeline like immediately. So there was some iteration there.
Starting point is 00:19:36 So maybe you can kind of tell us a little bit about some of the things you tried that were sort of unsuccessful. And you're like, oh, OK, that's not the best idea. I'm going to do it a different way. Yeah. yeah tell us more about those the explored parts of the journey because the journey is never linear right there's always ups and downs so yeah tell us about some of the downs i mean okay i have to remember but i think like you have one of the first things we did was trying to find duplicates and we very quickly realized that that's probably not gonna probably not gonna work that well and then we moved to a pruning approach and even there i think the first schema algorithm we had was more inefficient and the content level pruning as well like we were
Starting point is 00:20:19 unable to figure out how you can do this without processing multiple like a large number of rows so a lot of the initial approaches were not that scalable many times it's it's like is this even possible because you don't know if it is and but that's research and in the end like we have we built something that's like i think it can be used in like in a real world system as well but like that's the thing right like as software engineers like you might not be willing to spend that much time exploring the area because there's no guaranteed return but as a researcher like like i was part of the research team at the time so it was easier for us to you know keep going at it and eventually we got something which was good yeah for sure i mean that's you're right there when you when you're sort of not in
Starting point is 00:21:11 like a when you're in a research sort of domain in a research context you've got a lot more scopes try things and keep failing until you maybe you want to get something that works eventually but when you're kind of maybe a software engineer working on a production system there the incentives and the objectives are a little bit different right so you don't have as much freedom to okay if it didn't work the first time forget it we're going to do something else right whereas it's nice in the research world that you can iterate and keep up there's definitely something here but it might not work but it feels promising and you can sort of keep it and get there and get there in the end yeah I guess sort of whilst we're talking about that journey from sort of start to finish,
Starting point is 00:21:47 how long, we touched on the implementation effort, it's been like kind of a couple of months to sort of do. Between the conception of the original idea, how long did it take to arrive at the sort of the SIGMOD paper? I think about a year, yeah. And of course, it goes through like a round of reviews and you have to rebut it. And that also helps improve.
Starting point is 00:22:09 But in the duration, like between the review and the rebuttal, there is another two months. So, yeah. Okay. So it's like kind of like a 12-month time line. That's pretty cool. So, yeah, I mean, I always like to find out as well, like if there are any sort of surprises that you've sort of found on this journey of working on this project was
Starting point is 00:22:30 there something that sort of jumped out and you're like huh did not expect to learn that whilst working on this or this was something really cool i didn't really think was going to be a thing and that was a thing but yeah i guess like like you do have this idea sometimes that things that researchers work on is like you know hard to implement in actual production and a lot of it like many times it's a trade-off between like one thing I realized like just because the solution is simple it doesn't mean that it's not good and there is a trade-off between like triviality and practicality in some sense. It's difficult to find something that's practical and also novel. But I think we were able to do that.
Starting point is 00:23:13 And I think that's, I felt good about that. And I think I like that nowadays, like we can see more research, especially in data and systems that is like this. I hope that continues. Yeah yeah i completely agree with you you know it's that's kind of that sort of that's sort of a common sort of theme that a lot of people who i've been chatting to recently have sort of had that as well been like their their solution whatever they've come up with maybe hasn't been it's been sort of not trivial but like it's been it's been simple right it's been a simple well-formed idea that solves the problem clearly.
Starting point is 00:23:46 And those ideas are so much more easy to sort of, well, to sell to industries, well, and get people to actually go and implement because they can reason about them, right? They're very explainable, they're very understandable. It's like, okay, I can reason about this in my brain. Okay, I've got some confidence in this. Whereas a big complex sort of fancy solution,
Starting point is 00:24:02 maybe, yeah, people sometimes conflate complexity with quality. And those two things don't correlate in my, they shouldn't do anyway. But so I completely agree with you there. And so, yeah, that's a really, really good point. And cool, yeah. So speaking of ideas and simple ones and complex ones do you have how you go about sort of like
Starting point is 00:24:27 generating ideas you have like a approach for it and then like yeah once you've got those how do you go okay that's the one that i'm gonna bang my head against the wall for for the next 12 months to get to work so yeah it's about the creative process when i can watch what's your approach to it i mean i think it's first about finding a pain point where there's a pain point people are you where there's enough of a pain point usually people are willing to invest resources and resources are important whenever you work on a problem because you need resources both in terms of people as well as well as you know people willing to invest money and time. And so once you have that, then you can think about
Starting point is 00:25:09 what scope of the problem you want to tackle. And I usually, it's hard to tackle the whole problem at once. So if you can figure out a certain niche that's feasible enough and seems practical at the same time, given your timelines and everything, then you can get into the actual technicalities, like how approaching it in different angles,
Starting point is 00:25:32 figuring out, you know, the problem-solving process. And once you've, I think after a bit of exploration, you get the idea of how much scope this has, how much time it's worth investing into, like you want to put more into it or not and after that i suppose yeah you can like choose to invest more resources spend more time and eventually like it either works or it doesn't but but that's the process yeah that's that's a nice that's a nice approach to that to process. It always says a lot.
Starting point is 00:26:05 So getting incentives aligned, I guess, would be a way of phrasing it and kind of getting that buy-in, solving a pain point, and then that normally resources follow and you need resources to solve a problem, right? So yeah, it's a practical way of viewing that, sure. Cool, yeah. I also kind of want to ask you about what you're working on at the moment. So obviously R2-D2 is done and dusted the paper's published best paper award maybe in 10
Starting point is 00:26:31 years we don't know right maybe fingers crossed but yeah what what's the next thing is it have we got another star wars themed project on the go well not star wars but um so I guess R2-D2 was higher up in the stack. I was working more like Python, Spark, PySpark, Scalo. And I guess now in UIUC, I'm trying, like, understanding more about different parts of databases, distributed systems and all. And also exploring lower in the stack. So at UIUC, I was working more on like file formats and that also aligns with my current internship at lance db so over the summer like trying to develop better and implement new encodings so that's more work and and basically speeding up compression and random access so the work is more around like rust the python that sort of
Starting point is 00:27:25 thing and i i think it's very interesting also exploring a different part of the same space but learning a lot and it's very exciting because this is it's all like the future and nice yeah so i'm not really familiar at all with lance db so maybe you can kind of just give it give me the rundown on lance db and yeah like what what what's the problem it's trying to solve i mean basically you have aiml workloads right so you can have it's sort of like a database and there's a file format underneath which is better designed to support like these multi multi-modal ai workloads in terms of and now you have so much data processing that you have to do with these LLMs and all. So in terms of vector storage, in terms of speeding up retrieval of
Starting point is 00:28:12 the data as well as processing, this becomes very important. So they're designing like a better data format and like trying to beat Parquet as well on many metrics already beating parquet on many metrics actually but yeah it's a work in progress but it's it's pretty exciting that's awesome yeah that's really cool yeah before we do sort of move on to the the last word and sort of wrap up the wrap up the chat the name r2d2 who decided on the name and why? We need to get that explained. R2D2, I think, so we generally, we like to have, I think my senior came up with it, but we like to come up with names for our models,
Starting point is 00:28:57 which are memorable. And so that's why we try to choose the words in a way that you can abbreviate it in a cool way. So reducing redundancy duplication derelicts you know natural extension yeah no that's awesome honestly naming things is a hard thing but like i like it when papers have like a cool like name as the title it makes it miles more memorable right uh so yeah that's awesome and yeah so i guess kind of with that then time for the last word what's the one takeaway you want the listener to get from this chat today i think just that right that um
Starting point is 00:29:33 it's it's important to focus on a general idea and simplicity is often something which is very underrated but it can be very useful to solve difficult problems as well which are often easy to implement in the real world and i think this is something like to note both for researchers as well as for people in the industry because in in both sides there's a different perspective as to like what's easy to go into the product or whether it's worth investing into research to making it to make it you know good enough to go in the product but but i think it's like yeah basically that's the the message i would say yeah yeah i mean that's a great message runnock so yeah i guess thank you very much for chatting with me today it's been a
Starting point is 00:30:24 great chat. I'm sure the listener will have enjoyed it as well. And where can we find you on social media, anywhere like that? Where can people reach out to you? What's the best way? I'm on LinkedIn. Cool. Well, drop your LinkedIn in the show notes,
Starting point is 00:30:38 and we'll put all the papers we've talked about today and everything, all the materials in the show notes as well. And, yeah, the listener can reach out to you through that. And, yeah, I guess we'll see you all next time for some more awesome computer science research

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.