Disseminate: The Computer Science Research Podcast - Rui Liu | Towards Resource-adaptive Query Execution in Cloud Native Databases | #49

Starting point is 00:00:00 Hello and welcome to Disseminate the Computer Science Research Podcast. As usual, I'm your host, Jack Wardby. Today, it's my pleasure to say that I'm joined by Rui Leo, who will be telling us everything we need to know about his paper towards resource-adaptive query execution in cloud-native databases. Rui recently received his PhD in computer science from the University of Chicago. Before we do start, another quick few announcements. Remember, if you do enjoy the show, please consider supporting us through Buy Me A

Starting point is 00:00:49 Coffee. And we do have a listener survey out at the moment. So please go and check that out. Anyway, on to today's show. Welcome, Rui. Hi, Jack. Thank you so much for having me. I'm really excited to be here. Fantastic. Let's get started then. So I like to start off with my guests and getting them to tell us their story. So maybe you can tell us more about yourself and how you became interested in research and databases. Yeah, sure. So for myself, like I just mentioned, I recently received my PhD degree in neuroscience from the University of Chicago, where I was co-advised by Professor Aaron Imel and Mike Franklin.

Starting point is 00:01:27 I also work very closely with Professor Sanjay Krisha. So my PhD research is about building resource-efficient data-intensive systems. So during my PhD, I was also a research intern at a criticism lab of Microsoft, a data science intern at the AI engineering team of DocuSign, and a WITS team student at Argonne National Lab. But the question about my research path

Starting point is 00:01:50 towards data management research is like, if you take a peek at my resume or my CV, you probably will see a very diverse or zigzag background. So unlike many of my peers, I didn't get a chance to do some basic research when I was an undergraduate student. My research process started by my undergraduate final year project,

Starting point is 00:02:14 which is about mobile computing. The professor who advised me on that project taught me some very basic research methodologies. And then after that, when I graduated, a lot of my classmates become software engineers and I feel like probably I want to do some research rather than being, you know, make some contribution or impact in the industry. So then I decided to purchase a master's degree. And then I finally found that the Hong Kong Polytechnic University would be one who would

Starting point is 00:02:44 accept me as an NPU student. I think that's a research-oriented and fully funded master scholarship. It's like you can consider it as a mini version PhD. We still took several classes, but we spent a lot of time on doing research. And we got salary every month, and then we need to defend our system at the end of this month so that we can get degree. But still, at that time, my research focused mainly on mobile computing, but I did begin to explore some areas like UV-Cranes computing and cyber-physical systems. Once I finished that, I also got a chance to join a systems security group at the Chinese University of Hong Kong

Starting point is 00:03:32 because that host was my co-author for six months, basically. And that was a research citizenship. During that time, I still got a chance to explore some privacy and security projects or whatever. But so far, you can tell I do have some experience on different areas. They're pretty diverse. But one thing that has not changed is I have to manage a lot of data, including user information, like sensing data from sensors or mobile devices,

Starting point is 00:04:02 and some location or timestamp data. So there's a lot of them. So you need to be efficient in managing, you need to compute them, get whatever you want. Those challenges of data management fascinated me. Yeah, probably. That's the cool thing if we can, you know, explore some research ideas in that area. And then I found this whole data management area. And almost at the same time, I decided to go to US for a PhD degree.

Starting point is 00:04:31 So I said, why don't I just combine it together, get a PhD in database area? And that's pretty good, right? But the one issue I'm facing at that time is like pretty difficult to get in, you know, get accepted by a decent PhD program if you cannot show some solid research experience and skills. Then somehow I managed to go to the National University of Singapore

Starting point is 00:04:54 to join their database research group, working with Professor Ben-Chin Wei. I think at that time, I did explore some database or data system research, like blockchain, memory management, data cleaning. There's a lot of projects in that group. And I think I did a good job. And eventually, he wrote me a recommendation later so that I can get into the retail chip. From that time, I started my data management research path.

Starting point is 00:05:28 So I think that's my whole story. I'm probably a little wordy, but yeah, that's a long path. That's awesome. Yeah, I mean, there's a few things, I know what you're saying there, that sort of resonated with me as well. So not taking a linear path to databases and a PhD in databases is something that sort of happened with me as well sort of not taking the linear path to databases and a PhD in databases something that sort of happened with me as well I mean I started out doing mathematics and economics as

Starting point is 00:05:49 my undergraduate I had this dream of being the next wolf of wall street and then slowly transitioned away into statistics and you can kind of see how kind of computers came in a little bit more with more computational statistics and then finally settled on on a PhD in databases but one thing I will say that I think, I don't know if this is true with you, that it sounds like you've been through a lot of different areas and got exposure to a lot of different areas of computer science and various different topics.

Starting point is 00:06:15 Do you think that saved you well for your PhD, that exposure and having that broad understanding of the wider area rather than being sort of laser focused on one specific topic for such a long time yeah i think so exposing to different area and then finally focus on one makes me have a stronger motivation to be here so like because i already know what other areas and then finally i choose this area because i want to and i i do use some other previous experience on other areas to do my research yeah you kind of have that kind of cross-pollination of ideas and stuff and you never know where something you've

Starting point is 00:06:52 learned five years ago i might all of a sudden come back to you and think you might think oh god damn that's brilliant i remember this and i can apply that to this and yeah i can yeah yeah exactly yeah yeah yeah it's a lot for me yeah cool So let's talk about the paper that you published at CIDR recently. And it's called Towards Resource Adaptive Query Execution in Cloud Native Databases. So there's a few things in that title there, some background we need to set up here so the listener can kind of get on board with what we're going to be talking about for the next hour or so. So let's kind of start things off with, tell us what a cloud-native database is. Sure, yeah. I mean, there's a lot of definitions of cloud-native databases if you Google it.

Starting point is 00:07:31 But my definition is like, so cloud-native databases are designed to export the benefit of cloud computing environments. So unlike the traditional databases that are often deployed on specific hardware or within fixed environments, the cloud-native databases are built from scratch for cloud environments. Some key features or advantages of them are scalability, multi-tendency, high availability, and TIC's Go pricing model,

Starting point is 00:08:03 and the automated databases maintenance, you name it, there's a lot of different things. Some examples of automated databases are like Amazon Redshift, so Google Spanner and Microsoft Cosmos DB. Of course, there are a lot of other products in the world. But I think those databases are optimized for performance, availability, scalability in the cloud and making them very, very powerful for a lot of range of applications like web or mobile applications and some microservices or IoT systems, or even the machine learning and the current train large language models. Yeah, nice.

Starting point is 00:08:41 That's a really nice definition of cloud native databases. I mean, yeah, you see you go on Google and you stick that in and there's like a thousand different different definitions of it right and it's kind of hard to say okay well this person's saying this thing this person's saying this yeah that's a really nice nice definition of cloud native databases so we're in this world then so and everything sort of shifting towards us and using these um all the primitives that these cloud environments give us what are some of the kind of the key factors then that we need to think about that make us have to reconsider the architecture for the way we would develop a system um in a cloud environment i guess

Starting point is 00:09:16 yeah kind of what i'm hitting out here is give us the kind of the elevator pitch for you for for the for this paper really yeah sure so i think, yeah, that's a very interesting question. So I think one observation that we have is like the, so we argue that the ephemeral cloud resources are becoming prevalent, you know, because their prices are really, really attractive

Starting point is 00:09:37 compared with the traditional on-premise or on-demand the cloud that resulted. So there was a report, I remember, so they said, so peak time, the cloud resulted. So there was a report I remember. So they said the peak time price of the regular cloud resources are like 200

Starting point is 00:09:51 times higher than the eFirmware cloud resources. So there's a huge difference here. But there are two unique traits of this eFirmware cloud resources. It's like the eFirmware cloud resources. It's like the ephemeral cloud resources are pretty dynamic in

Starting point is 00:10:07 availability. Even the cloud resources provider may get them back for some reason. So that means the resources usage can be terminated. Second is like even their prices are very attractive, but

Starting point is 00:10:22 usually the prices are fluctuating over time. So their peak time prices will be much higher than their off-peak time prices. And the fluctuation time period is not days or weeks. It could be hours. So that's really, really dynamic. So based on these two observations, we feel like it's time to propose a new or re-imagine

Starting point is 00:10:51 what the cloud-based database looks like. So this is just an old picture. I can give you some concrete examples. So if you Google Amazon support instance, Amazon provides some short-lived but really good deal of cloud resources for you. So if you just want to use like one hour,

Starting point is 00:11:10 two hours, probably you can find a very good deal here. And also there's another cloud paradigm called Zero Carbon Cloud. It's proposed by a pretty good or famous professor, Andrew Chen in University of Chicago. In this Zero carbon cloud paradigm

Starting point is 00:11:25 the entire data center are driven by the renewable energies like wind solar like you know there's no cost for that

Starting point is 00:11:33 it's pretty good right so if but the problem is like we cannot control wind we cannot control water so the resources sometimes

Starting point is 00:11:41 or the energy sometimes will be terminated so you can see once you imagine those applications or the big pictures we have, you want to consider what if the resources are not stable? How should we still use them to build our system or build our cloning databases? So this paper shows our vision of the best practice for building and deploying these cloud-needed databases

Starting point is 00:12:09 on a single cloud result, especially from the perspective of cloud service provided. Nice. Yeah, there's some interesting things there that we were talking about, the fluctuation over time of these spot prices and that it can be in the magnitude of hours for it to change. I mean, that's kind of crazy, right? So so anyway and i also really like the sound of these carbon um like zero carbon data centers which yeah yeah yeah zero carbon clouds that's a fascinating thing

Starting point is 00:12:34 and anyway cool let's talk about then about i guess i guess these these primitives that you've proposed then that we need to basically be aware of when we're building a cloud native database on such an environment. So yeah, tell us about these primitives you've identified in your vision. Yeah, so I think we provide three primitives in this paper. One is called the query preemption. I think that's described the ability of permitting the queries to consistently consume. So once we have very limited resources, we know the results will be terminated.

Starting point is 00:13:11 It's very dynamic. And it's not reasonable to keep allocating resources to some long-running queries. So some other short-running queries have to wait until the long-running queries finish. That will significantly increase the latency of the query, right? So for those cases, I think the one important point we have is that we should allow the

Starting point is 00:13:38 queries can be, you know, suspended, you know, adaptively paused when there's a need or there's a beneficial to do so. I think that's the one, the primitive one, the query preemption. But that's for single queries or once we have one query, right? The second primitive we have

Starting point is 00:13:58 is called resorted arbitration. Let's move to the scenario like we have, you know, multi-users. There's a multi-ttenant environment for workload. Doing the same assumption inside the e-firmware resources, you know, flexibility in availability and the cost, how much resources a query needs, that's already answered by the existing resource reservation or scheduling magnets and whatnot. The question we want to ask is like,

Starting point is 00:14:28 is it worth allocating resources to a particular query or job? This is because we have limited resources, but we have a large amount of workloads. How should we allocate the resources to that? The thing that makes this worse or more important is like once we look at those query, the progress curve of each query, it's like they increase

Starting point is 00:14:49 at the very beginning significantly or quickly because probably what, you know, a lot of the modern data processing jobs are, you know, iterative. You know, we process data batch by batch, right? The first batch may already give you very rough idea of your final result.

Starting point is 00:15:06 But your final batch just pushes your final result from 90% accuracy to 91% accuracy. So you can imagine you combine everything together. We have limited resources, we have a large workload, and each workload

Starting point is 00:15:22 can get their results pretty good at the very beginning batch. But they will waste their resources at the final batch. How should we allocate it? Those things, once we put everything together, we have this primitiv to the result arbitration. The third primitiv, we call it the host tolerance. It's about the pricing model. Like I mentioned before, right now, it's state-of-the- it's about pricing model like I mentioned before

Starting point is 00:15:45 right now it's state of art pricing model is pay as you go so how much resources you use how much money you pay

Starting point is 00:15:52 right then our vision is like if we allow service provider or the users allow the service

Starting point is 00:15:59 provider to suspend their job you know for some reason and then I can give you more fine grained pricing control or there's more options you can choose. Then some users may take it

Starting point is 00:16:10 because not everyone is care about low latency, high availability, something like that. So I think for those users who prioritize cost efficiency over speed, then they may prefer more options. Like even if you said, find a job and then I'm not totally fine. I just want you to give me a good deal.

Starting point is 00:16:30 Yeah, awesome. If I just repeat those back to you then. So the first one is our query preemption. So that's basically saying that we can adaptively, given users of cloud environments, the ability to adaptively pause and suspend their queries.

Starting point is 00:16:45 Then we move on to our resource arbitration. And that is where we kind of want to ask the question, is it worth allocating something to a job? Because like you said, in the case of iterative queries, there's this sort of diminishing return of sort of every, not all iterations are equal. So the earlier ones are more valuable and then they are, because like you said going from

Starting point is 00:17:05 98 percent to 99 yeah sometimes you want to you want to give resources to the more promising jobs right sometimes you want to you want to keep allocating resources to one job until this job push the limited back you know really far from the the state. So it really depends your goal, your objective. But yeah. But it's all about flexibility, right? That's the thing. Having a more sort of flexible sort of, then this kind of gets comes into the pricing model, right?

Starting point is 00:17:33 And the cost tolerance and just exposing more options to users that is better for the users in terms of like cost, right? They maybe don't spend money or then they would need to. And it also, I guess, frees up resources for the cloud provider or that to then sell to somebody else right so basically yeah everybody everybody wins right that's the sort of the goal exactly better price model yeah the whole the whole every all boats rise so yeah we've got our three primitives then and then to sort of realize these primitives you've developed a framework called ratchet and. And tell us more about Ratchet. Give us the high-level overview of how you put these primitives

Starting point is 00:18:08 into practice, shall we say. Yeah, definitely. By the way, you have a better summary than me. I think that's... Sorry, you just told... I'm just repeating what you just told me, so it's fine. Yeah. So for Ratchet,

Starting point is 00:18:24 essentially, it's a novel or it's a it's a new resource adaptive query execution framework so um to real like we said to realize these three uh primitives so it's it's a framework so we have so we have a different way to to implement it right but in our paper we propose three um component or or three pillars in this framework. One is we design an adaptive query execution framework, which enables the query suspension and redemption using different strategies. The second component is a resources arbitration mechanism. It's responsible for determining resource allocation for suspension

Starting point is 00:19:06 and the resumption during runtime. The third one is like

Starting point is 00:19:10 a cost model like provide users more fine

Starting point is 00:19:13 grand set of the results and price options.

Starting point is 00:19:18 So I think overall that's the framework for Ratchet.

Starting point is 00:19:23 Nice, that's the high level overview. So let's dig into one of these one by one let's do a little bit of a deep dive and so kick us up for the first with the first pillar and the adaptive query execution framework you mentioned there's a few different strategies there and how we're going to go about suspending these jobs and things so yeah

Starting point is 00:19:36 give us a bit of a rundown of how that works sure definitely so yeah before we get to that question is one thing i want to announce is like, yeah, the paper described this first primitive just got accepted by the ICDE 2024. And then yeah, if you want, if any audience want to look at more details, you can find it. But probably it's not public right now because we didn't put it on our website but i believe i believe it will be published very soon as soon as you can find it and uh yeah let's go back to that uh that question so for for the printing one the active query suspension and resumption framework so in that paper we propose we have this this framework and this framework consisting of different query suspension and the resumption strategies those strategies are you know at different levels this framework. And this framework consisting of different query suspension and the resumption strategies.

Starting point is 00:20:26 Those strategies are, you know, at different levels. So, for example, we can say the most naive one is called

Starting point is 00:20:33 the redo query. So once your resources has been terminated, you don't do anything. So your query will stop there and then once

Starting point is 00:20:42 we want to resume it, we rerun it. That's the most naive one another uh another strategy is called we call it operator level suspension strategy so it's it's actually original from a previous work called query suspend and resume i think that's uh that's a that's a work you know proposed by patrice from the Microsoft Research. That's a pretty big name here. And then another strategy is we call it pipeline-level

Starting point is 00:21:08 satisfaction strategy. So that's something we developed for muscle-driven or pipeline-driven query execution.

Starting point is 00:21:16 Those things are like unlike operator-level satisfaction, they suspend the query at the operator

Starting point is 00:21:23 which has the lowest memory usage because we don't need to present too much data when we suspend a query. Pipeline-level strategy is like, the pipeline-level query execution is like, they split the query into different pipelines.

Starting point is 00:21:37 And then our strategy or our method is like, we can suspend this query once a pipeline is finished, and then all the intermediate data of this pipeline will be persisted. And once we want to resume this query, then we can rebuild this query plan and then gather those persisted intermediate data of all the processed pipeline and then going from there. Some other strategies have been proposed in this framework. It's like we call it data batches, that will suspend strategy. and then going from there. Some other strategies has been proposed in this framework is like, we call it data batches,

Starting point is 00:22:07 that will suspension strategy. It's like, we don't care about the query execution part, but we split the, we can split the input data into different batches, and we can suspend it once one or multiple batches are finished. And then it's very easy to, you know, keep the progress and the intermediate data, right? And there's another strategy that we also developed. We call it the protein level suspension strategy.

Starting point is 00:22:31 It's like we consider, so we don't care about what happened within the databases. We suspend the whole process where the database applications are. So that means if the database has multiple process to handle multiple queries, and once we want to suspend one of them, we suspend the entire process. And then we keep everything within this process into disk. Yeah, and then once we want to resume it,

Starting point is 00:23:01 we resume the process first, and reload everything within this process and then process it from there. Those things, you can see in this framework, we have very diverse or different strategies, but there are some trade-offs between them. It's like you can imagine once we want to redo a query, there's no cost here.

Starting point is 00:23:22 It's like we just let it stop and then we do not perceive anything. And then once we're on resume, we rerun it. But the thing is, we lost all the progress. There's nothing here. But for the process level suspension strategy,

Starting point is 00:23:37 we keep everything. Once we know the results will be terminated, we suspend the query and we keep everything. But the downside is like the intermediate data or the state could be really, really large because we will not only keep the database's data, we keep the process data, almost everything.

Starting point is 00:24:00 So I think the question is like, how should we select the most appropriate one once we have different queries we have different user requirements you know we have different environments so that's something we want to answer that's the primitive one sorry quick question on that real quick is that sort of the flexibility

Starting point is 00:24:17 it's a very flexible sort of framework there the different levels from all the way down to redo like you can lose everything but obviously then there's no preservation of state so there's very low overheads in that respect is that something that maybe we'll cover this later on i'm not quite sure but that like is that something you want to surface to the user of these systems so they can actually be they can actually declare like what does the user interface look like do i bit can i say yeah my query is a redo query it's i want it to say, the data batch level sort of computation.

Starting point is 00:24:47 How do I express that as a user? Is the system intelligent? Can the system work it out? Or is it something that's surfaced to the user? Right, yeah. So this, again, like I mentioned earlier, so those systems are from service provider perspective. Okay, right.

Starting point is 00:25:03 Yeah, so that means the system will make a decision, but they will consider user requirements. For example, like I said, so if I tell you that the users really, really want to get their data faster, or in a really quick way, I don't really care money, and I spend all my money on that.

Starting point is 00:25:19 Yeah, and then I won't even suspend the repair. I will try and back to carry on until the end. But if the user says, okay, I won't even suspend your query. I will try and batch it to query run until the end. But if they suspend it, I don't really care. How much suspension you could afford, we will give them fine-grained in our vision. We will give them fine-grained options, like one or two or three, or what different other way to let users say,

Starting point is 00:25:41 indicate their expectation. Based on all these things our system make decision which suspension strategies could be used gotcha there's a mapping there between sort of the high level what the unit does and then that gets converted down into these different different sort of strategies you use cool so i guess with that we're on to pillar number two now so yeah take it yeah yeah yeah so pillar number two is the is the resource arbitration we already have a paper about that but. But for that, once we have multiple queries, the multi-tendence, we consider users have very diverse requirements. So, we define competition criteria. will say, when I think our query can be finished, or when I think the result will be acceptable. So like

Starting point is 00:26:27 accuracy will be like 90%, or I already scanned like more than 90% of the data, or whatever. So our system will estimate how much progress a query can achieve if I give you a specific amount of resources. So we

Starting point is 00:26:43 will estimate that thing. And then we will do like some banded sale process. We say, if we allocated those resources to these jobs, we estimate how much progress we have. And then we probably will, it depends on different goals. If our goal is to try to keep the resources, the most promising ones, I will give the resources to the job

Starting point is 00:27:06 who can achieve the most progress. If I really care about fairness, I don't really want one job four behind. I want everyone to achieve some reasonable progress. I probably will keep allocating resources to the one achieved latest progress. So this framework or this primitive tool give us the ability to adapt

Starting point is 00:27:31 to the allocating resources to different job and to achieve some greater good, like fairness or efficiency, right? So yeah, that's the primitive tool. So on that primitive tool, sorry, really quick. Yeah, sure. So fairness for one job is maybe not the same fairness as the fairness for the other job.

Starting point is 00:27:54 I mean, I'm kind of thinking, because fairness is not really a well-defined thing, right? Or it's very ambiguous and you can depend on your perspective and that can be applied, I guess, to the same thing, to just jobs, to query execution, execution right so is there a global setting there or is it is that pillar adaptive so like it can be like the fairness is almost query level and that's then used to come up with a global sort of um uh decision that's fair i guess respect with respective to everyone's preferences i'm sorry if that doesn't make much sense, but yeah.

Starting point is 00:28:25 No, it's done. So I think that's a good question. So essentially, we support both of them. It's like if we know, for example, if we know, so it depends on workload and like I said, we support both of them. So, for example, if

Starting point is 00:28:41 all these workloads are from the same user, what they want to do is like we want to find it. So they just random give like two different configurations to different data processing jobs or whatever. I want to find the best one. I want to find the best result. Then this overall workload may have one global objective.

Starting point is 00:29:04 I want to find the best one, right? But if this query is from different users, like they may have different requirements, different goals, like this makes no sense if we set a global objective for them, right? Because everyone is individual. So in that case, we will consider like,

Starting point is 00:29:21 what's your objective? What's your competition criteria? How much money you want to spend? And then we can, you know, adapt this, like, find some way to allocate the resources to these jobs and to try to keep everything happy. Yeah, cool. I guess, yeah.

Starting point is 00:29:37 So then number three, the cost model. So tell us about pillar three. Yeah, the cost model is like, this is a mission as well. We haven't, yeah, we haven't is like, this is a vision as well. We haven't published any long paper about that. But in our paper, the vision is like I mentioned, we try to provide some like the C-less you pay pricing model rather than the pay as you go service model. It's like I mentioned already after

Starting point is 00:30:05 several times, if you allow me to search by your query or you give me some tolerance you can have, then I probably

Starting point is 00:30:12 give you better choice or better practice options. And then you'll be happy. I will be happy because I can reallocate

Starting point is 00:30:21 the result which should be allocated to you to some others to make more money. And then, yeah, everyone's happy. And I think in that case, we somehow achieve a better utilization

Starting point is 00:30:31 for this limited result in terms of money or in terms of how much money the service provider makes. So I guess in that case, cost tolerance is like fine-grained pricing model and what we call it's like

Starting point is 00:30:46 the standard S-WP model. I guess that sort of kind of brings the vision sort of together. We've kind of got our sort of three pillars there, each component. So with this obviously being a vision paper, there's a nice section in your paper about the future directions for Ratchet.

Starting point is 00:31:01 So tell us a little bit about where we go, where we go from this. Since we've got this framework. What's the next steps? Yeah. So I think since I just graduated, so I want to make a little more testimony about our new Chicago group. So this project is a long-term project. It's under the supervision of the professor. There's a bunch of talented people doing this

Starting point is 00:31:25 our next step is like actually we have a we already start making a suspension oriented basic system it's like we consider the queries the suspension is is the worst capacity then for this database we think that the query can be suspended or should be suspended. And then in that case, what would the modern database look like? How should we redesign each component to support that requirement? In that case, there's a very important difference. We cannot assume the query can run from beginning to the end. And then once we have multiple of them, once we suspend the app, you can imagine some challenging issues. like how should we keep the consistency?

Starting point is 00:32:08 How should we keep the observations? Or, you know, once we have multiple versions, you know, what if the data changed when we suspend the query and when we resume the query? How should we guarantee that the result will be the same? Or which one you want? Probably user want, like, when we systematize query, the data is like the old version, and what they resume next, it could be the newer version. How should we do that? Which one do you want?

Starting point is 00:32:34 You know? So, yeah, there's some, I think, promising research direction. And also you can imagine, like, this system is not only good for, like, resource-limited, but it also provides enough flexibility to almost every system. So if the cloud infrastructure needs to upgrade regularly or the system needs to shut down for some time, without this suspend-oriented database,

Starting point is 00:33:05 everyone loses their progress. Or we need to find a way to do that. But right now, as long as you can tell me when you want to shut down the introduction or upgrade the upside or downside of your resources scale,

Starting point is 00:33:22 I can find a way myself to keep everything happy. There's a lot of ways or application cases we can use those your resources scale. I can't find a way myself to keep everything happy, right? So you can imagine there's a lot of ways or application cases we can use those special R&D databases. And also another thing

Starting point is 00:33:32 is like, I think we want to spend some time on provide or spend more time on these PDR3s. I think it's not only about databases research or complex research.

Starting point is 00:33:41 They may need some economic thing, like how should we compare, how should we provide basic research or complex research, they may need some economic thing. How should we provide comprehensive pricing models to use more options

Starting point is 00:33:53 and then we can have a different mechanism to handle each option and make

Starting point is 00:33:59 more money specifically. I think that's probably the future work of the near future. For the 10 years or 20 years, I really have no idea.

Starting point is 00:34:12 Okay, so we're at base camp at the moment for the research group's agenda for the next 10 years. And we'll see where it goes. But it sounds like there's plenty of interesting directions for it to be going in for sure. Yeah, thanks. Yeah, so keep a lookout for the papers coming out of them out of your group because i'm sure they'll be very interesting the the the question i i like to ask um some of my guests sometimes about the work is if you put your reviewer hat on for a moment and you can have asked the limitations of your work obviously

Starting point is 00:34:40 this is a very like vision so it's kind of hard to know the concrete limitations of the work at the moment. But I guess what limitations or what challenges do you see in taking this vision and actually making something realistic and concrete and practical out of it? What are the big challenges you think are the limitations that you may hit into in the long term or even in the short term? Yeah, I think we do have some discussions or brainstorms when we explore this area or this project. I think several things we discussed before. Actually, we made some assumptions here. It's like we know, or we somehow know when the termination will happen.

Starting point is 00:35:26 If termination happened suddenly, like next second, we have no idea. There's no way to handle those progress. So that means one limitation or several cases, one case that we cannot handle is like the termination is

Starting point is 00:35:42 unexpected. No one tell us, we we probably will yeah lose everything or we will behave very very bad yeah that's that's one thing but if you can give us some hint about when the termination could happen i saw it could be not not not necessarily be a exact time point you can can give us a range, an interval. The termination could happen within this

Starting point is 00:36:07 time, like five minutes or 30 minutes or whatever. So it's like on Tuesday. Yeah,

Starting point is 00:36:11 exactly. Yeah, we can find a way. So I just wanted to give us something.

Starting point is 00:36:17 We will try our best to develop a system, you know, to keep everything in

Starting point is 00:36:24 good shape. Yeah, I think that's the one thing we haven't decided yet. Second thing, I think probably will change this project significantly. It's like, what if we will have a

Starting point is 00:36:39 really good world in the future? The resources are not limited. Even the zero-carbon cloud infrastructure, somehow, based on very, very good people or engineers, they somehow managed to get this energy really stable.

Starting point is 00:36:56 Even if it's driven by the sustainable resource resource, the energy source, they can somehow get those resources stable

Starting point is 00:37:06 there's no need to consider the resources will be terminated I wish they could

Starting point is 00:37:12 and I really want to live in the world like that we have unlimited

Starting point is 00:37:15 resources yeah it sounds a bit sort of utopia this glorious future

Starting point is 00:37:20 that will probably never happen right it sounds great yeah I don't know how realistic but in that case yeah if in that case well our project probably won't play a significant

Starting point is 00:37:31 role there because one we have two assumptions one is like resolve this identity we consider this will be the trend consider the global situation here so like you know everything and then we consider we we will have something about termination as long as you can give us these two assumptions and then we yeah our our system our project will give you good answers nice cool so yeah i guess we've kind of spoke with the vision there a little bit and i guess i mean what impact do you think Ratchet can have longer term? Or do you think it can be, how revolutionary do you think this project can be? Kind of, I guess, going forward. I mean, how do you think it could affect?

Starting point is 00:38:13 Just two questions here. Sorry, so there's the general impacts of the project. No, no, that's fine. And then there's the impact on the day-to-day of software engineers and data engineers. Yeah, if we're putting our prophecy hat on now, we're going to say, okay, this is what's going to happen in the future. What impact do you think it can have? The findings or the impact of Ratchet, we think it's like, I think I already said this a lot of times, that we want to argue that we should try our best to use the ephemeral

Starting point is 00:38:42 cloud resources. We should even accept the termination. To consider termination as a regular thing rather than a stable regular thing. Once we consider resources, it will be very dynamic and can be terminated sometimes or somehow. Then we possibly design a system to handle them or even utilize them. I think that's the biggest argument or

Starting point is 00:39:10 finding in our research that we want to consider. So I think for software engineer or the data engineer, my guess is once they run or they devise a system, they never consider the results can be determined. I don't think they consider that the results will be appropriate. They will consider how should I maximize the resource utilization, but they won't consider what, what if I have no, sometimes if I have no result, what can happen? Yeah, I think that's, I think that's a link to some existing area

Starting point is 00:39:49 like fault tolerance, recovery, recovery mechanism, or checkpointing, some, yeah, you name it.

Starting point is 00:39:57 There's some, you know, existing topic in that area. But, with our research, we provide another way to reconsider those things. What if we,

Starting point is 00:40:07 can we just try to keep as much as progress or keep the necessary progress and it makes people think, okay, that's good enough for us. If the termination could happen. I think that's something different from the previous research work.

Starting point is 00:40:23 Yeah, I guess because it's kind of when you were talking about the um in the in in the first pillar in the in the framework and different strategies kind of as you're talking about those i was thinking checkpointing i was thinking those sorts of those sorts of things and that was what was coming at like save point and and whatnot but then i guess what's happening here is this has now been surfaced as a as a you as like a an uh an interface basically that people can kind of interact with rather than it being like hidden within the system within the database somewhere it's doing checkpoint now we're kind of actually being able to exploit it and use it in our applications and whatever so yeah no i can definitely see

Starting point is 00:40:56 the parallels and how you are basically using those and putting them in a different light um right yes exactly so the next one that i like to ask a lot is what's the most interesting thing you've learned while working on on the project so in this case on ratchet what's been the best insight you've kind of got from it most surprising maybe i think most interesting lessons i learned from that is like so i think that's let's link to some backstory or, you know, original story of this project. It's like, so this project actually come from the original one. So in previous project I had, it's like, I mentioned that the rotary, like we want to schedule the different jobs during runtime. And then we want to give them the most appropriate resources.

Starting point is 00:41:45 And then we found during runtime, so we want to pre-add those queries. We want to, you know, during runtime, we want to, you know, re-gauge the results

Starting point is 00:41:55 and we need to systematize them. And at that time, we used the checkpoint mechanism that supports not only network, supports not only the work, not support, support not only

Starting point is 00:42:06 the processing job, but also the machine job. So we, so anyway, we rely on the checkpoint mechanism to suspend job. And sometimes we find checkpoint will be,

Starting point is 00:42:17 will be a time consuming process. It may generate a lot of data. And then if we keep doing that, sometimes it will give us an expected long latency. And that time we keep doing that, sometimes it will give us an unexpected long latency. And that time, so I keep going deeper and deeper and deeper

Starting point is 00:42:29 to find this. Okay, what's, how they, you know, checkpoint the query process. And that time we use Spark Structural Streaming as a platform. And then I go inside

Starting point is 00:42:40 their source code to see, okay, how they checkpoint the thing. And we figured out, okay, yeah, they did do some pretty good work, but they definitely did a lot of protests to maintain this checkpoint. So I think we want to keep exploring,

Starting point is 00:43:01 like, I don't have a better way to do that. If we could, so what can we, you know, is there any other don't have a better way to do that if if we could so what can we you know is there any other method we can use to suspend it so i think that's that's the most interesting thing like the current project actually is this is the previous project some failure point of the previous project if you keep going and keep going and you will find okay sometimes i probably i can i can i can make more work on this figure uh point and then the field of point i and then i can publish a new paper i think that's a pretty interesting legend that i learned it's like first is never overlook any uh field points secondly it's like keep going and find some details probably you get something

Starting point is 00:43:42 yeah keep iterating right and you never know like right right just keep going trying new things yeah awesome that's really cool and what whilst we've got you here really like it'd be nice for you all to hear about your the other work from your phd as well so maybe you can give the the listener a quick breakdown of some of the other things you've worked on across your across your studies yeah definitely yeah yeah yeah so besides those um i think i think the the the I think the research I just described by many research, but I also have some, I have strong interest in the machine learning area. So I do some project intersection

Starting point is 00:44:15 between the databases and machine learning. So one is like when I was a research intern at Microsoft, I, yeah, I optimized the feature store. I think it's one of the most important operations in features is called point-in-time join. It's essentially an operator to generate different features. And we optimize them to make it more faster to get the

Starting point is 00:44:46 features for training. And also once I was in the DocuSign AI engineering team, I optimized their container-based hyperparameter optimization workload. So you can consider once we have, I'm not sure if you are familiar with

Starting point is 00:45:02 hyperparameter optimization, but essentially it's like I have a lot of hyperparameter configurations. I want to find the best one to get better training results. And then in that infrastructure, they are using the container to run each hyperparameter configuration. Yeah, I optimize it and try to put different containers together and try to train them or execute them

Starting point is 00:45:30 in parallel to save more time and to better result in the application. I think one of my early work of my PhD studies is that I saw how once we have multiple models and is it possible to combine or pack them together as a bigger one and to the main single GPU and train them simultaneously.

Starting point is 00:45:52 So that we won't need to, without hurting the final result of each model. I think, yeah, those are some other research I explore or I did during my PhD but I think I will keep doing some of this machine learning work Cool, if any of those topics sounded interesting to the listener, we'll put links to those in the show notes as well so you can go and follow up and have a look at some of the other awesome

Starting point is 00:46:18 work that Rui's done. So I've just got two more questions for you now Rui, and the first one you maybe touched on a little bit a second ago actually when you was talking about iterating and kind of following through with ideas and kind of not kind of giving up with them that easily sort of thing but it's all about your creative process and how you approach research and how you approach idea generation and then once you've got an idea or a set of ideas we're selecting which ones to

Starting point is 00:46:45 dedicate time to and to pursue so yeah how do you go about that what's your what's your process right yeah so i think uh first i think first thing first is like as a phd student you cannot ignore the role of your providers so your providers yeah yeah yeah it's very common player yeah they carry a lot of ways once we want to get the idea of the cycling project. But for myself, I think I consider ideas as two kinds of things. Like one is the original idea. The other one is the optimization idea. So for the original idea, it's like you create something that never exists.

Starting point is 00:47:24 It's really totally new. But those original ideas, they need inspirations, patience, other workings. That is, you know, something that is really difficult to measure and follow, right? But I do, according to my experience, like I think I have some experience

Starting point is 00:47:42 on the optimization ideas, like something probably others have followed. First, find an area you have interest in. And then you keep exploring. And you will find some projects, find some papers, find some important contributions. And then the thing I would like to, you know, find is their assumption,

Starting point is 00:48:11 their scenario, their applications. And then I will see, can I break one or multiple assumptions of them? That may give us, you know, a totally different thing, right? Is the application or environment has been changed since the paper or the project proposed like for example you know a couple years ago the there's no gpu there's no matter of powerful hardware there's no nme or you know

Starting point is 00:48:40 you name it and now we have all of them so So can we, you know, we can add those new things to change some existing work. And then, of course, I want to see is there anything missing? Is there any, you know, chance I can optimize or I can add more function, you know, to overall improve the existing works? That's something I usually will do

Starting point is 00:49:05 once I want to propose an optimization idea, you know, try to improve the existing works. So that's my idea generation. And for the project selecting, I think I'm a little bit pragmatic. It's like, so once we select project, of course, interest is one thing, but once we select projects, of course, interest is

Starting point is 00:49:26 one thing, but we need to consider, you know, the delivery results, the durable results,

Starting point is 00:49:33 like how many times do you have, what's the result you are expecting? It's like, do you want

Starting point is 00:49:40 to have a paper in three months? Do you want to have a paper in one year? Or do you

Starting point is 00:49:44 want to have a long-term project? Do you want to have a paper in one year? Or do you want to have a long-term project and at the end of your PhD journey, and you can announce like, everyone, I make a really, really cool thing. Everyone should look at what's your goal. So yeah, sometimes I figure out, okay, this is something I want to achieve. And then I will see, okay, what's the best way

Starting point is 00:50:04 or what's the best project I can achieve that goal. So I think, yeah, that's my way to select projects. But ideally, definitely it will be much, much better if I can select a project I really

Starting point is 00:50:19 love. But that's usually not the case. Yeah, that's true. But I like yeah that's true but i like to say about you with just going back to the idea um generation part of your answer there and sort of like understanding all the variables all the assumptions and then thinking how can i change these variables how can i break this assumption and let you see what happens and follow those that was a really nice nice part of that question and yeah it's always good if you have something that you're passionate about right it makes it makes it so much easier to work on something and pursue something if

Starting point is 00:50:47 you are passionate about it and you're really interested in it so yeah no that's a really yeah really nice answer to that question really cool so we're at the end now so it's the time for the last word and yeah like what's the one thing you want the listener to take away from this podcast episode today? I think, yeah, I want to say, you know, I consider the computing resources or energy-based resources are more rare compared with, you know, the time before. And you can imagine how the modern data processing job how complicated the modern data processing job are.

Starting point is 00:51:36 And then how large the large language models are. And you can see they are resources consumption jobs, right? The tasks. So in that case, I think we need to consider if there's any better way to, you know, fully utilize the resources. Even if the resources are not stable, even if the resources are, you know, ephemeral, then we really want to use them. Because, yeah, the resources are rare in the future.

Starting point is 00:52:08 And each of them will... Yeah, you can imagine a lot of current model computation jobs consume a lot of resources. So I think that's the thing. Yeah, I think that's one way to keep the audience thinking, is there any other way or is there a better way to have a better system?

Starting point is 00:52:29 I think that's one message I want to deliver. Great. Good message to end on for sure. So thank you so much, Rui. It's been actually a pleasure to talk to you today. If the listener wants to find any of Rui's work, we'll put links to everything in the show notes. You can go and check those out. Whereabouts can we find you on social media, Rui? Are you on any of the platforms's work. We'll put links to her, I think, in the show notes. You can go and check those out. Whereabouts can we find you

Starting point is 00:52:45 on social media, Rui? Are you on any of the platforms, LinkedIn, Twitter? I'm sorry, X, should I say now? Can we find you anywhere? I have a LinkedIn account. I think people can find me

Starting point is 00:52:57 on that platform. For others, I'm not a very active social, you know, media guy. So I guess, probably, yeah. Email maybe might be the best way to contact you then if they if they want to talk about the work today we've been speaking about

Starting point is 00:53:11 today i guess so we can put your email in there so that so you can be you can get probably all you can yeah all you can you can send me a message on linkedin i get another way to reach out to me fantastic stuff um yeah and a reminder again if you enjoy the show please consider supporting us through buy me a coffee and we'll see you all next time for some more awesome computer science research Thank you.

CODACE Plant Stand

Disseminate: The Computer Science Research Podcast - Rui Liu | Towards Resource-adaptive Query Execution in Cloud Native Databases | #49

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

CODACE Plant Stand

Disseminate: The Computer Science Research Podcast - Rui Liu | Towards Resource-adaptive Query Execution in Cloud Native Databases | #49

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.