Disseminate: The Computer Science Research Podcast - Xiangyao Yu | Disaggregation: A New Architecture for Cloud Databases | #68

Starting point is 00:00:00 Disseminate the Computer Science Research Podcast. Hello and welcome to Disseminate the Computer Science Research Podcast. As usual, I'm your host, Jack Wardby. I'd like to start out with a little bit of an apology today. I'm not in my usual setup, so if the audio is not fantastic, I can only apologize. It does seem a bit echoy, but hopefully it comes across. It's okay, but anyway, we'll crack on. So, yeah, Disseminate, this is the podcast where we interview computer science research

Starting point is 00:00:25 is about their latest work, we dig into the problems they've tackled and how they've solved those problems and how those findings that they've got from that research can then be applied in practice. The overall goal here is to sort of try and further narrow that gap between research and practice and make computer science research more accessible. So yeah, if you're an industry practitioner, researcher or student, then this is the podcast for you. You can listen to us on Apple, Spotify and you can watch log on YouTube. So yeah, if you do enjoy the show, please like, follow, subscribe and tell a friend about us. So yeah, on to today's show.

Starting point is 00:00:58 And today we're going to be talking with Zhang Yau Yu, who is a assistant professor and the database group at the University of Wisconsin-Madison. And, yeah, Jingyao's research focuses in three areas, that is cloud native databases, new hardware for databases and core database techniques. I want to be focusing probably primarily today on the first of those, and that's specifically disaggregation, which is the separation of database components into independently managed scalable and services. So, yeah, Jingyao was awarded the 2025 VLDB Early Korea Research

Starting point is 00:01:33 Contribution Award for his work on this topic. I'm really excited to have him on the podcast today. Welcome to the show, Jingyao. Yeah, thank you. Cool. Thank you for a great introduction. Oh, yeah, that's good. Cool.

Starting point is 00:01:44 So let's get stuck in with some background then, sort of set the scene for the listener. And for those who are maybe new to the topic of disaggregation, Can you explain to us what that means and how it differs from the traditional classic database architecture? Yeah, for sure. So the classic database architecture, like, maybe the most famous way is Share Nothing. And the Share Nothing is like they have multiple servers collected by network, but within each server, they have computation, they have the storage, have logging, like, all the functionalities. It kind of replicated that across different servers.

Starting point is 00:02:20 And that is your one cluster. So this aggregation is like this different approach where they put different database components into different clusters. So now I have multiple clusters. And different clusters focus on maybe one cluster focus on computation, the other cluster focused on storage. So in short, you can think of this as conventional architecture is a typely coupled single cluster,

Starting point is 00:02:43 but this aggregation is a loosely coupled multiple clusters. Nice cool that, yeah. It's a great description of that. stuff that would decompose and all these different database functions into their own into their own services. Cool. So, yeah, I guess what motivated you to sort of explore disaggregation as this sort of shifting database architectures?

Starting point is 00:03:03 That's a great question. So I guess it started when I was a postdoc in 2018 at MIT. At that time, at that time, people don't really use the word disaggregation yet. So people call it different names. But it was pretty clear to me that this architecture has. great potential and it seems to me, oh, this seems to be the future, the separating the functionalities. So I guess that was a vague thought and saying, okay, this is still not certain but I want to explore this. I started with the first project like a computation pushdown

Starting point is 00:03:37 for analytical processing and went from there, expanded to many other projects. That's cool. Yeah, you definitely had some great foresight there because and where things are going, you've definitely been kind of ahead of a cab there. I think, yeah, I'm so cool. Yeah, let's talk about the shift a little bit more then. So we spoke about the traditional architecture a second ago, but what are the limitations of that architecture, that disaggregation, like directly addresses, and I guess this is we kind of need to start speaking about the cloud and the cloud environments, and then how did that sort of make this architectural shift kind of inevitable, I guess? So you're totally right. It has to be, it has to be, the discussion has to be about cloud,

Starting point is 00:04:16 because in the traditional environment, the traditional architecture was fine. For on-premises environment, for example, share nothing was a great architecture. It's actually perfect. But when we go to the cloud, the environment is different. It's not on-premises anymore. And in particular,

Starting point is 00:04:34 cloud has this very silly feature called on-demand scalability, which is if you ask for more competition resources, you can get it immediately. You can shrink competition resources and you only need to pay for what you use. And it's kind of elasticity or on-demand, scalability did not exist

Starting point is 00:04:51 in Share Nothing architecture or did not exist in on-premises environment. Okay, so it's more like, okay, when we move to the cloud, we really want our database to also have on-demand skillability, like it's super cool, super cost-efficient. But because Share Nothing was not designed

Starting point is 00:05:09 for this environment, it cannot fully exploit such scalability. Yeah, so So basically there's one thing is when we want to have on-demand skillability, we truly want scalability in the compute side, but not the storage side. The compute demand can change drastically over time,

Starting point is 00:05:34 but the storage demand does not change very much over time. And compute is stateless, it's easy to auto-scale. Storage is fundamentally difficult to auto-scale. In Share Nothing architecture, they always couple computer and storage into a single cluster. So that would make it very difficult to auto-scale. Like if I scale, after-scale compute and storage together,

Starting point is 00:05:53 and storage is very difficult to scale, as we know. So it becomes very difficult. Therefore, when people say, okay, we want to have on-demand scaleability for compute. So in the cloud, what should we do? That's separate compute and storage. So that, okay, we know we don't need to out-scale storage too much. So now we can scale computation

Starting point is 00:06:13 because it's separated at two services. I think that's the, it's not a limitation of sharing nothing, I would say, but it's more like it failed to exploit the new opportunity in the cloud. So in order to embrace this new opportunity, we need to go to this aggregated architecture. Yeah, nice. I guess, yeah, like you say, it's nothing kind of against the sort of the design of shade up because the rules of the game have sort of changed, right?

Starting point is 00:06:41 And it's met these new architect as possible. Let's talk about the sort of evolution of disaggregation and over the sort of since maybe like 2018 when you sort of had this, this sort of like great foresight of seeing this being a sort of a trend. And so things sort of initially started off with separating storage and compute, right? And like, since like Stunner Thick and Aurora sort of pioneers in that. But things have kind of started to go beyond that. So can you maybe take us on that journey from sort of day one, hey, we can split storage

Starting point is 00:07:12 and compute because it's great because storage has got different characteristics. characteristics to compute and we can actually pull it. We can scale things differently just for compute to sort of like where we are today. So I think, okay, I think a lot of cloud databases are separating compute and storage and some databases are going further. I think there is consensus on separating compute and storage. And for the extra separation, extra disaggregation, like every system has a little bit the different exploration, right?

Starting point is 00:07:46 So there is a very active area right now. At a super high level, it's basically like, well, for a subset of database functions, we can disaggregate that into a separate cluster. And there's so many database functionalities. So you can disaggregate in many, many different ways. So just to give you some example, for example, storage, okay, it can be further disaggregated in Socrates, for example, from Microsoft, they further disaggregated storage into a login service,

Starting point is 00:08:19 a page cache, and a durable page store. So we think about the logging service, for example. Logging service, the performance is really critical because it's on the critical path. But the log size is usually much smaller than the data, the page data. And because you care about performance and it's very small, you can't afford using some more advanced technology,

Starting point is 00:08:41 maybe more expensive storage to cut the latency there. But you don't want to use that expensive storage for the page store. Yeah, it's too expensive. But disaggregating these two, now you can use expensive for the logings, and that will improve the overall performance. So that's one example where, like, okay, because different functionality has different performance requirements,

Starting point is 00:09:04 by disaggregating them, And now you can customize the implementation. So some other examples are, okay, we'll talk about storage, further disaggregation. And you can also disarget execution, like, oh, we don't have to do all the execution in the compute layer. We can push some of the execution to, well, what I call a push-down layer,

Starting point is 00:09:25 a layer closer to the storage. It doesn't have to be within storage. It can be within storage, but it can be close to storage. Some other systems use, of Snowflake has this caching for intermediate data. They just say intermediate data has to be flushed from a compute node because it's too big. But you don't have to write it back to the storage layer.

Starting point is 00:09:46 You can write it back to this intermediate data caching layer that has lower costs than the storage layer like S3. So some other examples like metadata layer, query optimization as a service, or memory disaggregation. So a lot of exploration, a lot of ideas floating around here. Yeah, it's great. There's a lot of possibilities that. It's kind of almost, you kind of feel like a kid in the sweet shop, right? There's so many different database functions. Like, ooh, can we disaggregate this bit? Can we disaggregate that? But I guess there is a, there is probably a tendency to want to do that. But I guess there is a limit to how far we can take that, right? So, because if we start jamming the network in between everything, eventually we're probably going to sort of lose all the benefits of the disaggregation, right? And I guess possibly in some scenario, I think you actually reference this in your paper about this.

Starting point is 00:10:35 that there is some situations where if you go too far with this, the traditional architecture shared nothing can actually be better. Yeah, so that's actually a potential pitfall of deserogation. It sounds like it's not the case that deserogation is always good. So, for example, you don't want to disagree everything. I think it's like what the microservices are doing. Everything is a separate service. Every application you have hundreds of services.

Starting point is 00:11:05 you probably don't want so many services for databases. I mean, maybe, but I think that's probably pushing it too far. So disaggregation is a trade-off, actually, between performance and maybe separation of concerns. So the more you disaggregate, the performance tends to suffer. So if you want the best performance, you know what? Go share nothing. That will give you the best performance.

Starting point is 00:11:35 So people want a disaggregation, not for performance, clearly. It's for elasticity, auto-scaling, like this cost-efficiency, out-demand scalability, maybe separation of concerns. But you sacrifice performance in order to get these features. So when design assists, my guess, you should be very careful. If these elasticity or deserogation is the property you really, property you're really important for you, go for it. But be aware you're giving up some

Starting point is 00:12:10 performance. So you probably can get some optimizations to get some performance back. But it's probably very hard to get all the way back to share nothing. So, but for a particular use case, elasticity is not that important. Well, maybe you don't want to disaggregate aggressively. It's really interesting that because, I mean, there's these two sort of things sort of bouncing around my mind. And I know sort of when, when I was very much in recent, actually, you kind of, you kind of, I was at the time maybe very, you're very focused on performance, right, because you want to make a throughput higher and latency lower, right? And that's how you get your paper accepted sort of thing. And but when you start to set that

Starting point is 00:12:45 against the actual business concerns and the actual people actually, people who've run these systems in production actually, like the things they care about, it's not just performance, right? There is, there's many dimensions to this. And I think it's, and this can sort of satisfy those, those goals and those desires for those customers. And I think it's interesting. as well because you kind of you see you often see people say oh yeah disregardation is the future and that means that it's going to replace everything and it's going to that's the 10 years time that's what we're going to just be doing which you say that it's not true right like there's both type of systems can coexist and it all

Starting point is 00:13:20 depends on what the um yeah the actual customer wants to the one performance go this where you can care you can care about marble sissy you've got this so it just actually actually increases the the um the offerings for customers I guess and what's available so and that's yeah really cool brilliant so yeah let's talk about so obviously you and your team at the University of Wisconsin-Madison

Starting point is 00:13:41 in the database group there you've been exploring this design space and seeing what's possible within this paradigm so let's let's switch some folks and talk about some of these projects that you've been working on so you classify in your paper you classify them into sort of three categories so yeah tell us about the stuff you've been doing

Starting point is 00:13:56 to sort of reinvent or reimagine some core protocols in a disaggregation, in a disaggregated world. Yeah. I kind of classify the work we have been doing into three categories, but this is by no means exhaustive. I think other people have different dimensions because there's huge design space,

Starting point is 00:14:17 a lot of opportunities. So the three categories we did in our lab are the core particles, like fundamental database, core database protocols, and the query engine optimizations. and also some new capabilities that you can enable using deseragnation, like capabilities that didn't exist in traditional Shared Nothing architecture, for example. So I would just go through one by one.

Starting point is 00:14:44 So for the core protocol, like one work we did was to revisit Two-Face Commit. Like Two-Face Commit in this classic protocol is like very well designed. And I think originally it was a ZUME-M-Share-N-N-Thing. So there was one problem with 2PC. If I'm familiar with 2PC, you must know this. There is a blocking problem where in certain conditions 2PC protocols can get stuck. And there's no progress. Well, at least for certain data pieces, they happen to be locked indefinitely, cannot be released

Starting point is 00:15:21 because the state of a particular transaction cannot be determined. And fundamentally, I will not talk about the protocol overall, But fundamentally, you have this blocking because some node has failed and the state on that node is not accessible because when a node fails, the compute and storage fail together. That's fundamental reason.

Starting point is 00:15:43 Other nodes cannot see this failed nodes disk so we cannot make for our progress. But this is no longer true in this aggregated architecture because the storage is separate from compute. And you have to assume your storage doesn't fail because your storage fails, you lose everything in our database. And they're concerned is not just about blocking anymore.

Starting point is 00:16:02 You lose the whole database. So if your storage doesn't fail, even if your compute fails, other nodes, other compute nodes can still access to all the states in storage. So it can never be a case where a node fails, a compute node failed, and suddenly you lose access to your storage state

Starting point is 00:16:23 and you cannot make form of progress. You can't see it. The other computer nodes, Block 5 is just sitting right there accessible. But just leveraging that, you can address this blocking issue. Okay, the detail is a little bit that's complicated. So I will not go through that.

Starting point is 00:16:40 That's the insight. It's architectural change that change the fundamental assumptions that allow you to design protocols differently. We also did something for the control plane. I probably talk about that briefly. The disaggregation work today, a lot of that mainly focus on the data plane. But a control plane, think of Zookeeper, and that is mostly still the centralized service.

Starting point is 00:17:05 That is not disaggregated. So I think the work here is similar to like, oh, let's disaggregate the Zookeeper so that it can also auto-scale and it can also overlay on top of your database cluster. So you don't need to have a separate cluster. Yeah. So the second category of work we did is for query engines. and it's called push-down processing. Nowadays, I think push-down is adopted in many cloud databases, cloud production databases already.

Starting point is 00:17:37 So it's like, oh, okay, traditionally, you share nothing, every note is read home local disk, do some local competition, and then read some data exchange. So in this aggregation, you cannot read your local disk. the data was set in the remote storage. So every time you read from a remote storage, you have to transfer data over the network. So it's actually much more data transfer over the network

Starting point is 00:18:01 compared to a share-nothing architecture. But that's definitely overhead. So one way to mitigate that is you actually push these, push certain query processing down to the storage layer or close to the storage layer. So that means even if you only have four servers your compute cluster, but when you push down the sort of operation like silater aggregation, you can leverage serverless pushdown layer,

Starting point is 00:18:29 which may be able to use the hundreds of servers to process something in parallel and reduce the data and return to the compute layer. So that the data that received by the computer layer is much, much significantly reduced so that the performance can be improved. So there are a lot of details about pushdown. We actually have a series of papers. where you can push down simple operators. You can also push down a little bit advanced operators.

Starting point is 00:18:58 Like shuffle, even shuffle can be partially pushed down. Like Bitmap operators can be pushed down. And also, what about caching? How do you hybrid push down in caching? Because intuitively, if you push down, it's very difficult to cache the result of push-down computation. because that can you arbitrary predicate. How can you leverage that?

Starting point is 00:19:23 So there are some questions like how do you do hybrid push down and the caching to get the benefit to both worlds? And also, what if the pushdown layer is saturated? So you want to push down something, but other people have already pushed down a lot of computation. So if you push down further, it would actually be slower than not pushing down. Yeah, because the push down maybe computational power

Starting point is 00:19:44 is not that powerful. So what do you do in that case? Like, our solution is to push back. So you push down, but the pushdown layer says, I don't have CPU power. Sorry. So just return that to the computer layer. So you can just, okay, now I have to load the original data from storage.

Starting point is 00:20:01 So I can do some of those designs to get overall better performance. Yeah. So the third category of work is to enable new capabilities using disaggregation. And the product here is called Hermes. It's a real-time, we want to achieve real-time analytics. And a lot of people have been working on this topic. And one solution is to use a hybrid transactional analytical database, like H-Tap. But that requires to migrate to a new database.

Starting point is 00:20:35 Like, I have been using Postgres and maybe an analytical database. I don't know, Presto, happily. Now, because I want a real-time analytics, you ask me to migrate to a new database. that can be a little painful. Yeah, people don't like doing that right. They like to stay on those days, if it's possible, yeah. Exactly. And if you migrate in your database, you may lose features.

Starting point is 00:20:55 And maybe what you could do before, you cannot do now. So there are more headaches. And we were thinking, well, can we leverage deserogation? We'll just introduce a new module that can help you achieve real-time analytics. And basically it's a layer sitting, okay, both AP and TP, we assume, is deserogated. So the store is remote, and we insert a new layer in between. Okay, so we capture the transactional log, and we take that log tail and merge that with the analytical read, right, so that we can achieve real-time analytics using existing engines, and don't need to modify these engines, and also we can get pretty good freshness. So that's, we call it off-the-shelf, real-time analytics, because you don't have to.

Starting point is 00:21:45 switch your database yeah like it i mean that solves that problem of usability right and like you say the operational concerns of having to migrate data around between systems which is kind of people don't like doing so yeah so yeah actually when i was when i was looking through you at the sort of the projects you've been working on and obviously i really like the the two pc one because i mean i think it's my favorite algorithm two pc because it's just like i mean it's obviously it's got its problems and but it's it's it's a really nice easy to understand um algorithm that was a really nice sort of, yeah, take on it, I guess, and sort of modernising it anyway. So that was definitely my, one of my favorites, but I really like that the hair me stuff

Starting point is 00:22:19 as well and sort of kind of, it feels like, because a lot of people have tried to do H-Tap right over the years, and nothing's ever really stuck, right? No one's ever really solved it to, from my opinion, and things I've come across, like, in a brilliant, like a brilliant nice way. And this kind of, I feel is that is definitely getting close to that sort of, like, the best we can do with H-Tap, so I really, I really did like that as well. cool yeah i think that's potential yeah definitely sure i mean i kind of i'm going of going off piece here a little bit but do you have any sort of plans to sort of take these resets prototypes

Starting point is 00:22:52 into a more production environment or sort of get them out in the wild and get people using them or is the plan very much to keep these things research projects for the time being and that's a great question and that's the question i asked myself um many times in the past several years. I really hope to get something out there so people can use. That's kind of, I go for system researchers, that's always the dream.

Starting point is 00:23:17 You want people to use your system. I think, I try that, actually. A binary is a challenge for cloud databases. It's about cloud, and it has to be big scale. So anything about cloud database, you want it to be

Starting point is 00:23:34 practical, it's probably going to be something pretty big, Especially we're talking about infrastructure, though. So at this point, it seems to me maybe, okay, it's hard for us to build something ourselves, but we can put the idea out there. We can prove it, try to prove it using some prototype. But maybe it's easier if, okay,

Starting point is 00:23:55 is the big company or someone else a startup, I don't know, they want to pick up this idea and they want to incorporate some of the part of it or the entire idea into their system. And that's probably an easier path. Yeah, because the sheer scale of these systems is a little challenging to do that in the lab. Yeah, that's very true. You sort of need a whole engineering department, right, which is not for me, kind of have or have on hand, right, when you're working in academia, right?

Starting point is 00:24:26 It's cool. Yeah, so I guess kind of following on from this last, the thing we just just chat about there, is that speaking about practical implications and things, if I was a company now sort of, designing a system from scratch or maybe taking an existing system and trying to migrate that to a more cloud-native system. What would your advice be? And what would you be sort of saying the most common mistakes to try and void in these people trying to adopt disaggregation? I think there have been several quite successful formulas out there, like commercial systems doing a desegregation, many of these systems do desegregation in similar ways, I think one potential mistake, I kind of mentioned this earlier, is over disaggregation.

Starting point is 00:25:18 Before you are certain that particular segregation would give you benefits, be a little careful before you're inside a disaggregated that particular component, because the more you disaggregate, it tends to be that you suffer more of the performance issue. So be careful with what do you want to disarrigate. Otherwise, I think, oh, if you just disarrate computer and storage, it seems pretty natural, like, adding a lot of successful stories. But if you want to go further, just to just be careful, disargeting everything may not be a good idea.

Starting point is 00:25:53 Yeah, then that's Asia. The separation of storage and compute seems a pretty tested formula in our, right? That's definitely, yeah, pretty safe, I think. And, yeah, that's cool. So, yeah, let's switch out focusing and talk about the future then. So for researchers and system implementers, what do you think, obviously we've got on a smorgas board of possibilities and there's so many things we can disaggregate, but yeah, what do you think are the most open problem, exciting open problems in this space?

Starting point is 00:26:20 I think, as it said, there's just a lot of opportunities. When you think about it, how often do you see a new architecture for a different? distributed database. That doesn't happen not very often, right? Maybe every decade, every few decays,

Starting point is 00:26:38 you see such opportunity. And now it's like, this is a great opportunity. And also there's a lot of richness in this design. Like, you can disaggregate a lot of things. And, of course,

Starting point is 00:26:49 dishegrading everything is not a good idea. So you can try, okay, what makes sense to be disaggregated in what context, maybe for certain workflows? And if you just, okay,

Starting point is 00:27:01 if you disaggregate a lot, and then performance may suffer, but can you come out with some mechanism or algorithms so that it can get most of the performance back, like making the network not a key bottleneck, for example. So we have been thinking about some modules, for example, indexing.

Starting point is 00:27:22 We haven't seen a lot of work deserogating indexing. There is con-conspirate control, the core optimizer. You can disaggregate the car optimizer. You can disaggregate material. visualize views. It's like caching, but more advanced caching. And because this aggregate service can be shared by many database instances, so you can get some

Starting point is 00:27:43 common knowledge, like query optimizer. It may be different databases and have different knowledge about cry optimizer. If you put it together, it can make a better decision, right? So there are a lot of opportunities. Like, we are exploring a subset of these things as well in our lab, but I just feel there's just a huge space with a lot of our Another thing I think is pretty interesting is we haven't talked about a single cloud so far. Everything in a single cloud. That's great. But there is also a need to go to multiple clouds.

Starting point is 00:28:16 And then not necessarily multiple public clouds. It can be, I have a private cloud with some of my maybe privacy sensitive data. And also I store a lot of data in S3. And now I want to run a query between these three. between these two sets of data, how can I do that? So there's an even bigger design space. It's not a simple deserogation architecture. It's not like storage is one layer, computed in another layer.

Starting point is 00:28:42 It's like maybe multi-degriated architecture. You have locally have maybe deserated, and in the cloud is also deserogated. And now you want to go run query. It's like, how where do I run which part of the query? And how much competition I need, I can even allocate that elastically. So I think that's also the fascinating problem space.

Starting point is 00:29:04 It's even bigger design space. And I think there is a, there's definitely a need for such a system. And a lot of optimization opportunities. Like, we do the query in a certain way, it's probably going to be 10 times more cost efficient than the other way, and in a naive way, for example. Cool. Yeah.

Starting point is 00:29:22 Another thing I kind of, I mean, it's amazing we've kind of, I've gone through a podcast software without mentioning, you get there, AI, because it's everywhere, right? And it's hard to kind of get rid of it. But I can avoid it these days, it's in, yeah, it's everywhere. But the way I want to kind of approach this question is there is a hell of a lot of investment in, in, in, in, which isn't necessarily directly related to today's spaces. Obviously, it sort of gets caught up in the, in the, well, the things that kind of, I'm afraid of this question. yeah people aren't directly talking about databases

Starting point is 00:29:59 but like databases are still part of the ecosystem is probably how I would phrase it but because there's such a huge investment in it and the kind of that the things have been developed all the time what do you think of the implications of that or the sort of the side effects or the knock on effects that it might have on the cloud environment that might change what's possible

Starting point is 00:30:18 with disaggregation and databases or what directly might impact databases in the sense of like the functionality that we need to add and how that might play into disaggregation as well. Sorry, if that's a bit of a long-winded question. And that's a good question. I need to be careful here because I don't know too much about the AI part of this space. I think, okay, the way I think about is there is this mechanism part of the system

Starting point is 00:30:48 and there is this policy part of the system. AI is really great at the policy part. Like, for example, we say disaggregate. Okay, great. That's separated. That's what we make the decision. But then you say, oh, you can auto scale. Awesome.

Starting point is 00:31:04 But how many servers you want at a given time? And how do you want to scale it? When do you want to scale it? Like, AI is great for that. Right. Okay. Another way to look at this is like, okay, we enable auto scaling. Perfect.

Starting point is 00:31:19 But how do you? maximize the benefit from auto-scaling, I think you cannot have a human-making decision a lot of the time. I think, okay, their AI can play a very critical role because that's their domain. So I think that's part of it. Like, you have a lot of parameters.

Starting point is 00:31:38 You can team, like, AI can help you tune those parameters. And I think, AI will, okay, database will be a very important component in overall systems, including AI and another component. So I think there's got to be a lot of interesting trends. I don't know exactly what would happen, though.

Starting point is 00:32:03 I think maybe in the future, most of the queries will be generated by AI. But I don't know what that means because maybe AI, they will access the database in one way today, but next, like, two days later and completely changed the model. I mean, that can happen.

Starting point is 00:32:19 So I think it's a little hard to predict exactly what will happen, that exciting things will happen. That is for sure. Yeah, that's very truly. We can agree on that for sure. Cool. Yeah, kind of keeping on the theme of the future, if we were to sort of have this conversation again in 10 years' time, what impact do you think this segregation will have had on the databases and the database community? I think while it's probably just like

Starting point is 00:32:49 what share nothing had like the impact share nothing had like maybe 30 years ago I hope I hope the segregation it's probably already happening all cloud databases are kind of

Starting point is 00:33:02 adopting disaggregation today and I think it will go deeper and disaggregated even more and the system will become more composable more modular like modular disaggregation is kind of similar concept.

Starting point is 00:33:17 This aggregation is on a physical level, like physically, physically separated clusters, and composability is more about, well, it can be logical or physical modules, like software modules. So I think these two things, these two concepts will go hand-in-hand, so system will become more composable as well.

Starting point is 00:33:36 But you can, so now I want to build a database, but I don't have to build everything from scratch. I can reuse this cry optimizer, which is probably a, service provided by someone else and I can use the storage and I can just build my engine. Even building the engine can use something like Arrow for the internal data structure. So I think that will be like it's much easier to customize and build the database systems. And I think maybe the biggest impact is that this become common sense.

Starting point is 00:34:11 Hopefully one day it has become common sense. And people would not even ask, oh, are you using this aggregation or not? Like, oh, isn't that the only way to build systems or like for a large-scale cloud database systems? But I think if it becomes a common sense, people don't even realize this is the concept. And I think that's probably the biggest success. Yeah, definitely. It was really interesting there what you're saying about the almost having this disability where you've got all these services.

Starting point is 00:34:38 You can almost like pick and mix and build your own database. And you can even imagine that has. happening almost somewhat on demand, basically. A query comes in and there's some sort of optimizer. Yeah, yeah. And it builds your database on the fly and uses the logging service from there, the query optimized from there because of the type of query it's one to ask you. So, yeah, the possibilities really are sort of endless and quite it.

Starting point is 00:34:58 I'm sure it'll be the next 10 years, I think it's going to be very interesting as well. So, yeah, cool. Let's do a bit more, a bit of a reflection now because I know you've worked with some very, very influential, people in databases over your career so far. There's some big names in there, like Michael Stonebreaker, for example. So how would you say those relationships have shaped your approach to first research and then second system building?

Starting point is 00:35:28 I know those two things out mutually exclusive. But yeah, how have they shaped your approach to those? Yeah. So, yeah, I have a work with a lot of great people. And I learn a lot from them to the extent sometimes I don't remember which part. What I learned from individual, just a lot of them tell me things has shaped my research taste.

Starting point is 00:35:52 I think maybe there are two things I want to say. One is I think one thing I learned earlier in my career was try to do research that is demand-driven. At some point in my career, I was like, okay, I want to work on really cool ideas. So it's more like intellectually interesting But I don't I didn't really think about oh but who needs it

Starting point is 00:36:17 Yeah Then I try to change the mindset Like based on what they told me And maybe it should be demand driven Let's understand what do people need I mean sometimes they understand They are needing this They will tell you

Starting point is 00:36:31 Sometimes they don't know they need this But they actually need it So you at first understand What people actually need And you probably need to observe some transient industry, like what's going on. We'll talk to engineers, like, what are working on. And they will tell you engineering challenges,

Starting point is 00:36:48 which may not be research challenges. So you need to extract the research question out of these engineering challenges. So it's probably actually a research challenge. They may not even realize it. Okay, and for example, this aggregation is kind of the idea that way. So you get a research question,

Starting point is 00:37:06 you abstract it and say, okay, this is the research question, you define it, and then you develop solutions for the research question, then that the solution will also be hopefully useful for the real problems that people are facing the industry, because it's based on the demand there. It's demand-driven. Another approach, which is kind of similar to this approach, is like, I try to follow the trend. So the trend may not be demand It may be, okay, you can say AI is a trend because a lot of people are doing research there. What I mean, the trend is not what the papers are about, what the researchers are working on.

Starting point is 00:37:52 It's more about what's happening in the industry. Clearly you'll see something is going to be the future. And for example, back then, I clearly thought, okay, cloud is the future. And cloud database is going to be the future. And if that happens, we can innovate one way or another. There's just a lot of problems waiting for you to work on. But you can believe this is the trend. You jump on it, and then 10 years later,

Starting point is 00:38:19 it would be the infinite problems for you to work on. And the potential impact also be big. Some other trends I'm following right now is like, and we're not talking about this today, but like GPU databases, there's another trend we are betting on because GPU seems to be, have a, like, a great trajectory going forward so yeah so i think those are maybe two um general approaches to research yeah and i saw i've been been demand driven which i think yeah exactly like there's no

Starting point is 00:38:51 point in in creating anything if if someone's not going to use it right you want to solve a solve someone's problem right that's the way to sort of be successful in anything right in business and whatever if you can solve someone's problem and it also satisfies some demand then then great right you're on to a winner. And same with the trend driven, right? Like, yeah, I guess it is sort of sometimes hard to get the signal from the noise, especially with stuff like AI. But if you can kind of peel back that noise and sort of look at the structural trends

Starting point is 00:39:22 and sort of jump on one of those, then, yeah, again, it's a good way to sort of set yourself up for success, right, longer time. So, yeah, definitely agree with those insights. Great. So we've kind of arrived at the end. of the podcast now and we're on to the last word.

Starting point is 00:39:39 So two things here. First of all, what would you like practitioners to have taken away from this podcast today? And secondly, what would you like researchers to have taken away

Starting point is 00:39:49 from our chat today? I guess I'm probably having the same message for both people. I think this aggregation architecture is this is the new architecture for the case is finally we have a fundamentally new architecture

Starting point is 00:40:12 and if you if it satisfies your need like scalability elasticity this is the way to go and it opens a vast design space but this is maybe more for researchers this vast design space but it's not it's it's largely an export we are still in the early stage like look at how many papers are about segregated database versus share nothing databases. The design is way more complex, but there was so much fewer papers on it. But just don't fully understand it. There's a lot of research that can be done

Starting point is 00:40:47 and a lot of potential impact that can be made in this topic. Yeah, fantastic. That's a great message to end on. So yeah, thank you for saying. Thank you so much. It's been a lovely chat today. And also, before we do finish, we should give a shout out to,

Starting point is 00:41:00 we've had two of your students on the podcast before. So we had Yifi who came on and it was episode number. If I'm remembering this correctly, I didn't write this down. So I did get two years ago. Yeah, two years ago. Episode 48. And yeah, if you're interested in how you can optimize joint performance, go check that episode out.

Starting point is 00:41:17 And we also had Abigail came on recently. She was in our season two of our DuckDB in Research series, which is episode two in season two, sorry, I should say. And that was all about DBMSX sensibility. And she was called it anarchy in data faces. So yeah, it's a great episode to do go, go check that out as well. Yeah, thank you so much. It's been a lovely talk today.

Starting point is 00:41:40 And where can we find you on social media or anything like that? Where do you tweet at or it's not tweeting anymore, right? It's just X-y, I guess, or any of those. Yeah, I'm not super active on social media, but I have my personal website. Just search my name, I think. Yeah. And I do use Axel a little bit, but not too much. Cool. Well, we'll end things there. Thank you so much.

Starting point is 00:42:04 It's been a really, really insightful chat today. And yeah, great. And we'll see you all next time for some more awesome computer science research.

Disseminate: The Computer Science Research Podcast - Xiangyao Yu | Disaggregation: A New Architecture for Cloud Databases | #68

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.