Disseminate: The Computer Science Research Podcast - Xiangyao Yu | Disaggregation: A New Architecture for Cloud Databases | #68
Episode Date: November 27, 2025In this episode of Disseminate: The Computer Science Research Podcast, host Jack Waudby sits down with Xiangyao Yu (UW–Madison), one of the leading voices shaping the next generation of cloud-native... databases.We dive deep into disaggregation — the architectural shift transforming how modern data systems are built. Xiangyao breaks down:Why traditional shared-nothing databases struggle in cloud environmentsHow separating compute and storage unlocks elasticity, scalability, and cost efficiencyThe evolution of disaggregated systems, from Aurora and Snowflake through to advanced pushdown processing and new modular servicesHis team's research on reinventing core protocols like 2-phase commit for cloud-native environmentsReal-time analytics, HTAP challenges, and the Hermes architectureWhere disaggregation goes next — indexing, query optimizers, materialized views, multi-cloud architectures, and moreWhether you're a database engineer, researcher, or a practitioner building scalable cloud systems, this episode gives a clear, accessible look into the architecture that’s rapidly becoming the default for modern data platforms.Links:Xiangyao Yu's HomepageDisaggregation: A New Architecture for Cloud Databases [VLDB'25] Hosted on Acast. See acast.com/privacy for more information.
Transcript
Discussion (0)
Disseminate the Computer Science Research Podcast.
Hello and welcome to Disseminate the Computer Science Research Podcast.
As usual, I'm your host, Jack Wardby.
I'd like to start out with a little bit of an apology today.
I'm not in my usual setup, so if the audio is not fantastic, I can only apologize.
It does seem a bit echoy, but hopefully it comes across.
It's okay, but anyway, we'll crack on.
So, yeah, Disseminate, this is the podcast where we interview computer science research
is about their latest work, we dig into the problems they've tackled and how they've solved
those problems and how those findings that they've got from that research can then be applied
in practice. The overall goal here is to sort of try and further narrow that gap between
research and practice and make computer science research more accessible. So yeah, if you're
an industry practitioner, researcher or student, then this is the podcast for you. You can listen to
us on Apple, Spotify and you can watch log on YouTube. So yeah, if you do enjoy the show,
please like, follow, subscribe and tell a friend about us.
So yeah, on to today's show.
And today we're going to be talking with Zhang Yau Yu, who is a assistant professor
and the database group at the University of Wisconsin-Madison.
And, yeah, Jingyao's research focuses in three areas, that is cloud native databases,
new hardware for databases and core database techniques.
I want to be focusing probably primarily today on the first of those, and that's specifically
disaggregation, which is the separation of database components
into independently managed scalable and services.
So, yeah, Jingyao was awarded the 2025 VLDB Early Korea Research
Contribution Award for his work on this topic.
I'm really excited to have him on the podcast today.
Welcome to the show, Jingyao.
Yeah, thank you.
Cool.
Thank you for a great introduction.
Oh, yeah, that's good.
Cool.
So let's get stuck in with some background then, sort of set the scene for the listener.
And for those who are maybe new to the topic of disaggregation,
Can you explain to us what that means and how it differs from the traditional classic database architecture?
Yeah, for sure.
So the classic database architecture, like, maybe the most famous way is Share Nothing.
And the Share Nothing is like they have multiple servers collected by network,
but within each server, they have computation, they have the storage, have logging, like, all the functionalities.
It kind of replicated that across different servers.
And that is your one cluster.
So this aggregation is like this different approach
where they put different database components into different clusters.
So now I have multiple clusters.
And different clusters focus on maybe one cluster focus on computation,
the other cluster focused on storage.
So in short, you can think of this as conventional architecture
is a typely coupled single cluster,
but this aggregation is a loosely coupled multiple clusters.
Nice cool that, yeah.
It's a great description of that.
stuff that would decompose and all these different database functions into their own
into their own services.
Cool.
So, yeah, I guess what motivated you to sort of explore disaggregation as this sort of shifting
database architectures?
That's a great question.
So I guess it started when I was a postdoc in 2018 at MIT.
At that time, at that time, people don't really use the word disaggregation yet.
So people call it different names.
But it was pretty clear to me that this architecture has.
great potential and it seems to me, oh, this seems to be the future, the separating the
functionalities. So I guess that was a vague thought and saying, okay, this is still not certain
but I want to explore this. I started with the first project like a computation pushdown
for analytical processing and went from there, expanded to many other projects.
That's cool. Yeah, you definitely had some great foresight there because and where things are going,
you've definitely been kind of ahead of a cab there. I think, yeah, I'm so cool. Yeah, let's talk
about the shift a little bit more then. So we spoke about the traditional architecture a second
ago, but what are the limitations of that architecture, that disaggregation, like directly
addresses, and I guess this is we kind of need to start speaking about the cloud and the cloud
environments, and then how did that sort of make this architectural shift kind of inevitable, I guess?
So you're totally right. It has to be, it has to be, the discussion has to be about cloud,
because in the traditional environment, the traditional architecture was fine.
For on-premises environment, for example,
share nothing was a great architecture.
It's actually perfect.
But when we go to the cloud,
the environment is different.
It's not on-premises anymore.
And in particular,
cloud has this very silly feature
called on-demand scalability,
which is if you ask for more competition resources,
you can get it immediately.
You can shrink competition resources
and you only need to pay for what you use.
And it's kind of elasticity or on-demand,
scalability did not exist
in Share Nothing architecture or did not exist
in on-premises environment.
Okay, so
it's more like, okay, when we move to the cloud,
we really want our database
to also have on-demand skillability,
like it's super cool, super cost-efficient.
But because Share Nothing was not designed
for this environment, it cannot fully
exploit such scalability.
Yeah, so
So basically there's one thing is
when we want to have on-demand skillability,
we truly want scalability in the compute side,
but not the storage side.
The compute demand can change drastically over time,
but the storage demand does not change very much over time.
And compute is stateless, it's easy to auto-scale.
Storage is fundamentally difficult to auto-scale.
In Share Nothing architecture,
they always couple computer and storage
into a single cluster.
So that would make it very difficult to auto-scale.
Like if I scale, after-scale compute and storage together,
and storage is very difficult to scale, as we know.
So it becomes very difficult.
Therefore, when people say, okay,
we want to have on-demand scaleability for compute.
So in the cloud, what should we do?
That's separate compute and storage.
So that, okay, we know we don't need to out-scale storage too much.
So now we can scale computation
because it's separated at two services.
I think that's the, it's not a limitation of sharing nothing, I would say,
but it's more like it failed to exploit the new opportunity in the cloud.
So in order to embrace this new opportunity,
we need to go to this aggregated architecture.
Yeah, nice.
I guess, yeah, like you say, it's nothing kind of against the sort of the design of shade
up because the rules of the game have sort of changed, right?
And it's met these new architect as possible.
Let's talk about the sort of evolution of disaggregation and over the sort of since
maybe like 2018 when you sort of had this, this sort of like great foresight of seeing
this being a sort of a trend.
And so things sort of initially started off with separating storage and compute, right?
And like, since like Stunner Thick and Aurora sort of pioneers in that.
But things have kind of started to go beyond that.
So can you maybe take us on that journey from sort of day one, hey, we can split storage
and compute because it's great because storage has got different characteristics.
characteristics to compute and we can actually pull it.
We can scale things differently just for compute to sort of like where we are today.
So I think, okay, I think a lot of cloud databases are separating compute and storage
and some databases are going further.
I think there is consensus on separating compute and storage.
And for the extra separation, extra disaggregation, like every system has a little bit
the different exploration, right?
So there is a very active area right now.
At a super high level, it's basically like, well, for a subset of database functions,
we can disaggregate that into a separate cluster.
And there's so many database functionalities.
So you can disaggregate in many, many different ways.
So just to give you some example, for example, storage, okay, it can be further disaggregated
in Socrates, for example, from Microsoft,
they further disaggregated storage into a login service,
a page cache, and a durable page store.
So we think about the logging service, for example.
Logging service, the performance is really critical
because it's on the critical path.
But the log size is usually much smaller than the data,
the page data.
And because you care about performance and it's very small,
you can't afford using some more advanced technology,
maybe more expensive storage to cut the latency there.
But you don't want to use that expensive storage for the page store.
Yeah, it's too expensive.
But disaggregating these two,
now you can use expensive for the logings,
and that will improve the overall performance.
So that's one example where, like,
okay, because different functionality has different performance requirements,
by disaggregating them,
And now you can customize the implementation.
So some other examples are,
okay, we'll talk about storage, further disaggregation.
And you can also disarget execution, like,
oh, we don't have to do all the execution in the compute layer.
We can push some of the execution to,
well, what I call a push-down layer,
a layer closer to the storage.
It doesn't have to be within storage.
It can be within storage,
but it can be close to storage.
Some other systems use,
of Snowflake has this caching for intermediate data.
They just say intermediate data has to be flushed from a compute node because it's too big.
But you don't have to write it back to the storage layer.
You can write it back to this intermediate data caching layer that has lower costs than the storage layer like S3.
So some other examples like metadata layer, query optimization as a service, or memory disaggregation.
So a lot of exploration, a lot of ideas floating around here.
Yeah, it's great. There's a lot of possibilities that. It's kind of almost, you kind of feel like a kid in the sweet shop, right?
There's so many different database functions. Like, ooh, can we disaggregate this bit? Can we disaggregate that?
But I guess there is a, there is probably a tendency to want to do that. But I guess there is a limit to how far we can take that, right?
So, because if we start jamming the network in between everything, eventually we're probably going to sort of lose all the benefits of the disaggregation, right?
And I guess possibly in some scenario, I think you actually reference this in your paper about this.
that there is some situations where if you go too far with this,
the traditional architecture shared nothing can actually be better.
Yeah, so that's actually a potential pitfall of deserogation.
It sounds like it's not the case that deserogation is always good.
So, for example, you don't want to disagree everything.
I think it's like what the microservices are doing.
Everything is a separate service.
Every application you have hundreds of services.
you probably don't want so many services for databases.
I mean, maybe, but I think that's probably pushing it too far.
So disaggregation is a trade-off, actually, between performance
and maybe separation of concerns.
So the more you disaggregate, the performance tends to suffer.
So if you want the best performance, you know what?
Go share nothing.
That will give you the best performance.
So people want a disaggregation, not for performance, clearly.
It's for elasticity, auto-scaling, like this cost-efficiency,
out-demand scalability, maybe separation of concerns.
But you sacrifice performance in order to get these features.
So when design assists, my guess,
you should be very careful.
If these elasticity or deserogation is the property you really,
property you're really important for you, go for it. But be aware you're giving up some
performance. So you probably can get some optimizations to get some performance back. But it's
probably very hard to get all the way back to share nothing. So, but for a particular use
case, elasticity is not that important. Well, maybe you don't want to disaggregate aggressively.
It's really interesting that because, I mean, there's these two sort of things sort of
bouncing around my mind. And I know sort of when, when I was very much in recent,
actually, you kind of, you kind of, I was at the time maybe very, you're very focused on
performance, right, because you want to make a throughput higher and latency lower, right?
And that's how you get your paper accepted sort of thing. And but when you start to set that
against the actual business concerns and the actual people actually, people who've run these
systems in production actually, like the things they care about, it's not just performance, right?
There is, there's many dimensions to this. And I think it's, and this can sort of satisfy
those, those goals and those desires for those customers. And I think it's interesting.
as well because you kind of you see you often see people say oh yeah
disregardation is the future and that means that it's going to replace everything
and it's going to that's the 10 years time that's what we're going to just be doing which
you say that it's not true right like there's both type of systems can coexist and it all
depends on what the um yeah the actual customer wants to the one performance go this
where you can care you can care about marble sissy you've got this so it just actually
actually increases the the um the offerings for customers I guess and what's available so
and that's yeah
really cool
brilliant so yeah let's talk about
so obviously you and your team
at the University of Wisconsin-Madison
in the database group there you've been exploring
this design space and seeing what's possible
within this paradigm so let's
let's switch some folks and talk about
some of these projects that you've been working on
so you classify in your paper
you classify them into sort of three categories
so yeah tell us about the stuff you've been doing
to sort of reinvent or reimagine
some core protocols
in a disaggregation, in a disaggregated world.
Yeah.
I kind of classify the work we have been doing
into three categories, but this is by no means exhaustive.
I think other people have different dimensions
because there's huge design space,
a lot of opportunities.
So the three categories we did in our lab
are the core particles,
like fundamental database, core database protocols,
and the query engine optimizations.
and also some new capabilities that you can enable using deseragnation,
like capabilities that didn't exist in traditional Shared Nothing architecture, for example.
So I would just go through one by one.
So for the core protocol, like one work we did was to revisit Two-Face Commit.
Like Two-Face Commit in this classic protocol is like very well designed.
And I think originally it was a ZUME-M-Share-N-N-Thing.
So there was one problem with 2PC.
If I'm familiar with 2PC, you must know this.
There is a blocking problem where in certain conditions 2PC protocols can get stuck.
And there's no progress.
Well, at least for certain data pieces, they happen to be locked indefinitely, cannot be released
because the state of a particular transaction cannot be determined.
And fundamentally, I will not talk about the protocol overall,
But fundamentally, you have this blocking
because some node has failed
and the state on that node is not accessible
because when a node fails,
the compute and storage fail together.
That's fundamental reason.
Other nodes cannot see this failed nodes disk
so we cannot make for our progress.
But this is no longer true in this aggregated architecture
because the storage is separate from compute.
And you have to assume your storage doesn't fail
because your storage fails,
you lose everything in our database.
And they're concerned is not just about blocking anymore.
You lose the whole database.
So if your storage doesn't fail,
even if your compute fails,
other nodes, other compute nodes
can still access to all the states in storage.
So it can never be a case where a node fails,
a compute node failed,
and suddenly you lose access to your storage state
and you cannot make form of progress.
You can't see it.
The other computer nodes,
Block 5 is just sitting right there accessible.
But just leveraging that, you can
address this blocking issue.
Okay, the detail is a little bit that's complicated.
So I will not go through that.
That's the insight.
It's architectural change that change the fundamental
assumptions that allow you to design protocols differently.
We also did something for the control plane.
I probably talk about that briefly.
The disaggregation work today,
a lot of that mainly focus on the data plane.
But a control plane, think of Zookeeper, and that is mostly still the centralized service.
That is not disaggregated.
So I think the work here is similar to like, oh, let's disaggregate the Zookeeper so that it can also auto-scale and it can also overlay on top of your database cluster.
So you don't need to have a separate cluster.
Yeah.
So the second category of work we did is for query engines.
and it's called push-down processing.
Nowadays, I think push-down is adopted in many cloud databases,
cloud production databases already.
So it's like, oh, okay, traditionally, you share nothing,
every note is read home local disk, do some local competition,
and then read some data exchange.
So in this aggregation, you cannot read your local disk.
the data was set in the remote storage.
So every time you read from a remote storage,
you have to transfer data over the network.
So it's actually much more data transfer over the network
compared to a share-nothing architecture.
But that's definitely overhead.
So one way to mitigate that is you actually push these,
push certain query processing down to the storage layer
or close to the storage layer.
So that means even if you only have four servers
your compute cluster, but when you push down the sort of operation like silater aggregation,
you can leverage serverless pushdown layer,
which may be able to use the hundreds of servers to process something in parallel
and reduce the data and return to the compute layer.
So that the data that received by the computer layer is much, much significantly reduced
so that the performance can be improved.
So there are a lot of details about pushdown.
We actually have a series of papers.
where you can push down simple operators.
You can also push down a little bit advanced operators.
Like shuffle, even shuffle can be partially pushed down.
Like Bitmap operators can be pushed down.
And also, what about caching?
How do you hybrid push down in caching?
Because intuitively, if you push down,
it's very difficult to cache the result of push-down computation.
because that can you arbitrary predicate.
How can you leverage that?
So there are some questions like how do you do hybrid push down
and the caching to get the benefit to both worlds?
And also, what if the pushdown layer is saturated?
So you want to push down something,
but other people have already pushed down a lot of computation.
So if you push down further,
it would actually be slower than not pushing down.
Yeah, because the push down maybe computational power
is not that powerful.
So what do you do in that case?
Like, our solution is to push back.
So you push down, but the pushdown layer says,
I don't have CPU power.
Sorry.
So just return that to the computer layer.
So you can just, okay, now I have to load the original data from storage.
So I can do some of those designs to get overall better performance.
Yeah.
So the third category of work is to enable new capabilities using disaggregation.
And the product here is called Hermes.
It's a real-time, we want to achieve real-time analytics.
And a lot of people have been working on this topic.
And one solution is to use a hybrid transactional analytical database, like H-Tap.
But that requires to migrate to a new database.
Like, I have been using Postgres and maybe an analytical database.
I don't know, Presto, happily.
Now, because I want a real-time analytics, you ask me to migrate to a new database.
that can be a little painful.
Yeah, people don't like doing that right.
They like to stay on those days, if it's possible, yeah.
Exactly.
And if you migrate in your database, you may lose features.
And maybe what you could do before, you cannot do now.
So there are more headaches.
And we were thinking, well, can we leverage deserogation?
We'll just introduce a new module that can help you achieve real-time analytics.
And basically it's a layer sitting, okay, both AP and TP, we assume, is deserogated.
So the store is remote, and we insert a new layer in between.
Okay, so we capture the transactional log, and we take that log tail and merge that with the analytical read, right, so that we can achieve real-time analytics using existing engines, and don't need to modify these engines, and also we can get pretty good freshness.
So that's, we call it off-the-shelf, real-time analytics, because you don't have to.
switch your database yeah like it i mean that solves that problem of usability right and like you say
the operational concerns of having to migrate data around between systems which is kind of people
don't like doing so yeah so yeah actually when i was when i was looking through you at the sort
of the projects you've been working on and obviously i really like the the two pc one because i mean
i think it's my favorite algorithm two pc because it's just like i mean it's obviously it's got its
problems and but it's it's it's a really nice easy to understand um algorithm that was a really
nice sort of, yeah, take on it, I guess, and sort of modernising it anyway.
So that was definitely my, one of my favorites, but I really like that the hair me stuff
as well and sort of kind of, it feels like, because a lot of people have tried to do H-Tap
right over the years, and nothing's ever really stuck, right?
No one's ever really solved it to, from my opinion, and things I've come across, like,
in a brilliant, like a brilliant nice way.
And this kind of, I feel is that is definitely getting close to that sort of, like,
the best we can do with H-Tap, so I really, I really did like that as well.
cool yeah i think that's potential yeah definitely sure i mean i kind of i'm going of going off
piece here a little bit but do you have any sort of plans to sort of take these resets prototypes
into a more production environment or sort of get them out in the wild and get people using them
or is the plan very much to keep these things research projects for the time being
and that's a great question and that's the question i asked myself um many times in the past several
years. I really
hope to get something out there
so people can use. That's kind of, I go
for system researchers,
that's always the dream.
You want people to use your system.
I think, I try that,
actually. A binary is a challenge
for cloud databases.
It's about cloud, and
it has to be big scale.
So anything
about cloud database, you want it to be
practical, it's probably
going to be something pretty big,
Especially we're talking about infrastructure, though.
So at this point, it seems to me maybe, okay,
it's hard for us to build something ourselves,
but we can put the idea out there.
We can prove it, try to prove it using some prototype.
But maybe it's easier if, okay,
is the big company or someone else a startup, I don't know,
they want to pick up this idea
and they want to incorporate some of the part of it
or the entire idea into their system.
And that's probably an easier path.
Yeah, because the sheer scale of these systems is a little challenging to do that in the lab.
Yeah, that's very true.
You sort of need a whole engineering department, right, which is not for me, kind of have or have on hand, right, when you're working in academia, right?
It's cool.
Yeah, so I guess kind of following on from this last, the thing we just just chat about there, is that speaking about practical implications and things, if I was a company now sort of,
designing a system from scratch or maybe taking an existing system and trying to migrate that
to a more cloud-native system. What would your advice be? And what would you be sort of saying
the most common mistakes to try and void in these people trying to adopt disaggregation?
I think there have been several quite successful formulas out there, like commercial systems
doing a desegregation, many of these systems do desegregation in similar ways, I think
one potential mistake, I kind of mentioned this earlier, is over disaggregation.
Before you are certain that particular segregation would give you benefits, be a little
careful before you're inside a disaggregated that particular component, because the more
you disaggregate, it tends to be that you suffer more of the performance issue.
So be careful with what do you want to disarrigate.
Otherwise, I think, oh, if you just disarrate computer and storage,
it seems pretty natural, like, adding a lot of successful stories.
But if you want to go further, just to just be careful,
disargeting everything may not be a good idea.
Yeah, then that's Asia.
The separation of storage and compute seems a pretty tested formula in our, right?
That's definitely, yeah, pretty safe, I think.
And, yeah, that's cool.
So, yeah, let's switch out focusing and talk about the future then.
So for researchers and system implementers, what do you think, obviously we've got on a smorgas
board of possibilities and there's so many things we can disaggregate, but yeah, what do you
think are the most open problem, exciting open problems in this space?
I think, as it said, there's just a lot of opportunities.
When you think about it, how often do you see a new architecture for a different?
distributed database.
That doesn't happen
not very often,
right?
Maybe every decade,
every few decays,
you see such opportunity.
And now it's like,
this is a great opportunity.
And also there's a lot of
richness in this design.
Like,
you can disaggregate a lot of things.
And, of course,
dishegrading everything is not a good idea.
So you can try,
okay,
what makes sense to be disaggregated
in what context,
maybe for certain workflows?
And if you just,
okay,
if you disaggregate a lot,
and then performance may suffer,
but can you come out with some mechanism
or algorithms so that
it can get most of the performance back,
like making the network not a key bottleneck, for example.
So we have been thinking about some modules,
for example, indexing.
We haven't seen a lot of work
deserogating indexing.
There is con-conspirate control,
the core optimizer.
You can disaggregate the car optimizer.
You can disaggregate material.
visualize views. It's like caching, but more advanced caching.
And because this aggregate service can be shared by many database instances, so you can get some
common knowledge, like query optimizer. It may be different databases and have different
knowledge about cry optimizer. If you put it together, it can make a better decision, right?
So there are a lot of opportunities. Like, we are exploring a subset of these things as well
in our lab, but I just feel there's just a huge space with a lot of our
Another thing I think is pretty interesting is we haven't talked about a single cloud so far.
Everything in a single cloud.
That's great.
But there is also a need to go to multiple clouds.
And then not necessarily multiple public clouds.
It can be, I have a private cloud with some of my maybe privacy sensitive data.
And also I store a lot of data in S3.
And now I want to run a query between these three.
between these two sets of data, how can I do that?
So there's an even bigger design space.
It's not a simple deserogation architecture.
It's not like storage is one layer, computed in another layer.
It's like maybe multi-degriated architecture.
You have locally have maybe deserated,
and in the cloud is also deserogated.
And now you want to go run query.
It's like, how where do I run which part of the query?
And how much competition I need,
I can even allocate that elastically.
So I think that's also the fascinating problem space.
It's even bigger design space.
And I think there is a, there's definitely a need for such a system.
And a lot of optimization opportunities.
Like, we do the query in a certain way,
it's probably going to be 10 times more cost efficient than the other way,
and in a naive way, for example.
Cool.
Yeah.
Another thing I kind of, I mean, it's amazing we've kind of,
I've gone through a podcast software without mentioning,
you get there, AI, because it's everywhere, right?
And it's hard to kind of get rid of it.
But I can avoid it these days, it's in, yeah, it's everywhere.
But the way I want to kind of approach this question is there is a hell of a lot of investment in, in, in, in, which isn't necessarily directly related to today's spaces.
Obviously, it sort of gets caught up in the, in the, well, the things that kind of, I'm afraid of this question.
yeah people aren't directly talking about databases
but like databases are still part of the ecosystem
is probably how I would phrase it
but because there's such a huge investment
in it and the kind of that the things have been developed
all the time what do you think of the implications of that
or the sort of the side effects or the knock on effects
that it might have on the cloud environment
that might change what's possible
with disaggregation and databases
or what directly might impact databases
in the sense of like the functionality that we need to add
and how that might play into disaggregation as well.
Sorry, if that's a bit of a long-winded question.
And that's a good question.
I need to be careful here because I don't know too much about the AI part of this space.
I think, okay, the way I think about is there is this mechanism part of the system
and there is this policy part of the system.
AI is really great at the policy part.
Like, for example, we say disaggregate.
Okay, great.
That's separated.
That's what we make the decision.
But then you say, oh, you can auto scale.
Awesome.
But how many servers you want at a given time?
And how do you want to scale it?
When do you want to scale it?
Like, AI is great for that.
Right.
Okay.
Another way to look at this is like, okay, we enable auto scaling.
Perfect.
But how do you?
maximize the benefit from auto-scaling,
I think you cannot have a human-making decision
a lot of the time.
I think, okay, their AI can play a very critical role
because that's their domain.
So I think that's part of it.
Like, you have a lot of parameters.
You can team, like, AI can help you tune those parameters.
And I think,
AI will, okay, database will be
a very important component
in overall systems,
including AI and another component.
So I think there's got to be a lot of interesting trends.
I don't know exactly what would happen, though.
I think maybe in the future,
most of the queries will be generated by AI.
But I don't know what that means
because maybe AI,
they will access the database in one way today,
but next, like, two days later
and completely changed the model.
I mean, that can happen.
So I think it's a little hard to predict exactly what will happen, that exciting things will happen.
That is for sure.
Yeah, that's very truly.
We can agree on that for sure.
Cool.
Yeah, kind of keeping on the theme of the future, if we were to sort of have this conversation again in 10 years' time, what impact do you think this segregation will have had on the databases and the database community?
I think
while it's probably just like
what share nothing had
like the impact share nothing had
like maybe 30 years ago
I hope
I hope
the segregation
it's probably already happening
all cloud databases are kind of
adopting disaggregation today
and I think it will go deeper
and disaggregated even more
and the system will become
more composable
more modular
like modular disaggregation
is kind of similar concept.
This aggregation is on a physical level,
like physically, physically separated clusters,
and composability is more about,
well, it can be logical or physical modules,
like software modules.
So I think these two things,
these two concepts will go hand-in-hand,
so system will become more composable as well.
But you can, so now I want to build a database,
but I don't have to build everything from scratch.
I can reuse this cry optimizer,
which is probably a,
service provided by someone else and I can use the storage and I can just build my engine.
Even building the engine can use something like Arrow for the internal data structure.
So I think that will be like it's much easier to customize and build the database systems.
And I think maybe the biggest impact is that this become common sense.
Hopefully one day it has become common sense.
And people would not even ask, oh, are you using this aggregation or not?
Like, oh, isn't that the only way to build systems or like for a large-scale cloud database systems?
But I think if it becomes a common sense, people don't even realize this is the concept.
And I think that's probably the biggest success.
Yeah, definitely.
It was really interesting there what you're saying about the almost having this disability
where you've got all these services.
You can almost like pick and mix and build your own database.
And you can even imagine that has.
happening almost somewhat on demand, basically.
A query comes in and there's some sort of optimizer.
Yeah, yeah.
And it builds your database on the fly and uses the logging service from there,
the query optimized from there because of the type of query it's one to ask you.
So, yeah, the possibilities really are sort of endless and quite it.
I'm sure it'll be the next 10 years, I think it's going to be very interesting as well.
So, yeah, cool.
Let's do a bit more, a bit of a reflection now because I know you've worked with
some very, very influential,
people in databases over your career so far.
There's some big names in there, like Michael Stonebreaker, for example.
So how would you say those relationships have shaped your approach to first research
and then second system building?
I know those two things out mutually exclusive.
But yeah, how have they shaped your approach to those?
Yeah.
So, yeah, I have a work with a lot of great people.
And I learn a lot from them to the extent sometimes I don't remember which part.
What I learned from individual,
just a lot of them tell me things
has shaped my research taste.
I think maybe there are two things I want to say.
One is I think one thing I learned earlier in my career
was try to do research that is demand-driven.
At some point in my career, I was like,
okay, I want to work on really cool ideas.
So it's more like intellectually interesting
But I don't
I didn't really think about oh but who needs it
Yeah
Then I try to change the mindset
Like based on what they told me
And maybe it should be demand driven
Let's understand what do people need
I mean sometimes they understand
They are needing this
They will tell you
Sometimes they don't know they need this
But they actually need it
So you at first understand
What people actually need
And you probably need to observe
some transient industry, like what's going on.
We'll talk to engineers, like, what are working on.
And they will tell you engineering challenges,
which may not be research challenges.
So you need to extract the research question
out of these engineering challenges.
So it's probably actually a research challenge.
They may not even realize it.
Okay, and for example,
this aggregation is kind of the idea that way.
So you get a research question,
you abstract it and say,
okay, this is the research question,
you define it, and then you develop solutions for the research question, then that the solution
will also be hopefully useful for the real problems that people are facing the industry,
because it's based on the demand there. It's demand-driven. Another approach, which is kind
of similar to this approach, is like, I try to follow the trend. So the trend may not be demand
It may be, okay, you can say AI is a trend because a lot of people are doing research there.
What I mean, the trend is not what the papers are about, what the researchers are working on.
It's more about what's happening in the industry.
Clearly you'll see something is going to be the future.
And for example, back then, I clearly thought, okay, cloud is the future.
And cloud database is going to be the future.
And if that happens, we can innovate one way or another.
There's just a lot of problems waiting for you to work on.
But you can believe this is the trend.
You jump on it, and then 10 years later,
it would be the infinite problems for you to work on.
And the potential impact also be big.
Some other trends I'm following right now is like,
and we're not talking about this today,
but like GPU databases, there's another trend we are betting on
because GPU seems to be, have a, like,
a great trajectory going forward so yeah so i think those are maybe two um general approaches to
research yeah and i saw i've been been demand driven which i think yeah exactly like there's no
point in in creating anything if if someone's not going to use it right you want to solve a
solve someone's problem right that's the way to sort of be successful in anything right in business
and whatever if you can solve someone's problem and it also satisfies some demand then then great right
you're on to a winner.
And same with the trend driven, right?
Like, yeah, I guess it is sort of sometimes hard to get the signal from the noise,
especially with stuff like AI.
But if you can kind of peel back that noise and sort of look at the structural trends
and sort of jump on one of those, then, yeah, again,
it's a good way to sort of set yourself up for success, right, longer time.
So, yeah, definitely agree with those insights.
Great.
So we've kind of arrived at the end.
of the podcast now
and we're on to the
last word.
So two things here.
First of all,
what would you like
practitioners to have taken
away from this podcast today?
And secondly,
what would you like researchers
to have taken away
from our chat today?
I guess I'm probably having
the same message
for both people.
I think
this aggregation architecture
is this
is the new architecture for the case is finally we have a fundamentally new architecture
and if you if it satisfies your need like scalability elasticity this is the way to go and it opens
a vast design space but this is maybe more for researchers this vast design space but it's not
it's it's largely an export we are still in the early stage like look at how many papers are
about segregated database versus share nothing databases.
The design is way more complex,
but there was so much fewer papers on it.
But just don't fully understand it.
There's a lot of research that can be done
and a lot of potential impact that can be made in this topic.
Yeah, fantastic.
That's a great message to end on.
So yeah, thank you for saying.
Thank you so much.
It's been a lovely chat today.
And also, before we do finish,
we should give a shout out to,
we've had two of your students on the podcast before.
So we had Yifi who came on and it was episode number.
If I'm remembering this correctly, I didn't write this down.
So I did get two years ago.
Yeah, two years ago.
Episode 48.
And yeah, if you're interested in how you can optimize joint performance, go check
that episode out.
And we also had Abigail came on recently.
She was in our season two of our DuckDB in Research series,
which is episode two in season two, sorry, I should say.
And that was all about DBMSX sensibility.
And she was called it anarchy in data faces.
So yeah, it's a great episode to do go, go check that out as well.
Yeah, thank you so much.
It's been a lovely talk today.
And where can we find you on social media or anything like that?
Where do you tweet at or it's not tweeting anymore, right?
It's just X-y, I guess, or any of those.
Yeah, I'm not super active on social media, but I have my personal website.
Just search my name, I think.
Yeah.
And I do use Axel a little bit, but not too much.
Cool. Well, we'll end things there. Thank you so much.
It's been a really, really insightful chat today.
And yeah, great. And we'll see you all next time for some more awesome computer science research.
