Disseminate: The Computer Science Research Podcast - Rui Liu | Towards Resource-adaptive Query Execution in Cloud Native Databases | #49
Episode Date: April 1, 2024In this episode, we talk to Rui Liu and explore the transformative potential of Ratchet, a groundbreaking resource-adaptive query execution framework. We delve into the challenges posed by ephemeral r...esources in modern cloud environments and the innovative solutions offered by Ratchet. Rui guides us through the intricacies of Ratchet's design, highlighting its ability to enable adaptive query suspension and resumption, sophisticated resource arbitration for diverse workloads, and a fine-grained pricing model to navigate fluctuating resource availability. Join us as we uncover the future of cloud-native databases and workloads, and discover how Ratchet is poised to revolutionize the way we harness the power of dynamic cloud resources.Links:CIDR'24 PaperRui's LinkedIn Rui's Twitter/XRui's HomepageYou can find links to all Rui's work from his Google Scholar profile. Hosted on Acast. See acast.com/privacy for more information.
Transcript
Discussion (0)
Hello and welcome to Disseminate the Computer Science Research Podcast.
As usual, I'm your host, Jack Wardby.
Today, it's my pleasure to say that I'm joined by Rui Leo,
who will be telling us everything we need to know about his paper
towards resource-adaptive query execution in cloud-native databases.
Rui recently received his PhD in computer science from the University of Chicago.
Before we do start, another quick few
announcements. Remember, if you do enjoy the show, please consider supporting us through Buy Me A
Coffee. And we do have a listener survey out at the moment. So please go and check that out. Anyway,
on to today's show. Welcome, Rui. Hi, Jack. Thank you so much for having me. I'm really excited to
be here. Fantastic. Let's get started then. So I like to start off with my
guests and getting them to tell us their story. So maybe you can tell us more about yourself and
how you became interested in research and databases. Yeah, sure. So for myself, like I just
mentioned, I recently received my PhD degree in neuroscience from the University of Chicago,
where I was co-advised by Professor Aaron Imel
and Mike Franklin.
I also work very closely with Professor Sanjay Krisha.
So my PhD research is about building
resource-efficient data-intensive systems.
So during my PhD, I was also a research intern
at a criticism lab of Microsoft,
a data science intern at the AI engineering team of DocuSign,
and a WITS team student at Argonne National Lab.
But the question about my research path
towards data management research is like,
if you take a peek at my resume or my CV,
you probably will see a very diverse
or zigzag background.
So unlike many of my peers,
I didn't get a chance to do some basic research
when I was an undergraduate student.
My research process started by my undergraduate final year project,
which is about mobile computing.
The professor who advised me on that project
taught me some very basic research methodologies.
And then after that, when I graduated, a lot of my classmates become software engineers
and I feel like probably I want to do some research rather than being, you know, make
some contribution or impact in the industry.
So then I decided to purchase a master's degree.
And then I finally found that the Hong Kong Polytechnic University would be one who would
accept me as an NPU student.
I think that's a research-oriented and fully funded master scholarship.
It's like you can consider it as a mini version PhD.
We still took several classes, but we spent a lot of time on doing research.
And we got salary every month, and then we need to defend our system at the end of this month so that we can get degree.
But still, at that time, my research focused mainly on mobile computing, but I did begin to explore some areas like UV-Cranes computing and cyber-physical systems.
Once I finished that,
I also got a chance to join a systems security group at the Chinese University of Hong Kong
because that host was my co-author for six months, basically.
And that was a research citizenship.
During that time, I still got a chance
to explore some privacy and security projects or whatever.
But so far, you can tell I do have some experience on different areas.
They're pretty diverse.
But one thing that has not changed is I have to manage a lot of data,
including user information, like sensing data from sensors or mobile devices,
and some location or timestamp data.
So there's a lot of them.
So you need to be efficient in managing, you need to compute them, get whatever you want.
Those challenges of data management fascinated me.
Yeah, probably.
That's the cool thing if we can, you know, explore some research ideas in that area.
And then I found this whole data management area. And almost at the same time,
I decided to go to US for a PhD degree.
So I said, why don't I just combine it together,
get a PhD in database area?
And that's pretty good, right?
But the one issue I'm facing at that time
is like pretty difficult to get in,
you know, get accepted by a decent PhD program
if you cannot show some solid research experience and skills.
Then somehow I managed to go to the National University of Singapore
to join their database research group,
working with Professor Ben-Chin Wei.
I think at that time, I did explore some database or data system research,
like blockchain, memory management, data cleaning.
There's a lot of projects in that group.
And I think I did a good job.
And eventually, he wrote me a recommendation later so that I can get into the retail chip.
From that time, I started my data management research path.
So I think that's my whole story.
I'm probably a little wordy, but yeah, that's a long path.
That's awesome.
Yeah, I mean, there's a few things, I know what you're saying there,
that sort of resonated with me as well.
So not taking a linear path to databases and a PhD in databases
is something that sort of happened with me as well sort of not taking the linear path to databases and a PhD in databases something
that sort of happened with me as well I mean I started out doing mathematics and economics as
my undergraduate I had this dream of being the next wolf of wall street and then slowly transitioned
away into statistics and you can kind of see how kind of computers came in a little bit more with
more computational statistics and then finally settled on on a PhD in databases but one thing
I will say that I think,
I don't know if this is true with you,
that it sounds like you've been through a lot of different areas
and got exposure to a lot of different areas of computer science
and various different topics.
Do you think that saved you well for your PhD,
that exposure and having that broad understanding of the wider area
rather than being sort of laser focused on one specific topic
for such a long time
yeah i think so exposing to different area and then finally focus on one makes me have a stronger
motivation to be here so like because i already know what other areas and then finally i choose
this area because i want to and i i do use some other previous experience on other areas to do my research yeah you kind of
have that kind of cross-pollination of ideas and stuff and you never know where something you've
learned five years ago i might all of a sudden come back to you and think you might think oh god
damn that's brilliant i remember this and i can apply that to this and yeah i can yeah yeah exactly
yeah yeah yeah it's a lot for me yeah cool So let's talk about the paper that you published at CIDR recently.
And it's called Towards Resource Adaptive Query Execution in Cloud Native Databases.
So there's a few things in that title there, some background we need to set up here
so the listener can kind of get on board with what we're going to be talking about for the next hour or so.
So let's kind of start things off with, tell us what a cloud-native database is.
Sure, yeah. I mean, there's a lot of definitions of cloud-native databases if you Google it.
But my definition is like, so cloud-native databases are designed to export the benefit
of cloud computing environments. So unlike the traditional databases that are often deployed on specific hardware
or within fixed environments,
the cloud-native databases are built from scratch
for cloud environments.
Some key features or advantages of them
are scalability, multi-tendency, high availability,
and TIC's Go pricing model,
and the automated databases maintenance, you name it,
there's a lot of different things. Some examples of automated databases are like Amazon Redshift,
so Google Spanner and Microsoft Cosmos DB. Of course, there are a lot of other products
in the world. But I think those databases are optimized for performance, availability, scalability in the cloud and
making them very, very powerful for a lot of range of applications like web or mobile
applications and some microservices or IoT systems, or even the machine learning and
the current train large language models.
Yeah, nice.
That's a really nice definition of cloud native databases.
I mean, yeah, you see you go on Google and you stick that in and there's like a thousand different different definitions of it
right and it's kind of hard to say okay well this person's saying this thing this person's saying
this yeah that's a really nice nice definition of cloud native databases so we're in this world
then so and everything sort of shifting towards us and using these um all the primitives that
these cloud environments give us what are
some of the kind of the key factors then that we need to think about that make us have to
reconsider the architecture for the way we would develop a system um in a cloud environment i guess
yeah kind of what i'm hitting out here is give us the kind of the elevator pitch for you for for
the for this paper really yeah sure so i think, yeah, that's a very interesting question.
So I think one observation
that we have is like the,
so we argue that the ephemeral
cloud resources are becoming prevalent,
you know, because their prices
are really, really attractive
compared with the traditional
on-premise or on-demand
the cloud that resulted.
So there was a report,
I remember,
so they said, so peak time, the cloud resulted. So there was a report I remember. So they said the peak time price
of the regular cloud
resources are like 200
times higher than the eFirmware cloud
resources. So there's a huge
difference here. But there are two
unique
traits of this eFirmware cloud
resources. It's like
the eFirmware cloud resources. It's like the ephemeral cloud resources
are pretty dynamic in
availability. Even
the cloud resources provider may
get them back for some reason.
So that means the
resources usage
can be terminated.
Second is like even their prices are
very attractive, but
usually the prices are fluctuating over time.
So their peak time prices will be much higher than their off-peak time prices.
And the fluctuation time period is not days or weeks.
It could be hours.
So that's really, really dynamic. So based on these two
observations, we feel like
it's time to propose
a new or re-imagine
what the cloud-based database looks like.
So this is just an old picture.
I can give you some concrete examples.
So if you Google
Amazon support instance,
Amazon provides some short-lived
but really good deal of cloud resources for you.
So if you just want to use like one hour,
two hours,
probably you can find a very good deal here.
And also there's another cloud paradigm
called Zero Carbon Cloud.
It's proposed by a pretty good
or famous professor,
Andrew Chen in University of Chicago.
In this Zero carbon cloud paradigm
the entire data center
are driven
by the renewable energies
like wind
solar
like
you know
there's no cost for that
it's pretty good right
so if
but the problem is like
we cannot control wind
we cannot control water
so
the resources
sometimes
or the energy
sometimes
will be terminated
so you can see once you imagine those applications or the big pictures we have,
you want to consider what if the resources are not stable?
How should we still use them to build our system or build our cloning databases?
So this paper shows our vision of the best practice
for building and deploying these cloud-needed databases
on a single cloud result,
especially from the perspective of cloud service provided.
Nice. Yeah, there's some interesting things there
that we were talking about,
the fluctuation over time of these spot prices
and that it can be in the magnitude of hours for it to change.
I mean, that's kind of crazy, right? So so anyway and i also really like the sound of these carbon
um like zero carbon data centers which yeah yeah yeah zero carbon clouds that's a fascinating thing
and anyway cool let's talk about then about i guess i guess these these primitives that you've
proposed then that we need to basically be aware of when we're building a cloud native database on such an environment. So yeah, tell us about these primitives you've
identified in your vision. Yeah, so I think we
provide three primitives in this paper. One is
called the query preemption. I think that's described
the ability of permitting the queries to consistently
consume. So once we have very limited resources,
we know the results will be terminated.
It's very dynamic.
And it's not reasonable to keep allocating resources
to some long-running queries.
So some other short-running queries
have to wait until the long-running
queries finish.
That will significantly increase the latency of the query, right?
So for those cases, I think the one important point we have is that we should allow the
queries can be, you know, suspended, you know, adaptively paused when there's a need
or there's a beneficial to do so.
I think that's the one,
the primitive one,
the query preemption.
But that's for single queries
or once we have one query, right?
The second primitive we have
is called resorted arbitration.
Let's move to the scenario
like we have, you know, multi-users.
There's a multi-ttenant environment for workload.
Doing the same assumption inside the e-firmware resources, you know, flexibility in availability
and the cost, how much resources a query needs, that's already answered by the existing resource
reservation or scheduling magnets and whatnot.
The question we want to ask is like,
is it worth allocating resources to a particular query or job?
This is because we have limited resources,
but we have a large amount of workloads.
How should we allocate the resources to that?
The thing that makes this worse or more important
is like once we look at those query,
the progress curve of each query,
it's like they increase
at the very beginning significantly
or quickly because probably what,
you know, a lot of the modern
data processing jobs are,
you know, iterative.
You know, we process data batch by batch, right?
The first batch may already give you
very rough idea of your final result.
But your final batch
just pushes your final result
from 90% accuracy
to 91% accuracy.
So you can
imagine you combine everything together.
We have limited resources,
we have a large workload, and each workload
can get their results pretty good
at the very beginning batch.
But they will waste their resources at the final batch.
How should we allocate it?
Those things, once we put everything together, we have this primitiv to the result arbitration.
The third primitiv, we call it the host tolerance.
It's about the pricing model.
Like I mentioned before, right now, it's state-of-the- it's about pricing model like I mentioned before
right now
it's state of art
pricing model
is pay as you go
so how much
resources you use
how much money
you pay
right
then
our vision
is like
if we allow
service provider
or the users
allow the service
provider
to suspend their job
you know
for some reason
and then I can give you
more fine grained pricing control
or there's more options you can choose.
Then some users may take it
because not everyone is care about low latency,
high availability, something like that.
So I think for those users
who prioritize cost efficiency over speed,
then they may prefer more options.
Like even if you said,
find a job and then I'm not totally fine.
I just want you to give me a good deal.
Yeah, awesome.
If I just repeat those back to you then.
So the first one is our query preemption.
So that's basically saying
that we can adaptively,
given users of cloud environments,
the ability to adaptively pause
and suspend their queries.
Then we move on to our resource arbitration.
And that is where we kind of want to ask the question,
is it worth allocating something to a job?
Because like you said, in the case of iterative queries,
there's this sort of diminishing return of sort of every,
not all iterations are equal.
So the earlier ones are more valuable and then they are,
because like you said going from
98 percent to 99 yeah sometimes you want to you want to give resources to the more promising jobs
right sometimes you want to you want to keep allocating resources to one job until this job
push the limited back you know really far from the the state. So it really depends your goal, your objective.
But yeah.
But it's all about flexibility, right?
That's the thing.
Having a more sort of flexible sort of,
then this kind of gets comes into the pricing model, right?
And the cost tolerance and just exposing more options to users
that is better for the users in terms of like cost, right?
They maybe don't spend money or then they would need to.
And it also, I guess, frees up resources for the cloud provider or that to then sell to somebody else right so basically yeah everybody
everybody wins right that's the sort of the goal exactly better price model yeah the whole the whole
every all boats rise so yeah we've got our three primitives then and then to sort of realize these
primitives you've developed a framework called ratchet and. And tell us more about Ratchet. Give us the high-level overview
of how you put these primitives
into practice, shall we say.
Yeah, definitely.
By the way, you have a better summary than me.
I think that's...
Sorry, you just told...
I'm just repeating what you just told me,
so it's fine.
Yeah. So for Ratchet,
essentially, it's a novel or it's a it's a new resource
adaptive query execution framework so um to real like we said to realize these three uh primitives
so it's it's a framework so we have so we have a different way to to implement it right but in our
paper we propose three um component or or three pillars in this framework.
One is we design an adaptive query execution framework,
which enables the query suspension and redemption using different strategies.
The second component is a resources arbitration mechanism.
It's responsible for determining resource allocation for suspension
and
the
resumption
during
runtime.
The
third one
is like
a
cost
model
like
provide
users
more
fine
grand
set
of
the
results
and
price
options.
So I
think
overall
that's
the
framework
for
Ratchet.
Nice,
that's the
high level
overview.
So let's dig
into one of these one by one let's do a little bit of a deep dive and so kick us up for the first
with the first pillar and the adaptive query execution framework you mentioned there's a few
different strategies there and how we're going to go about suspending these jobs and things so yeah
give us a bit of a rundown of how that works sure definitely so yeah before we get to that
question is one thing i want to announce is like, yeah, the paper described this first primitive just got accepted by the ICDE 2024. And then yeah, if you want, if any audience want to look at more details, you can find it. But probably it's not public right now because we didn't put it on our website but i believe i believe it will be
published very soon as soon as you can find it and uh yeah let's go back to that uh that question
so for for the printing one the active query suspension and resumption framework so in that
paper we propose we have this this framework and this framework consisting of different query
suspension and the resumption strategies those strategies are you know at different levels this framework. And this framework consisting of different query suspension
and the resumption
strategies.
Those strategies are,
you know,
at different levels.
So,
for example,
we can say
the most naive one
is called
the redo query.
So once your
resources has been
terminated,
you don't do anything.
So your query
will stop there
and then once
we want to resume it,
we rerun it.
That's the most naive one
another uh another strategy is called we call it operator level suspension strategy so it's
it's actually original from a previous work called query suspend and resume i think that's uh that's
a that's a work you know proposed by patrice from the Microsoft Research. That's a pretty big name here. And then another strategy
is we call it
pipeline-level
satisfaction strategy.
So that's
something we
developed
for
muscle-driven
or pipeline-driven
query execution.
Those things
are like
unlike
operator-level
satisfaction,
they suspend
the query
at the operator
which has
the lowest
memory usage
because we don't need to present too much data
when we suspend a query.
Pipeline-level strategy is like,
the pipeline-level query execution is like,
they split the query into different pipelines.
And then our strategy or our method is like,
we can suspend this query once a pipeline is finished, and then all the intermediate
data of this pipeline will be persisted.
And once we want to resume this query, then we can rebuild this query plan and then gather
those persisted intermediate data of all the processed pipeline and then going from there.
Some other strategies have been proposed in this framework.
It's like we call it data batches, that will suspend strategy. and then going from there. Some other strategies has been proposed in this framework
is like, we call it data batches,
that will suspension strategy.
It's like, we don't care about the query execution part,
but we split the, we can split the input data
into different batches, and we can suspend it
once one or multiple batches are finished.
And then it's very easy to, you know,
keep the progress and the intermediate data, right? And there's another strategy that we also developed.
We call it the protein level suspension strategy.
It's like we consider, so we don't care about what happened within the databases.
We suspend the whole process where the database applications are.
So that means if the database has multiple process
to handle multiple queries,
and once we want to suspend one of them,
we suspend the entire process.
And then we keep everything within this process into disk.
Yeah, and then once we want to resume it,
we resume the process first,
and reload everything within this process
and then process it from there.
Those things, you can see in this framework,
we have very diverse or different strategies,
but there are some trade-offs between them.
It's like you can imagine once we want to redo a query,
there's no cost here.
It's like we just let it stop
and then we do not perceive anything.
And then once we're on resume,
we rerun it.
But the thing is, we lost all the progress.
There's nothing here.
But for the process level
suspension strategy,
we keep everything.
Once we know
the results will be terminated, we suspend
the query and we keep everything.
But the downside is like the intermediate data
or the state could be really, really large
because we will not only keep the database's data,
we keep the process data, almost everything.
So I think the question is like,
how should we select the most appropriate one once we have different
queries we have different user requirements
you know we have different environments
so that's something we want to answer
that's the primitive one
sorry quick question on that real quick
is that sort of the flexibility
it's a very flexible sort of framework
there the different levels from all the way down to redo
like you can lose everything but obviously then
there's no preservation of state so there's very low overheads in that respect
is that something that maybe we'll cover this later on i'm not quite sure but that
like is that something you want to surface to the user of these systems so they can actually be
they can actually declare like what does the user interface look like do i bit can i say
yeah my query is a redo query it's i want it to say, the data batch level sort of computation.
How do I express that as a user?
Is the system intelligent?
Can the system work it out?
Or is it something that's surfaced to the user?
Right, yeah.
So this, again, like I mentioned earlier,
so those systems are from service provider perspective.
Okay, right.
Yeah, so that means the system will make a decision,
but they will consider user requirements.
For example, like I said,
so if I tell you that the users
really, really want to get their data faster,
or in a really quick way,
I don't really care money,
and I spend all my money on that.
Yeah, and then I won't even suspend the repair.
I will try and back to carry on until the end. But if the user says, okay, I won't even suspend your query. I will try and batch it to query run until the end.
But if they suspend it, I don't really care.
How much suspension you could afford,
we will give them fine-grained in our vision.
We will give them fine-grained options,
like one or two or three,
or what different other way to let users say,
indicate their expectation.
Based on all these things our
system make decision which suspension strategies could be used gotcha there's a mapping there
between sort of the high level what the unit does and then that gets converted down into these
different different sort of strategies you use cool so i guess with that we're on to pillar
number two now so yeah take it yeah yeah yeah so pillar number two is the is the resource
arbitration we already have a paper about that but. But for that, once we have multiple queries, the multi-tendence, we consider users have very diverse requirements. So, we define competition criteria. will say, when I think our query can be finished, or when I think the result will be
acceptable. So like
accuracy will be like 90%,
or I already scanned like
more than 90% of the data, or whatever.
So our system will estimate
how much progress
a query can achieve
if I give you a specific
amount of resources. So we
will estimate that thing.
And then we will do like some banded sale process.
We say, if we allocated those resources to these jobs,
we estimate how much progress we have.
And then we probably will, it depends on different goals.
If our goal is to try to keep the resources,
the most promising ones,
I will give the resources to the job
who can achieve the most progress.
If I really care about fairness,
I don't really want one job four behind.
I want everyone to achieve some reasonable progress.
I probably will keep allocating resources
to the one achieved latest progress.
So this framework or this primitive tool
give us the ability to adapt
to the allocating resources to different job
and to achieve some greater good,
like fairness or efficiency, right?
So yeah, that's the primitive tool.
So on that primitive tool, sorry, really quick.
Yeah, sure.
So fairness for one job is maybe not the same fairness
as the fairness for the other job.
I mean, I'm kind of thinking,
because fairness is not really a well-defined thing, right?
Or it's very ambiguous and you can depend on your perspective
and that can be applied, I guess, to the same thing,
to just jobs, to query execution, execution right so is there a global setting there or is it
is that pillar adaptive so like it can be like the fairness is almost query level and that's
then used to come up with a global sort of um uh decision that's fair i guess respect with
respective to everyone's preferences i'm sorry if that doesn't make much sense, but yeah.
No, it's done. So I think that's a good
question. So essentially,
we support both of them.
It's like
if we know, for example, if we know,
so it depends on workload and
like I said, we support both of them.
So, for example, if
all these workloads are from
the same user, what they want to do is like
we want to find it.
So they just random give like two different configurations to different data processing
jobs or whatever.
I want to find the best one.
I want to find the best result.
Then this overall workload may have one global objective.
I want to find the best one, right?
But if this query is from different users,
like they may have different requirements,
different goals,
like this makes no sense
if we set a global objective for them, right?
Because everyone is individual.
So in that case, we will consider like,
what's your objective?
What's your competition criteria?
How much money you want to spend?
And then we can, you know, adapt this, like,
find some way to allocate the resources to these jobs
and to try to keep everything happy.
Yeah, cool.
I guess, yeah.
So then number three, the cost model.
So tell us about pillar three.
Yeah, the cost model is like, this is a mission as well.
We haven't, yeah, we haven't is like, this is a vision as well. We haven't published any long paper about that.
But in our paper, the vision is like I mentioned,
we try to provide some like the C-less you pay pricing model
rather than the pay as you go service model.
It's like I mentioned already after
several times,
if you allow me
to search by
your query
or you give me
some tolerance
you can have,
then I probably
give you better
choice or better
practice options.
And then
you'll be happy.
I will be happy
because I can
reallocate
the result
which should be
allocated to you
to some others
to make more money.
And then, yeah, everyone's happy.
And I think in that case,
we somehow achieve a better utilization
for this limited result
in terms of money
or in terms of how much money
the service provider makes.
So I guess in that case,
cost tolerance is like
fine-grained pricing model
and what we call it's like
the standard S-WP model.
I guess that sort of kind of
brings the vision sort of together.
We've kind of got our sort of three
pillars there, each component.
So with this obviously being a vision paper,
there's a nice section in your paper
about the future directions for Ratchet.
So tell us a little bit about where we go,
where we go from this.
Since we've got this framework. What's the next steps?
Yeah. So I think since I just graduated, so I want to make
a little more testimony about our new Chicago group. So this
project is a long-term project. It's under the
supervision of the professor. There's a bunch of
talented people doing this
our next step is like actually we have a we already start making a suspension oriented
basic system it's like we consider the queries the suspension is is the worst capacity then
for this database we think that the query can be suspended or should be suspended. And then in that case, what would the modern database look like?
How should we redesign each component to support that requirement?
In that case, there's a very important difference.
We cannot assume the query can run from beginning to the end.
And then once we have multiple of them, once we suspend the app,
you can imagine some challenging issues. like how should we keep the consistency?
How should we keep the observations?
Or, you know, once we have multiple versions, you know, what if the data changed when we suspend the query and when we resume the query?
How should we guarantee that the result will be the same?
Or which one you want? Probably user want, like, when we systematize query,
the data is like the old version,
and what they resume next, it could be the newer version.
How should we do that?
Which one do you want?
You know?
So, yeah, there's some, I think, promising research direction.
And also you can imagine, like, this system is not only good
for, like, resource-limited,
but it also provides enough flexibility to almost every system.
So if the cloud infrastructure needs to upgrade regularly
or the system needs to shut down for some time,
without this suspend-oriented database,
everyone loses their progress. Or we need
to find a way to do that. But right
now, as long as you can
tell me
when you want to shut
down the introduction or
upgrade the upside or downside
of your resources scale,
I can find a way myself to keep
everything happy.
There's a lot of ways or application cases we can use those your resources scale. I can't find a way myself to keep everything happy, right?
So you can imagine there's a lot of ways
or application cases
we can use
those special R&D databases.
And also another thing
is like,
I think we want to spend
some time on
provide or spend more time
on these PDR3s.
I think it's not only about
databases research
or complex research.
They may need some
economic thing,
like how should we compare, how should we provide basic research or complex research, they may need some economic thing.
How should we provide
comprehensive
pricing models
to use
more options
and then
we can
have a
different
mechanism
to handle
each option
and make
more money
specifically.
I think
that's
probably
the
future work of the near future.
For the 10 years or 20 years, I really have no idea.
Okay, so we're at base camp at the moment for the research group's agenda for the next 10 years.
And we'll see where it goes.
But it sounds like there's plenty of interesting directions for it to be going in for sure.
Yeah, thanks.
Yeah, so keep a lookout
for the papers coming out of them out of your group because i'm sure they'll be very interesting
the the the question i i like to ask um some of my guests sometimes about the work is if you put
your reviewer hat on for a moment and you can have asked the limitations of your work obviously
this is a very like vision so it's kind of hard to know the concrete limitations of the work at the moment.
But I guess what limitations or what challenges do you see in taking this vision and actually making something realistic and concrete and practical out of it?
What are the big challenges you think are the limitations that you may hit into in the long term or even in the short term?
Yeah, I think we do have some discussions or brainstorms when we explore this area or this project.
I think several things we discussed before.
Actually, we made some assumptions here.
It's like we know, or we somehow know
when the termination will happen.
If termination happened suddenly,
like next second, we have
no idea. There's no way to
handle those progress.
So that means one limitation
or several
cases, one case that we cannot
handle is like the termination is
unexpected. No one
tell us, we we probably
will yeah lose everything or we will behave very very bad yeah that's that's one thing but if you
can give us some hint about when the termination could happen i saw it could be not not not
necessarily be a exact time point you can can give us a range, an interval.
The termination
could happen
within this
time, like
five minutes
or 30 minutes
or whatever.
So it's
like on
Tuesday.
Yeah,
exactly.
Yeah, we
can find a
way.
So I just
wanted to
give us
something.
We will
try our
best to
develop a
system,
you know,
to keep
everything in
good shape.
Yeah, I think that's
the one thing we haven't
decided yet. Second thing,
I think probably will change
this project significantly.
It's like, what if
we will have a
really good world in the future?
The resources are not limited.
Even the zero-carbon cloud
infrastructure, somehow,
based on very, very
good people or engineers,
they somehow managed to get
this energy really stable.
Even if it's driven by
the sustainable
resource resource,
the energy source,
they can
somehow get
those resources
stable
there's no
need to
consider
the resources
will be
terminated
I wish
they could
and I
really want
to live
in the
world
like that
we have
unlimited
resources
yeah
it sounds
a bit
sort of
utopia
this glorious
future
that will
probably never
happen right
it sounds
great
yeah
I don't know how
realistic but in that case yeah if in that case well our project probably won't play a significant
role there because one we have two assumptions one is like resolve this identity we consider
this will be the trend consider the global situation here so like you know everything
and then we consider we we will have something about
termination as long as you can give us these two assumptions and then we yeah our our system our
project will give you good answers nice cool so yeah i guess we've kind of spoke with the vision
there a little bit and i guess i mean what impact do you think Ratchet can have longer term? Or do you think it can be, how revolutionary do you think this project can be?
Kind of, I guess, going forward.
I mean, how do you think it could affect?
Just two questions here.
Sorry, so there's the general impacts of the project.
No, no, that's fine.
And then there's the impact on the day-to-day of software engineers and data engineers.
Yeah, if we're putting our prophecy hat on now,
we're going to say, okay, this is what's going to happen in the future. What impact do you think it
can have? The findings or the impact of Ratchet, we think it's like, I think I already said this
a lot of times, that we want to argue that we should try our best to use the ephemeral
cloud resources. We should even accept the termination.
To consider termination as a regular thing
rather than a stable regular thing.
Once we consider resources,
it will be very dynamic and can be terminated sometimes
or somehow.
Then we possibly design a system to handle them
or even utilize them. I think that's the biggest argument or
finding in our research that we want to consider. So I think for software engineer or the data
engineer, my guess is once they run or they devise a system, they never consider the results can be determined.
I don't think they consider that the results will be appropriate.
They will consider how should I maximize the resource utilization, but they won't consider what, what if I have no, sometimes if I have no result, what can happen?
Yeah,
I think that's,
I think that's a link to some
existing area
like
fault tolerance,
recovery,
recovery mechanism,
or checkpointing,
some,
yeah,
you name it.
There's some,
you know,
existing topic in that area.
But,
with our research,
we provide another way
to reconsider those things.
What if we,
can we just try to keep
as much as progress or
keep the necessary progress
and it makes people think, okay,
that's good enough for us. If the termination
could happen. I think that's
something different from
the previous research work.
Yeah, I guess because it's kind of
when you were talking about the um in the in in the first pillar in the in the framework and
different strategies kind of as you're talking about those i was thinking checkpointing i was
thinking those sorts of those sorts of things and that was what was coming at like save point and
and whatnot but then i guess what's happening here is this has now been surfaced as a as a you as
like a an uh an interface basically that people can kind of interact with rather than it being like hidden within the
system within the database somewhere it's doing checkpoint now we're kind of actually being able
to exploit it and use it in our applications and whatever so yeah no i can definitely see
the parallels and how you are basically using those and putting them in a different light
um right yes exactly so the next one that i like to ask a lot is what's the most
interesting thing you've learned while working on on the project so in this case on ratchet what's
been the best insight you've kind of got from it most surprising maybe i think most interesting
lessons i learned from that is like so i think that's let's link to some backstory or, you know, original story of this project.
It's like, so this project actually come from the original one.
So in previous project I had, it's like, I mentioned that the rotary, like we want to schedule the different jobs during runtime.
And then we want to give them the most appropriate resources.
And then we found
during runtime,
so we want to pre-add
those queries.
We want to, you know,
during runtime,
we want to, you know,
re-gauge the results
and we need to
systematize them.
And at that time,
we used the checkpoint mechanism
that supports not only
network,
supports not only the work, not support,
support not only
the processing job,
but also the machine job.
So we,
so anyway,
we rely on the checkpoint mechanism
to suspend job.
And sometimes we find
checkpoint will be,
will be a time consuming process.
It may generate a lot of data.
And then if we keep doing that,
sometimes it will give us
an expected long latency. And that time we keep doing that, sometimes it will give us an unexpected long latency.
And that time,
so I keep going deeper
and deeper and deeper
to find this.
Okay, what's,
how they, you know,
checkpoint the query process.
And that time we use
Spark Structural Streaming
as a platform.
And then I go inside
their source code to see,
okay, how they checkpoint
the thing.
And we figured out, okay, yeah,
they did do some pretty good work,
but they definitely did a lot of protests
to maintain this checkpoint.
So I think we want to keep exploring,
like, I don't have a better way to do that.
If we could, so what can we, you know, is there any other don't have a better way to do that if if we could so what can we you
know is there any other method we can use to suspend it so i think that's that's the most
interesting thing like the current project actually is this is the previous project
some failure point of the previous project if you keep going and keep going and you will find okay
sometimes i probably i can i can i can make more work on this figure uh point and then the field of point i and then i can publish a new
paper i think that's a pretty interesting legend that i learned it's like first is never overlook
any uh field points secondly it's like keep going and find some details probably you get something
yeah keep iterating right and you never know like right right just keep going trying new things yeah awesome that's really cool and what whilst we've
got you here really like it'd be nice for you all to hear about your the other work from your phd as
well so maybe you can give the the listener a quick breakdown of some of the other things you've
worked on across your across your studies yeah definitely yeah yeah yeah so besides those um i
think i think the the the I think the research I just described
by many research, but I also have some,
I have strong interest in the machine learning area.
So I do some project intersection
between the databases and machine learning.
So one is like when I was a research intern at Microsoft,
I, yeah, I optimized the feature store.
I think it's one of the most important operations
in features is called point-in-time join.
It's essentially an operator to generate different features.
And we optimize them to make it more faster
to get the
features for training. And also
once I was
in the DocuSign AI engineering
team, I optimized their
container-based
hyperparameter optimization workload.
So you can consider once we
have, I'm not sure if you are familiar with
hyperparameter optimization, but essentially
it's like I have a lot of hyperparameter configurations.
I want to find the best one to get better training results.
And then in that infrastructure, they are using the container
to run each hyperparameter configuration.
Yeah, I optimize it and try to put different containers together
and try to
train them or execute them
in parallel to save more time
and to better result
in the application.
I think one of my early
work of my PhD studies
is that I
saw how once we have multiple models
and is it possible to combine or pack them together as a bigger one and to the main single GPU and train them simultaneously.
So that we won't need to, without hurting the final result of each model.
I think, yeah, those are some other research I explore or I did during my PhD
but I think I will keep doing
some of this machine learning work
Cool, if any of those topics sounded
interesting to the listener, we'll put links to
those in the show notes as well so you can go and follow up
and have a look at some of the other awesome
work that Rui's done. So I've just got
two more questions for you now
Rui, and the first one
you maybe touched
on a little bit a second ago actually when you was talking about iterating and kind of following
through with ideas and kind of not kind of giving up with them that easily sort of thing but it's
all about your creative process and how you approach research and how you approach idea
generation and then once you've got an idea or a set of ideas we're selecting which ones to
dedicate time to and to pursue so yeah how do you go about that what's your what's your process
right yeah so i think uh first i think first thing first is like as a phd student you cannot
ignore the role of your providers so your providers yeah yeah yeah it's very common
player yeah they carry a lot of ways once we want to get the idea of the cycling project.
But for myself, I think I consider ideas as two kinds of things.
Like one is the original idea.
The other one is the optimization idea.
So for the original idea, it's like you create something that never exists.
It's really totally new.
But those original ideas,
they need inspirations, patience, other workings.
That is, you know,
something that is really difficult to measure
and follow, right?
But I do, according to my experience,
like I think I have some experience
on the optimization ideas,
like something probably others have followed.
First, find an area you have interest in.
And then you keep exploring.
And you will find some projects, find some papers,
find some important contributions.
And then the thing I would like to,
you know, find is their assumption,
their scenario, their applications.
And then I will see,
can I break one or multiple assumptions of them?
That may give us, you know,
a totally different thing, right?
Is the application or environment
has been changed since the paper or the project proposed like for example you know a couple years
ago the there's no gpu there's no matter of powerful hardware there's no nme or you know
you name it and now we have all of them so So can we, you know, we can add those new things
to change some existing work.
And then, of course, I want to see
is there anything missing?
Is there any, you know, chance I can optimize
or I can add more function, you know,
to overall improve the existing works?
That's something I usually will do
once I want to propose an optimization idea,
you know, try to improve the existing works.
So that's my idea generation.
And for the project selecting,
I think I'm a little bit pragmatic.
It's like, so once we select project,
of course, interest is one thing, but once we select projects, of course,
interest is
one thing,
but we need
to consider,
you know,
the delivery
results,
the durable
results,
like how
many times
do you have,
what's the
result you are
expecting?
It's like,
do you want
to have a
paper in
three months?
Do you want
to have a
paper in
one year?
Or do you
want to have a long-term project? Do you want to have a paper in one year? Or do you want to have a long-term project
and at the end of your PhD journey,
and you can announce like,
everyone, I make a really, really cool thing.
Everyone should look at what's your goal.
So yeah, sometimes I figure out,
okay, this is something I want to achieve.
And then I will see, okay, what's the best way
or what's the best project
I can achieve that goal.
So I think, yeah, that's my way
to select projects. But ideally,
definitely
it will be much, much better
if I can select
a project I really
love. But
that's usually not the case.
Yeah, that's true. But I like yeah that's true but i like to say about you with
just going back to the idea um generation part of your answer there and sort of like
understanding all the variables all the assumptions and then thinking how can i change
these variables how can i break this assumption and let you see what happens and follow those
that was a really nice nice part of that question and yeah it's always good if you have something
that you're passionate about right it makes it makes it so much easier to work on something and pursue something if
you are passionate about it and you're really interested in it so yeah no that's a really
yeah really nice answer to that question really cool so we're at the end now so it's the time for
the last word and yeah like what's the one thing you want the listener to take away from this podcast episode today?
I think, yeah, I want to say, you know,
I consider the computing resources or energy-based resources are more rare
compared with, you know, the time before.
And you can imagine how the modern data processing job
how complicated the modern data processing job are.
And then how large the large language models are.
And you can see they are
resources consumption jobs, right?
The tasks.
So in that case, I think we need to consider if there's any better way to, you know, fully utilize the resources.
Even if the resources are not stable, even if the resources are, you know, ephemeral, then we really want to use them.
Because, yeah, the resources are rare
in the future.
And each of them will...
Yeah, you can imagine
a lot of current model computation jobs
consume a lot of resources.
So I think that's the thing.
Yeah, I think that's one way
to keep the audience thinking,
is there any other way or is there a better way to have a better system?
I think that's one message I want to deliver.
Great. Good message to end on for sure.
So thank you so much, Rui.
It's been actually a pleasure to talk to you today.
If the listener wants to find any of Rui's work,
we'll put links to everything in the show notes.
You can go and check those out.
Whereabouts can we find you on social media, Rui? Are you on any of the platforms's work. We'll put links to her, I think, in the show notes. You can go and check those out. Whereabouts can we find you
on social media, Rui?
Are you on any of the platforms,
LinkedIn, Twitter?
I'm sorry,
X, should I say now?
Can we find you anywhere?
I have a LinkedIn account.
I think people can find me
on that platform.
For others,
I'm not a very active
social,
you know,
media guy.
So I guess,
probably, yeah. Email maybe might be the best way to contact you then if they if they want to talk about the work today we've been speaking about
today i guess so we can put your email in there so that so you can be you can get probably all you
can yeah all you can you can send me a message on linkedin i get another way to reach out to me
fantastic stuff um yeah and a reminder again if you enjoy the show
please consider supporting us through buy me a coffee and we'll see you all next time for some
more awesome computer science research Thank you.