Disseminate: The Computer Science Research Podcast - Lexiang Huang | Metastable Failures in the Wild | #17
Episode Date: January 9, 2023Summary: In this episode Lexiang Huang talks about a framework for understanding a class of failures in distributed systems called metastable failures. Lexiang tells us about his study on the pre...valence of such failures in the wild and how he and his colleagues scoured over publicly available incident reports from many organizations, ranging from hyperscalers to small companies. Listen to the episode to find out about his main findings and gain a deeper understanding of metastable failures and how you can identity, prevent, and mitigate against them!Links: OSDI paper and talkPersonal websiteTwitterLinkedIn Hosted on Acast. See acast.com/privacy for more information.
Transcript
Discussion (0)
Hello and welcome to Disseminate, a computer science research podcast. I'm your host, Jack Wardby.
I'm delighted to say I'm joined today by Leshang Huang, who will be talking about his OSDI 22 paper, Metastable Failures in the Wild.
Leshang is a PhD student at Pennsylvania State University, and his research focuses on performance debugging for complex distributed systems and he's actively collaborating
with large-scale companies to debug performance issues and improve cloud efficiency.
Lashan, welcome to the show.
Hi Jack, thanks for having me.
Let's dive straight in. So can you tell us how you became interested in research and
how you became a PhD student and specifically how did you get interested in distributed systems?
Oh, yeah, that's a very good question. Yeah, so for doing research, well, I like to explore the
fundamental ideas behind phenomenons, like for metastable failures, before, like, we proposed,
like, this idea, there are many forms of it, like, people call it, like, persistent overload,
that's spiral, etc. But what's fundamentally uh like uh similar among those
things uh they are something that we can dig deeper into and provide a general framework
to understanding them and solving them yeah so that's why i'm interested in research uh and you
were talking about distributed systems distributed systems as well, yeah. Yeah, distributed systems are popular. But also, I desire faster computer systems in general.
And I'm interested in distributed systems
because there are many hard problems, of course,
but because distributed systems are at scale.
So when, like you say, improving the performance of system,
even by a little bit,
it can generate a very impact impactful like influence
so um that's why i like to do research in this area fantastic yeah for sure there's a lot of
interesting problems in distributed systems right yeah right so you mentioned them earlier on in the
star of the show today i guess are these metastable failures so can you tell the listeners what are
metastable failures and maybe give us some illustrative examples?
I know you mentioned a few before, but let's dig into that a little bit more.
Oh, yeah, yeah. So meta-stable failures.
Well, speaking of that, you must have heard of retrystorms, right?
So a retrystorm, let's give an example.
Like suppose you have a database with the original capacity of 1,000 requests per second,
and you're running it at a load of 700 requests per second.
And it's running good, but all of a sudden,
there is a temporary trigger,
which introduces background interference,
such as a limped wear or garbage collection
that decreases your capacity temporarily,
say, to 600 requests per second.
And now, because your capacity is a little bit lower than the load,
so now you're
overloaded. And then the application starts to timeout, and then the retry starts. And let's say
at maximum, each of the requests will retry one time. So in the end, your load can be increased
to 1400 requests per second. And at that time, even after you remove the trigger and your capacity
was recovered to 100 requests per second your system is still overloaded and the system can
never get out of it without any human intervention and we call this permanent overload so to summarize
a metastable failure is a case where there's a permanent overload even after the trigger is
removed i know a lot a lot a lot of your prior works this paper as well has been building up a
framework to reason about this class of failures and these metastable failures so can you kind of
tell us more about this framework that you've developed and also about the various properties
of other metastable failures as well yeah Yeah, so definitely models to help us understand why this happens.
But I would say one of the distinguishing properties of metastable failure
is that there's a sustaining effect that keeps the system in a metastable failure state
even after the overloading trigger is removed.
For example, in the retry storm example,
the retry themselves serves as the sustaining effect,
even after the trigger is removed. But because the retry already adds up the load to over the
capacity, retries leads to more retries and keeps the system stuck. This is how metastable failures is separated from other types of failures.
So speaking of how a normal system is like transit to a metastable failure state. So suppose
a system is stable and everything is working fine. And once there's a load increase or a
capacity decrease, we can render the system to be vulnerable.
So vulnerable is a state that's running good for now.
There's nothing wrong with it.
There's no problem.
But once there's overloading trigger that overloads the system, then it can push the system from vulnerable state to metastable failure state. And then because there was a sustaining effect, as we
have been talking about, that keeps the system stuck in the metastable failure state, it is not
until people take some drastic methods to recover the system, such as reboot the system or reduce
the load of the system to very low, can the system recover from the metastable failures.
Oh yeah, so it needs some human intervention there almost to go in there and sort of
resolve the issue. Yeah, and this is most of the cases as we observed so far.
Cool. So in this paper, what was the main sort of research goal and what you're trying to
achieve in this paper?
Yeah. So first, we want to raise people's attention on this type of failure. Because by nature, metastable failure is rare. It doesn't happen that frequently.
But once it happens from our study, it can lead to catastrophic results.
And so from our survey, we do find that like a metastable failure, when it happens,
it can lead to from like four to 10 hours, most commonly, major outages. And we found that
sometimes people didn't understand the reason behind metastable failures correctly.
So they try to do some recovery methods that even amplifies metastable failures.
So that's why we want to propose the idea of metastable failures and provide a framework
to understand it.
And then after that, we can about like what are the solutions to
uh these problems and how can we resolve it yeah so those are our like main research goals
fantastic so you mentioned it a second ago and there's an amazing study in the paper
of various different metastable failures in the wild how did you go about trying to tackle the problem
the research goal how did you design this study what was your approach to to to doing this uh so
uh so so you were asking about like the um how we define the metastability or no more so how you
how you design the study and how you what your kind of okay we want to go and get a better
understanding about metastable failures and how did you approach us answering that question yeah
yeah so um so first of all like when studying a problem we want to make sure it's really existing
in the wild so that's something that we're studying so that's why we first design is a survey we want
to study the prevalence of metastable failures in the wild.
So how we did that is that we studied the public post-mortem incident reports.
We started with that because usually when a company that tries to write incident reports,
that incident is already big enough to warrant public awareness. So we started with analyzing 100 to 1,000 public incident reports.
And we also tried to find a diversity of companies
and see whether they exist.
For example, we would try to look through reports
in large cloud infrastructure providers
like AWS, Microsoft Azure, and Google Cloud,
as well as we focus on some smaller companies and projects like Apache, Consandra, etc.
Yeah, so that's how we design.
Fantastic. Cool. So can you summarize for the listeners what the findings were of this study?
Yeah, well, definitely there's a lot of findings that we find.
Well, let's dive in. Let's go through them all.
You can look at the paper for all the details, but I would say the most important is that the metastable failures really can be catastrophic.
And from our study, we find that at least four out of 15 major outages in the last decade at AWS was due to metastable failures.
And people have diagnosed this type of failures under different names in a very ad hoc fashion, like persistent congestion, persistent overload, retry storms, death spirals, etc. And people have done ad hoc recoveries to this as well,
like load shedding, rebooting, adding more resources,
or even tweaking configurations.
So our insight from our study was that
these different looking failures can be characterized under one taxonomy.
And based on what we find right there,
we try to identify the metastable failures
as we talk about in post-mortem incident reports,
and that we were able to identify
21 metastable failures out of them,
ranging from large companies
to small companies and projects.
And we found that the most important thing we found,
as we talked about, it can cause major outages. And four to 10 hours, most commonly, as we talked
about, and incorrect handling really leads to future incidents. And I can give you an example
on this. Like, for one of the incidents we found at Spotify, like, there was a metastore failure
that happened due to retry. was a retry storm and then
the engineers at Spotify want to identify why the retry happens so they ask even more
excessive logging to the retries but because the excessive logging the penalty of the retries is
even higher which exacerbates the metastability and leads to future incidents
to happen.
So, yeah, so that's how
this shows
without properly understanding
this is metastable failures and
the mechanism behind that, this
can actually lead to wrong
error handling or recovery methods.
So
all of this makes metastable failure
an important class of failures to study.
For sure, yeah, definitely.
So of the metastable failures you identified
across these different companies,
what was the diversity in the metastable failures
in that were there certain types of,
once you kind of develop the taxonomy
and the framework and whatever, did you find that there were certain things of once you kind of develop the taxonomy and the framework and
whatever yeah did you find that there were certain things that happened with a lot more frequency
than other types of metastable failures well yeah so so i guess yeah to put another way is that in
this study we want to actually uh convince people that besides many common patterns of metastable
failures there are also many other type of metastable fitters, there are also many other types of metastable fitters that have the same characteristics, which is like permanent overload even after the trigger is removed.
But they also exist.
So the most common one I see you have been asking is due to retries or retry storms.
That's a classic example.
But besides that, we also find low spike can be the trigger.
But on the other hand, capacity decreasing, this event can be another trigger.
Because what really matters is about the level between the load as well as the capacity.
When your load is below the capacity, everything is working fine.
But when your load is above the capacity
and if there's a sustaining effect,
if there's an amplifying mechanism
that's existing right there,
it can keep your system get stuck.
Yeah, so on the other dimension,
besides the workload amplification
that exists in the retrystorm sort of case.
There's also a capacity degradation amplification that exists in many other scenarios as well.
Fantastic. That's fascinating. I know in your papers, you've been working on this line of work
for a while. this this paper introduced three
like new extensions to to the framework for reasoning about um metastable failures can you
talk tell us a little bit about what these these these were these additions you made and
why they were made uh so you were talking about different types of metastable failures or yeah so
in in in the paper you talk about how you took
the original framework or the existing framework you had and then you made some additions to it
and i just wanted to kind of see if i could just talk like a description of what the changes the
key changes were over the initial the initial framework and why you made those specific changes
yeah so the initial framework and our framework they don't
like conflict with each other actually that paper was also our paper uh so we um we just like present
a more deeper understanding of the metastable failure framework yeah so from the beginning like
as we have been talking about like how the system transit from stable to vulnerable to metastable
failures that's still uh right there but But what we find is that load increase
is not the only reason
because symmetrically,
capacity decrease can also render the system
to be vulnerable,
as well as adjust the ratio
between the load and the capacity.
And that's the key factor
where metastable failures could happen.
As well as we find that the
system, the
vulnerable state is not a
binary state. You cannot say a system
is vulnerable or not vulnerable.
Actually, systems have different degrees
of vulnerability.
That's another thing that we find that exists
in practice.
And people in industry might want
to pay attention to how vulnerable
your system is
to keep that in mind
so that they can better maintain it
and run it without running
into metastability or other
issues.
So that's another thing
and also the other thing that
we try to provide in this OSDI
paper is that we want to provide real examples for metastable failures.
Because even in our previous HotOS paper, we said that replicating metastable failures is very hard.
And we gave some sort of ideas and simplified incidents of how it could happen.
But we didn't provide any real
data on that.
Because in this
paper, we have more space, and
then we want to provide
more examples, as well as
release the code for people to
really test it out to see, oh, this is
how metastable failure happens, and
if they can come
up with any solutions to it,
we are very welcome.
And you can just test it on our testbed.
Yeah.
Fascinating.
Yeah, it just comes to my mind is that
obviously you have this framework
and you can taxonomize all of the existing failures
you've observed in the wild under that framework.
Are there any ones that are possible under the framework
that you did not observe in any real,
like kind of almost can it act as a warning sign?
It's like, be careful, don't do this because this bad thing could happen.
So if that question makes any sense, even.
Yeah.
So we definitely think about like the complete completeness
of the model we although we didn't prove it because it's very hard to prove but uh we came
up with the scenarios or taxonomy uh by factoring the basic components that really matters so that's
also how people in industry really measure the system. So one is the load and the other is the capacity and how the relationship
between those two metrics changes over time.
That's how we think about it and think about how the system could possibly
run into metastable failures.
Yeah.
And in our model or taxonomy we factored uh different like uh
different uh combinations of uh two types of triggers as well as like two types of amplification
methods although um there could be multiple triggers and multiple amplifications so there
could be superpositions of those four scenarios as well okay cool so yeah how do you
go about detangling kind of if you've got multiple of these things happening at the same time how do
you detangle as to which is the the i guess i mean they're all problems right which is the root cause
i guess yeah yeah yeah so so definitely uh there there can be a lot of like misleading signals you
know yeah like especially when I was working at Twitter,
I was working on trying to find the metastable failures right there.
There are many things that could happen,
like there's a load spike and there's capacity decreasing,
maybe due to garbage collection,
maybe due to some like just hanging load node
that's not doing well in an unhealthy state something like that as well as
like when when say there's like some amplification we want to find there are different signals on
that as well yeah so to characterize this definitely multiple things can happen together
but we want to provide a simple taxonomy for people to understand.
So that's why we want to decouple it.
And although for real instance,
like multiple things can happen at the same time, but for this paper,
because we want to demonstrate the basic,
like the fundamentals out of it.
So when replicating those instance or scenarios,
we introduce one trigger at a time,
as well as like one amplification mechanism
at a time and in this paper in total actually we introduced like three real examples with with one
like uh the the retry storm which is well known everywhere uh that forms the four uh metastability
scenarios okay cool cool yeah you mentioned um you're uh working at twitter a second ago so it's
a nice segue into i wanted
to kind of and you also mentioned the kind of the metastable failure identified well but
let's pull on that thread a little bit more because it's a fascinating case study that you
go through in the paper so can we go through that in a little bit more depth can you tell us what
the what the failure was and how the process of identifying it as well and how you went about
figuring out this is this is the problem yeah yeah so that's a very good question and i would like to share so so when when i was working at twitter like we were wondering why after rigorous
testing um the system can sometimes still get stuck in very low throughput state that's what
engineers are always wondering so do i uh and then when looking through like some of the uh
incident report although it's not public anymore like it's inside of the company uh we find there
are something that seems like there's something that's like keeping the system in a uh in a very
low throughput state with some implication mechanism and then uh we try to find why so in one particular instance we found that uh at that
date there was like a peak load test and uh the initial uh load spike from the peak load test
introduced high queue length to the system because of because of like the jobs is arriving
and the memory allocation right there and because of of this memory allocation, there's more active objects
to process
for the GC, because
at Twitter, we use a language that has a garbage
collection right there.
And there's also higher memory pressure
that causes more garbage collection
GC cycles. And this
pushes the GC behavior to
be high. But as we all know,
garbage collection takes GPU,
like it consumes GPU resources,
it consumes memory resources,
and the GC causes application to pause and slow down,
naturally, because it's in contention
with the normal request.
It's consuming resources.
And because of this, the jobs are slowed down this the job the jobs are slow slowed down
and because of the jobs are slowing down uh the queue length is even higher yeah this completes
a loop did you see so when the low spike uh introduce high queue length and high queue
length introduce high gc behavior gc behaviors introduce uh job slowing down and which gives back you know uh increasing
the the high queue length so this forms uh the mechanism of capacity degradation amplification
so that's how we how we find the sustaining effect uh at twitter specifically so uh to summarize
the sustaining effect is a contention between the arriving traffic and the GC consuming resources.
So after that, we also
try to see, like,
what do engineers do about it, right?
So people have done
aggressive load shedding, you know.
They have tried to decrease
the request per second,
say, by a lot.
And in our own replication,
we try to decrease the
load by 30%,
but still, the GC doesn't
lower. It doesn't help lower the
GC, so the system gets stuck
right there. The GC is high, the
queue length is high, even after reducing
the load, it doesn't get out of it.
Yeah.
And at
Twitter, we find only after rebooting all the machines and instances
can we recover eventually from it so yeah yeah so that's so that's a metastable failure and
you can find that this metastable failure is not really caused by re-choice it's caused by re-choice. It's caused by garbage collections that's consuming CPU resources
and in contention with the normal request serving.
My next question,
and you've mentioned this throughout the chat so far,
about replicating metastable failures.
Yes.
And obviously it's a very difficult thing,
and you mentioned the initial sort of in your HotOS paper
that it was kind of, that was kind of not of not there then you find it really hard to do that
i guess the first thing i kind of want to ask is with this with your replication of these
metastable failures what were you trying what were the questions you were trying to answer with
this set of experiments yeah so uh we do have like uh different types of experiments in our paper.
But as we have discussed, each of them represents a different type of metastable failures.
So first of all, we want to do proof of concept.
Metastable failures is not only one type.
It's not just like a retry storm.
It can be formed differently with different types of triggers, with different types of amplification mechanisms.
So for each of the experiments right there, we do have different purpose for that.
For example, for the garbage collection, as we talked about, like in the Twitter's case, we try to replicate it in a local environment just to make sure that it's a metastable failure?
Yes, it is. And also we confirmed that the trigger was due to the low spike
and the amplification, the sustaining effect,
was due to the capacity degradation.
But while there's also another type of trigger
that could render the system to be runnable
and eventually metastable
which is capacity decreasing
trigger. For example, we have
introduced slowdown
to a replicated state machine
right there. And the slowdown
serves as a capacity decreasing
trigger.
And eventually
it slows down the system
and then the application starts to time out
and then the retry starts.
So that's the main message from that experiment.
As well as we have built a three-tier cache example.
So when in the system, we have the web server,
we have the look at set cache,
as well as like a back-end database system.
And once upon a time, there's a cache hit rate drop to the look-as-at cache.
And then, because the cache cannot serve many of the requests, the requests fall back to
the back-end database system.
And because the database was not designed to handle this many requests, so
it was overloaded. And many requests
start to time out. And because
of the application time out, it cannot
really refill the cache.
Because it doesn't refill the cache,
so the cache hit or drop
don't really recover.
So that's where the capacity
degradation persists, and
the system never gets out of it. So that's another type capacity degradation persists and the system never gets out of it.
So that's another type of metastable failure
where the capacity decreasing trigger,
the capacity decreasing event as the trigger,
as well as the capacity degradation amplification
serve as the sustaining effect.
Interesting.
So given the current state of the framework you use for reproducing these metastable failures, how easy is it to add new ones?
And is there various components that are composable that say, oh, I want to do this and try this and try that?
Or are they very independent? You have to construct one from hand each time. How easy is it to make new ones?
Well, I would say because even in real life metastable failures are rare
so if you want to reproduce it it's definitely hard and we have said that in both the hot os
paper as well as the osdl paper yeah so reproducing reproducing it is hard because you want to
because you need to understand like where the boundary is you know like like what's the
boundary between the vulnerability as well as
the metastability and it changes according to many other sort of things like if your system
has a higher load then your system is more vulnerable so um so a less intense trigger
can push your system into a metastable failure state while uh if your system is running a lower
load then then it's harder yeah question that could actually come to mind
is that how can,
at the moment,
the flow of direction is kind of,
hey, we've identified some problem,
there's some problem,
and we want to go and reproduce it, right?
So we use the framework for this,
the reproduction framework for that.
Would you be able to sort of reverse that
and say,
when I'm designing some system
could we use this framework to inform our design decisions for like you know i don't know you could
build something and run various workloads say okay what are the what are the points at which
my system topples over are these these these um these metastable failures emerge
yeah so one of the so that that's actually a very good question
because one of the takeaways from our work
is that the sustainability effect is the key component,
is the key property of the metastable failures.
If we can eliminate the sustainability effect,
that's going to be great.
But many times, actually,
we cannot eliminate the sustainability effect
because those naturally arise from common optimizations.
Like the retries, they arise because we want to handle some transient failures automatically.
And for the garbage collections, needless to say.
So sometimes we just cannot eliminate the sustainability effects.
So that's why we want to minimize it.
For example, for the retries, we want to see what's the proper configurations for that,
how many retries we want to have.
The more retries we want to have, in some scenarios, the more likely we can overload the system.
But if we retry less, we might not realize the natural functionality of it.
So that's the garbage collection.
We want to know what's the configurations
of the garbage collection
that can really fit on my scenario
without introducing metastable failures.
Yeah.
So that's how people can take away
from our work about designing their systems. Yeah. So that's how people can take away from our work about designing their systems.
Yeah. As far as when people, after designing the system, they are operating their systems.
They want to really understand the vulnerability of their system.
Yeah. Because the other thing about the system operator, or say like for this podcast, we want to bridge the gap between
the academia and the industry. So one takeaway from industrial people is that the vulnerability
of a system was impacted by two main things that we find in our work. So one thing is the system
load. The system load determines vulnerability. The higher your load is, the more vulnerable you are.
That means the less intense trigger you need to push the system into metastable failure state.
So fundamentally, there's a trade-off between efficiency and vulnerability.
The more efficiency you want to have, the more vulnerability you need to bear with as well.
So that's one key thing system practitioners want
to have in mind. On the other hand, system configurations also impact vulnerability.
And let's still keep with the GC example. The larger memory that we have in our experience,
we find that the lower vulnerability we have, this is because a larger memory released the garbage collection pressure
by some degree.
So for different types of metastable failures
that happens,
you want to really know
what system configurations
that are related to this.
And you want to know
how each of the configurations
impact the vulnerability
and then to it in the right the configurations impact the vulnerability and then
to it in the right direction
and in the right order
and in the right
amount.
Is the
framework publicly available?
Can a listener go and
clone the Git repo and start playing around with it?
Yeah, I mean the
replications examples are open
sourced and we welcome our
people to give it a try to see medicine failures how it happens and propose a solution somehow
with them yeah i can definitely see this being integrated into as another part of as another
tool to complement people's pipelines right for sure yeah definitely so i know you you conclude your your paper with a really really
nice discussion section and and it'd be great if you could kind of pull out the the key findings
and observations from this nice discussion discussion at the end and tell the listener
about those so many of them i i would say uh those are like details you can look at the paper
if you're interested uh and i would say like one of the most uh one of the important thing uh we want to uh sell is uh fix to break when you try to fix
the system as we discussed uh sometimes we'll break our system if you don't understand what's
the reason of either breaks like in either broke in the first place uh so we have a uh besides uh
the incident we just talked about in the beginning of the podcast, we also have some other similar incidents
that's going that route as well.
So I'll give it a shot if you're interested.
As well as like we think prevention and mitigation
might be the way to go.
Although currently we don't think
there's existing solutions to it,
but we foresee that in the future
to prevent this thing from happening,
one first thing we can do is that we can detect and react to the trigger quickly enough to avoid
metastable filters. This is because the sustaining effects may not be emitted. It needs some time
to be triggered, to start amplifying.
And on the other hand, the sustaining effects also takes time to amplify the overload.
If the overload after reducing the load back to the normal level is reduced and the system is not overloaded, then it's not a metastable failure.
But if we didn't catch the sustaining effect fast enough, then it already overloaded the system too much.
Then after, even after we lower the load,
it doesn't help.
It's still overloaded.
Then we were too slow.
Yeah, so detect and react to triggers quickly.
How to do that?
Probably we won't do it automatically.
That's a question we probably want to
give more thoughts on in the future
yeah fascinating yes how do you detect what the tipping point is right because there will be a
point at which you've entered that state and there's no way you can get out but now you've
got to either like say restart the system or do some extreme intervention right to to resolve the
issue yeah i i suppose like how much time we have in that window
and we want to catch up with the load increase
as fast as possible to make sure to end it,
to end the trigger before it pushes it
into metastable failure, the overload too much.
Fantastic.
Yeah, so I mean, this next question
is something I ask all of my interviewees on the podcast.
And I'm sure there were a lot of things on this.
What were the most interesting and maybe unexpected lessons that you have learned while working on metastable failures?
And I imagine there's quite a few.
But yeah, just give us your highlights.
Yeah, the highlights is also like fix the break. We didn't expect this to happen because when people try to say recover from an incident,
they usually go on the right route.
If not make things better, they are not making things worse.
But for metastable failures, because of the nature of the sustaining effect,
if you are not aware of it, you might be amplifying it in some ways.
So that was an unexpected thing
that we found during our research.
And the other thing we find, or in general,
is that replication is hard.
How do you push the system into metastatic filters?
First of all, we don't have...
Because from our previous research
from the HOT-OS paper,
we found that the system might be vulnerable,
but the transition from vulnerable
to metastable filter was unclear by that time.
And later we find that,
like from the industry,
we have experienced that the higher low that you are,
it seems like the easier the system it is to
to have gone bad.
So later we realized, oh, there's actually a various degree of vulnerability right there.
And probably if we want to push a system into a metastable failure state, we want to first
make sure that the sustaining effect exists in the first place and then then we want to find the right point where to tip it into metastable failure state
yeah when when did this idea sort of arise and so when was the idea for metastable failures
conceived and kind of i guess let's dig into a little bit more about that journey from that
initial sort of conception of hey this thing looks like we can taxonomize it, metastable failures, to the OSDI paper.
What was that journey like?
And were there things along the way that you tried that failed?
And what are the sort of the war stories, I guess, of that journey with metastable failures?
Yeah, yeah, yeah.
So actually, metastable failures are already exist we didn't like create
this type of failure you know it already exists but you gave it a name but people understand like
people analyze them in very different forms as we talk about like ad hoc analysis ad hoc
technology ad hoc recovery like all of these things and people write like a different block
like blog post on it uh and the people propose like a blog posts on it and people propose like different
like so-called
lessons to it but we find that
we can actually like generalize
this type of instance under one
mechanism
under one taxonomy and analyze
them and this was
originally coming from like
Bronson he was a
software engineer at at facebook and he
found that oh these things happen and then after the hot os paper was published uh it like these
ideas like occurs to many other engineers uh like in other companies oh this thing really happens
in my company as well same problem yeah yeah same problem. Let's take a deeper look at it
and let's dig deeper into it,
like how to reason
about metastable features.
And that's where I also was lucky
to have a chance
to do the internship at Twitter
and we found that engineers there
also wondering why my system was slow
and it got stuck in slow,
but for no obvious reasons.
And then we try to look through the incident reports
and all that sort of stuff.
And in the end,
we found that combining the survey
as well as the insider view
of metastable failure happens,
we find there are multiple types of triggers,
multiple types of implications,
and they can form a nice picture
of how this really happens in the wild.
Already the WAC had some great real-world impact.
I don't know if you are limited to disclose
how it's being leveraged inside Twitter.
Yeah, so for impact-wise,
so one of the lessons we have learned
is that not all metastable failures
are catastrophic.
They are still like mild metastable failures
like we had at Twitter.
So we also wrote this in the paper
that although for this instance,
it's a metastable failure,
but it doesn't really result
in like user-facing failures.
It's internally that's raising some alarms that needs to be fixed
and eventually might lead to bad outages by the engineers
because they react quickly enough.
So they were able to stop it before it becomes a very bad metastable failure.
Which also confirms our claim that if you can catch up with the metastable failure. Which also confirms our claim
that if you can catch up with the metastable failures
in time, then you can mitigate it
before it happens.
So where do you go next then
with metastable failures? What do you have planned for future
research with them?
So for metastable failures in general, I think there
are multiple ways to go. So one thing
as we have mentioned is to detect and react
to triggers automatically.
And the other thing we also have
mentioned is to design systems to
eliminate and minimize sustain
effect if possible. And there are
ways to do that, like
if you're doing
network sort of things,
you want to consider like the slow path,
not just the fast path
because the slow path
sometimes leads
to metastable failures.
Yeah.
And,
on the other hand,
how to understand
the degree of vulnerability
automatically
of the system
to control risk
is also something
that's very interesting.
Maybe it's not as researchy,
but for people in industry,
that might be really helpful.
Like people want to think about
what system load you should run
and what capacity you should allocate to the system
to determine the vulnerability,
to measure those to determine your vulnerability.
Like load testing can help reveal issues and adding capacity can help lower vulnerability.
And on the other hand, system configurations also affect vulnerability.
And what are the relevant configs and how to control them to lower vulnerability?
That's also open questions.
Yeah, I mean, in the lifetime of the incident happens,
there is also recovery.
Once, unfortunately,
you are in a metastable failure,
you want to recover from it,
there are also multiple things you need to do.
The first one is to fix the trigger
to prevent a recurrence.
And there are multiple ways you can do that.
You can negate load spike by load shedding. there's you can do like rollbacks to hot deployments uh you can also do
hot fix software uh bugs like on the fly and after you have done all of this you want to make sure
you end the overload to break the sustaining effect cycle like load shedding uh like increasing
the capacity like change the policy to reduce the implication factors,
so on and so forth.
So currently, many of the stuff can be done by engineers.
Like once there's an alarm,
there's detection of metastable failure that might happen.
You want to go there and find the right knob to tune them.
But in the future, if we can even make this all automatic,
more automatic, then it's going to be even better.
It's going to kind of speed up the recovery process.
Definitely, yeah.
I look forward to all of your future research.
That sounds fantastic.
Cool.
So yeah, so obviously,
are you working on any other projects in this area
or is it all related to metastable failures at the moment or
is anything any other projects you're working on that the listener may be interested in
oh yeah yeah so uh metastable failure is only one um one projects i'm working on uh and it's about
throughput related uh failures or related performance problems and because i'm working
on performance debugging of distributed systems there's also another type of performance bugs,
which is latency bugs.
And I'm trying to find,
I'm trying to like profiling
as well as like debugging latency issues
in distributed systems
using distributed systems tracing.
And currently,
maybe some a little bit more background details.
So currently,
there are actually a lot of distributed system traces
that's generated.
For example, at Meta, at Facebook,
one day they generate billions of traces.
And there are many of the traces right there.
And the current practice is that people use some,
people look at them individually.
And sometimes they use some basic aggregation metrics
just to filter some of the interesting traces out.
But in the end, in the boys' town, two people look at individual traces and figure out where
the latency is long and where the optimization opportunities are.
But we find that looking at the individual trace fundamentally have the problem of being
biased by each of the trace. So that's why
in my research, I proposed a
tool called TPROF to aggregate
the distributed
system trace systems together to expose
a more overview
of the system and
trying to see where the latency
is high, as well as
where you want to spend most of the time
debugging so that you can get most of the improvement
over the end-to-end latency. So that's the type of things
I'm doing. And on the other hand, I'm also interested in
improving the efficiency of the distributed system or
like a current cloud system. Because performance and efficiency
always goes hand in hand.
You can always throw in more machines
to improve the performance,
but your utilizations might suffer.
Yeah, so to help broaden my horizon in that area,
I was also collaborating
and doing internships at Microsoft Research
working on utilizing workload characteristics to try to improve Azure cloud utilization.
Amazing. That's great.
And this has been a recent sort of interest of mine and trying to understand how people approach idea generation in this area.
And how do you decide what to work on?
What's your process for that?
Because you work on some amazing things, really interesting, cool stuff. How do you decide what to work on what's your process for that because you work on some amazing things really interesting cool stuff how do you arrive at that thing this is
what i want to do this is really cool uh yeah so so i would say different people have different
approaches to do this but yeah but for me i would say it's trial and error like for example
or the distributed tracing projects i was doing, I was interested in finding performance issues.
Then I was thinking about what's the state of art?
Oh, people are using distributed tracing.
Those provide fine-grained details of where the latencies are.
Then I started using that.
And after using that, I said, oh, this works well.
I can find something.
But there's also a lot of complexity that's getting in the way.
Like, there are just so many traces.
Which one should I look at?
Which one can I believe in?
We need to have an overview
of all of this. That's where
the aggregation comes in.
And then I start
trying different aggregation
methods. Different of them have
pros and cons. They have
constraints, but in the end
I try to sort them out
and provide multiple levels of granularity,
of aggregation.
I found, oh, it works well.
I can use my own design tool
to find many interesting performance bugs,
and in the end, it was published
on ACM's symposium on color computing.
And yeah, that's how a paper was generated.
Okay, yeah, yeah.
We'll put a link to that paper in the show notes
if anyone wants to go and check it out.
Thanks, thanks.
I mean, yeah, so that's for that paper.
But for this paper, it's also a little bit different
because after doing that, I found,
oh, I could use some, like, the open source benchmarks
and try to find the performance issues with them.
But, you know, for distributed system in the open source world,
you don't have that large scale distributed system, you know.
They're always like, you always wonder, like,
how can you even do more impactful research?
So that's why I try to intern at big companies,
either with cloud on-prem or with big public cloud.
So at Twitter, I was able to hear engineers' opinions
about some of the papers they found,
for example, in HotOS about the metastable failures.
It occurs to them.
We also always wonder the the same problem why it's
just low through but let's dig into it and see whether some of the uh incidents that happens
at twitter are metastable failure well i would say adding up to your previous question um some
many of the incidents at twitter was not metastable failures and they were just like a regular
overload overloading like issues.
And then once the trigger is fixed,
then it's not overloaded anymore.
Although engineers still have to do
like a recovery method to prevent
the overloading event from happening in the future.
But it's not metastable fitters, you know.
Yeah, metastable fitters are rare and persisting.
That's why it was hard. But i was lucky i would say in twitter i was able to find one instant and uh and i was able to
find like a well-documented data that i can analyze them uh and and show a show in the paper
to the public yeah so yeah yeah so so how that's how that helps so for for grad student i would say if you really
want to uh look for interesting problems you can go to industry and see yeah sometimes things pops
up and if you're interested there can be collaborations amazing yeah no it's really
nice to see the breadth of different everyone's different sort of approach to answering this
question is really interesting and that's another really fascinating answer to it as well.
Brilliant.
So I've got just two more questions now.
Yeah, yeah.
And the first one is,
what do you think is the biggest challenge now
in your research area?
So in distributed systems, in metastable failures
and all the other cool things you work on,
what do you think is the biggest challenge
that's facing us now?
Yeah, so I think I've mentioned that a little bit but um
because i'm working on performance debugging i found that currently still many of there are
tons of human efforts that were spent on performance debugging like in in large companies
like google like facebook like meta like all these large companies or even smaller companies
people hire performance engineers or capacity engineers specifically for dealing
with performance problems.
And the tools they're currently using, which are good,
but we can do even more than that.
And I think one of the keys to do more automation.
Yeah, for example, for the metastable failure incidents,
if we can somehow find an automatic way to detect it
quickly enough, then it can help prevent it from happening in the first place. And also,
when the system is already in a metastable failure state, how to automatically recover from it,
maybe we can auto-tune the configuration somehow to get rid of, can get out
of the metastable failure state. That's
another thing that we can do.
And in general, about debugging
performance issues,
there are different
stages. You first detect
there's a problem, and then you try to
diagnose it to try to find the root cause
of it, and then after that, you
try to recover from it and prevent that from happening in the
future.
So in the detection field, like in this stage, I think recently there are multiple papers
on this, which is good.
Like people are leading to that direction to do more automation on that.
But for diagnostic, I think there's still a lot of things that we can do.
Like especially if you go to industry, you will find there are a lot of data right there.
Like telemetry, like all this table,
and you want to join this table with that table
and see what really happens.
But there's just too much data, but too less people.
Right, okay.
Too much data, not enough people.
Yeah, yeah, yeah, yeah.
And there's a larger opportunity where people can sit down
and just build tools
to help automatically analyze those telemetry
to find the signals out of the haystack,
to give signals, for example,
oh, the metastable failure is going to happen.
Let's catch up with it.
Or the metastable failure already happened,
so I don't need a human to look at it.
And the recovery phase can be done by some automation as well.
If anything, in the early stage,
it can generate some instant report
for the performance engineers
to help them better look for the root causes
or provide
suggestions to them to do the recovering uh before going into like a fully automatic mode you know
yeah yeah we should definitely get on a t-shirt too much data not enough people right
maybe i should make one yeah that's awesome all right so time for uh the last word now
yeah what's the the one key thing you want the listener to take away from your research and from
this podcast today yeah so uh i would say uh we have talked about a lot of things in the big
but still specifically for metastable fillers um um the takeaway is that it's really
prevalent and it can cause major outages that's why it is important and how to fix that understanding
the sustaining effect first and then understand the degree of vulnerability to prevent that thing
from happening in the future amazing and let's let's end it there thanks so much for coming on
the show it's been a great
great conversation and if the listeners are interested in knowing more about the shang's
work we'll put links to all of the relevant materials in the show notes and we will see
you next time for some more awesome computer science research thanks jack thanks everybody
for listening you