Disseminate: The Computer Science Research Podcast - Lexiang Huang | Metastable Failures in the Wild | #17

Episode Date: January 9, 2023

Summary: In this episode Lexiang Huang talks about a framework for understanding a class of failures in distributed systems called metastable failures. Lexiang tells us about his study on the pre...valence of such failures in the wild and how he and his colleagues scoured over publicly available incident reports from many organizations, ranging from hyperscalers to small companies. Listen to the episode to find out about his main findings and gain a deeper understanding of metastable failures and how you can identity, prevent, and mitigate against them!Links: OSDI paper and talkPersonal websiteTwitterLinkedIn Hosted on Acast. See acast.com/privacy for more information.

Transcript
Discussion (0)
Starting point is 00:00:00 Hello and welcome to Disseminate, a computer science research podcast. I'm your host, Jack Wardby. I'm delighted to say I'm joined today by Leshang Huang, who will be talking about his OSDI 22 paper, Metastable Failures in the Wild. Leshang is a PhD student at Pennsylvania State University, and his research focuses on performance debugging for complex distributed systems and he's actively collaborating with large-scale companies to debug performance issues and improve cloud efficiency. Lashan, welcome to the show. Hi Jack, thanks for having me. Let's dive straight in. So can you tell us how you became interested in research and how you became a PhD student and specifically how did you get interested in distributed systems?
Starting point is 00:01:06 Oh, yeah, that's a very good question. Yeah, so for doing research, well, I like to explore the fundamental ideas behind phenomenons, like for metastable failures, before, like, we proposed, like, this idea, there are many forms of it, like, people call it, like, persistent overload, that's spiral, etc. But what's fundamentally uh like uh similar among those things uh they are something that we can dig deeper into and provide a general framework to understanding them and solving them yeah so that's why i'm interested in research uh and you were talking about distributed systems distributed systems as well, yeah. Yeah, distributed systems are popular. But also, I desire faster computer systems in general. And I'm interested in distributed systems
Starting point is 00:01:51 because there are many hard problems, of course, but because distributed systems are at scale. So when, like you say, improving the performance of system, even by a little bit, it can generate a very impact impactful like influence so um that's why i like to do research in this area fantastic yeah for sure there's a lot of interesting problems in distributed systems right yeah right so you mentioned them earlier on in the star of the show today i guess are these metastable failures so can you tell the listeners what are
Starting point is 00:02:22 metastable failures and maybe give us some illustrative examples? I know you mentioned a few before, but let's dig into that a little bit more. Oh, yeah, yeah. So meta-stable failures. Well, speaking of that, you must have heard of retrystorms, right? So a retrystorm, let's give an example. Like suppose you have a database with the original capacity of 1,000 requests per second, and you're running it at a load of 700 requests per second. And it's running good, but all of a sudden,
Starting point is 00:02:48 there is a temporary trigger, which introduces background interference, such as a limped wear or garbage collection that decreases your capacity temporarily, say, to 600 requests per second. And now, because your capacity is a little bit lower than the load, so now you're overloaded. And then the application starts to timeout, and then the retry starts. And let's say
Starting point is 00:03:13 at maximum, each of the requests will retry one time. So in the end, your load can be increased to 1400 requests per second. And at that time, even after you remove the trigger and your capacity was recovered to 100 requests per second your system is still overloaded and the system can never get out of it without any human intervention and we call this permanent overload so to summarize a metastable failure is a case where there's a permanent overload even after the trigger is removed i know a lot a lot a lot of your prior works this paper as well has been building up a framework to reason about this class of failures and these metastable failures so can you kind of tell us more about this framework that you've developed and also about the various properties
Starting point is 00:04:00 of other metastable failures as well yeah Yeah, so definitely models to help us understand why this happens. But I would say one of the distinguishing properties of metastable failure is that there's a sustaining effect that keeps the system in a metastable failure state even after the overloading trigger is removed. For example, in the retry storm example, the retry themselves serves as the sustaining effect, even after the trigger is removed. But because the retry already adds up the load to over the capacity, retries leads to more retries and keeps the system stuck. This is how metastable failures is separated from other types of failures.
Starting point is 00:04:47 So speaking of how a normal system is like transit to a metastable failure state. So suppose a system is stable and everything is working fine. And once there's a load increase or a capacity decrease, we can render the system to be vulnerable. So vulnerable is a state that's running good for now. There's nothing wrong with it. There's no problem. But once there's overloading trigger that overloads the system, then it can push the system from vulnerable state to metastable failure state. And then because there was a sustaining effect, as we have been talking about, that keeps the system stuck in the metastable failure state, it is not
Starting point is 00:05:32 until people take some drastic methods to recover the system, such as reboot the system or reduce the load of the system to very low, can the system recover from the metastable failures. Oh yeah, so it needs some human intervention there almost to go in there and sort of resolve the issue. Yeah, and this is most of the cases as we observed so far. Cool. So in this paper, what was the main sort of research goal and what you're trying to achieve in this paper? Yeah. So first, we want to raise people's attention on this type of failure. Because by nature, metastable failure is rare. It doesn't happen that frequently. But once it happens from our study, it can lead to catastrophic results.
Starting point is 00:06:21 And so from our survey, we do find that like a metastable failure, when it happens, it can lead to from like four to 10 hours, most commonly, major outages. And we found that sometimes people didn't understand the reason behind metastable failures correctly. So they try to do some recovery methods that even amplifies metastable failures. So that's why we want to propose the idea of metastable failures and provide a framework to understand it. And then after that, we can about like what are the solutions to uh these problems and how can we resolve it yeah so those are our like main research goals
Starting point is 00:07:13 fantastic so you mentioned it a second ago and there's an amazing study in the paper of various different metastable failures in the wild how did you go about trying to tackle the problem the research goal how did you design this study what was your approach to to to doing this uh so uh so so you were asking about like the um how we define the metastability or no more so how you how you design the study and how you what your kind of okay we want to go and get a better understanding about metastable failures and how did you approach us answering that question yeah yeah so um so first of all like when studying a problem we want to make sure it's really existing in the wild so that's something that we're studying so that's why we first design is a survey we want
Starting point is 00:08:03 to study the prevalence of metastable failures in the wild. So how we did that is that we studied the public post-mortem incident reports. We started with that because usually when a company that tries to write incident reports, that incident is already big enough to warrant public awareness. So we started with analyzing 100 to 1,000 public incident reports. And we also tried to find a diversity of companies and see whether they exist. For example, we would try to look through reports in large cloud infrastructure providers
Starting point is 00:08:42 like AWS, Microsoft Azure, and Google Cloud, as well as we focus on some smaller companies and projects like Apache, Consandra, etc. Yeah, so that's how we design. Fantastic. Cool. So can you summarize for the listeners what the findings were of this study? Yeah, well, definitely there's a lot of findings that we find. Well, let's dive in. Let's go through them all. You can look at the paper for all the details, but I would say the most important is that the metastable failures really can be catastrophic. And from our study, we find that at least four out of 15 major outages in the last decade at AWS was due to metastable failures.
Starting point is 00:09:30 And people have diagnosed this type of failures under different names in a very ad hoc fashion, like persistent congestion, persistent overload, retry storms, death spirals, etc. And people have done ad hoc recoveries to this as well, like load shedding, rebooting, adding more resources, or even tweaking configurations. So our insight from our study was that these different looking failures can be characterized under one taxonomy. And based on what we find right there, we try to identify the metastable failures as we talk about in post-mortem incident reports,
Starting point is 00:10:14 and that we were able to identify 21 metastable failures out of them, ranging from large companies to small companies and projects. And we found that the most important thing we found, as we talked about, it can cause major outages. And four to 10 hours, most commonly, as we talked about, and incorrect handling really leads to future incidents. And I can give you an example on this. Like, for one of the incidents we found at Spotify, like, there was a metastore failure
Starting point is 00:10:42 that happened due to retry. was a retry storm and then the engineers at Spotify want to identify why the retry happens so they ask even more excessive logging to the retries but because the excessive logging the penalty of the retries is even higher which exacerbates the metastability and leads to future incidents to happen. So, yeah, so that's how this shows without properly understanding
Starting point is 00:11:13 this is metastable failures and the mechanism behind that, this can actually lead to wrong error handling or recovery methods. So all of this makes metastable failure an important class of failures to study. For sure, yeah, definitely.
Starting point is 00:11:29 So of the metastable failures you identified across these different companies, what was the diversity in the metastable failures in that were there certain types of, once you kind of develop the taxonomy and the framework and whatever, did you find that there were certain things of once you kind of develop the taxonomy and the framework and whatever yeah did you find that there were certain things that happened with a lot more frequency than other types of metastable failures well yeah so so i guess yeah to put another way is that in
Starting point is 00:11:56 this study we want to actually uh convince people that besides many common patterns of metastable failures there are also many other type of metastable fitters, there are also many other types of metastable fitters that have the same characteristics, which is like permanent overload even after the trigger is removed. But they also exist. So the most common one I see you have been asking is due to retries or retry storms. That's a classic example. But besides that, we also find low spike can be the trigger. But on the other hand, capacity decreasing, this event can be another trigger. Because what really matters is about the level between the load as well as the capacity.
Starting point is 00:12:42 When your load is below the capacity, everything is working fine. But when your load is above the capacity and if there's a sustaining effect, if there's an amplifying mechanism that's existing right there, it can keep your system get stuck. Yeah, so on the other dimension, besides the workload amplification
Starting point is 00:13:03 that exists in the retrystorm sort of case. There's also a capacity degradation amplification that exists in many other scenarios as well. Fantastic. That's fascinating. I know in your papers, you've been working on this line of work for a while. this this paper introduced three like new extensions to to the framework for reasoning about um metastable failures can you talk tell us a little bit about what these these these were these additions you made and why they were made uh so you were talking about different types of metastable failures or yeah so in in in the paper you talk about how you took
Starting point is 00:13:46 the original framework or the existing framework you had and then you made some additions to it and i just wanted to kind of see if i could just talk like a description of what the changes the key changes were over the initial the initial framework and why you made those specific changes yeah so the initial framework and our framework they don't like conflict with each other actually that paper was also our paper uh so we um we just like present a more deeper understanding of the metastable failure framework yeah so from the beginning like as we have been talking about like how the system transit from stable to vulnerable to metastable failures that's still uh right there but But what we find is that load increase
Starting point is 00:14:26 is not the only reason because symmetrically, capacity decrease can also render the system to be vulnerable, as well as adjust the ratio between the load and the capacity. And that's the key factor where metastable failures could happen.
Starting point is 00:14:43 As well as we find that the system, the vulnerable state is not a binary state. You cannot say a system is vulnerable or not vulnerable. Actually, systems have different degrees of vulnerability. That's another thing that we find that exists
Starting point is 00:14:59 in practice. And people in industry might want to pay attention to how vulnerable your system is to keep that in mind so that they can better maintain it and run it without running into metastability or other
Starting point is 00:15:15 issues. So that's another thing and also the other thing that we try to provide in this OSDI paper is that we want to provide real examples for metastable failures. Because even in our previous HotOS paper, we said that replicating metastable failures is very hard. And we gave some sort of ideas and simplified incidents of how it could happen. But we didn't provide any real
Starting point is 00:15:46 data on that. Because in this paper, we have more space, and then we want to provide more examples, as well as release the code for people to really test it out to see, oh, this is how metastable failure happens, and
Starting point is 00:16:02 if they can come up with any solutions to it, we are very welcome. And you can just test it on our testbed. Yeah. Fascinating. Yeah, it just comes to my mind is that obviously you have this framework
Starting point is 00:16:17 and you can taxonomize all of the existing failures you've observed in the wild under that framework. Are there any ones that are possible under the framework that you did not observe in any real, like kind of almost can it act as a warning sign? It's like, be careful, don't do this because this bad thing could happen. So if that question makes any sense, even. Yeah.
Starting point is 00:16:43 So we definitely think about like the complete completeness of the model we although we didn't prove it because it's very hard to prove but uh we came up with the scenarios or taxonomy uh by factoring the basic components that really matters so that's also how people in industry really measure the system. So one is the load and the other is the capacity and how the relationship between those two metrics changes over time. That's how we think about it and think about how the system could possibly run into metastable failures. Yeah.
Starting point is 00:17:22 And in our model or taxonomy we factored uh different like uh different uh combinations of uh two types of triggers as well as like two types of amplification methods although um there could be multiple triggers and multiple amplifications so there could be superpositions of those four scenarios as well okay cool so yeah how do you go about detangling kind of if you've got multiple of these things happening at the same time how do you detangle as to which is the the i guess i mean they're all problems right which is the root cause i guess yeah yeah yeah so so definitely uh there there can be a lot of like misleading signals you know yeah like especially when I was working at Twitter,
Starting point is 00:18:05 I was working on trying to find the metastable failures right there. There are many things that could happen, like there's a load spike and there's capacity decreasing, maybe due to garbage collection, maybe due to some like just hanging load node that's not doing well in an unhealthy state something like that as well as like when when say there's like some amplification we want to find there are different signals on that as well yeah so to characterize this definitely multiple things can happen together
Starting point is 00:18:38 but we want to provide a simple taxonomy for people to understand. So that's why we want to decouple it. And although for real instance, like multiple things can happen at the same time, but for this paper, because we want to demonstrate the basic, like the fundamentals out of it. So when replicating those instance or scenarios, we introduce one trigger at a time,
Starting point is 00:19:04 as well as like one amplification mechanism at a time and in this paper in total actually we introduced like three real examples with with one like uh the the retry storm which is well known everywhere uh that forms the four uh metastability scenarios okay cool cool yeah you mentioned um you're uh working at twitter a second ago so it's a nice segue into i wanted to kind of and you also mentioned the kind of the metastable failure identified well but let's pull on that thread a little bit more because it's a fascinating case study that you go through in the paper so can we go through that in a little bit more depth can you tell us what
Starting point is 00:19:35 the what the failure was and how the process of identifying it as well and how you went about figuring out this is this is the problem yeah yeah so that's a very good question and i would like to share so so when when i was working at twitter like we were wondering why after rigorous testing um the system can sometimes still get stuck in very low throughput state that's what engineers are always wondering so do i uh and then when looking through like some of the uh incident report although it's not public anymore like it's inside of the company uh we find there are something that seems like there's something that's like keeping the system in a uh in a very low throughput state with some implication mechanism and then uh we try to find why so in one particular instance we found that uh at that date there was like a peak load test and uh the initial uh load spike from the peak load test
Starting point is 00:20:34 introduced high queue length to the system because of because of like the jobs is arriving and the memory allocation right there and because of of this memory allocation, there's more active objects to process for the GC, because at Twitter, we use a language that has a garbage collection right there. And there's also higher memory pressure that causes more garbage collection
Starting point is 00:20:57 GC cycles. And this pushes the GC behavior to be high. But as we all know, garbage collection takes GPU, like it consumes GPU resources, it consumes memory resources, and the GC causes application to pause and slow down, naturally, because it's in contention
Starting point is 00:21:17 with the normal request. It's consuming resources. And because of this, the jobs are slowed down this the job the jobs are slow slowed down and because of the jobs are slowing down uh the queue length is even higher yeah this completes a loop did you see so when the low spike uh introduce high queue length and high queue length introduce high gc behavior gc behaviors introduce uh job slowing down and which gives back you know uh increasing the the high queue length so this forms uh the mechanism of capacity degradation amplification so that's how we how we find the sustaining effect uh at twitter specifically so uh to summarize
Starting point is 00:21:59 the sustaining effect is a contention between the arriving traffic and the GC consuming resources. So after that, we also try to see, like, what do engineers do about it, right? So people have done aggressive load shedding, you know. They have tried to decrease the request per second,
Starting point is 00:22:20 say, by a lot. And in our own replication, we try to decrease the load by 30%, but still, the GC doesn't lower. It doesn't help lower the GC, so the system gets stuck right there. The GC is high, the
Starting point is 00:22:35 queue length is high, even after reducing the load, it doesn't get out of it. Yeah. And at Twitter, we find only after rebooting all the machines and instances can we recover eventually from it so yeah yeah so that's so that's a metastable failure and you can find that this metastable failure is not really caused by re-choice it's caused by re-choice. It's caused by garbage collections that's consuming CPU resources and in contention with the normal request serving.
Starting point is 00:23:11 My next question, and you've mentioned this throughout the chat so far, about replicating metastable failures. Yes. And obviously it's a very difficult thing, and you mentioned the initial sort of in your HotOS paper that it was kind of, that was kind of not of not there then you find it really hard to do that i guess the first thing i kind of want to ask is with this with your replication of these
Starting point is 00:23:33 metastable failures what were you trying what were the questions you were trying to answer with this set of experiments yeah so uh we do have like uh different types of experiments in our paper. But as we have discussed, each of them represents a different type of metastable failures. So first of all, we want to do proof of concept. Metastable failures is not only one type. It's not just like a retry storm. It can be formed differently with different types of triggers, with different types of amplification mechanisms. So for each of the experiments right there, we do have different purpose for that.
Starting point is 00:24:13 For example, for the garbage collection, as we talked about, like in the Twitter's case, we try to replicate it in a local environment just to make sure that it's a metastable failure? Yes, it is. And also we confirmed that the trigger was due to the low spike and the amplification, the sustaining effect, was due to the capacity degradation. But while there's also another type of trigger that could render the system to be runnable and eventually metastable which is capacity decreasing
Starting point is 00:24:50 trigger. For example, we have introduced slowdown to a replicated state machine right there. And the slowdown serves as a capacity decreasing trigger. And eventually it slows down the system
Starting point is 00:25:05 and then the application starts to time out and then the retry starts. So that's the main message from that experiment. As well as we have built a three-tier cache example. So when in the system, we have the web server, we have the look at set cache, as well as like a back-end database system. And once upon a time, there's a cache hit rate drop to the look-as-at cache.
Starting point is 00:25:32 And then, because the cache cannot serve many of the requests, the requests fall back to the back-end database system. And because the database was not designed to handle this many requests, so it was overloaded. And many requests start to time out. And because of the application time out, it cannot really refill the cache. Because it doesn't refill the cache,
Starting point is 00:25:55 so the cache hit or drop don't really recover. So that's where the capacity degradation persists, and the system never gets out of it. So that's another type capacity degradation persists and the system never gets out of it. So that's another type of metastable failure where the capacity decreasing trigger, the capacity decreasing event as the trigger,
Starting point is 00:26:15 as well as the capacity degradation amplification serve as the sustaining effect. Interesting. So given the current state of the framework you use for reproducing these metastable failures, how easy is it to add new ones? And is there various components that are composable that say, oh, I want to do this and try this and try that? Or are they very independent? You have to construct one from hand each time. How easy is it to make new ones? Well, I would say because even in real life metastable failures are rare so if you want to reproduce it it's definitely hard and we have said that in both the hot os
Starting point is 00:26:52 paper as well as the osdl paper yeah so reproducing reproducing it is hard because you want to because you need to understand like where the boundary is you know like like what's the boundary between the vulnerability as well as the metastability and it changes according to many other sort of things like if your system has a higher load then your system is more vulnerable so um so a less intense trigger can push your system into a metastable failure state while uh if your system is running a lower load then then it's harder yeah question that could actually come to mind is that how can,
Starting point is 00:27:29 at the moment, the flow of direction is kind of, hey, we've identified some problem, there's some problem, and we want to go and reproduce it, right? So we use the framework for this, the reproduction framework for that. Would you be able to sort of reverse that
Starting point is 00:27:42 and say, when I'm designing some system could we use this framework to inform our design decisions for like you know i don't know you could build something and run various workloads say okay what are the what are the points at which my system topples over are these these these um these metastable failures emerge yeah so one of the so that that's actually a very good question because one of the takeaways from our work is that the sustainability effect is the key component,
Starting point is 00:28:13 is the key property of the metastable failures. If we can eliminate the sustainability effect, that's going to be great. But many times, actually, we cannot eliminate the sustainability effect because those naturally arise from common optimizations. Like the retries, they arise because we want to handle some transient failures automatically. And for the garbage collections, needless to say.
Starting point is 00:28:40 So sometimes we just cannot eliminate the sustainability effects. So that's why we want to minimize it. For example, for the retries, we want to see what's the proper configurations for that, how many retries we want to have. The more retries we want to have, in some scenarios, the more likely we can overload the system. But if we retry less, we might not realize the natural functionality of it. So that's the garbage collection. We want to know what's the configurations
Starting point is 00:29:12 of the garbage collection that can really fit on my scenario without introducing metastable failures. Yeah. So that's how people can take away from our work about designing their systems. Yeah. So that's how people can take away from our work about designing their systems. Yeah. As far as when people, after designing the system, they are operating their systems. They want to really understand the vulnerability of their system.
Starting point is 00:29:38 Yeah. Because the other thing about the system operator, or say like for this podcast, we want to bridge the gap between the academia and the industry. So one takeaway from industrial people is that the vulnerability of a system was impacted by two main things that we find in our work. So one thing is the system load. The system load determines vulnerability. The higher your load is, the more vulnerable you are. That means the less intense trigger you need to push the system into metastable failure state. So fundamentally, there's a trade-off between efficiency and vulnerability. The more efficiency you want to have, the more vulnerability you need to bear with as well. So that's one key thing system practitioners want
Starting point is 00:30:27 to have in mind. On the other hand, system configurations also impact vulnerability. And let's still keep with the GC example. The larger memory that we have in our experience, we find that the lower vulnerability we have, this is because a larger memory released the garbage collection pressure by some degree. So for different types of metastable failures that happens, you want to really know what system configurations
Starting point is 00:30:57 that are related to this. And you want to know how each of the configurations impact the vulnerability and then to it in the right the configurations impact the vulnerability and then to it in the right direction and in the right order and in the right
Starting point is 00:31:11 amount. Is the framework publicly available? Can a listener go and clone the Git repo and start playing around with it? Yeah, I mean the replications examples are open sourced and we welcome our
Starting point is 00:31:26 people to give it a try to see medicine failures how it happens and propose a solution somehow with them yeah i can definitely see this being integrated into as another part of as another tool to complement people's pipelines right for sure yeah definitely so i know you you conclude your your paper with a really really nice discussion section and and it'd be great if you could kind of pull out the the key findings and observations from this nice discussion discussion at the end and tell the listener about those so many of them i i would say uh those are like details you can look at the paper if you're interested uh and i would say like one of the most uh one of the important thing uh we want to uh sell is uh fix to break when you try to fix the system as we discussed uh sometimes we'll break our system if you don't understand what's
Starting point is 00:32:15 the reason of either breaks like in either broke in the first place uh so we have a uh besides uh the incident we just talked about in the beginning of the podcast, we also have some other similar incidents that's going that route as well. So I'll give it a shot if you're interested. As well as like we think prevention and mitigation might be the way to go. Although currently we don't think there's existing solutions to it,
Starting point is 00:32:41 but we foresee that in the future to prevent this thing from happening, one first thing we can do is that we can detect and react to the trigger quickly enough to avoid metastable filters. This is because the sustaining effects may not be emitted. It needs some time to be triggered, to start amplifying. And on the other hand, the sustaining effects also takes time to amplify the overload. If the overload after reducing the load back to the normal level is reduced and the system is not overloaded, then it's not a metastable failure. But if we didn't catch the sustaining effect fast enough, then it already overloaded the system too much.
Starting point is 00:33:26 Then after, even after we lower the load, it doesn't help. It's still overloaded. Then we were too slow. Yeah, so detect and react to triggers quickly. How to do that? Probably we won't do it automatically. That's a question we probably want to
Starting point is 00:33:43 give more thoughts on in the future yeah fascinating yes how do you detect what the tipping point is right because there will be a point at which you've entered that state and there's no way you can get out but now you've got to either like say restart the system or do some extreme intervention right to to resolve the issue yeah i i suppose like how much time we have in that window and we want to catch up with the load increase as fast as possible to make sure to end it, to end the trigger before it pushes it
Starting point is 00:34:15 into metastable failure, the overload too much. Fantastic. Yeah, so I mean, this next question is something I ask all of my interviewees on the podcast. And I'm sure there were a lot of things on this. What were the most interesting and maybe unexpected lessons that you have learned while working on metastable failures? And I imagine there's quite a few. But yeah, just give us your highlights.
Starting point is 00:34:40 Yeah, the highlights is also like fix the break. We didn't expect this to happen because when people try to say recover from an incident, they usually go on the right route. If not make things better, they are not making things worse. But for metastable failures, because of the nature of the sustaining effect, if you are not aware of it, you might be amplifying it in some ways. So that was an unexpected thing that we found during our research. And the other thing we find, or in general,
Starting point is 00:35:16 is that replication is hard. How do you push the system into metastatic filters? First of all, we don't have... Because from our previous research from the HOT-OS paper, we found that the system might be vulnerable, but the transition from vulnerable to metastable filter was unclear by that time.
Starting point is 00:35:35 And later we find that, like from the industry, we have experienced that the higher low that you are, it seems like the easier the system it is to to have gone bad. So later we realized, oh, there's actually a various degree of vulnerability right there. And probably if we want to push a system into a metastable failure state, we want to first make sure that the sustaining effect exists in the first place and then then we want to find the right point where to tip it into metastable failure state
Starting point is 00:36:11 yeah when when did this idea sort of arise and so when was the idea for metastable failures conceived and kind of i guess let's dig into a little bit more about that journey from that initial sort of conception of hey this thing looks like we can taxonomize it, metastable failures, to the OSDI paper. What was that journey like? And were there things along the way that you tried that failed? And what are the sort of the war stories, I guess, of that journey with metastable failures? Yeah, yeah, yeah. So actually, metastable failures are already exist we didn't like create
Starting point is 00:36:47 this type of failure you know it already exists but you gave it a name but people understand like people analyze them in very different forms as we talk about like ad hoc analysis ad hoc technology ad hoc recovery like all of these things and people write like a different block like blog post on it uh and the people propose like a blog posts on it and people propose like different like so-called lessons to it but we find that we can actually like generalize this type of instance under one
Starting point is 00:37:14 mechanism under one taxonomy and analyze them and this was originally coming from like Bronson he was a software engineer at at facebook and he found that oh these things happen and then after the hot os paper was published uh it like these ideas like occurs to many other engineers uh like in other companies oh this thing really happens
Starting point is 00:37:39 in my company as well same problem yeah yeah same problem. Let's take a deeper look at it and let's dig deeper into it, like how to reason about metastable features. And that's where I also was lucky to have a chance to do the internship at Twitter and we found that engineers there
Starting point is 00:37:59 also wondering why my system was slow and it got stuck in slow, but for no obvious reasons. And then we try to look through the incident reports and all that sort of stuff. And in the end, we found that combining the survey as well as the insider view
Starting point is 00:38:22 of metastable failure happens, we find there are multiple types of triggers, multiple types of implications, and they can form a nice picture of how this really happens in the wild. Already the WAC had some great real-world impact. I don't know if you are limited to disclose how it's being leveraged inside Twitter.
Starting point is 00:38:43 Yeah, so for impact-wise, so one of the lessons we have learned is that not all metastable failures are catastrophic. They are still like mild metastable failures like we had at Twitter. So we also wrote this in the paper that although for this instance,
Starting point is 00:38:59 it's a metastable failure, but it doesn't really result in like user-facing failures. It's internally that's raising some alarms that needs to be fixed and eventually might lead to bad outages by the engineers because they react quickly enough. So they were able to stop it before it becomes a very bad metastable failure. Which also confirms our claim that if you can catch up with the metastable failure. Which also confirms our claim
Starting point is 00:39:26 that if you can catch up with the metastable failures in time, then you can mitigate it before it happens. So where do you go next then with metastable failures? What do you have planned for future research with them? So for metastable failures in general, I think there are multiple ways to go. So one thing
Starting point is 00:39:48 as we have mentioned is to detect and react to triggers automatically. And the other thing we also have mentioned is to design systems to eliminate and minimize sustain effect if possible. And there are ways to do that, like if you're doing
Starting point is 00:40:03 network sort of things, you want to consider like the slow path, not just the fast path because the slow path sometimes leads to metastable failures. Yeah. And,
Starting point is 00:40:14 on the other hand, how to understand the degree of vulnerability automatically of the system to control risk is also something that's very interesting.
Starting point is 00:40:23 Maybe it's not as researchy, but for people in industry, that might be really helpful. Like people want to think about what system load you should run and what capacity you should allocate to the system to determine the vulnerability, to measure those to determine your vulnerability.
Starting point is 00:40:45 Like load testing can help reveal issues and adding capacity can help lower vulnerability. And on the other hand, system configurations also affect vulnerability. And what are the relevant configs and how to control them to lower vulnerability? That's also open questions. Yeah, I mean, in the lifetime of the incident happens, there is also recovery. Once, unfortunately, you are in a metastable failure,
Starting point is 00:41:12 you want to recover from it, there are also multiple things you need to do. The first one is to fix the trigger to prevent a recurrence. And there are multiple ways you can do that. You can negate load spike by load shedding. there's you can do like rollbacks to hot deployments uh you can also do hot fix software uh bugs like on the fly and after you have done all of this you want to make sure you end the overload to break the sustaining effect cycle like load shedding uh like increasing
Starting point is 00:41:40 the capacity like change the policy to reduce the implication factors, so on and so forth. So currently, many of the stuff can be done by engineers. Like once there's an alarm, there's detection of metastable failure that might happen. You want to go there and find the right knob to tune them. But in the future, if we can even make this all automatic, more automatic, then it's going to be even better.
Starting point is 00:42:06 It's going to kind of speed up the recovery process. Definitely, yeah. I look forward to all of your future research. That sounds fantastic. Cool. So yeah, so obviously, are you working on any other projects in this area or is it all related to metastable failures at the moment or
Starting point is 00:42:25 is anything any other projects you're working on that the listener may be interested in oh yeah yeah so uh metastable failure is only one um one projects i'm working on uh and it's about throughput related uh failures or related performance problems and because i'm working on performance debugging of distributed systems there's also another type of performance bugs, which is latency bugs. And I'm trying to find, I'm trying to like profiling as well as like debugging latency issues
Starting point is 00:42:53 in distributed systems using distributed systems tracing. And currently, maybe some a little bit more background details. So currently, there are actually a lot of distributed system traces that's generated. For example, at Meta, at Facebook,
Starting point is 00:43:08 one day they generate billions of traces. And there are many of the traces right there. And the current practice is that people use some, people look at them individually. And sometimes they use some basic aggregation metrics just to filter some of the interesting traces out. But in the end, in the boys' town, two people look at individual traces and figure out where the latency is long and where the optimization opportunities are.
Starting point is 00:43:37 But we find that looking at the individual trace fundamentally have the problem of being biased by each of the trace. So that's why in my research, I proposed a tool called TPROF to aggregate the distributed system trace systems together to expose a more overview of the system and
Starting point is 00:43:57 trying to see where the latency is high, as well as where you want to spend most of the time debugging so that you can get most of the improvement over the end-to-end latency. So that's the type of things I'm doing. And on the other hand, I'm also interested in improving the efficiency of the distributed system or like a current cloud system. Because performance and efficiency
Starting point is 00:44:24 always goes hand in hand. You can always throw in more machines to improve the performance, but your utilizations might suffer. Yeah, so to help broaden my horizon in that area, I was also collaborating and doing internships at Microsoft Research working on utilizing workload characteristics to try to improve Azure cloud utilization.
Starting point is 00:44:49 Amazing. That's great. And this has been a recent sort of interest of mine and trying to understand how people approach idea generation in this area. And how do you decide what to work on? What's your process for that? Because you work on some amazing things, really interesting, cool stuff. How do you decide what to work on what's your process for that because you work on some amazing things really interesting cool stuff how do you arrive at that thing this is what i want to do this is really cool uh yeah so so i would say different people have different approaches to do this but yeah but for me i would say it's trial and error like for example or the distributed tracing projects i was doing, I was interested in finding performance issues.
Starting point is 00:45:26 Then I was thinking about what's the state of art? Oh, people are using distributed tracing. Those provide fine-grained details of where the latencies are. Then I started using that. And after using that, I said, oh, this works well. I can find something. But there's also a lot of complexity that's getting in the way. Like, there are just so many traces.
Starting point is 00:45:44 Which one should I look at? Which one can I believe in? We need to have an overview of all of this. That's where the aggregation comes in. And then I start trying different aggregation methods. Different of them have
Starting point is 00:46:00 pros and cons. They have constraints, but in the end I try to sort them out and provide multiple levels of granularity, of aggregation. I found, oh, it works well. I can use my own design tool to find many interesting performance bugs,
Starting point is 00:46:16 and in the end, it was published on ACM's symposium on color computing. And yeah, that's how a paper was generated. Okay, yeah, yeah. We'll put a link to that paper in the show notes if anyone wants to go and check it out. Thanks, thanks. I mean, yeah, so that's for that paper.
Starting point is 00:46:35 But for this paper, it's also a little bit different because after doing that, I found, oh, I could use some, like, the open source benchmarks and try to find the performance issues with them. But, you know, for distributed system in the open source world, you don't have that large scale distributed system, you know. They're always like, you always wonder, like, how can you even do more impactful research?
Starting point is 00:46:59 So that's why I try to intern at big companies, either with cloud on-prem or with big public cloud. So at Twitter, I was able to hear engineers' opinions about some of the papers they found, for example, in HotOS about the metastable failures. It occurs to them. We also always wonder the the same problem why it's just low through but let's dig into it and see whether some of the uh incidents that happens
Starting point is 00:47:32 at twitter are metastable failure well i would say adding up to your previous question um some many of the incidents at twitter was not metastable failures and they were just like a regular overload overloading like issues. And then once the trigger is fixed, then it's not overloaded anymore. Although engineers still have to do like a recovery method to prevent the overloading event from happening in the future.
Starting point is 00:47:58 But it's not metastable fitters, you know. Yeah, metastable fitters are rare and persisting. That's why it was hard. But i was lucky i would say in twitter i was able to find one instant and uh and i was able to find like a well-documented data that i can analyze them uh and and show a show in the paper to the public yeah so yeah yeah so so how that's how that helps so for for grad student i would say if you really want to uh look for interesting problems you can go to industry and see yeah sometimes things pops up and if you're interested there can be collaborations amazing yeah no it's really nice to see the breadth of different everyone's different sort of approach to answering this
Starting point is 00:48:41 question is really interesting and that's another really fascinating answer to it as well. Brilliant. So I've got just two more questions now. Yeah, yeah. And the first one is, what do you think is the biggest challenge now in your research area? So in distributed systems, in metastable failures
Starting point is 00:48:57 and all the other cool things you work on, what do you think is the biggest challenge that's facing us now? Yeah, so I think I've mentioned that a little bit but um because i'm working on performance debugging i found that currently still many of there are tons of human efforts that were spent on performance debugging like in in large companies like google like facebook like meta like all these large companies or even smaller companies people hire performance engineers or capacity engineers specifically for dealing
Starting point is 00:49:26 with performance problems. And the tools they're currently using, which are good, but we can do even more than that. And I think one of the keys to do more automation. Yeah, for example, for the metastable failure incidents, if we can somehow find an automatic way to detect it quickly enough, then it can help prevent it from happening in the first place. And also, when the system is already in a metastable failure state, how to automatically recover from it,
Starting point is 00:50:01 maybe we can auto-tune the configuration somehow to get rid of, can get out of the metastable failure state. That's another thing that we can do. And in general, about debugging performance issues, there are different stages. You first detect there's a problem, and then you try to
Starting point is 00:50:19 diagnose it to try to find the root cause of it, and then after that, you try to recover from it and prevent that from happening in the future. So in the detection field, like in this stage, I think recently there are multiple papers on this, which is good. Like people are leading to that direction to do more automation on that. But for diagnostic, I think there's still a lot of things that we can do.
Starting point is 00:50:42 Like especially if you go to industry, you will find there are a lot of data right there. Like telemetry, like all this table, and you want to join this table with that table and see what really happens. But there's just too much data, but too less people. Right, okay. Too much data, not enough people. Yeah, yeah, yeah, yeah.
Starting point is 00:51:01 And there's a larger opportunity where people can sit down and just build tools to help automatically analyze those telemetry to find the signals out of the haystack, to give signals, for example, oh, the metastable failure is going to happen. Let's catch up with it. Or the metastable failure already happened,
Starting point is 00:51:22 so I don't need a human to look at it. And the recovery phase can be done by some automation as well. If anything, in the early stage, it can generate some instant report for the performance engineers to help them better look for the root causes or provide suggestions to them to do the recovering uh before going into like a fully automatic mode you know
Starting point is 00:51:52 yeah yeah we should definitely get on a t-shirt too much data not enough people right maybe i should make one yeah that's awesome all right so time for uh the last word now yeah what's the the one key thing you want the listener to take away from your research and from this podcast today yeah so uh i would say uh we have talked about a lot of things in the big but still specifically for metastable fillers um um the takeaway is that it's really prevalent and it can cause major outages that's why it is important and how to fix that understanding the sustaining effect first and then understand the degree of vulnerability to prevent that thing from happening in the future amazing and let's let's end it there thanks so much for coming on
Starting point is 00:52:44 the show it's been a great great conversation and if the listeners are interested in knowing more about the shang's work we'll put links to all of the relevant materials in the show notes and we will see you next time for some more awesome computer science research thanks jack thanks everybody for listening you

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.