Disseminate: The Computer Science Research Podcast - Thaleia Doudali | Is Machine Learning Necessary for Cloud Resource Usage Forecasting? | #43

Episode Date: November 20, 2023

Summary:In this week's episode, we talk with Thaleia Doudali and explore the realm of cloud resource forecasting, focusing on the use of Long Short Term Memory (LSTM) neural networks, a popular machin...e learning model. Drawing from her research, Thaleia discusses the surprising discovery that, despite the complexity of ML models, accurate predictions often boil down to a simple shift of values by one time step. The discussion explores the nuances of time series data, encompassing resource metrics like CPU, memory, network, and disk I/O across different cloud providers and levels. Thaleia highlights the minimal variations observed in consecutive time steps, prompting a critical question: Do we really need complex machine learning models for effective forecasting? The episode concludes with Thaleia's vision for practical resource management systems, advocating for a thoughtful balance between simple solutions, such as data shifts, and the application of machine learning. Tune in as we unravel the layers of cloud resource forecasting with Thaleia Doudali.Links:SoCC'23 PaperThaleia's HomepageIMDEA Software HomepageGitHub Repo Hosted on Acast. See acast.com/privacy for more information.

Transcript
Discussion (0)
Starting point is 00:00:00 Hello and welcome to Disseminate, the computer science research podcast. I'm your host, Jack Wardby. A reminder that if you do enjoy the show, please do consider supporting us through Buy Me A Coffee. It really helps us to keep making the show. Today, I'm joined by Thalea Dudali, who'll be telling us everything we need to know about, is machine learning necessary for cloud resource usage forecasting? Thalea is an assistant professor at the IMDEA Software Institute in Madrid, and there she leads the Muse Research Lab, where she works at the intersection of operating systems, machine learning, and computer vision. D'Elea, welcome to the show.
Starting point is 00:00:56 Great to be here. Thank you for having me. The pleasure is all ours. So let's jump straight in. Can you tell us a little bit more about yourself? I can't give you a very brief introduction there, there but yeah tell us more about yourself and how you became interested in research and in this space yeah absolutely um i back in the undergrad days i i was always fascinated by by coding i was one of the few girls to take those operating systems classes and i i just always liked the feeling of sitting in front of a dark screen and coding. And that led me to more of the operating systems, computer architecture. I really enjoyed these things. And I really liked the environment in the lab where I did my diploma
Starting point is 00:01:38 thesis at the National Technical University of Athens. And at that point of life, I was not ready to go into industry. So I really wanted to continue being a student. And I guess that's why I pursued a PhD. And during the PhD, I really enjoyed the research process, you know, the process of finding new problems, thinking outside the box, trying out different things. And research really is a place where you can try crazy things. And even if they don't work, it's fine. And I felt that I
Starting point is 00:02:11 could not do that so easily in industry. So that's why I continued on the academic path. And it's something I really enjoy so far as a junior faculty. And yeah, the journey is still long, but I'm very excited. Awesome yeah I mean you're really nice to have that creative freedom right to kind of pursue things that maybe sound a little bit crazy and if you was in industry the higher-ups wouldn't sign off on it right but you have that space to sort of express yourself which is which is very very appealing cool so obviously today we're going to be talking about machine learning and about cloud computing environment. So and obviously resource efficiency as well.
Starting point is 00:02:49 So can you maybe give the listener a little bit of background information and sort of set the scene for us for this for this chat today? Yeah, absolutely. Let's give a little bit of context to cloud computing. Any company like Google, Amazon, Microsoft, all of these big companies are essentially cloud providers. They allow you to access massive amounts of hardware just using your computer online. And the actual hardware is sitting in some data center in the world, somewhere where you don't even know. So now any user, you, me, businesses, we can actually give some money and create virtual machines, run our applications on the cloud and all of that. Now, from the point of these companies, these cloud providers, they have a very big challenge and they've had this problem for almost a decade now. And the problem is that
Starting point is 00:03:45 they cannot achieve high resource efficiency. And what we mean by that is that for all the amount of hardware they put and they pay to have, they don't get their money's worth, let's say. So this means low cost efficiency. And this happens for many reasons. The simplest example I can give you is the following. You are a user, you go online to AWS and you want to create a virtual machine or you want to run a serverless function. The first thing you need to do is to configure how many resources you want to use, how much memory you want to use, how many cores, what type of storage, networks, all these different choices, and you have to put a number there. And I think most of the times the users don't really know what number to put. Maybe
Starting point is 00:04:37 they put the one that makes it for the cheaper option or what sounds as a reasonable number, like, you know, maybe four gigabytes sounds good or not. So this is fine for the user. But then for the cloud provider, this creates a problem because many times the user is asking more hardware than they're actually using. So this difference between how much the user is asking and how much they're actually using creates this problem for the cloud provider where the hardware is not utilized to the maximum potential and this creates a cost loss efficiency loss and this creates this problem so obviously they're leaving
Starting point is 00:05:20 some money on the table and i guess that's why the cloud providers don't like it right because they want to maximize it squeeze every little bit they can out of out of leaving some money on the table there. I guess that's why the cloud providers don't like it, right? Because they want to maximize it, squeeze every little bit they can out of there. But it's not just money. It is money. At the same time, when we talk about resource efficiency, you may also think about it as energy efficiency because the more you don't utilize to the maximum these machines, you don't utilize to the maximum the energy,
Starting point is 00:05:44 the hardware that you invested in. Okay, yeah. So you're maximum the energy the hardware you get that you invested in okay yeah so you're not actually using the hardware you've got there as efficiently which then is obviously having a drag on sort of a financial drag or whatever on on on the interesting cool yeah it's another good angle to it so you said that they've been they've been tight trying to tackle this problem for the last sort of 10 years yeah so what are the sort of current production sort of grade approaches to sort of helping like it so what are the sort of current production sort of grade approaches to sort of helping like kind of improve the situation a little bit towards getting better efficiency so um the approach that the cloud providers are taking is the following they
Starting point is 00:06:17 all of the cloud providers have created their own specialized software that we typically name as a cloud resource management system. And what does this system does? What essentially the cloud provider can do now is no matter what resources the user is asking, essentially the cloud provider will give the user the amount they're actually using. So let's say that you are a user and you ask for 10 gigabytes of memory. But in reality, you're using only 5 gigabytes of memory. So the cloud provider creates software that monitors your usage and your behavior and actually gives you dynamically only the amount that you're actually using.
Starting point is 00:07:06 So the way to improve efficiency as a cloud provider is to dynamically adjust the amount of hardware that you distribute across the users to match their actual usage and not what they asked originally. And being able to really follow how much usage the users are using is a hard problem because now you need to do that proactively. You need to be able to provide the user the resources that they want before they're actually using them so that they can use them. So now this is where the problem of prediction comes in. Like for these systems to be useful and efficient, they need a way to be able to predict the number of resources the user will need in the future. So now we go from a resource management problem, from an efficiency problem, at the end of the day, it boils down to a prediction problem. This is the core component of all these systems, having a model to be able to predict future
Starting point is 00:08:12 resource usage. And this is where really the story of this paper starts. Okay, cool. So yeah, I know that the star of your paper was this one approach called, and we can maybe go through a few of the other approaches, like the types of models that the star of your paper was this one approach called, and we can maybe go through a few of the other approaches, like the types of models that the CRMS systems, let me get the acronym correct, there we go, the Cloud Resource Management Systems, that they use as well. But I know that the star of the show is this long, short-term memory. And when I first read that, I was a bit like, hang on a minute,
Starting point is 00:08:40 long, short? This is confusing. There's something going wrong here. Am I reading this wrong? But yeah, so tell us about the Long, short? This is confusing. There's something going wrong here. Am I reading this wrong? But yeah. Tell us about the long, short-term memory and the types of models these systems use at the moment. Yeah, this is a type of machine learning models that falls under the category of recurrent neural networks. So you may have heard of convolutional neural networks. These are very good for classifying things, separating cats and dogs in images. But when it comes to time series, we will use recurrent neural networks.
Starting point is 00:09:15 These are very good at finding things that are predicting things over time, patterns that are repeating. So they have this notion of time inside them. And long-short memory networks are a category, let's say, of recurring neural networks that are able to kind of remember things from the past. That's where the term memory comes. And I think long-short, I don't know why it's called like that, probably because they keep a lot of memory. But at the end of the day, they have proven to be very, very accurate for many tasks. Like, for example, predicting the weather, predicting how the stock markets will vary, predicting very, very different things in various domains from finance, health, weather, whatever. And recently, because in systems we see a lot of effort in exploring machine learning solutions,
Starting point is 00:10:13 recently we saw a lot of papers that also try to integrate these models in system level problems in general, resource management being one of them. What is good about these models is that they seem to be accurate across domain. They are relatively easy to configure. I mean, of course, they have a lot of hyper parameters, but you spend some time tuning them and at the end of the day, you can use them. In our work and in many of these works in systems, we treat them as a black box. So we don't necessarily care to know what's inside. We care to tune it very well. And we care about the input and the output, let's say. So in our case, the input will be the resource
Starting point is 00:10:59 usage in the past, let's say in the past one hour. And the prediction, the output will be the predicted usage in, let's say, in the next five minutes. So that's how we use those models in the context of this work. Okay, amazing. We've got a few questions that are jumping around my mind. I'm going to hold off on them now because I think we're probably going to touch on them a little bit later on about the actual metrics and the time frame you predict over you mentioned
Starting point is 00:11:26 five minutes there which was interesting but anyway it's okay so we're taking these uh lstms i'll use the acronym from for them from now on and and you start off in your paper by exploring their predictive capabilities in and can you tell us more about this sort of these experiments you did and how you ran and what your findings were and how you achieved this exploration of seeing how good they actually were at predicting things? Absolutely. Yeah, so we followed the trend of everybody using LSTMs, so we also wanted to use them. For this part of the experiments, we used a public data set from Google that captures how much resources is being used across time for applications. Some of Google's applications, of course, you don't know exactly what they are, but for some applications that were running for many, many hours. So they provide data publicly
Starting point is 00:12:19 on how many resources these applications were using, like cores, memory, disk, etc. So what we do in our approach is we train a different LSTM model per one of these applications. And this is something that is done, is a common approach to train a different model per application because all these applications can have very different patterns, behaviors, etc. You can imagine, for example, that I can give a very simple example. If you have a social media application, Facebook, Instagram, a user may use it more differently than an alarm clock application. So the alarm clock, you always use it in the morning. It's quite lightweight, fast. It doesn't consume a lot of resources.
Starting point is 00:13:05 But when you go and log into Facebook, probably you are doing that at night or throughout the day. So you use it more frequently and it takes more resources to run it. It needs memory, it needs compute, all of that. So this is a simplified example of different applications and that they will use resources very differently. So we train a different LSTM model per application. And now we want to test how good they are. We want to test their accuracy. So to do that, we create a set of different test sets.
Starting point is 00:13:39 So in the first experiment, we test against the data that the model was trained, the data that the model has already seen. And we get very, very good results, great accuracy. Good. This means that the model learned. Next, moving on, we tested against the data of applications that were similar, that exhibited similar patterns. And again, we see quite good accuracy, good levels of accuracy. We say, great, this is what machine learning should do, should be able to learn similar patterns. And then we say, okay, what if we test the models against random data? Like, let's test it again, a completely different application. What is going to happen?
Starting point is 00:14:29 Probably we were expecting something bad to happen because we wouldn't expect this model to learn something that they have never seen, some data. So surprisingly, we see again very good accuracy. And this is where we start getting suspicious, very suspicious. So what we did next, and this is where the story becomes very interesting, is that we said, okay, let's look what the data looks like. Let's visualize this time series and this prediction.
Starting point is 00:15:01 So we simply created a plot of the real data, the real time series and this prediction. So we simply created a plot of the real data, the real time series and the predicted time series. And so what we were expecting to see is the two lines overlapping, being very close to each other. But what we saw is that the two lines, the LSTM line was always shifted to the right. So it was as if it was predicting the previous value in time. And this is where we started getting suspicious and say, we consistently were seeing this shift, no matter the test data, no matter the
Starting point is 00:15:40 LSTM model. We did this extensively across many, many experiments. So we would always see that the LSTM's predictions were shifted in time to the right. We searched a lot in Google. We found various blog posts. We looked at papers that have used LSTMs. And you know what's funny is that not many people have visualized those predictions. So thankfully in some blog posts, we saw some similar visualizations and we did notice a similar shift. But what happens is that the authors, like everybody that works with LSTMs, always focuses on accuracy. If accuracy is good, if my error is low, great. But when you visualize this, we cannot just simply ignore this consistent shift.
Starting point is 00:16:34 So this is where we started thinking about it. And instead of focusing and saying what's wrong with LSTMs, we said, okay, is there a different way I can produce these shifts? Since this data, this shifted data brings so good accuracy, there must be a much simpler way we can generate such shifted data. And of course you can. You can just simply shift your time series one time. And that is totally legit. And we actually give it a fancy name. We call this the persistent forecast.
Starting point is 00:17:13 And this is a terminology that is used in literature. And we call this a persistent forecast because you make the assumption that the values persist over time. The values will change very little from one time step to the other. And so then the remainder of this paper tries to make a case that by using this forecast,
Starting point is 00:17:35 by assuming that the data values persist over time, you can forecast cloud resources very well and you don't really need those lstms cool yeah so a few questions that kind of i do going through that i didn't want to interrupt you while you was in your flow so um that's a really first of all persistent forecast is a really cool name for something that's kind of i guess relatively straightforward right quite simple but that's really cool but the the so the basic kind of idea here is i guess these lstms are going oh yeah the value was five ten minutes or like some time step ago therefore now it's going to be five again sort of business it's what you're saying it's going okay it's a
Starting point is 00:18:14 forget it's lazy okay i saw this before oh it's probably going to be that now there we go because these values are really highly correlated over time right so that's sort of yeah so the lstm results they were able to follow very nicely the patterns but it was as if they were spitting out the input to the future so right yeah i think they were not were they even learning that's our concern i see i see cool any i've got a few questions about the um about how you actually the sort of the software you use to train these LSTMs and like what how easy was it history, in the past, do you look at? Like an hour ago, two hours ago, how many time steps do you consider in the past? How many time steps you consider in the future? We considered one, which is quite common to do here. There are other parameters with respect to the model itself, how fast is learning, the number of training epochs, the learning rate.
Starting point is 00:19:28 These are very typical parameters that you have to tune in any neural network. And what you usually do is you consider some values for each of these parameters and you do a grid search and you try to find the best combination that worked well. We did that one time and we kept the same combination of good values across all the models that we trained. And we trained a different model per application. Nice, cool. How long does it take to train all these models? These models are relatively fast in the matter of a couple of minutes because the input here is just a time series.
Starting point is 00:20:07 I see, yeah. A different model per time series. And then, of course, if you have a GPU or whatever, you can run them even faster. So they're relatively, you know, they're actually quite fast to train. And that is why they're a good candidate for systems integration because they will not take months and you know days to to train you can train them relatively fast and we train them even faster do inference even faster in an online manner which is good right if you
Starting point is 00:20:40 want a real-time sort of system and you've got some real-time guarantees. Cool. So, cool, let's continue with the story then. So there was, you've had this sort of epiphany of, hang on, something's not right here. Something didn't smell good. Then we figured out that we've got this shift. Yeah. And then we went and sort of said, okay, look, now we've got this sort of simple, persistent forecast
Starting point is 00:20:59 that we can take as an approach. And you went and sort of explored loads of cloud resource usage and types of data and tried to sort of explore and see whether this could be applied to that so yeah like tell us about how you went about finding this i'm sorry figuring this out and what your experiments were and what you found absolutely uh yeah we considered different public data sets from providers such as google microsoft and wabba and bitins. The reason why we consider these different datasets is because they capture information at a different level. So to be more
Starting point is 00:21:34 concrete, for example, some datasets, they capture resource usage at the level of a physical machine. A physical machine is the server with the actual hardware that is sitting in a data center. This is like the base level. Then going at a level deeper or upper, if you want, you have the virtual machines where each user can create a cluster of different virtual machines. So now there can be many virtual machines allocated in a physical machine. So this will change the perspective of the resource usage. And at the end or at the deepest level, you have the workloads, the applications. So you can observe a resource usage at an application level where many applications are running inside a virtual machine.
Starting point is 00:22:25 Many virtual machines are located in a physical machine. So there is this hierarchy or different levels where you can observe resource usage. So the different data sets can give us access to this variety of resource levels. And then what essentially we do is we want to see if in these different levels of resource usage, whether across time resource usage shows high data persistence. This persistence means that from one time step to the next, we want to see whether the values change drastically or not. For example, if from one time to the next five minutes, my memory usage only changes by a few kilobytes, that's very little.
Starting point is 00:23:16 So this shows high data persistence. So to capture this persistence, we just look at the average difference, delta, between two consecutive time steps, two consecutive values in the time series. And in this way, we can conclude, we can do this across any time series and we can observe whether or not we see high data persistence. And the insights, the findings that we came out with are the following. Overall, across datasets, we observed high data persistence. So our takeaway is that really in the cloud, in the context of physical virtual machine applications, we see that resource usage doesn't change drastically every five minutes, let's say. So that's great. That means
Starting point is 00:24:06 small delta. That means great opportunity for this persistent forecast to work well. Now, at the different levels, we saw different things. At the physical machine level, that is the hardware level, where there can be many virtual machines, many applications, we saw the highest data persistence. And this makes sense because at that level, usually those servers are overloaded. They keep the utilization high. So we see very stable use, stable patterns. We don't see a lot of dynamic behaviors.
Starting point is 00:24:41 So things remain relatively the same. They don't change a lot. Now, as you go deeper, like to the virtual machine level, patterns start becoming more periodic. And this makes also sense because it has to do with how the user is using the virtual machine. For example, the user wakes up in the morning, uses their virtual machine, then goes to sleep at night, wakes up. So this creates a periodic, a diurnal pattern, a daily pattern. And then when you go to the application level, this is where things become the most dynamic because an application can do whatever it wants. It can have any pattern possible. So this is where we saw less data persistence, more variability in the values. for simple persistent forecasts to work well. But if you're at the application level, maybe you want to do something more sophisticated
Starting point is 00:25:47 or particular for this application. So these are our insights. And if you want, I can answer the question of the paper. Yeah, answer the question, answer the question. Do the drum roll, whatever. So is machine learning necessary for cloud resource usage forecasting? We answer no, but there is a but.
Starting point is 00:26:11 Always there is a but. We say no for the most part. There will be cases that you will need something more clever. Is it machine learning? Is it not? I leave it up to you. But we see that for majority of the cases, for majority of this data and these patterns that we see, these patterns are quite predictable, stable, periodic.
Starting point is 00:26:34 You can get away with simple solutions, solutions that we have already. And we are using in the systems for years now because they do work well for those cases. I think the new questions that come out of this work and the new challenges really is to how to identify in which cases you need ML and you don't need, how to identify which patterns need machine learning to be predicted well. And this maybe creates a problem of classifying patterns into different categories of stable, dynamic, periodic, other ones possible. And then if you are to use something more clever, something more not so simple, is it going to be machine learning? Maybe yes. We propose probably not to use LSTMs or if you use machine learning, really, really put effort into making
Starting point is 00:27:27 sure that it learns well, it actually learns. And now we see future systems having a combination of both. We see future systems being augmented with machine learning and you have to do that in a practical way. If you are to use machine learning, you have to do that in a practical way if you are to use machine learning you have to make sure it's practical low overheads because at the end of the day it's an online system and things need to be fast nice it's a great summary of the findings i had a quick question on the so you said that as you moved up the stack there from or down the stack depending on your perspective um that that the simple sort of the persistent forecasting sort of didn't perform as well because the pattern become more varied.
Starting point is 00:28:10 Was that also true of LSTMs as well? Did they observe the same thing as you kind of moved up the stack? Did they obviously mimic it in the same way? They got less accurate as you did that? Yeah, absolutely. In the sense that if a pattern was very dynamic, even there the LSTM will struggle. And if a pattern is completely random, then it's very, very hard to predict it, no matter what you use.
Starting point is 00:28:35 Of course. Yeah. Yeah. That's really cool. You listed loads of really nice directions there for future work. So I'm going to dig into that now and say where do you go next then with this with this work then so you listed some great directions what's first on the to-do list exactly exactly uh because at the end of the day this paper is more on the analysis side giving you a lot of insight but we don't actually use this persistent forecast uh somewhere so we are now looking for a real world resource management system where we can actually integrate this persistent forecast. And of course, together with that, we want to identify when to use ML and what kind of ML to use, because I think now in the time series forecasting domain,
Starting point is 00:29:20 we're moving past those LSTMs. There are new techniques out there, more recent state-of-the-arts. So we want to update ourselves there. And what is really, really important now is to see that we don't just care about accuracy anymore. We want accuracy to be good, but now we want to see how the accuracy, how the predictions impact performance, systems performance. So now we introduce new metrics, performance level metrics such as runtime efficiency. We want to see how much we're going to improve those by having better, more accurate predictions. How far along are you with this sort of work? Is this sort of in progress or is it? In progress, definitely. Yeah definitely yeah yeah cool of the of the analysis that you did so far in lstms and
Starting point is 00:30:11 persistent forecasting are there any sort of sort of limitations to this work and scenarios that maybe your analysis has been kind of it's never perfect there's always some holes in it things you wish you had more time to do so like are any sort of like areas where you'd like think ah yeah we're a bit weak on that yeah i think there is definitely room to explore many many more data sets we experimented with some that are public that are across different levels but there are even more for example there are different data sets that capture resource usage for microservice applications. These potentially can have very different behaviors. Same with serverless workloads.
Starting point is 00:30:53 Again, these may exhibit very different behaviors. So we don't know if our insights and observations will generalize to those. But what is good is that the way we explore this data persistence with this very, very simple metric of looking at the difference in the values, this is a thing you can do no matter what data you have. So at least we give you the means to do it very simply. But I'm curious to see about what people observe in those different subdomains of cloud workloads yeah yeah for sure that'd be really interesting to see whether the finding generalizes across different areas as well and i just wanted to just roll back a second to the question you said about deciding when to use various different ml models and like given this scenario how how do
Starting point is 00:31:44 you envisage that happening like how how do you go about like i have a collection of data all of this this smorgasbord of ml models i need to pick one obviously i can just do it one by one and just find the best eventually but that feels very time consuming do you feel like it's going to be a case of learning the properties of the data that suits the model better like how are you going to go about sort of deciding that or how could we go about deciding yeah that's a a very very challenging problem and an open problem so there are many many people have a very different approach so for example there is one approach that says build a global model build a model to rule them all literally and then research they're very promising
Starting point is 00:32:27 research i don't know if it's going to be accurate if it's going to be efficient so one approach is that then the other approach um is to have different models for different classes of patterns, of things, whatever. So kind of understanding and classifying what you see into irregular, normal, stable, periodic. You kind of, I truly believe that we need to first study the data that we are working with. And the data itself will tell us what we need. Of course, there will be cases, I believe there will be cases where things are totally unpredictable.
Starting point is 00:33:11 And in this case, you might just do a simple best effort and you have to tolerate the error, no matter what. And then if there are different models for different classes of patterns, I really don't know if the answer is different ML method per class. Are you going to use the same type? Are you going to use LSTMs and LSTMs or decision trees, whatever, all these different things that are there for different patterns?
Starting point is 00:33:51 This is a very interesting problem to explore, actually. Yeah, it sounds like it could be keeping you busy for a while, that one. Which is a good thing, which is which is cool cool yeah so yeah um i guess my next question sort of is how do you think then so i'll be my day-to-day sort of life as a software developer data engineer i can sort of leverage the things that you found in your paper yeah i think the insights from this paper are very good for production level integration. And it's something that actually agrees with many things that we see in production. There are various papers from Microsoft Research where they actually also use these simple persistent forecast versions of this persistent forecast for their production level
Starting point is 00:34:47 machines. And it really goes to show that in production, in industry, you want simple solutions. And industry is very hesitant to integrate machine learning in their stack. And Microsoft had a great paper on this at Sigmund. They talk about how it's practically hard because you have to integrate this ML code into your existing code base. You need to debug it. You need to be able to run tests against it. And you need to trust it.
Starting point is 00:35:21 You need to be able to explain. For example, if the machine learning model predicts something completely random that's going to make the user pay more money, then you need to be able to explain that. So there is a lot of hesitation in applying ML. And of course, there is practicality, overheads of cost, engineering, all of that so i really think that this concept of simplicity rules is is what industry really likes to follow and and make sense yeah i mean you can totally understand it right if it's simple and it works well enough like yeah we don't want the big
Starting point is 00:36:00 nasty complex thing over there that we can't really understand that's a black box that occasionally tells us to bill this customer a thousand dollars and then the customer's angry that's why and then i don't have a reason for it right so um yeah no that's that's really interesting that's really cool so yeah i guess building on that then so whilst you've been working in this sort of space and on this project what's the most like interesting thing that you've that you've found the most interesting lesson that you've found, the most interesting lesson that you've learned? Yeah, this project was truly eye-opening, literally, because we visualized all these things. And I really think it's those visualizations that generated a lot of insight.
Starting point is 00:36:38 So what my biggest takeaway, especially when working with Amal, and it's something that I have observed also as a PhD student in my thesis, is to not trust just the number of accuracy, just the number of the prediction error accuracy. You need to really dive deeper and see if the model is actually learning. And one way to do it is to visualize those results. You can use explainable ML to help you out. You can use ML that is kind of self-explainable, like decision trees. So you really, really need to validate that it's actually learning. And that means at the same time that you need to truly understand the data that you are working with, like observe patterns and understand what
Starting point is 00:37:26 works well for some and not the others and this will i think really like the visualization is the key and validation is the key yeah yeah no that's that's so true so yeah so now while i was talking about the projects and the origins of it and whatnot and things you learned were there any things that you tried along the way that kind of failed any sort of dead ends you hit and then you're like oh i mean the fact that you found that it was like shifted by one is kind of i guess like a dead end is not a dead end but like an interesting like strange finding anyway but yeah because a what happened is at the end at the beginning of this project we said okay let's build this one model to rule them all. Okay, right. And we seemed to build generalizable models.
Starting point is 00:38:16 And that's the idea behind running these experiments that test against different data sets like seen data, unseen data. And because the results were too good to be true, this is where it started getting suspicious. So the interesting part there is that instead of thinking that, oh, we failed, we took this failure and we turned it into success by saying that, oh, look, instead of using those LSTMs, you might as well do something simple and create the same result. So a failure became an opportunity. And this is something I think that happens a lot during phd yeah very true yeah you guys say it's one door closes another one opens right like uh it's always an opportunity yeah exactly yeah it's all about your perspective right so yeah that's that's so true that's cool um yeah i guess obviously being the being the the the lead of the the muse lab you you do lots of other things as well.
Starting point is 00:39:06 So can you maybe tell us some about some of your other research and things? Absolutely. Yeah, absolutely. The other part of research that really excites me is driven by these visualization insights. So at the end, towards the end of my PhD and now with my students, we have the idea of what if we actually use images in systems, in system level resource management, for example. And this sounds like a very crazy idea, but that's what research and academia is for, to do crazy things that may not work. But the selling point here is that if you visualize things, you yourself can generate insights. And then if you are to use machine learning, you can now use a different class of algorithms that you couldn't before because you didn't have images. And machine learning works
Starting point is 00:39:59 great for images. Everything is done for images and videos and all of that. So can we leverage these different types of algorithms inside now systems research? And this is, I don't have the answer to this. It seems promising. It seems that it can potentially learn those harder patterns that are out there. It has a lot of potential, but then of course, there are many challenges and open questions with respect to, is it practical? Is it lightweight? Is it, what is it? So it's an interesting research line. So that's one of the other things I'm doing. And then because I work at the intersection of machine learning and systems. I do have a research direction in systems for ML. How can we build better system support for supporting those machine learning workloads,
Starting point is 00:40:53 particularly those large language models that are now so hard? Yes. How we can improve their memory management and improve their performance. Cool, yeah. Just going back to the visualization, that was really interesting because I was trying to wrap my brain around the idea while you were explaining it. But can you give an example of how that would work?
Starting point is 00:41:15 Because I'm thinking you would somehow generate an image from some resource metrics or something, and then you would feed that image then into some TNs. The easiest thing you can do is just the graph that you have that visualizes the time series that the single line plot put it through a convolutional neural network and see if it learns that better compared to using lstms on top of numbers that's well so the the like the image representation of the same information that can actually has learned that better than just feeding something that just
Starting point is 00:41:50 reads the pure numbers the answer to that potentially maybe i mean that'd be really cool you start with this simple representation then you can move on to more complex image representation where you extract the frequency this called like a spectrogram or okay here uh image representations and you can get ideas from other domains like earthquake detection they also are related to time series and they have done things there with images so you can now i really like this intersection of different domains because you can apply something from a different domain and maybe it works better or, yeah.
Starting point is 00:42:31 That's interesting. Those images have sort of, can you analyze these images over time as well? Can you kind of come back? No, because that's kind of, we did say in the earthquake, because I'm thinking of the seismograph, right? And was that what it's called? And it changes and yeah. That's really cool. I look forward to reading the paper on that one yeah hopefully
Starting point is 00:42:48 cool um so yeah this the next one's the penultimate question and it's it's my favorite question now it's I must say but yeah it's about the creative process because I love seeing how everyone everyone's everyone's a different answer to this question right and it's great to kind of get a peek inside someone else's mind so yeah so tell us how you go about generating ideas and then once you've done that how you then choose which ones to spend the next six months to two years working on right absolutely yeah what i i really like to do and i think i'm good at doing is uh finding uh different perspectives looking at doing, is finding different perspectives, looking at the problem from a different angle.
Starting point is 00:43:28 When somebody can observe this as a limitation, I observe this as an opportunity, as we said before. So, for example, when I'm reading a paper, I always ask myself, what could I have done differently? So that may have to do with the design. I could have designed it differently, or I could have done differently. So that may have to do with the design. I could have designed it differently or I could have evaluated differently using different metrics because, and this is where a lot of opportunity comes because if you add a different evaluation direction, you come up with
Starting point is 00:44:00 a different work because it's one thing to optimize for runtime, but then maybe it's a different thing to optimize for efficiency or throughput or whatever, like a different metric. And I think this is a good starting point for a student, for a project, because then you build upon something that is there. It's great if it's open source and you can reproduce the results. So first trying to reproduce the results from a paper and then asking this thing of what could I have done differently, different performance metric or different way to view what the problem, different objective. So this is one safe way to generate project ideas. And then I think it's also important to keep up to date with conferences and publications. I mean,
Starting point is 00:44:54 there are numerous, especially in the machine learning domain, it's impossible to keep track, but at least having a sense of what are the hot topics of now. For example, this year, yeah, large language models are skyrocketed. It's definitely a very hot topic. Machine learning has been a very hot topic over the last few years. So these are the more hot upcoming topics. But there's always those very standard problems that we are still not able to solve. Such problems are cloud efficiency, scheduling, resource management. These are very fundamental problems, but there is still room for improvement because the other things change.
Starting point is 00:45:42 The workloads change. The users users change you see different patterns different behaviors so it's still a very very relevant problem to work on it's a fantastic answer i love that um simple question you ask yourself as you're going through a paper of how could i do this differently how did you arrive at sort of that being your sort of mindset when reading that was it sort of that came to you over time or was it like a it came over time by throughout my phd uh this is how i went from one publication to the other i i remember very distinctly my first paper was i had read that other paper and it was anyways on some topic and And then I said like, oh, what if I do this for like shared resources and many users and multi-tenancy? And this really worked.
Starting point is 00:46:34 I remember my advisor, she was very happy. Yeah. And this now also happens because now that I'm reviewing papers in program committees, you kind of have to ask these questions to properly evaluate. Or like, is their approach, is their design justified well, evaluated well? And if not, then it gives room for improvement, right? I think it's great. I'm going to try and take that question and use it when I'm reading things and hopefully it generates me some some good ideas as well
Starting point is 00:47:08 and the second point as well about sort of yeah staying on the hot topics and like trying your best to filter out the noise and stay on the hot topics but then also giving the old sort of classic problems some love as well right because there's there's plenty of those that are not solved so yeah and in systems research you always what is great you always see a recycling of fundamental ideas so there are some systems principles ideas design approaches that always we come back to those because even though they're old they are are robust, efficient, and they work well. So we always come back to those. Yeah, it'd be interesting to see what the cycle is on some of them. Because after 10 years, oh, this is terrible, we should do it this way.
Starting point is 00:47:52 Then someone goes, well, hang on a minute. The old way was actually quite good as well. Well, there's nothing new under the sun, right, they say. So yeah, cool. That's awesome. And yeah, so it's the last question now. So what's the one takeaway you'd like the listeners to get from this podcast today? machine learning machine intelligence is cool but i believe that we need to use our human intelligence to see when to use and when we have to use machine intelligence yes yes that's a great great message to end on so let's let's finish there thank you so much for coming on the show
Starting point is 00:48:37 and if the listeners interested in knowing more about flair's work we'll put every link for i think in the show notes and again if you do enjoy the show please consider supporting us through buy me a coffee it really helps keep the show on the road and yeah we'll see you all next time for some more awesome computer science research Thank you.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.