Disseminate: The Computer Science Research Podcast - Hani Al-Sayeh | Juggler: Autonomous Cost Optimization and Performance Prediction of Big Data Applications | #6

Starting point is 00:00:00 Hello and welcome to Disseminate, the podcast bringing you the latest computer science research. I'm your host, Jack Wardby. I'm delighted to say I'm joined today by Hani Al-Sai, who will be talking about his ACM SIGMOD paper, Juggler, Autonomous Cost Optimization and Performance Prediction of Big Data Applications. Hani is a PhD student in the Databases and Information Systems group at the Technical University of Ilmenau. He's interested in optimization and performance of big data applications. Hani, thanks for joining us on the show. Yeah, thanks for the invitation. You're welcome. So let's dive straight in. Can you first off introduce your work and describe the current workflow for developing big data applications in the cloud? or even previously Hadoop, MapReduce, Paradigm, and Fling,

Starting point is 00:01:26 and many others that allows application developers to write their data flows in a transparent way that could be executed later on in parallel in cloud. So application developers do not need to care a lot about how this code would be executed in virtual machines, in parallel, and all other this stuff. The only one task for application developers, which is a hard task nowadays, is to provide new algorithms and novel algorithms. Like nowadays you have plenty of data many of applications are on demand many are now

Starting point is 00:02:09 in the queue like nobody yet has time to to start working on it and what i tried to do is is this application developers while they suffer from many uh challenges to just reduce some of these challenges during their development of their applications. They maybe should be more concerned on the accuracy of their models, the correctness of their solutions, free of bug solutions, something like this. Okay, fantastic. So what is the challenge or the maybe even hidden challenge facing application developers in this workflow? What harms performance? So as I mentioned,

Starting point is 00:02:53 so the problem for application developers when they do decisions like which data set to cache or not, currently, for example, Apache Spark, allow application developers to cache some data sets that they expect that it will be used later on and persisted later on when it's not needed. Which is a performance aspect that application developer must make this decision because nobody else shall do it. And actually, the application developer shall make this decision while he or she lacks a lot of important information. Try to imagine those who are developing Spark ML they are preparing a library that would be used later on.

Starting point is 00:03:41 We realized in many cases, they avoid caching many important datasets, which might be because they expect it should be cached. So these datasets are not filtered. And so maybe after many, many transformations, it's expected to not be small datasets. They make these decisions. The problem when they make these decisions, they don't give the opportunity afterwards to those who are using this application, even though if they can provide models

Starting point is 00:04:13 to predict the size of these data sets or so, the chance has been already lost because the program is now in a jar. It's a binary, no possibility to update it. And here comes the problem. When an application developer suffers, instead of preparing clean code, solving real world problems, the application developer shall make performance-based decisions that he or she cannot even make. On the other side, the end user, or sometimes online schedulers, has to deal with these jars as they are after the fact

Starting point is 00:04:52 and deal with the decisions of application developers. And we realized in a study we made afterwards, or even during preparing Juggler, we realized for around more than 100 applications from real world applications in the market currently used we realized thousands of what we call them anomalies anomalies in the sense that whether a data set is cached while it should not be cached because it's not reused or a data set that is recomputed a lot of times and it should not be cached because it's not reused, or a dataset that is recomputed a lot of times

Starting point is 00:05:27 and it's not cached at all. How does Juggler solve this problem? So Juggler comes as a framework in between, between the development and also the usage. So Juggler solves the problem by... We applied this solution in Spark itself. The idea comes as follows. It's hard.

Starting point is 00:05:55 Always, we realize many, many other related work also focus on... How to say? Focus on injecting or parsing source code sometimes. It's not sometimes efficient because also sometimes we deal with libraries that are dependent on another libraries. So also the source code might not always be accessible. Therefore, we simplify the problem as follows. Also, the source code might not always be accessible.

Starting point is 00:06:28 Therefore, we simplify the problem as follows. We have a data flow, which is a collection of datasets and dependencies between them, like operators and datasets or RDDs and transformations. So we look at it as only a DAG of transformations. And because we have access to the source code of Spark, we can decide which dataset to be cached and when to un-persist it because in Spark you have the full control.

Starting point is 00:06:55 You can apply your scheduling technique. So, Juggler solves the problem as follows. First of all, we instrumented Spark to get some important metrics with regards to each data set or RDD. Each data set in general has a number of computations and a cost of computation, which is the cost of transformations required to be applied. Here, the cost is the time for computation. And also, it has the size that should be considered because if it is a very big dataset and its computation time is very fast, it's not recommended to cache this dataset.

Starting point is 00:07:48 So the first thing, we get these metrics we instrumented spark to get some some metrics about each data set and our hotspot detector the first component is park starts ranking these data sets and produce schedules. Each schedule recommends various options of caching and unpersisting of datasets. And the output of this first stage is to know which dataset should be cached and when to unpersist it. And as I said, it has multiple what we call schedules, which means that it's not only one option to cache one dataset. Sometimes it's cache two, three, or more, as many as possible. So this is the first. The second stage is we need to predict the size of these datasets with regards to

Starting point is 00:08:39 submitted application parameters. Here, the use case is is because we are aware of many use cases like this application developers prepare the source of the code that serves multiple use cases so it accepts parameters by end-user and then the end-user select these application parameters and on the fly we should predict the size of datasets based on these submitted application parameters. Just simple use case. Try to imagine that it's a predicate of a filter that's submitted by end user. Then, of course, it influences the selectivity of a filter and therefore the size of the data set and the cardinality of these operators. Right. So in the next stage, based on the first one, we know what what are those data sets and then we predict the size of these data sets.

Starting point is 00:09:30 Or our jugular autonomously does all this stuff after predicting these just one stage is conducted called the memory calibration. Just briefly, each application has its own characteristic with regards to memory footprint. Here we mean the execution memory. And there is unified memory shared between execution and caching. So it's important to predict how much the application would utilize with regards to execution because if the application uses a lot of memory for execution it will evict partitions of some cache data sets which is not recommended yeah and and with this we we can know

Starting point is 00:10:18 the exact cluster size or the number of machines that fits running the application without having any evictions. The last one is the performance prediction. Also with regards to selected application parameters, it influences the time, how much it takes. So Juggler conducts multiple experiments of various selected application parameters, train models, multiple models, and using cross-validation and these techniques to select which is the most suitable model that fits well the experiments. Out of many candidate models, the performance or the execution time model is extracted,

Starting point is 00:11:10 and this is the whole life cycle of Juggler. This is conducted in an offline training phase, which requires a lot of experiments. Therefore, it should be noted that Juggler is not recommended for those applications that only run once in the cloud to be able to amortize the cost

Starting point is 00:11:32 of this offline training so it's for recurring workloads or those workloads that are run frequently with different application parameters and different data sets which is, by the way, the majority of use cases in public cloud are somehow like this.

Starting point is 00:11:51 As an end user, how do I interact with the juggler? So for an end user, end user is not aware about the terminologies of schedule, for example. What's the schedule? Okay, it caches many datasets or not. It doesn't matter for an end user who might be, by the way, not a developer. So this end user get each schedule as an offer because we have now runtime prediction. And of course, if we know the optimal cluster size, and we know how much time it

Starting point is 00:12:28 takes, you know how many machines you need and how much time it takes. So there will be each cloud provider provides this kind of calculation. And therefore, you can know the costs. So in the paper, we just simply mentioned, it's just a multiplication between the time and the number of machines. However, a juggler is also modular in the sense like any calculation could be applied. So the end user submit the application parameters. And because we already have the trained model, juggler already trained everything in the in the in the offline training phase then the end user receives the offers each schedule is an offer and each offer comes as follows number of machines and the time and the cost. Of course, for juggler omits offering those schedules or offers

Starting point is 00:13:31 that have others who gives better offers in terms of cost and latencies. Like imagine you have two schedules. Schedule number one takes one hour and two machines, sorry, one hour and, for example, costs $100. And another one takes 50 minutes, which is less, and $20. So in this case, the first schedule is not offered. Can you talk us through your evaluation of Juggler and what were the key insights from your evaluation and your analysis? So during evaluating Juggler, many things we realized. First of all, we realized that predicting the size of sets is easier than predicting the execution time.

Starting point is 00:14:27 I'm not saying it's easy. So cardinality estimation is challenging by itself. But compared to the time, it's easier. Let me give you an example. Sometimes you can conduct the same experiment two times. Always you have, if you have, for example, the same input, the same configuration, the same application, always the data sets in terms of size, they are similar. But the time varies from an experiment to another experiment.

Starting point is 00:14:57 There are many reasons for this. And our experiments are on our own perm cluster. So it's not in public cluster. So in public cluster, you have a lot of interference. You have multi-tenancy and many stuff. I'm talking about a cluster that is not in use. And also we realize the huge variances. Maybe the reason is a lot of uncertainties with regards to garbage collection, operating systems, scheduling, many, many things.

Starting point is 00:15:25 So always, and especially for short-running applications, it differs in percentage of plus 20%, 30% sometimes. That makes it quite challenging. So this is one of the insights we realized during our experiments. And also we realized that also Juggler, since the optimization of cost is the goal of juggler, we compare it with other related work. And sometimes juggler recommends those configurations

Starting point is 00:15:57 that lead to longer runs, but the cheapest. You understand my point? Which means that maybe it might not always like optimization of cost is is is an important goal that fits many use cases but it's not the only one many customers for example would would accept paying double to to to reduce the latency on that you mentioned there's other approaches and tools similar to Juggler. What are those and how do they compare? So actually, there is no tool, an end-to-end tool,

Starting point is 00:16:36 that addresses the problem from A to Z in the same context, I mean. So Juggler is the first one to do it. However, each component of Juggler has a lot of similar components that address similar problems, but for different use cases. So, for example, this hotspot detector that selects datasets or rank datasets based on this matrix has other related work like cash eviction policies. It's a different use case but also cash eviction

Starting point is 00:17:10 policies rank data sets to decide which to evict. So we compare with them. However, this comparison just informs some assumptions to make it as fair as possible.

Starting point is 00:17:27 What I would like to say here, just to be fair, it doesn't mean that the components of Juggler should replace the other components because the use case is different. And for example, in the case of Juggler, the hotspot detector was able to provide better schedules or better decisions because the metrics uh are rich we have three metrics the number of of computations the size and the time for computation while the related work always you find a related work that

Starting point is 00:17:59 has for example two metrics some of them are have only the number of computation uh which makes also for them it's it depends on the use case i mean yeah yeah something like this sure so what are the limitations of juggler you mentioned that for certain workloads ones that you run in room once it isn't worth doing the offline training phase. Are there any other limitations of juggler? Yeah, so the training phase is, as you mentioned, it's limited to recurring workloads. In our experiments, by the way, we realized in an example like PCA,

Starting point is 00:18:41 one run was enough to start gaining benefits because the improvement was huge. But this is not, let's say, it's not expected to be the generic case. In general, and the applications we have already studied, like PCA, SVM, linear regression, like their usage is immense, like a huge amount of runs per day. Therefore, maybe this could be one of the limitations of Juggler, like it's not suitable for non-recurring workloads. And also Juggler is not suitable like other frameworks to answer questions about the performance prediction

Starting point is 00:19:26 with regards to various selected configurations. What I try to say is, Juggler, for example, for this application says, four machines is the optimal, and the time is expected to be 30 minutes. Then the user cannot get any information, what would be the runtime if I run it in 10 machines, for example, which might be important for the

Starting point is 00:19:53 end user. However, the good thing that there are many other frameworks that do this, like for example, one of very important frameworks like Ernest predict these trade or what if analysis with regards to the cluster size. So if we have two machines, three machines, four machines, and so

Starting point is 00:20:15 the limitation of Juggler that they doesn't consider the cash limitation which makes their prediction not accurate if we have cash evictions. So maybe what I can say, combining both solutions or running Ernest on top of Juggler might make a lot of sense. Also, there is another work like CherryPick, for example, can predict the time in different cluster types. And this is also one of the limitations of Juggler.

Starting point is 00:20:50 Juggler, for example, to predict the runtime, the experiments for runtime prediction should be conducted from scratch if you have a new cluster or new cloud installation. While, for example, other studies like CherryPick tries to avoid conducting these experiments from scratch using features of each cluster regarding how many cores per machine or per instance and so on and avoid reconducting the whole offline training phase again for performance prediction

Starting point is 00:21:31 what I'm trying to say like there are limitations of juggler yes with this regards but also the good news that there are some other related work that continues this like juggler can be executed with harmony with other related work on this.

Starting point is 00:21:51 So who do you think will find Juggler the most useful? Who is it for? Well, I think application developers would be very satisfied if they get news like don't care anymore about caching or unpersisting. I think application developers would be very satisfied if they get news like, don't care anymore about caching or unpersisting. There is a framework that would do this for you. This would be great for these who suffer a lot. I don't expect currently.

Starting point is 00:22:18 So it's a step towards such a goal, but we are still far away. Like, just to be frank, it's not expected that today we can get this kind of reliable solution just to say, always developers do good work in general. However, they suffer because you have a lot of complex code that has, if you look at the DAG or the tree of execution,

Starting point is 00:22:47 you find a gigantic tree that has plenty of data sets and transformations between them in form of forks, joints. And it's expected that these application developers make a lot of mistakes. As I mentioned to you that they have, we already made study and there are plenty of mistakes they are doing. I think this would be good for application developers in one hand.

Starting point is 00:23:15 And on the other hand, it will be good for those who suffer from configuration of cluster, like the cluster configuration or cluster administrators, which is a nightmare for them when they have plenty of types of instances. So which type of the instance? And if we change the instance type, what will be the suitable instance size or number of machines?

Starting point is 00:23:42 It's totally tough for them. So juggler can be a step towards make the life easier for both those end users or administrators who select cluster configuration, which is a very hard task, and also for application developers who suffer from making decisions while they lack important information.

Starting point is 00:24:04 Is Juggler publicly available? Where can our listeners go and find it? Currently not, but we are working on making it public in our GitHub the next maybe few months. Fantastic. So listeners, look out for it. So my next question is, what's the most interesting but maybe unexpected lesson that you have learned while working on this topic? Well, good question.

Starting point is 00:24:34 So actually, normally, people try to predict. Then, based on their prediction, they make decisions to optimize. So the prediction is a step towards optimization in many cases. After many tries we did here in our offices in Hilmi now, we realized after years that this is not always the good way to solve the problem because we are trying to solve the easier problem using the harder one. I'm not saying it's easy.

Starting point is 00:25:15 So optimization is not easy, but prediction is harder. And we decide to flip the order by, first of all, we optimize. And then after optimization, you have a more robust performance that could be predicted. And this is one of the messages of Juggler. That's why you see the optimization takes place in the first in four stages the first three stages and the last one in the sequence is comes for uh for performance prediction and and actually the the starting point of juggler was an experiment i i i can share why i was preparing a paper i realized svmM, I was running it,

Starting point is 00:26:05 it took, when I was changing the configuration, it took two hours in one machine and four minutes in two machines. And it was shocking for me because like all the expected stuff, like if it takes, for example, 10 minutes in one machine, it's expected to be six minutes in two machines not five because it's non-linear regarding serial parts and so so the

Starting point is 00:26:31 curve is is expected to be different and this spike model uh show us uh let's say it it was very hard moment for us because my task was for prediction. How can I predict the performance? It is two hours in one machine and four minutes in two machines. And therefore, we changed the order. It's better, first of all, to know what are the poor configurations. Like one machine in this use case is poor, should not be selected. And then from two to three and four, this is where we can make prediction models that gives the tradeoff between cost and latency. While in poor configuration, you lose both cost and also the latency. So it was unexpected in the road. And now during Juggler, one of the messages

Starting point is 00:27:27 is we apply optimization that doesn't need prediction. Something like a rule-based optimization in database, for example, you push filter down, you don't need to predict the performance. Like we can apply a lot of optimization techniques without the need of prediction. And this is what we try to do also here. What's next for Juggler? What do you have planned for future research? So for the future research, actually, we would like to solve the problem,

Starting point is 00:27:59 the same problem, but in a different way. To support recurring workloads, so many techniques could be applied, like sampling and sample runs, and the challenging part, how to make them lightweight, because the sample runs are not reusable. Here, the use case could be that the application runs once. Also, we want to think about how can we improve the caching mechanisms, maybe on the fly during the application run, maybe not to rely on offline training. This is quite important and we are trying to do some research here nowadays on these directions.

Starting point is 00:28:44 Last question from me now. What attracted you to this research area? And on top of that, what do you think is the biggest challenge in this area now? Well, what attracted me actually is the numbers

Starting point is 00:28:59 we realized in our experiments when we saw how much it is severe when end users select wrong cluster configuration. And also when application developers select wrong datasets for being cached. These numbers really shocked us and they are already presented in the paper.

Starting point is 00:29:23 And in following papers, we already have different numbers that are more shocking and and this this draws our attention that there should be something to to do here like to solve this problem what about the second part of the problem question please the second part of the question was what do you think is the biggest challenge in this area now? The biggest challenge is that, first of all, the use case is complicated. You have a variety of configurations of options, for example, with regards to how to solve the problem. So you have to select which instance type. For example, in AWS, it allows you to have more than 100 different options.

Starting point is 00:30:08 Challenging. And also one of the biggest challenges is that you have different actors nowadays. You have an application developer, you have an end user. And also it differs. Like sometimes the dataset, the characteristic of the dataset, not only the size, the dataset itself plays a role. The application parameters, hybrid application parameters. Also, we mentioned hyper parameters, for example, in machine learning. that many models are required to be built on top of juggler just to avoid having wrong decisions and correct prediction of data set.

Starting point is 00:30:52 I mean, the problem is quite complicated. And the problem is few amount of research, not few in numbers, but relatively compared to how much it's serious and it's frequent maybe we need more research on this direction And we'll end it there Thanks so much Hany for coming on the show

Starting point is 00:31:13 If the listener is interested to know more about Hany's work all the links to his papers will be put in the show notes and when Juggler gets released I'll be sure to add it to the show notes and retweet it off gets released i'll be sure to add it to the show notes and retweet it off all of them disseminate social media accounts so thanks again honey yeah and thanks a lot uh for your time and uh thanks for the audience for for their listening awesome great

Starting point is 00:31:39 see you next time see you

Disseminate: The Computer Science Research Podcast - Hani Al-Sayeh | Juggler: Autonomous Cost Optimization and Performance Prediction of Big Data Applications | #6

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.