Disseminate: The Computer Science Research Podcast - Ahmed Sayed | REFL: Resource Efficient Federated Learning | #33

Starting point is 00:00:00 Hello and welcome to Disseminate the Computer Science Research Podcast. I'm your host Jack Wardby. Today I'm joined by Ahmed Saeed who will be telling us everything we need to know about resource efficient federated learning. Ahmed is an assistant professor at the Queen Mary University of London. Welcome to the show Ahmed. Yeah, hi Jack. Thanks for inviting me and I'm very happy to be within your podcast today. Fantastic. So I've given you a very brief introduction there, but maybe you can tell us more about yourself and how you became interested in systems research. Yeah. So, again, my name is Ahmad Syed. I'm assistant professor at the Premier University of London. And I lead the Syed Systems Research Group Group where we focus on scalable, adaptive, yet efficient distributed systems and hence the name of the group. So in terms of how I became

Starting point is 00:01:14 interested in system research, actually it is kind of in Leon during my undergrad I have been you know fascinated about you know computer systems and their architectures, actually also more on the distributed systems and computer networks. And basically, my final project was something around these topics. And we built communication systems back then to enable communication over the Internet through Bluetooth. So back in the time, it was kind of, there is no wifi, we had only Bluetooth. But in terms of building systems, it's kind of fascinating. I was fascinated by this because it's kind of challenging

Starting point is 00:01:56 and also rewarding. So I think, you know, the various system components and protocols that keep our applications live or every day's application that you use is very essential. And there's something interesting to look into how to make them efficient and scalable as the scale of our applications increase over the days. As you mentioned Bluetooth, I remember being at school and it's like sending like video music videos or music between us using bluetooth our phones and stuff yeah and then before that it was infrared as well we used to use yeah yeah these are golden age you know i kind of uh you you had that very you know very uh challenging task of you know sending even like short music that you know takes a while and and you want to make sure the connection is not lost

Starting point is 00:02:48 and you need it to stay together. Yeah, we'll be stood there all lunchtime just to send a 30 second clip to each other. Cool. Great. So today we're going to be talking about resource efficient federated learning. So maybe you can just start us off by explaining what federated learning is. Right. So in a nutshell, you know, FL has actually emerged recently. I mean, the idea of federation is not kind of new. There are federated databases and it has been there for a while.

Starting point is 00:03:18 But for machine learning, it emerged as a kind of a paradigm for allowing distributed machine learning. And it emerged as a way to address the communication issues initially. So it was introduced by Google. They had distributed learning that existed, but they weren't communication efficient. So they came up with this federated average protocol. And then this federated learning has also started to kick off and become very popular after they were introduced by Google paper, because now the privacy restrictions and regulations have been put in place over the past few years. And they had this limitation of uh you know how to train their

Starting point is 00:04:06 models on the client's data while there is these privacy restrictions so they came up with you know this model where instead of you know collecting your data and into their servers and you know risking uh you know all the lawsuits about you know leakage leakage of their data, of the client's data or user data. Then instead now they ship the model or the compute to you, and the data stays within your device for training their models, and they just send some updates based on this trend model locally. So it's kind of the interest of it has blown away over the past few years. And it's one of the highly researched area nowadays.

Starting point is 00:04:50 And its main use cases to enable privacy preserving machine learning on users decentralized data. Awesome. So you're kind of pushing the compute to the data there rather than pulling all in and doing new computation locally. Cool. So you touched on a little bit there, kind of some examples of where FL is used today. So is there any other sort of, like, can you maybe give us some more examples of companies or applications that are using federated learning today in production? Yeah, that's a good question, actually. So in federated learning, there is actually two different settings. We have a cross-device, which is the most typical setting where

Starting point is 00:05:26 you have, you know, service providers. And these are relying on, you know, user data from their mobile devices or smartware, all these kind of or IT devices. So you have large number of devices that have data and you want to train on them. So for example, in this case, I would mention Google is using this to train their Google keyboard. And they actually highlighted that they improved the model by training on users data instead of the held out data set in their servers. They improved their model by 24%, which actually is kind of significant for them. Also they use it for their Google Assistant, so to train their voice recognition task for

Starting point is 00:06:15 the Google Assistant. Also Apple uses this as well in their Siri. Siri's voice recognition is trained by using federated learning on data from the users. So these are kind of examples that I can, you know, mention in terms of cross-device. So another model is actually more related to organizations or companies, which is cross-silo. They call it cross-silo federated learning. And in and in cross silo you have organizations and each organization can be let's say hospitals or banks or companies and each of them they want to collaborate on uh you know training a global model that is kind of collaborative model that is more

Starting point is 00:06:57 performant than their own local model because usually for example in hospital data the medical images of you know users are quite uh private and and they are not allowed to share these data. So they use federated learning to train their publication, National Communications, that have gained quite some interest and citations where there was a large-scale effort deployment for tumor detection task among medical institutions and hospitals over 71 cities crossing, you know, six continents. Typically, this wouldn't be possible without this concept of federated learning, because normally data are not allowed to be, you know, shared. And, you know, when they deployed it, they actually trained very good models. And it led to, you know, actually discovery of, you know, rare cancer that didn't exist. So they used it to

Starting point is 00:08:09 kind of rare cancer boundary detection task, and they achieved quite good results. Another example would be also IBM. Also, research has recently, you know, announced a solution for, you know, money laundering or fighting money laundering. And they call this solution a private vertical FL for anti-money laundering. And I think these emerging solutions are quite nice and promotes more federated learning and practice. Yeah, it really demonstrates the power of federated learning there. I mean, I guess the ability to kind of go across political boundaries and, like, different continents and everything

Starting point is 00:08:49 and kind of having access to all that and, like, yeah, fascinating. I guess the numbers kind of speak for itself in, like, 24% increases. Yeah, exactly. It's not non-negligible, right? So, yeah, finding things like rare tumors, awesome. Cool. So let's go into the details of like how federated learning works a little bit more so can you maybe walk us through the life cycle of how i

Starting point is 00:09:11 go about um kind of training an fl model right this is actually a good question and it's important for uh the audience to understand the process so um let's say I'm going to use now actually the cross-device setting because this is a more common setting. That's where FL emerged from. So in the lifecycle for federated learning, first you want to train the model in a federated learning setting, and then the trained model is deployed for inference later on the target device. So the main focus of federated learning setting. And then the trained model is deployed for inference later

Starting point is 00:09:45 on the target device, right? So the main focus of federated learning is on the training cycle actually. So in federated learning, there is the server or the service provider who owns a model that is designed for a certain task. Let's say the task is voice recognition or next word prediction for Google keyboard or whatever okay and so they

Starting point is 00:10:05 have designed this model and they want to train it on the user data okay so we think of the users at the clients who own their smartphones or smart watches or whatever and they have data that reside on these devices right so typically what happened is that you know first the users check in with the server saying that I'm available for the training. And we say check in here, meaning that like a kind of a login process is based on the availability of the client. So usually Google doesn't train or involve users unless they are connected to a charger so that they don't drain their power and they are connected to a Wi-Fi and they are idle, not used by the user.

Starting point is 00:10:48 So to not overload the device, okay? So after these devices check in, you may have like actually millions of devices or thousands of them checking at any point, right? So now the server needs to select a subset of this. Usually the subset is not quite large because the aim actually is to train on a subset because when you train on a very large sample, you have this problem of large batch size that actually degrades the model performance. And it's hard to deal with. So they take a sub-sample with a target number of devices, let's say a hundred or maybe a thousand within one round.

Starting point is 00:11:29 And after this selection stage, which can be based on maybe random sampling or other selection algorithm, depending on the their application, they will send the task to the users. And when we say a task here, it consists of the model that they want to train, plus any configuration like the hyper that they want to train, plus any configuration

Starting point is 00:11:45 like the hyperparameters or whatever to this device. They are sent over the network to the devices. Each device start training its model on its local data, and they train it for maybe a number of local epochs. When they finish the training, each device needs to upload the model update. This is an updated model or after fine-tuning it over the local data and it's sent to the server. Then the server goes over an aggregation stage. Typically, the server sets a deadline, maybe 10 minutes, waits for 10 minutes for the clients to finish and after this 10 minutes starts aggregating. So some of these clients are able to finish in time because maybe because we here see you know the clients are you know heterogeneous in terms of you know computation and the network,

Starting point is 00:12:39 the data size or the size of the dead set they have, there are many factors that contribute to the heterogeneity and therefore not all of them are able to finish and submit the update. So some of them become a struggle and are missed out from the aggregation. Then after the server collects these updates, it does kind of apply the federated average algorithm by aggregating these updates to create a new global model and these sounds are you know

Starting point is 00:13:07 repeated over and over until you know The global model reaches a certain objective or target accuracy and by that point the model is say to be trained And it can be deployed and then for the users So as you kind of running through there, because I'm going to ask you next kind of about what are the various challenges that come with Federated Learning? You kind of touched on a few of them there, but I guess

Starting point is 00:13:32 the first thing when we were running through that lifecycle there was kind of the requirement of often being needed to be connected to Wi-Fi and being idle because otherwise if you're on a move, things can be dropping in and out, like the resilience there to sort of the clients dropping in and out, like the resilience there to sort of the kind of the clients dropping in and out, I guess it's kind of quite hard to deal with.

Starting point is 00:13:50 Yes. Yeah. Cause then you end up with lots of stragglers. I guess you kind of have that. I mean, a lot of cases prerequisite of like, yeah, the phone's going to be plugged in.

Starting point is 00:13:56 We're going to do it between midnight and 6am when the user is probably not going to be on the phone. Exactly. So when I put my iPhone in and that's on the night, that's what they're doing, right? They're doing some federated learning and it comes up and says, yeah, exactly. Cool. Yeah. So when I put my iPhone in and that's what they're doing, right? They're doing some federated learning and it comes up and says, yeah. Exactly. Yeah. Cool. Yeah. So maybe we can dive into a little bit more of the challenges in your work.

Starting point is 00:14:14 Right. So one of the major challenges I mentioned is the heterogeneity. Right. So we have various sources of heterogeneity. We have the data. I mean, each client has its own data, and these data distributions vary significantly among all the devices. So we are dealing with non-ID data setup, which is harder to train a model on. We have also the device and network heterogeneity, because, you know, you're using iPhone, I'm using maybe Samsung, we have with different computational configuration and background load. So this varies and what type of network I'm connected with and its quality.

Starting point is 00:14:53 So also we have the behavior heterogeneity, which you just mentioned is related to the availability of the devices, right? So this is very significantly and creates more problem and some cases can create like a concept drift where you are training in the night for some time zone, which is a day time zone for others. Then you are kind of biased or the model keeps shifting between one to another. So these are quite significant challenges that actually leads to the federal learning model not to be able to, until this point, to compete with a centralized trend model if you have all these data. So obviously, a centralized

Starting point is 00:15:39 model is going to be more performant, but actually now the application of centralized model is going to be more performant, but actually now the application of centralized model is prohibited more and more because of these regulations. The other issues is actually also related to, so if we highlight actually the high level issues, we have the problems with efficiency and effectiveness. So efficiency here, meaning the system efficiency and how we efficiently use the system resources or the client resources to achieve the target. And the challenges under this is like the heterogeneity and what type of optimization algorithms and whether we do multi-task learning or personalized FL or meta-learning. So now you

Starting point is 00:16:20 have various factors that leads to or feeds into the challenge of how we can make it efficient and effective training. Another main issue and actually one of the significant one is the privacy and security. So even though we say that FL is proposed for privacy preserving, I mean, by itself, it doesn't have any privacy guarantees, okay? Because there are attacks that can still restore the data of the users, and they have been applied successfully. So there are further privacy enhancing techniques that have been proposed in the literature, and there are a wide kind of research into this, like differential

Starting point is 00:17:01 privacy, homomorphic encryption, multi-party computation are some of these techniques that are being researched to enhance the privacy and security. And also you have also the malicious clients and actors and the adversarial servers. So these are more issues into the problem. Another big issue is, or if we categorize them as a big categories, the third one would be the fairness and bias. As I just mentioned, when you shift between the time zones, you have fairness issues, right?

Starting point is 00:17:36 Because the ones involved in the training are not representative of the global population, right? So there's a fairness and bias issue. And the problem is how to leverage or introduce a kind of diverse techniques that ensures a high level of diversity. The last and also one of the major challenges that are not widely researched,

Starting point is 00:18:01 there are a few research works that are coming in, is the system challenges and how to make the system or platform and deployment friendly and easy to, you know, scale with a large number of, you know, users. Until now, I don't think there is a very large scale deployment other than, you know, the ones, you know, provided by large, you know, scale providers like Google because they have control over the end device in some sense. So for other competitors, it's a bit hard to deploy this in scale.

Starting point is 00:18:35 And also there are also the system parameters tuning how to parameter various system artifacts, like the server side, there are many parameters that can control or affect the global model training, such as how many clients I select per round, how long should I wait, the reporting deadline, and there are so many things that play in and can affect the quality of the final model. So there are system research challenges here also as well. So globally, if we think about them,

Starting point is 00:19:09 we have efficiency and effectiveness, we have privacy and security, we have fairness and bias, and we have system challenges. So I think these are the kind of four categories that I think of in terms of big picture for FL challenges. It sounds a very fertile ground for research and it's so many challenges to kind of get stuck into there. Cool.

Starting point is 00:19:29 So you mentioned there that there's kind of been some of these challenges have kind of prevented FL getting near to like centralized sort of learning deployment at the moment. So you kind of mentioned in your paper that the key metric and key performance metric in FL is this time to accuracy. So can you maybe tell us kind of a little bit what this is and then what the sort of main determinants are of achieving good time to accuracy? Right. So many work actually now focus on this metric.

Starting point is 00:19:57 They announce or pronounce it as time to accuracy, and they come up with this naming. So in simple words, this means how much time you needed in terms of runtime to achieve a certain or your target accuracy. So you remember one of the objective is to train the model until it reaches a target accuracy, right? So now if you think about if you use just rounds, you know, it may not be easy to compare because around can take you know one hour or one minute right okay so to have a fair you know and practical comparison from the system point of view we look at the time because at the time is kind of invariance because i mean

Starting point is 00:20:39 when i compare a time it's a kind of a common ground that i'm comparing everyone with right so they look at minimizing the amount of time needed to achieve the highest accuracy possible or the target accuracy that we are aiming for so that means the time accurate to accuracy metric so this actually depends on two factors one of them is the statistical efficiency of your training. So statistical efficiency means by how much your model has improved within each round, right? Okay, so we are looking here at the quality improvement or the accuracy improvement, the deltas in accuracy. So this is the statistical efficiency of the training. So if you have more statistically efficient kind of algorithms, then your model will go to the

Starting point is 00:21:26 target accuracy faster, right? I mean, in terms because of the delta is higher, so you reach the target faster. The other way in looking into the time is the system efficiency. How efficient is your system? So, and it usually refers to how much each round takes you to finish. And this depends on the various system factors in terms of how fast the training is moving on and how long training takes. And this can depend on the computation of the end devices. So some work actually just select the very fast devices to participate. Or they also apply some compression techniques

Starting point is 00:22:06 to reduce the communication time. They apply many techniques to reduce the length of the round so that you achieve the target faster. Yeah, on the statistical efficiency, I guess you get marginal gains the more rounds you do, depending on what approach you use. But I guess not all rounds are equal is what i'm trying to say right so like exactly you have to fact you have to factor that in because maybe the 10th round you're only getting an incremental gain of like i don't know one percent something right and so i

Starting point is 00:22:35 guess that has to play right as well so yeah that's correct typically you know at the beginning of the training actually you you your margins are higher than the later stages of the training so you see larger improvements at the beginning then the model starts to saturate because maybe it doesn't learn more or you are not changing the learning rate there are like you know your learning schedule rate schedule whatever so there are factors like related to the learning task itself what kind of optimizer, the aggregation process, how many learners you are aggregating, the mini batch size. There are so many factors that play into this efficiency.

Starting point is 00:23:15 Cool, yeah. And another thing that jumped out to me when you were talking about system efficiency was that sometimes people just select the fast devices. But do you get any sort of biases implicitly selecting certain devices and the speed of them? Because even the user data on them may be different, right? So does it kind of not matter? I don't know.

Starting point is 00:23:33 Yeah, so this is actually one of the main motivations for our work in Riffle. Actually, we looked at the works that aim to optimize the system efficiency only without thinking about diversity. And thought about this is not right because you know you are kind of biased in a sense uh you are not improving anymore uh when uh i mean you are keep training on the fast devices and you are leaving out uh you know strugglers or slow devices that they may have valuable data. So maybe these devices, they have only a subset of the classes and you are not seeing the other classes that are of interest from the slower devices. So kind of this bias is actually problematic.

Starting point is 00:24:16 And that was the main trade-off that we looked into in Riffle. We have the system efficiency versus diversity trade-off. Right. OK, nice. Yeah, because you mentioned in the paper that you kind We have the system efficiency versus diversity trade-off. Okay, nice. Yeah, because you mentioned in the paper that you kind of optimized design for this before you call it resource to accuracy instead. So can you explain that kind of, I mean, maybe we've just touched on it there, but can you maybe go into that in a little bit more depth and why you did it? Right. So in the paper, we use a different metric instead of looking at just a time because it may not be the right metric.

Starting point is 00:24:58 So, in fact, in some cases, you want to optimize how much resources, because if you think about these users, they are kind of resources that you are leveraging and using. And it's not free resource, right? It's not owned by the service provider. It's not your resource. So typically, you would like to reduce the resource consumption of these devices as well as reducing the time. So by ignoring the consumption on these devices is not a kind of sustainable solution. Typically, if you are going to consume many resources that I paid for as a mobile user and on my smartphone, you keep consuming my resources for your FL, then I actually wouldn't be willing to participate in this. I may actually opt out because you are harming my device, you are deteriorating its performance, et cetera, et cetera, right? So we looked at the computational resources that are needed, the total, in terms of how much compute and communication time that you are using to achieve the target accuracy. And this actually can extend to the situation where you are training on battery-powered devices instead of, like, Google's assumption

Starting point is 00:26:04 that you need to connect it to a charger, then it's very important now, if you are training on battery-powered devices, to reduce the consumption on these devices in terms of compute and time. So it extends more on a wider, you know, applications and scenarios instead of just focusing on time to accuracy.

Starting point is 00:26:21 Yeah, cool. Let's maybe go on a tangent a little bit here about the opt-in, opt-out sort of thing here. So like when, as an end user, obviously because the companies are doing this, like Google are doing this, like you said, with the keyboard. When do I opt into this?

Starting point is 00:26:37 Is it like when you kind of download the app and whatever and you click terms and conditions, except I'm just signing my life away there? Well, usually there is kind of a panel that appears, usually saying that would you like to be part of the analytics, their own analytics, right? I think these are the kind of when you opt in, and actually you can later opt out if you opted in initially.

Starting point is 00:27:03 So I think in many cases cases people doesn't realize that this is happening because it happens in the background while the device is idle so uh that's why they do it and for purposefully when the device is either so that you don't realize but still they are using your resources right uh i mean uh this is something our source you paid the money for and they are they are consuming it in for their benefit yeah that's probably why the battery life on my iphone sucks um but yeah anyway cool let's let's dig into raffle some more then so maybe you can give us a high level of the overview of how it works and talk us about the architecture of raffle. Yeah, so basically, as we have

Starting point is 00:27:47 said, the main pitch for Ruffle was that heterogeneity is one of the major challenges, as I explained, and it's obvious now. And we find as we said, the state-of-the-art work addresses heterogeneity in a way that introduces a trade-off between system efficiency and the resource, resource diversity also, right? And also we found that these solutions doesn't have, you know, the resource consumption in their mind. And they never, you know, thought about how much resources I'm consuming to reach this. They care about minimizing the time from the point of view of the provider, the service provider,

Starting point is 00:28:23 because the provider wants to train the model fast, to deploy it fast, right? They never thought or took the point of view of the client side and the resources consumed by the client. And that was the main thinking was in Wifl. So we thought about how we can build actually a practical solution that still achieves the good time to accuracy metric, while also reducing the resource to accuracy metric. And that's why we came with this resource efficient federal learning system. So if we dig deep into the main components of Riffle, there are two main components. So one of them is intelligent participant selection component, or we call it IPS.

Starting point is 00:29:11 And this component aims to increase or improve the diversity of the training. So typically, as I said before, some works like ORT, that was one of the set of art that we compared with, they rely on kind of biasing the selection toward the fast clients. And it wasn't Oort only. I mean, there are other works that did this in terms of improving the time to accuracy. However, we found that this is actually harms

Starting point is 00:29:41 and is not useful in the non-ID setting and the model qualities are not good as you know claimed. So we thought about how to overcome this issue and we looked at the availability as one of the factors that have never been used as a kind of ways to you know tune the training. So we have clients that are not always available right so and if we want to diversify our training we should care about or consider how long they will stay for the training right and rationally you would normally prioritize your selection towards that the ones that you know or likely would be not available in the future okay Okay. So, so that you can capture their data and your training.

Starting point is 00:30:28 And then if they leave out the training and never come back, you still, you know, they contributed to your model. Right. So to do this, actually, we needed to introduce actually a availability prediction module on the

Starting point is 00:30:40 devices. Okay. So this is kind of a very simple time series prediction module that, you know, runs on each device. It's not on the devices. So this is kind of a very simple time series prediction module that runs on each device. It's not on the server to avoid any privacy issues. It runs locally on the device and it trains on the device patterns. So it looks at the patterns of your use of the device

Starting point is 00:31:01 and your charging and stuff like that. So we focus on the charging state as kind of prediction of your availability, because we assume that you will train when you are connected to a charger. So, and this module will tell us kind of with, for each next round, okay, if you are online, it will tell us if you're going to be online for the next round or not. Okay. And this will be sent as a probability value to the server and the server gets these values from the clients and selects the least ones as the candidate for this round because they are not likely to be available then in the next round. Right. So that's how we done the intelligent participant selection module.

Starting point is 00:31:47 And this eventually will improve your diversity of the clients and improve the model quality, especially in the non-ID setting that is the typical setting in Fiddle. So we have actually another module called stillness- aggregation so we talked about struggler right so some clients are you know can struggle and usually uh in federated average these clients are left out when you have the deadline reporting deadline if these struggles don't come in time they are left out from the aggregation okay so it's like kind of you think about this kind of wasted resource right so these clients have already, you know, spent some time training, right? And because they were unfortunate to not have a very fast device like others or the network got disconnected or whatever happened, right? Because you don't see

Starting point is 00:32:37 what's happening on their side. They were left out, okay? So normally these updates are lost. So it's wasted resource, right? So we introduced a stale aggregation rule. In a sense, we allow these clients to submit their updates even at later rounds. Okay. However, the problem with this is that, you know, when you aggregate these stale updates, it will degrade your model quality because, you know, your model already shifted from this model that they start training on and they moved on to a different model, okay? So let's say you're at around X and they submit at round X plus 10, 10 rounds later. The model already has

Starting point is 00:33:18 changed, right? So you have fresh updates from the clients that are training on that updated model at x plus 10 and you have this tail with the model that is in past by 10 rounds okay so we introduce some rules to actually have a better you know effectiveness or efficiency from this tail aggregation and this rule involves two factors one of them is a dampening, which try to dampen the effect of this stale update. So when we aggregate, there are weights that is applied or multiplied by the model update so that it doesn't have the same impact as the fresh models. So we have dampening factor that depends on how stale you are. So the more stale in the training you are or the model

Starting point is 00:34:06 are, then it's damped more in terms of its effect. And we have another factor called boosting factor. We actually boost your stale update by some value based on how deviant you are from the fresh updates, okay? So if you are different, meaning that you have new knowledge or something new to contribute, okay. In a sense, we look at, this is a client that had maybe different data and this will contribute something new to the model. So we give you a higher kind of weight

Starting point is 00:34:41 and we have the weighted sum between the dampening and the boosting factor to you know combine them in a single weight that is applied on the stale ablates actually we experimented with this and had like a kind of convergence analysis and we did ablation study until it was hard to reach the best rule for this and the ones we're using the paper actually we found it at least the better accuracy for the model yeah i have a quick question on the um how you measure that um some stale um updates going to be um beneficial or how do you measure its difference from yeah it's got something good to contribute and boost to boost it up how do you measure that right so we we have

Starting point is 00:35:23 uh the fresh updates right or the fresh model updates we average them right so and we have this still update so we we apply a kind of uh kl divergence metric uh to look at uh their divergence and also we tried other you know um functions like cosine similarity and these kind of you know functions that used to find the distance between two objects and we use that as a way to look into how deviant you are from the fresh updates okay cool cool that makes sense more details actually are in the paper i mean it's hard to explain this in a podcast, I mean, kind of, or virtually, basically.

Starting point is 00:36:11 This is true, yeah, yeah. So the interested listeners should go and check out the paper. Yes. For sure. Cool, so let's talk about, let's talk some numbers then. So can you tell us about your experiments and how IPS and stalemate-aware aggregation actually performed in practice?

Starting point is 00:36:24 Yeah, that's a good one. So in our experiments, actually, we used the NVIDIA GPUs clusters to train our model. So we train in a simulated environment. We actually didn't train on end devices because it's hard to train this in the wild. So what we do is emulate the client's training on the GPUs, like V100 or A100 GPUs. And we simulate the training effects, like the client selection, the aggregation. Everything is kind of simulated and run time. And we use actually realistic computational profiles for the devices. Like there is AI benchmark that is, you know,

Starting point is 00:37:09 profiles various devices and gives you the actual runtime for inference and training on these devices. And there's also MobiBear trace that, you know, has traces of, you know, network speeds from over the globe. And we use this to profile devices. For heterogeneity also, there is a trace that we have used for emulating the availability of the devices. So in total also, each experiment we repeated it three times to the average of this,

Starting point is 00:37:38 and this actually amounted for almost 13,000 GPU hours of training, which is significant. So if I talk about numbers in terms of, comparing with the state of the art, when we compare with ORT, if we run for long-term number of rounds until the convergence, we see that Riffle achieves a model that is higher in accuracy compared to Oort by 20 points. OK, so this is actually a significant number.

Starting point is 00:38:12 I mean, if we talk about numbers, Riffle achieves 60 percent accuracy, while Oort at that point achieves only 40 percent. OK, within the same resources right if we talk about the amount of resources to achieve to accuracy okay and also riffle also managed to achieve this within reasonable time even lower time than ORT okay compared to another state of the art which is SAFA, SAFA is a kind of different algorithm that you know kind of allows for diversity and still aggregation by not even having a selection in software there is no selection everyone is selected right so the problem with software is that resource wastage is high because you're selecting everyone

Starting point is 00:38:57 and then when you do still updates you have a threshold like you say that by default they use five rounds anything that has been five round is discarded so again there's a problem here with resource wastage so we found when we compare with software to achieve the same target accuracy let's say 50 accuracy on the google speech benchmark we achieve more than 2x reduction in terms of resource usage, which is significant at the same time. I mean, even the time didn't vary much. I mean, we took 12 hours, they took 10 hours, which is not quite different in terms of runtime. But the reduction in resource usage is quite significant. And if we look at larger scale experiment, we did even larger scale experiments compared to Sava, when we have 3x number of clients, the reduction is more than 5x. So when you go larger and larger scale,

Starting point is 00:39:50 you're going to gain more in terms of reduction in terms of resource usage. Also, we looked at the future proofness of, you know, Riffle, and we found that it's future proof compared to Oort or other, you know, mechanisms that are similar to Oort. And in this experiment, we looked at doubling the computational speed of the devices. So we took different percentages of the device population and doubled their speed. So we took 25% doubling their speed, 50% doubling their speed, because the speed of devices improve our time, right? So in the future, they improve.

Starting point is 00:40:31 And also, we looked at 100%, all of them doubling the speed. And what we found is that why, you know, Riffle and Oort gain from, you know, both, you know, both of them gain by reducing the time and the resource consumption by this doubling of the speed in fact the problem with ort is the quality doesn't improve i mean it's still the quality stays the same because it's biased they select only the files device you give more files device they will only target these files device. And Riffle, no, we have diverse selection, and that's why the results of accuracy keeps improving by doubling these speeds. Yes, on that really quick, so when you doubled the speed, did you also then increase the variance of like you still have the really slow old devices,

Starting point is 00:41:20 but then you've got some devices that are now twice as fast, or was it just the average speed that increased? No, no. There is still this variance. All devices, they maintain their own speed. Only 25% just doubled. That makes sense. You can imagine

Starting point is 00:41:37 a future where... I guess I know this is true, but I'd expect the variance in devices to probably increase over time, maybe. I don't know. I'm just kind of um uh yeah i mean i guess yeah uh if you think about it uh like a you know some some devices their improvements are you know significant compared to others uh um usually it's not uh uniform the improvements by you know companies and devices because of the hardware configurations set up are different from company to company. Yeah, I mean, I guess not many people now are using an iPhone 3G, right?

Starting point is 00:42:17 They're kind of all gone off the market, I'd say. I guess, yeah, yeah. There's probably a lower bound in the version that people are running on, or phones people are running on. But anyway, yeah, that's right. It's probably a lower bound in the version that people are running on, on phones people are running on. But anyway, yeah. Cool. So it sounds an absolute win, a win on all fronts, this Raffle.

Starting point is 00:42:34 So let's look at it from the other side for a moment. What are the limitations of Raffle? So, yeah, Raffle, I mean, like any other federated learning system, have common limitations. And one of the key limitations is actually dealing with misinformation. learning system have a common limitations. And one of the key limitations is actually dealing with misinformation when you have malicious or non-faceful clients, right? Because if you think about it, we are lying

Starting point is 00:42:55 on a prediction model that is trained on the device, right? For predicting the future of availability. So imagine the situation that these clients are sending wrong information. So actually we did somehow with this in Ruffle by prohibiting the client that were selected for one round to participate for five rounds in the future. So in some sense, if you lie about your future availability

Starting point is 00:43:22 and say, I'm not gonna be available in the next round and you were selected, this will not help you to be selected in the next five rounds because you were selected for this round in some sense. But I think this misinformation because you're dealing with clients that exist in the wild and you don't know and there's communication channels that exist, this problem of misinformation is actually quite challenging to deal with. Okay, so this simplification simpler solution may not actually be the best solution but This is one of the limitations that exist The other limitation that we have so far is actually how to you know

Starting point is 00:43:58 automate the fine-tuning of the various knobs that we have in this System and this also not just for RFL, most RFL systems, they have this problem because you have a tremendous number of factors that play in and hyperparameters at the server side, at the client side, and the optimization algorithm and the learning process. The multitude of these knobs that you need to tune is very hard compared to training on a centralized setting. So I think these are the main limitations that exist,

Starting point is 00:44:37 not just for Riffle, for other systems as well. Cool. So with those in mind, what's next on the reset agenda for Riffle? Well, at this point, actually, what's next on the Reset agenda for REFL? Well, at this point actually we are thinking of expanding REFL because REFL is limited to only federated learning setting. We actually thinking about expanding it to be a major framework that can support various techniques or distributed learning techniques such as transfer learning, multitask learning, personalized FL, and decentralized learning kind of techniques. And I think it's, I mean, until now, there is no such framework that, you know, are kind

Starting point is 00:45:17 of inclusive of all these techniques. Mainly, they are federated learning and there are other, you know, small works on transfer learning. And it's, there is no major framework that, you know, capture all of these. they are federated learning and there are other small works on transfer learning. And there is no major framework that captures all of these. And I think this is beneficial because it's good to have some kind of a generic and inclusive framework of various techniques. Because in some cases, you can have transfer learning based on federated learning, right? There are some work on transfer federated learning and multitask learning can be applied in also federated learning so it's good to have

Starting point is 00:45:51 a framework that have these kind of techniques so yeah so my next question is kind of as a software developer data engineer how do you think i can leverage the findings in your research and what impact do you think this can have longer term? Yeah, actually, this is a hard one. But one thing that I can say actually is that still the current AFL systems are not there yet. I mean, there are many frameworks that are coming into picture and they are evolving over the past few years. Like there are FLOWER, there are FedML, FedScale, and FedAI. There are various frameworks that are coming from several startups, and I think they are

Starting point is 00:46:33 evolving over time. But I think as a system, we are not there yet. I mean, there are so many challenges and solutions that make, practical for wide deployment. And also we have the barrier of large companies or providers having control over the end device, which actually make this deployment even harder. But I think the results we have obtained in FL, actually, we think is quite a major

Starting point is 00:47:07 step that can you know towards having a resource efficient federal learning because we we took the view of the client side in terms of the resources they use and when you say to the clients i have a resource efficient system for training, I'm not going to impact you much in terms of the training, and eventually the training model will be beneficial to you, I think more users will be willing to opt in. So I think it's one good step towards the adoption of thinking about resource efficiency and building more robust models in a sustainable way over decentralized data. Fantastic. So how long have you been working on Ruffle for?

Starting point is 00:47:52 How long was the project? Well, I think we started early 2020. Okay, just before COVID. Yeah, exactly. I think we started by that time. I mean, my work started with distributed ML in a context of HPC and clusters.

Starting point is 00:48:14 And then we found more problems in federated learning that are interesting and worse looking into unsolving. And that's why we started working on that. And I think it took more than two years to come up with a complete framework and all the experiments. You see the experiments took quite a significant amount of time,

Starting point is 00:48:42 and we tried to look at all aspects for evaluating it. Yes, so kind of across that period then, what's the most interesting, maybe unexpected lesson that you learned while working on it? What caught your guard? Yeah, well I think system research actually is quite hard. I guess this is my experience. It's very hard work. It takes some time and persistence. And it involves many sleepless nights and bashing your head,

Starting point is 00:49:18 scratching your head and stuff like that over bugs that you don't know what's happening. And it's because so many things are in play and sometimes you get non-meaningful results and you need to find where exactly where is this coming from but I think the ultimate reward is actually quite you know high compared to other work because you see your work in practice it's kind of a living creature something that you know people can relate to or feel basically right compared to other kind of i'm not you know you know saying law on the mathematical or theoretical work they are important as foundations but you don't feel it. I mean, and I think it's also the open sourcing

Starting point is 00:50:09 and making your solutions kind of, you know, available for others to, you know, use and practice. It gives you some joy and, you know, kind of achievement in terms of you contributed something to, you know, the research and, you know and the community as well, to other researchers and community as well. So I think the lessons learned from Rafael, it was a hard journey, but I think now if it is something that can be beneficial in the future

Starting point is 00:50:42 in terms of systems that uses its concepts or algorithms to make their systems more resource efficient this is something that will give us a significant job yeah do you have any so what other research have you kind of got going on at the minute is raffle sort of your main vehicle your main vehicle for research at the moment? Well, so it's not actually just Riffle. I mean, we have other kind of work that's going on, but I think the main theme that we are focusing on is still federated learning. We are looking at other directions such as battery-powered scenarios, how to leverage techniques, as I said before, transfer learning or knowledge transfer. So we are looking into still the main theme

Starting point is 00:51:30 of the big picture of federated learning, but it's kind of the main theme that I'm focusing on now, yes. But there are other works. In terms of computer networks and the civil system, there are other works that I'm interested in. But the major driver, as you say, is digital learning. Cool, you've got many irons in the fire then. Cool.

Starting point is 00:51:51 So how do you go about, like, so this is my favorite question. I love this question. How do you go about, like, generating ideas and then selecting which things to work on? I kind of want to know more about your creative process. Yeah, so I think this is actually quite tough one. I don't think there's a single, you know, a universal approach,

Starting point is 00:52:13 and there is no right approach, to be honest. But I think we usually aim to first identify the problem based on the recent line of work, right, and see how the state of the art have achieved. And, you know, so that we know this is the point we need to start from, right? And we try to critically analyze these, you know, state of the art solutions by, you know,

Starting point is 00:52:43 trying to maybe experimenting with them or analyzing their model or the solution they are proposing. Is it practical? Is it efficient? We look at all these, you know, aspects and dissect these solutions to find exactly what they have missed or overlooked in their solution. And before we make any claims, we try to quantify their limitations by doing experiments, so that we can motivate our research area or question that we are, you know, we're going to investigate further. And then after this, after you're identifying your research question, you have a project, then, you know, developing the solution, solution i think it's kind of falls naturally by you know uh thinking about various algorithms and techniques that can work would work out with

Starting point is 00:53:34 you know the final objective that you have identified in mind nice thanks for that it's another another good answer to that question like i said i love that question it's great to see how everyone works and approaches this thing it's cool yeah i agree with it it's tough one and i don't think uh i i my answer may be the the optimal or the best one but uh i think everyone has his own approach and i think if it's working for you then you you go for it exactly yeah there's not a one size fits all sort of approach to that thing right you've got to find what works for you that's that's the yes that's the thing exactly cool so i've just got just got two more questions now and then the penultimate one is what do you think is a very big pitch what do you think is the biggest challenge in kind of systems research federated learning today

Starting point is 00:54:17 well yeah i think first and foremost is actually actually the main challenge for us in this research area is actually finding skilled and motivated researchers that work on system research. OK, and finding the ones with the right background and experience is extremely hard because, you know, you are competing in academia or competing with industry that pays quite well. And, you know, it's very unfair, you know, kind of computation for us. It's very hard to find, you know, good researchers, especially in system research. And the second issue is actually, you know, with our research is to have enough time and preservance and the right resources, because eventually you need the hardware resources to, you know, run all the,

Starting point is 00:55:11 all of these and you need time and, you know, persistent. So I think these are kind of the challenges in terms of, you know, personal challenges to, you know, adapt to yourself that it takes time. You don't want to

Starting point is 00:55:28 just rush things. You need to study carefully the system and run experiments accurately to get the correct results. So I think these are the two biggest challenges, finding the right researchers with the background in system research is very hard. The other challenge is having this personality or adapting yourself to the nature of system research. It's not the typical research that you just build a model and run it on MATLAB and get some results and publish it. It's a totally different kind of research that requires time, resources, and preservance.

Starting point is 00:56:13 Yeah, that's nice. Cool. So yeah, last question now. What's the one takeaway you want the listener to get from this podcast today? Well, again, I'm going to reiterate now on that, you know, system research is quite painful. I mean, if you are doing it, I salute you. But it's fun.

Starting point is 00:56:38 I can say it's fun because it allows you to acquire a multitude of skills and you know experiences and actually makes you more you know kind of yeah I think eventually when you are coming out of systems research your opportunities actually in the market is higher because it's kind of unique and rare kind of resource and skill. So it's something that is going to be, you know, rewarding eventually if you don't work in academia like myself. The second thing is that I would like is that, you know, the outcomes, like, you know, when you have outcomes like papers or open source code, these usually that comes from system research have, you know, have higher impact than other type of work, because these can be adopted by, you know, systems used in big companies or providers. And most, actually, if you think about it,

Starting point is 00:57:49 most of the great systems like Hadoop, Spark, all these systems actually came out as a result of a system research within these companies. So if you think about it, most of our systems that exist today actually is a result of system research. So I think this is a main message that system research is actually quite great and i i hope you know we we can find more people doing it fantastic great let's let's finish it there thanks so much for coming on the uh on the show and the listeners uh interested in knowing more about your work we'll put links

Starting point is 00:58:24 everything in the show notes and if you enjoy the show please consider supporting the podcast through buy me a coffee and we will see you all next time for some more awesome computer science research Thank you.

Disseminate: The Computer Science Research Podcast - Ahmed Sayed | REFL: Resource Efficient Federated Learning | #33

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.