Disseminate: The Computer Science Research Podcast - Ahmed Sayed | REFL: Resource Efficient Federated Learning | #33
Episode Date: May 26, 2023Summary: Federated Learning (FL) enables distributed training by learners using local data, thereby enhancing privacy and reducing communication. However, it presents numerous challenges relating to t...he heterogeneity of the data distribution, device capabilities, and participant availability as deployments scale, which can impact both model convergence and bias. Existing FL schemes use random participant selection to improve fairness; however, this can result in inefficient use of resources and lower quality training. In this episode, Ahmed Sayed talks about how he and his colleagues address the question of resource efficiency in FL. He talks about the benefits of intelligent participant selection, and incorporation of updates from straggling participants. Tune in to learn more!Links:EuroSys'23 PaperAhmed's LinkedIn Ahmed's HomepageAhmed's TwitterREFL Github Hosted on Acast. See acast.com/privacy for more information.
Transcript
Discussion (0)
Hello and welcome to Disseminate the Computer Science Research Podcast. I'm your host Jack Wardby.
Today I'm joined by Ahmed Saeed who will be telling us everything we need to know about resource efficient federated learning.
Ahmed is an assistant professor at the Queen Mary University of London. Welcome to the show Ahmed.
Yeah, hi Jack. Thanks for inviting me and I'm very happy to be within your podcast today.
Fantastic. So I've given you a very brief introduction there, but maybe you can tell us more about yourself and how you became interested in systems research.
Yeah. So, again, my name is Ahmad Syed. I'm assistant professor at the Premier University of London.
And I lead the Syed Systems Research Group Group where we focus on scalable, adaptive,
yet efficient distributed systems and hence the name of the group. So in terms of how I became
interested in system research, actually it is kind of in Leon during my undergrad I have been
you know fascinated about you know computer systems and their architectures, actually also more on the distributed systems and computer networks.
And basically, my final project was something around these topics.
And we built communication systems back then to enable communication over the Internet through Bluetooth.
So back in the time, it was kind of, there is no wifi,
we had only Bluetooth.
But in terms of building systems, it's kind of fascinating.
I was fascinated by this because it's kind of challenging
and also rewarding.
So I think, you know, the various system components
and protocols that keep our applications live or every day's application that you use is very essential.
And there's something interesting to look into how to make them efficient and scalable as the scale of our applications increase over the days.
As you mentioned Bluetooth, I remember being at school and it's like sending like video music videos or music between us using bluetooth our phones and stuff yeah
and then before that it was infrared as well we used to use yeah yeah these are golden age you
know i kind of uh you you had that very you know very uh challenging task of you know sending even
like short music that you know takes a while and and you want to make sure the connection is not lost
and you need it to stay together. Yeah, we'll be stood there all lunchtime
just to send a 30 second clip to each other.
Cool. Great. So today we're going to be talking about
resource efficient federated learning. So maybe you can just start us off by
explaining what federated learning is.
Right. So in a nutshell, you know, FL has actually emerged recently.
I mean, the idea of federation is not kind of new.
There are federated databases and it has been there for a while.
But for machine learning, it emerged as a kind of a paradigm for allowing distributed machine learning.
And it emerged as a way to address the communication issues initially. So it was introduced by Google. They had distributed learning that existed, but they weren't
communication efficient. So they came up with this federated average protocol. And then this federated learning has also started
to kick off and become very popular
after they were introduced by Google paper,
because now the privacy restrictions and regulations
have been put in place over the past few years.
And they had this limitation of uh you know how to train their
models on the client's data while there is these privacy restrictions so they came up with you know
this model where instead of you know collecting your data and into their servers and you know
risking uh you know all the lawsuits about you know leakage leakage of their data, of the client's data or user data.
Then instead now they ship the model or the compute to you,
and the data stays within your device for training their models,
and they just send some updates based on this trend model locally.
So it's kind of the interest of it has blown away over the past few years.
And it's one of the highly researched area nowadays.
And its main use cases to enable privacy preserving machine learning on users decentralized data.
Awesome. So you're kind of pushing the compute to the data there rather than pulling all in and doing new computation locally.
Cool. So you touched on a little bit there, kind of some examples of where FL is used today.
So is there any other sort of, like, can you maybe give us some more examples of companies
or applications that are using federated learning today in production?
Yeah, that's a good question, actually.
So in federated learning, there is actually two different settings.
We have a cross-device, which is the most typical setting where
you have, you know, service providers. And these are relying on, you know, user data from their
mobile devices or smartware, all these kind of or IT devices. So you have large number of devices
that have data and you want to train on them. So for example, in this
case, I would mention Google is using this to train their Google keyboard. And they actually
highlighted that they improved the model by training on users data instead of the held
out data set in their servers. They improved their model by 24%, which actually is kind
of significant for them.
Also they use it for their Google Assistant, so to train their voice recognition task for
the Google Assistant.
Also Apple uses this as well in their Siri.
Siri's voice recognition is trained by using federated learning on data from the users.
So these are kind of examples that I can, you know, mention in terms of cross-device. So another
model is actually more related to organizations or companies, which is cross-silo. They call it
cross-silo federated learning. And in and in cross silo you have organizations and each
organization can be let's say hospitals or banks or companies and each of them they want to
collaborate on uh you know training a global model that is kind of collaborative model that is more
performant than their own local model because usually for example in hospital data the medical
images of you know users are quite uh private and and they are not allowed to share these data.
So they use federated learning to train their publication, National Communications, that have gained
quite some interest and citations where there was a large-scale effort deployment for tumor
detection task among medical institutions and hospitals over 71 cities crossing, you know, six continents. Typically, this wouldn't be possible
without this concept of federated learning, because normally data are not allowed to be,
you know, shared. And, you know, when they deployed it, they actually trained very good
models. And it led to, you know, actually discovery of, you know, rare cancer that didn't exist. So they used it to
kind of rare cancer boundary detection task, and they achieved quite good results. Another example
would be also IBM. Also, research has recently, you know, announced a solution for, you know,
money laundering or fighting money laundering.
And they call this solution a private vertical FL for anti-money laundering.
And I think these emerging solutions are quite nice and promotes more federated learning and practice.
Yeah, it really demonstrates the power of federated learning there.
I mean, I guess the ability to kind of go across political boundaries
and, like, different continents and everything
and kind of having access to all that and, like, yeah, fascinating.
I guess the numbers kind of speak for itself in, like, 24% increases.
Yeah, exactly.
It's not non-negligible, right?
So, yeah, finding things like rare tumors, awesome.
Cool.
So let's go into the details of like how
federated learning works a little bit more so can you maybe walk us through the life cycle of how i
go about um kind of training an fl model right this is actually a good question and it's important
for uh the audience to understand the process so um let's say I'm going to use now actually the cross-device setting
because this is a more common setting.
That's where FL emerged from.
So in the lifecycle for federated learning,
first you want to train the model in a federated learning setting,
and then the trained model is deployed for inference later
on the target device. So the main focus of federated learning setting. And then the trained model is deployed for inference later
on the target device, right?
So the main focus of federated learning is
on the training cycle actually.
So in federated learning, there is the server
or the service provider who owns a model
that is designed for a certain task.
Let's say the task is voice recognition
or next word prediction for Google keyboard or whatever okay and so they
have designed this model and they want to train it on the user data okay so we think of the users
at the clients who own their smartphones or smart watches or whatever and they have data that reside
on these devices right so typically what happened is that you know first the users check in with the
server saying that I'm available for
the training. And we say check in here, meaning that like a kind of a login process is based on
the availability of the client. So usually Google doesn't train or involve users unless they are
connected to a charger so that they don't drain their power and they are connected to a Wi-Fi
and they are idle, not used by the user.
So to not overload the device, okay?
So after these devices check in,
you may have like actually millions of devices
or thousands of them checking at any point, right?
So now the server needs to select a subset of this.
Usually the subset is not quite large because the aim actually is to train on a subset because when you train on a very large sample, you have this problem of large batch size that actually degrades the model performance.
And it's hard to deal with.
So they take a sub-sample with a target number of devices, let's say a hundred or maybe a thousand within one round.
And after this selection stage,
which can be based on maybe random sampling
or other selection algorithm,
depending on the their application,
they will send the task to the users.
And when we say a task here,
it consists of the model that they want to train,
plus any configuration like the hyper that they want to train, plus any configuration
like the hyperparameters or whatever to this device. They are sent over the network to the
devices. Each device start training its model on its local data, and they train it for maybe a
number of local epochs. When they finish the training, each device needs to upload the model update.
This is an updated model or after fine-tuning it over the local data and it's sent to the server.
Then the server goes over an aggregation stage. Typically, the server sets a deadline, maybe 10
minutes, waits for 10 minutes for the clients to finish and after this 10 minutes starts
aggregating. So some of these clients are able to finish in time because maybe because we here see
you know the clients are you know heterogeneous in terms of you know computation and the network,
the data size or the size of the dead set they have, there are many factors that contribute to the heterogeneity
and therefore not all of them are able to finish
and submit the update.
So some of them become a struggle
and are missed out from the aggregation.
Then after the server collects these updates,
it does kind of apply the federated average algorithm
by aggregating these updates to create a new global model and these sounds are you know
repeated over and over until you know
The global model reaches a certain objective or target accuracy and by that point the model is say to be trained
And it can be deployed and then
for the users
So as you kind of running through there, because I'm going to ask you next kind of about
what are the various challenges that come with
Federated Learning? You kind of touched on a few
of them there, but I guess
the first thing when we were running through that
lifecycle there was kind of the requirement
of often being needed to be
connected to Wi-Fi and being idle because
otherwise if you're on a move, things can be
dropping in and out, like the resilience there to
sort of the clients dropping in and out, like the resilience there to sort of the kind of the clients dropping in and
out, I guess it's kind of quite hard to deal with.
Yes.
Yeah.
Cause then you end up with lots of stragglers.
I guess you kind of have that.
I mean,
a lot of cases prerequisite of like,
yeah,
the phone's going to be plugged in.
We're going to do it between midnight and 6am when the user is probably
not going to be on the phone.
Exactly.
So when I put my iPhone in and that's on the night,
that's what they're doing,
right?
They're doing some federated learning and it comes up and says, yeah, exactly. Cool. Yeah. So when I put my iPhone in and that's what they're doing, right? They're doing some federated learning and it comes up and says, yeah.
Exactly. Yeah. Cool. Yeah. So maybe we can dive into a little bit more of the challenges in your work.
Right. So one of the major challenges I mentioned is the heterogeneity. Right.
So we have various sources of heterogeneity. We have the data. I mean, each client has its own data,
and these data distributions vary significantly among all the devices. So we are dealing with
non-ID data setup, which is harder to train a model on. We have also the device and network
heterogeneity, because, you know, you're using iPhone, I'm using maybe Samsung, we have with different computational configuration
and background load.
So this varies and what type of network I'm connected with
and its quality.
So also we have the behavior heterogeneity,
which you just mentioned is related to
the availability of the devices, right?
So this is very significantly and creates more problem
and some cases can create like a concept drift where you are training in the night for some time zone, which is a day time zone for others.
Then you are kind of biased or the model keeps shifting between one to another. So these are quite significant
challenges that actually leads to the federal learning model not to be able to, until this point,
to compete with a centralized trend model if you have all these data. So obviously, a centralized
model is going to be more performant, but actually now the application of centralized model is going to be more performant, but actually now the application of centralized
model is prohibited more and more because of these regulations.
The other issues is actually also related to, so if we highlight actually the high level
issues, we have the problems with efficiency and effectiveness.
So efficiency here, meaning the system efficiency and how we efficiently use the system resources
or the client resources to achieve the target.
And the challenges under this is like the heterogeneity and what type of optimization
algorithms and whether we do multi-task learning or personalized FL or meta-learning. So now you
have various factors that leads to or feeds into the challenge of how we can
make it efficient and effective training.
Another main issue and actually one of the significant one is the privacy and security.
So even though we say that FL is proposed for privacy preserving, I mean, by itself,
it doesn't have any privacy guarantees, okay?
Because there are attacks that can still restore the data of the users,
and they have been applied successfully. So there are further privacy enhancing techniques that have
been proposed in the literature, and there are a wide kind of research into this, like differential
privacy, homomorphic encryption, multi-party computation are some
of these techniques that are being researched to enhance the privacy and security. And also you
have also the malicious clients and actors and the adversarial servers. So these are more issues into
the problem. Another big issue is, or if we categorize them as a big categories,
the third one would be the fairness and bias.
As I just mentioned,
when you shift between the time zones,
you have fairness issues, right?
Because the ones involved in the training
are not representative of the global population, right?
So there's a fairness and bias issue.
And the problem is how to leverage
or introduce a kind of diverse techniques
that ensures a high level of diversity.
The last and also one of the major challenges
that are not widely researched,
there are a few research works that are coming in,
is the system challenges and how to make the system or platform and deployment friendly and easy to, you know,
scale with a large number of, you know, users. Until now, I don't think there is a very large
scale deployment other than, you know, the ones, you know, provided by large, you know,
scale providers like Google
because they have control over the end device in some sense.
So for other competitors,
it's a bit hard to deploy this in scale.
And also there are also the system parameters tuning
how to parameter various system artifacts,
like the server side, there are many parameters
that can control or affect the global model training, such as how many clients I select
per round, how long should I wait, the reporting deadline, and there are so many things that
play in and can affect the quality of the final model.
So there are system research challenges here also as well.
So globally, if we think about them,
we have efficiency and effectiveness,
we have privacy and security, we have fairness and bias,
and we have system challenges.
So I think these are the kind of four categories
that I think of in terms of big picture for FL challenges.
It sounds a very fertile ground for research and it's so many challenges to kind of get
stuck into there.
Cool.
So you mentioned there that there's kind of been some of these challenges have kind of
prevented FL getting near to like centralized sort of learning deployment at the moment.
So you kind of mentioned in your paper that the key metric and key performance metric
in FL is this time to accuracy.
So can you maybe tell us kind of a little bit what this is
and then what the sort of main determinants are
of achieving good time to accuracy?
Right. So many work actually now focus on this metric.
They announce or pronounce it as time to accuracy,
and they come up with this naming.
So in simple words, this means how much time you needed
in terms of runtime to achieve a certain or your target accuracy. So you remember one of the
objective is to train the model until it reaches a target accuracy, right? So now if you think about
if you use just rounds, you know, it may not be easy to compare because around can take you
know one hour or one minute right okay so to have a fair you know and practical comparison from the
system point of view we look at the time because at the time is kind of invariance because i mean
when i compare a time it's a kind of a common ground that i'm comparing everyone with right so they look at
minimizing the amount of time needed to achieve the highest accuracy possible or the target
accuracy that we are aiming for so that means the time accurate to accuracy metric
so this actually depends on two factors one of them is the statistical efficiency of your training.
So statistical efficiency means by how much your model has improved within each round, right? Okay,
so we are looking here at the quality improvement or the accuracy improvement, the deltas in
accuracy. So this is the statistical efficiency of the training. So if you have more statistically
efficient kind of algorithms, then your model will go to the
target accuracy faster, right? I mean, in terms because of the delta is higher, so you
reach the target faster. The other way in looking into the time is the system efficiency.
How efficient is your system? So, and it usually refers to how much each round takes you to finish.
And this depends on the various system factors in terms of how fast the training is moving
on and how long training takes.
And this can depend on the computation of the end devices.
So some work actually just select the very fast devices to participate.
Or they also apply some compression techniques
to reduce the communication time.
They apply many techniques to reduce the length of the round
so that you achieve the target faster.
Yeah, on the statistical efficiency,
I guess you get marginal gains the more rounds you do,
depending on what approach you use. But I guess not all rounds are equal is what i'm trying
to say right so like exactly you have to fact you have to factor that in because maybe the 10th round
you're only getting an incremental gain of like i don't know one percent something right and so i
guess that has to play right as well so yeah that's correct typically you know at the beginning of the
training actually you you your margins are higher than the later stages of the training so you see
larger improvements at the beginning then the model starts to saturate because
maybe it doesn't learn more or you are not changing the learning rate there are like you
know your learning schedule rate schedule whatever so there are factors like related to the learning
task itself what kind of optimizer, the aggregation process,
how many learners you are aggregating, the mini batch size.
There are so many factors that play into this efficiency.
Cool, yeah.
And another thing that jumped out to me when you were talking about system efficiency
was that sometimes people just select the fast devices.
But do you get any sort of biases implicitly selecting certain devices
and the speed of them?
Because even the user data on them may be different, right?
So does it kind of not matter?
I don't know.
Yeah, so this is actually one of the main motivations for our work in Riffle.
Actually, we looked at the works that aim to optimize the system efficiency
only without thinking about diversity.
And thought about this is not right because you know you are kind of biased in a sense uh you are not improving anymore
uh when uh i mean you are keep training on the fast devices and you are leaving out uh you know
strugglers or slow devices that they may have valuable data. So maybe these devices, they have only a subset of the classes
and you are not seeing the other classes that are of interest
from the slower devices. So kind of this bias is actually problematic.
And that was the main trade-off that we looked into in Riffle.
We have the system efficiency versus diversity trade-off.
Right. OK, nice. Yeah, because you mentioned in the paper that you kind We have the system efficiency versus diversity trade-off.
Okay, nice.
Yeah, because you mentioned in the paper that you kind of optimized design for this before you call it resource to accuracy instead.
So can you explain that kind of, I mean, maybe we've just touched on it there, but can you maybe go into that in a little bit more depth and why you did it?
Right.
So in the paper, we use a different metric instead of looking at just a time because it may not be the right metric.
So, in fact, in some cases, you want to optimize how much resources, because if you think about these users, they are kind of resources that you are leveraging and using.
And it's not free resource, right? It's not owned by the service provider. It's not your resource. So typically, you would like to reduce the resource consumption of these devices as well as reducing the time.
So by ignoring the consumption on these devices is not a kind of sustainable solution.
Typically, if you are going to consume many resources that I paid for as a mobile user and on my smartphone, you keep consuming my resources for your FL, then I actually wouldn't be willing to participate in this. I may actually opt out because you are harming my device, you are deteriorating its performance, et cetera, et cetera, right? So we looked at the computational resources that are needed, the total,
in terms of how much compute and communication time that you are using
to achieve the target accuracy.
And this actually can extend to the situation where you are training
on battery-powered devices instead of, like, Google's assumption
that you need to connect it to a charger,
then it's very important now,
if you are training on battery-powered devices,
to reduce the consumption on these devices
in terms of compute and time.
So it extends more on a wider, you know,
applications and scenarios
instead of just focusing on time to accuracy.
Yeah, cool.
Let's maybe go on a tangent a little bit here
about the opt-in, opt-out sort of thing here.
So like when, as an end user,
obviously because the companies are doing this,
like Google are doing this,
like you said, with the keyboard.
When do I opt into this?
Is it like when you kind of download the app
and whatever and you click terms and conditions,
except I'm just signing my life away there?
Well, usually there is kind of a panel that appears,
usually saying that would you like to be part of the analytics,
their own analytics, right?
I think these are the kind of when you opt in,
and actually you can later opt out if you opted in initially.
So I think in many cases cases people doesn't realize that
this is happening because it happens in the background while the device is idle so uh that's
why they do it and for purposefully when the device is either so that you don't realize but
still they are using your resources right uh i mean uh this is something our source
you paid the money for and they are they are consuming it in for their benefit yeah that's
probably why the battery life on my iphone sucks um but yeah anyway cool let's let's dig into
raffle some more then so maybe you can give us a high level of the overview of how it works and
talk us about the architecture of raffle. Yeah, so basically, as we have
said, the main pitch for Ruffle was that heterogeneity
is one of the major challenges, as I explained, and it's obvious now. And we find
as we said, the state-of-the-art work addresses heterogeneity
in a way that introduces a trade-off between system
efficiency and the resource, resource diversity also, right?
And also we found that these solutions doesn't have, you know, the resource consumption in their mind.
And they never, you know, thought about how much resources I'm consuming to reach this.
They care about minimizing the time from the point of view of the provider, the service provider,
because the provider wants to train the model fast, to deploy it fast, right?
They never thought or took the point of view of the client side and the resources consumed
by the client.
And that was the main thinking was in Wifl.
So we thought about how we can build actually a practical solution that still achieves the good time to accuracy metric, while also reducing the
resource to accuracy metric. And that's why we came with this resource efficient federal learning
system. So if we dig deep into the main components of Riffle, there are two main components.
So one of them is intelligent participant selection component, or we call it IPS.
And this component aims to increase or improve the diversity of the training.
So typically, as I said before, some works like ORT, that was one of the set of art that we compared with,
they rely on kind of biasing the selection
toward the fast clients.
And it wasn't Oort only.
I mean, there are other works that did this
in terms of improving the time to accuracy.
However, we found that this is actually harms
and is not useful in the non-ID setting
and the model qualities are not
good as you know claimed. So we thought about how to overcome this issue and we looked at the
availability as one of the factors that have never been used as a kind of ways to you know tune the
training. So we have clients that are not always available right so and if we want
to diversify our training we should care about or consider how long they will stay for the training
right and rationally you would normally prioritize your selection towards that the ones that you know
or likely would be not available in the future okay Okay. So, so that you can capture their data and your training.
And then if they leave out the training and never come back,
you still,
you know,
they contributed to your model.
Right.
So to do this,
actually,
we needed to introduce actually a availability prediction module on the
devices.
Okay.
So this is kind of a very simple time series prediction module that, you know, runs on each device. It's not on the devices. So this is kind of a very simple time series prediction
module that runs on each device.
It's not on the server to avoid any privacy issues.
It runs locally on the device and it trains
on the device patterns.
So it looks at the patterns of your use of the device
and your charging and stuff like that.
So we focus on the charging state as kind of
prediction of your availability, because we assume that you will train when you are connected to a
charger. So, and this module will tell us kind of with, for each next round, okay, if you are online,
it will tell us if you're going to be online for the next round or not. Okay. And this will be sent as a probability
value to the server and the server gets these values from the clients and selects the least
ones as the candidate for this round because they are not likely to be available then in the next
round. Right. So that's how we done the intelligent participant selection module.
And this eventually will improve your diversity of the clients and improve the model quality,
especially in the non-ID setting that is the typical setting in Fiddle.
So we have actually another module called stillness- aggregation so we talked about struggler right
so some clients are you know can struggle and usually uh in federated average these clients
are left out when you have the deadline reporting deadline if these struggles don't come in time
they are left out from the aggregation okay so it's like kind of you think about this kind of
wasted resource right so these clients have already, you know, spent some time training, right? And because they were unfortunate to not have a very fast device
like others or the network got disconnected or whatever happened, right? Because you don't see
what's happening on their side. They were left out, okay? So normally these updates are lost. So it's wasted resource, right?
So we introduced a stale aggregation rule.
In a sense, we allow these clients to submit their updates even at later rounds.
Okay.
However, the problem with this is that, you know, when you aggregate these stale updates,
it will degrade your model quality because, you know, your model already shifted
from this model that they start training on and they moved on to a different model, okay? So let's
say you're at around X and they submit at round X plus 10, 10 rounds later. The model already has
changed, right? So you have fresh updates from the clients that are training on that updated model at x plus 10 and you have
this tail with the model that is in past by 10 rounds okay so we introduce some rules to actually
have a better you know effectiveness or efficiency from this tail aggregation
and this rule involves two factors one of them is a dampening, which try to dampen the effect of this stale update.
So when we aggregate, there are weights that is applied or multiplied by the model update
so that it doesn't have the same impact as the fresh models.
So we have dampening factor that depends on how stale you are.
So the more stale in the training you are or the model
are, then it's damped more in terms of its effect. And we have another factor called boosting factor.
We actually boost your stale update by some value based on how deviant you are from the fresh
updates, okay? So if you are different, meaning that you have new knowledge
or something new to contribute, okay.
In a sense, we look at,
this is a client that had maybe different data
and this will contribute something new to the model.
So we give you a higher kind of weight
and we have the weighted sum between the dampening
and the boosting factor to you know combine
them in a single weight that is applied on the stale ablates actually we experimented with this
and had like a kind of convergence analysis and we did ablation study until it was hard to reach
the best rule for this and the ones we're using the paper actually we found it at least the better
accuracy for the model yeah i have a quick question on the um how you measure that um some
stale um updates going to be um beneficial or how do you measure its difference from yeah it's got
something good to contribute and boost to boost it up how do you measure that right so we we have
uh the fresh updates
right or the fresh model updates we average them right so and we have this still update so we we
apply a kind of uh kl divergence metric uh to look at uh their divergence and also we tried other
you know um functions like cosine similarity and these kind of you know functions
that used to find the distance between two objects and we use that as a way to look into
how deviant you are from the fresh updates okay cool cool that makes sense more details actually
are in the paper i mean it's hard to explain this in a podcast,
I mean, kind of, or virtually, basically.
This is true, yeah, yeah.
So the interested listeners should go and check out the paper.
Yes.
For sure.
Cool, so let's talk about, let's talk some numbers then.
So can you tell us about your experiments
and how IPS and stalemate-aware aggregation
actually performed in practice?
Yeah, that's a good one.
So in our experiments, actually, we used the NVIDIA GPUs clusters to train our model.
So we train in a simulated environment.
We actually didn't train on end devices because it's hard to train this in the wild.
So what we do is emulate the client's training on the GPUs, like V100 or A100 GPUs.
And we simulate the training effects, like the client selection, the aggregation.
Everything is kind of simulated and run time.
And we use actually realistic computational profiles for the devices. Like there is AI benchmark that is, you know,
profiles various devices and gives you the actual runtime
for inference and training on these devices.
And there's also MobiBear trace that, you know,
has traces of, you know, network speeds from over the globe.
And we use this to profile devices.
For heterogeneity also,
there is a trace that we have used for emulating the availability of the devices.
So in total also, each experiment we repeated it three times to the average of this,
and this actually amounted for almost 13,000 GPU hours of training, which is significant.
So if I talk about numbers in terms of,
comparing with the state of the art,
when we compare with ORT,
if we run for long-term number of rounds
until the convergence,
we see that Riffle achieves a model that is higher in accuracy compared to Oort by 20 points.
OK, so this is actually a significant number.
I mean, if we talk about numbers, Riffle achieves 60 percent accuracy, while Oort at that point achieves only 40 percent.
OK, within the same resources right if we
talk about the amount of resources to achieve to accuracy okay and also riffle
also managed to achieve this within reasonable time even lower time than
ORT okay compared to another state of the art which is SAFA, SAFA is a kind of
different algorithm that you know kind of allows for diversity and still aggregation
by not even having a selection in software there is no selection everyone is selected right
so the problem with software is that resource wastage is high because you're selecting everyone
and then when you do still updates you have a threshold like you say that by default they use
five rounds anything that
has been five round is discarded so again there's a problem here with resource wastage so we found
when we compare with software to achieve the same target accuracy let's say 50 accuracy on the
google speech benchmark we achieve more than 2x reduction in terms of resource usage, which is significant at the same time. I mean,
even the time didn't vary much. I mean, we took 12 hours, they took 10 hours, which is not quite
different in terms of runtime. But the reduction in resource usage is quite significant. And if we
look at larger scale experiment, we did even larger scale experiments compared to Sava, when we have 3x number of clients, the reduction is more than 5x. So when you go larger and larger scale,
you're going to gain more in terms of reduction in terms of resource usage. Also, we looked at
the future proofness of, you know, Riffle, and we found that it's future proof compared to Oort or
other, you know, mechanisms that are similar to Oort.
And in this experiment, we looked at doubling the computational speed of the devices.
So we took different percentages of the device population and doubled their speed.
So we took 25% doubling their speed, 50% doubling their speed, because the speed
of devices improve our time, right?
So in the future, they improve.
And also, we looked at 100%, all of them doubling the speed.
And what we found is that why, you know, Riffle and Oort gain from, you know, both,
you know, both of them gain by reducing the time and the resource consumption by
this doubling of the speed in fact the problem with ort is the quality doesn't improve i mean
it's still the quality stays the same because it's biased they select only the files device you give
more files device they will only target these files device. And Riffle, no, we have diverse selection,
and that's why the results of accuracy keeps improving by doubling these speeds.
Yes, on that really quick, so when you doubled the speed, did you also then increase the variance of like you still have the really slow old devices,
but then you've got some devices that are now twice as fast,
or was it just the average speed that
increased? No, no.
There is still this variance.
All devices, they maintain their own
speed. Only 25%
just doubled.
That makes sense. You can imagine
a future where... I guess
I know this is true, but I'd expect the
variance in devices to probably increase over
time, maybe. I don't know. I'm just kind of um uh yeah i mean i guess yeah uh if you think about it uh like a
you know some some devices their improvements are you know significant compared to others uh
um usually it's not uh uniform the improvements by you know companies and devices because of the hardware configurations set up
are different from company to company.
Yeah, I mean, I guess not many people now are using an iPhone 3G, right?
They're kind of all gone off the market, I'd say.
I guess, yeah, yeah.
There's probably a lower bound in the version that people are running on,
or phones people are running on. But anyway, yeah, that's right. It's probably a lower bound in the version that people are running on,
on phones people are running on.
But anyway, yeah.
Cool.
So it sounds an absolute win, a win on all fronts, this Raffle.
So let's look at it from the other side for a moment.
What are the limitations of Raffle?
So, yeah, Raffle, I mean, like any other federated learning system,
have common limitations.
And one of the key limitations is actually dealing with misinformation. learning system have a common limitations.
And one of the key limitations is actually dealing with misinformation when you have malicious
or non-faceful clients, right?
Because if you think about it, we are lying
on a prediction model that is trained on the device, right?
For predicting the future of availability.
So imagine the situation that these clients
are sending wrong information.
So actually we did somehow with this in Ruffle
by prohibiting the client that were selected
for one round to participate for five rounds in the future.
So in some sense, if you lie about your future availability
and say, I'm not gonna be available in the next round
and you were selected, this will not help you to be selected in the next five rounds because you
were selected for this round in some sense. But I think this misinformation because you're dealing
with clients that exist in the wild and you don't know and there's communication channels that
exist, this problem of misinformation is actually quite challenging to deal with. Okay, so this simplification
simpler solution may not actually be the best solution but
This is one of the limitations that exist
The other limitation that we have so far is actually how to you know
automate the fine-tuning of the various knobs that we have in this
System and this also not just for RFL,
most RFL systems, they have this problem
because you have a tremendous number of factors
that play in and hyperparameters at the server side,
at the client side, and the optimization algorithm
and the learning process. The multitude of these knobs that you need to tune is very hard compared
to training on a centralized setting. So I think these are the main limitations that exist,
not just for Riffle, for other systems as well. Cool. So with those in mind, what's next on the
reset agenda for Riffle? Well, at this point, actually, what's next on the Reset agenda for REFL?
Well, at this point actually we are thinking of expanding REFL because REFL is limited
to only federated learning setting.
We actually thinking about expanding it to be a major framework that can support various
techniques or distributed learning techniques such as transfer learning, multitask learning, personalized
FL, and decentralized learning kind of techniques.
And I think it's, I mean, until now, there is no such framework that, you know, are kind
of inclusive of all these techniques.
Mainly, they are federated learning and there are other, you know, small works on transfer
learning.
And it's, there is no major framework that, you know, capture all of these. they are federated learning and there are other small works on transfer learning.
And there is no major framework that captures all of these.
And I think this is beneficial because it's good to have some kind of a generic and inclusive framework of various techniques.
Because in some cases, you can have transfer learning based on federated learning, right?
There are some work on transfer federated learning and multitask learning can be applied in also federated learning so it's good to have
a framework that have these kind of techniques so yeah so my next question is kind of as a
software developer data engineer how do you think i can leverage the findings in your research and what impact do you think this can have longer term?
Yeah, actually, this is a hard one.
But one thing that I can say actually is that still the current AFL systems are not there yet.
I mean, there are many frameworks that are coming into picture and they are evolving over the past few years.
Like there are FLOWER, there are FedML, FedScale, and
FedAI.
There are various frameworks that are coming from several startups, and I think they are
evolving over time.
But I think as a system, we are not there yet.
I mean, there are so many challenges and solutions that make, practical for wide deployment.
And also we have the barrier of large companies or providers
having control over the end device,
which actually make this deployment even harder.
But I think the results we have obtained in FL,
actually, we think is quite a major
step that can you know towards having a resource efficient federal learning because
we we took the view of the client side in terms of the resources they use and when you say to
the clients i have a resource efficient system for training, I'm not going to impact you much in
terms of the training, and eventually the training model will be beneficial to you, I think more
users will be willing to opt in. So I think it's one good step towards the adoption of thinking
about resource efficiency and building more robust models in a sustainable way over decentralized data.
Fantastic.
So how long have you been working on Ruffle for?
How long was the project?
Well, I think we started early 2020.
Okay, just before COVID.
Yeah, exactly.
I think we started by that time.
I mean, my work started
with distributed ML in a
context of HPC and clusters.
And then
we found more problems
in federated learning that are interesting
and worse looking into
unsolving. And that's why we started
working on that. And I think it took more than two years to come up with a complete framework
and all the experiments.
You see the experiments took quite a significant amount of time,
and we tried to look at all aspects for evaluating it.
Yes, so kind of across that period then, what's the most interesting, maybe unexpected lesson
that you learned while working on it? What caught your guard?
Yeah, well I think system research actually is quite hard.
I guess this is my experience.
It's very hard work.
It takes some time and persistence.
And it involves many sleepless nights and bashing your head,
scratching your head and stuff like that over bugs
that you don't know what's happening.
And it's because so many things are
in play and sometimes you get non-meaningful results and you need to find where exactly
where is this coming from but I think the ultimate reward is actually quite you know high
compared to other work because you see your work in practice it's kind of a living creature something that you know people can relate to or feel basically
right compared to other kind of i'm not you know you know saying law on the mathematical or
theoretical work they are important as foundations but you don't feel it. I mean, and I think it's also the open sourcing
and making your solutions kind of, you know,
available for others to, you know, use and practice.
It gives you some joy and, you know, kind of achievement
in terms of you contributed something to, you know,
the research and, you know and the community as well,
to other researchers and community as well.
So I think the lessons learned from Rafael, it was a hard journey,
but I think now if it is something that can be beneficial in the future
in terms of systems that uses its concepts
or algorithms to make their systems more resource efficient this is something that will give us a
significant job yeah do you have any so what other research have you kind of got going on at the
minute is raffle sort of your main vehicle your main vehicle for research at the moment? Well, so it's not actually just Riffle. I mean,
we have other kind of work that's going on, but I think the main theme that we are focusing on
is still federated learning. We are looking at other directions such as battery-powered scenarios,
how to leverage techniques, as I said before, transfer learning or knowledge transfer.
So we are looking into still the main theme
of the big picture of federated learning,
but it's kind of the main theme that I'm focusing on now, yes.
But there are other works.
In terms of computer networks and the civil system,
there are other works that I'm interested in.
But the major driver, as you say, is digital learning.
Cool, you've got many irons in the fire then.
Cool.
So how do you go about, like, so this is my favorite question.
I love this question.
How do you go about, like, generating ideas
and then selecting which things to work on?
I kind of want to know more about your creative process.
Yeah, so I think this is actually quite tough one.
I don't think there's a single, you know,
a universal approach,
and there is no right approach, to be honest.
But I think we usually aim to first identify the problem
based on the recent line of work, right,
and see how the state of the art have achieved.
And, you know, so that we know
this is the point we need to start from, right?
And we try to critically analyze these, you know,
state of the art solutions by, you know,
trying to maybe experimenting with them or analyzing
their model or the solution they are proposing. Is it practical? Is it efficient? We look at all these,
you know, aspects and dissect these solutions to find exactly what they have missed or overlooked
in their solution. And before we make any claims, we try to quantify their limitations by doing experiments,
so that we can motivate our research area or question that we are, you know, we're going to
investigate further. And then after this, after you're identifying your research question, you
have a project, then, you know, developing the solution, solution i think it's kind of falls naturally
by you know uh thinking about various algorithms and techniques that can work would work out with
you know the final objective that you have identified in mind nice thanks for that it's
another another good answer to that question like i said i love that question it's great to see how
everyone works and approaches this thing it's cool yeah i agree with it it's tough one and i don't think uh i i
my answer may be the the optimal or the best one but uh i think everyone has his own approach and
i think if it's working for you then you you go for it exactly yeah there's not a one size fits
all sort of approach to that thing right you've got to find what works for you that's that's the yes that's the thing exactly cool so i've just got just got
two more questions now and then the penultimate one is what do you think is a very big pitch what
do you think is the biggest challenge in kind of systems research federated learning today
well yeah i think first and foremost is actually actually the main challenge for us in this research area is actually finding skilled and motivated researchers that work on system research.
OK, and finding the ones with the right background and experience is extremely hard because, you know, you are competing in academia or competing with industry that pays quite well.
And, you know, it's very unfair, you know, kind of computation for us.
It's very hard to find, you know, good researchers, especially in system research.
And the second issue is actually, you know, with our research is to have enough time and preservance and the right resources,
because eventually you need the hardware resources to,
you know,
run all the,
all of these and you need time and,
you know,
persistent.
So I think these are kind of the challenges in terms of,
you know,
personal challenges to,
you know,
adapt to yourself that it takes time. You don't want to
just rush things. You need to study carefully the system and run experiments accurately to get the
correct results. So I think these are the two biggest challenges, finding the right researchers with the background
in system research is very hard.
The other challenge is having this personality or adapting yourself to the nature of system
research.
It's not the typical research that you just build a model and run it on MATLAB and get some results and publish it.
It's a totally different kind of research
that requires time, resources, and preservance.
Yeah, that's nice.
Cool.
So yeah, last question now.
What's the one takeaway you want the listener
to get from this podcast today?
Well, again, I'm going to reiterate now on that, you know, system research is quite painful.
I mean, if you are doing it, I salute you.
But it's fun.
I can say it's fun because it allows you to acquire a multitude of skills and you know experiences
and actually makes you more you know kind of yeah I think eventually when you are coming
out of systems research your opportunities actually in the market is higher because it's kind of unique and rare kind of resource and skill.
So it's something that is going to be, you know, rewarding eventually if you don't work in academia like myself. The second thing is that I would like is that, you know, the outcomes, like, you know,
when you have outcomes like papers or open source code, these usually that comes from system
research have, you know, have higher impact than other type of work, because these can be adopted
by, you know, systems used in big companies or providers.
And most, actually, if you think about it,
most of the great systems like Hadoop, Spark,
all these systems actually came out as a result of a system research within these companies.
So if you think about it,
most of our systems that exist today
actually is a result of system research.
So I think this is a main message that system research is actually quite great and i i hope you know we we can find more
people doing it fantastic great let's let's finish it there thanks so much for coming on the uh on
the show and the listeners uh interested in knowing more about your work we'll put links
everything in the show notes and if you enjoy the show please consider supporting the podcast
through buy me a coffee and we will see you all next time for some more awesome computer science
research Thank you.