Disseminate: The Computer Science Research Podcast - Anastasiia Kozar | Fault Tolerance Placement in the Internet of Things | #61
Episode Date: December 16, 2024In this episode, we chat with Anastasiia Kozar about her research on fault tolerance in resource-constrained environments. As IoT applications leverage sensors, edge devices, and cloud infrastructure,... ensuring system reliability at the edge poses unique challenges. Unlike the cloud, edge devices operate without persistent backups or high availability standards, leading to increased vulnerability to failures. Anastasiia explains how traditional methods fall short, as they fail to align resource allocation with fault tolerance needs, often resulting in system underperformance.To address this, Anastasiia introduces a novel resource-aware approach that combines operator placement and fault tolerance into a unified process. By optimizing where and how data is backed up, her solution significantly improves system reliability, especially for low-end edge devices with limited resources. The result? Up to a tenfold increase in throughput compared to existing methods. Tune to learn more! Links:Fault Tolerance Placement in the Internet of Things [SIGMOD'24]The NebulaStream Platform: Data and Application Management for the Internet of Things [CIDR'20]nebula.stream Hosted on Acast. See acast.com/privacy for more information.
Transcript
Discussion (0)
Hello and welcome to another episode of Disseminate the Computer Science Research Podcast. I'm your
host Jack Wardby. Today's episode we are going to be talking about a few things. We're going to be
talking about stream processing, the internet of things and fault tolerance and joining us in that
chat to tell us all about her awesome research is Anastasia Koza who is a PhD student at the Technical University of Berlin.
Welcome to the show, Anastasia.
Thank you, Jack. It's really nice to be here. Thanks for the invitation.
Cool. Well, let's jump straight in. So can you start off by telling us your backstory,
a little bit more about yourself and how you became interested in data management research?
Yeah. So as you've mentioned, I am a fourth year PhD student
at Technical University Berlin. I completed my master's at Otto Friedrich University of
Bamberg, which is in Bavaria, Germany. And there I got inspired by my professor, Daniela Nicholas.
Big hi to Daniela. She was a professor of databases back there and i somehow got the best grade and
the lectures were so interesting and then she asked me also if i'm interested in a job
so i got invited as a research assistant to her assistant to her faculty and i started working
on iot actually directly we had like a large-scale installation of like
cattle-related IoT devices. So we were analyzing behavior of cows in Bavaria and trying to predict
what they are doing at a specific moment of time. For example, drinking, eating, sleeping.
Yeah, so that was my research. And then we had a chat with her. I told her that
after my master's, I would be interested to like proceed the academic experience or like to work
further on that. And she provided me with like a list of options where I can get the PhD. Well,
specifically in Germany, to be fair. And the first first one was Berlin so I passed like a few
interviews and got accepted to technical university and now I'm working with professor
Volker Markle yeah and I'm really happy to be part of this faculty awesome the uh So what do cows spend most of their time doing? That's the first question.
Eating.
That's something me and cows have in common, I guess I spend a lot of my time eating.
We also shared the same thing that my master's project was also on cows.
So mine was also, yeah, all kinds of cows and sheep and things, but it to do with the the spatial temporal analysis of salmonella and anti anti can i can't remember the title now anti
microbial resistance that sort of stuff and trying to work out how it spread it was a long time ago
now i've forgotten most of the stuff but yeah so yeah i also did my masters related to cows so yeah
we've got something else in common there as well.
Cool.
So yeah, as I teased at the top of the show,
today's topic is actually processing the internet of things and fault tolerance.
So let's set a little bit more context for the chat so the listener knows what all these things are.
Tell us, yeah, what the heck is the internet of things?
And yeah, what is a stream processing engine and fault tolerance?
Yeah, introduce all these topics for us Anastasia. Yeah sure well so the internet of things I think
it's a really common term nowadays so there is no like really strict definition but it's
basically a bunch of devices that are interconnected by a network it's not necessarily even internet
they could be interconnected by some kind of local network. In most cases,
they're also connected to a cloud. So it's like unified something like sensor edge cloud
network altogether, where we have really many different types of devices. There could be small
sensor devices, intermediate edge devices, large scale servers like in cloud. So that's what we mean by IoT. And what do we mean
by stream processing engine in relation to the IoT is that the sensors that are normally located
somewhere, for example, geo-distributed, geo-distributedly located somewhere in the field,
they send data, they submit data with like really high frequencies,
and these frequencies have to travel from the sensor devices through this edge devices to the
cloud. And normally, well, in most cases, the users then talk to the cloud and can retrieve the data
and work on it, process it, analyze it, and do whenever they want to with the data.
Yeah, so this is a context of IoT in relation to the stream processing.
Awesome stuff.
Yeah, it's everywhere these days, IoT, right?
I mean, everything is connected to the internet.
You fridge, it's crazy.
You're getting more and more connected.
There's some crazy numbers in your paper at the very start
about the volume of devices that will be connected
in the internet against by this device.
I can't remember the exact number.
Maybe I've got it next to me. I can pull it up here really quick but 27 billion
by 2025 so yeah well it's a serious serious volume there cool so given all this context then
obviously you and your colleagues uh to you bell and you're working on an iot data management
system that's called nebula stream so can you tell us a little bit more about this is a collaboration
as well with a few other universities i I think, right? The whole project.
Yeah. So tell us more about Nebula Stream and what's going on there with that.
Well, so the original idea of Nebula Stream project was to process this data that is passing
from sensors towards the cloud on the way. So basically there was a paper on CIDR where we ran a couple of experiments
and found out that sending all this large amounts of data that are produced by the sensors to the
cloud somehow introduces a bottleneck at the cloud when it starts processing this amount of data.
So it's sometimes profitable to push down the computation to actual smaller
devices. I can give an example so that it will be easier to get it. For instance, if we have a query
which has to filter a temperature, let's say temperature less than 22, and then there are a
bunch of sensors deployed, we don't need to send all the temperature values to the cloud to distract to all this
temperature that is less than 22 degrees. We can just subtract this data directly at the sensor
and send only the data that we actually need. So it will reduce a lot of network usage and it will
also save some resources for us, the cloud, to actually enable more continuous computation.
Yeah, so you referenced a side of paper there.
Just for the listener, what's that paper called in case they're interested?
And I'll drop a link to it in the show notes as well.
But yeah, I mean, if you can remember what it's called.
It's actually called Nebula Stream.
But for everyone who is interested, we have website called uh nebula.stream.io
and the paper is called the nebula stream platform data and application management for the internet
of things awesome stuff so so given this this nebula stream and your research focuses specifically
on sort of in this context fault tolerance right so yeah and you recently published this is why we're trying to say a paper at sigmod this year so can you give us the overview or the problem statement of your
actual research here and what you're actually trying to achieve yeah yeah yeah so basically
because we are having a processing on this like smaller devices that are also unreliable, plus the network can be unreliable. It's important
to make sure that we are handling failures in case they happen. So for instance, as you mentioned
before, that we are having partners. One of our partners is Charité. I don't know if you've heard
of it. It's the largest research hospital in Europe. And they have their ICU units, which is intensive
care units. And the patients that align there, they're getting constantly analyzed in terms of
their heart rate, their temperature, their whatever vitals that are necessary for the analysis.
And the idea there is that if we would apply an Ebola stream there,
it's crucial for the life safety of the patient that no data is lost
and all the data is getting processed.
So my topic is about how to make sure that we don't lose any data
in case the data was set by user to be really important and necessary that could be
many solutions one of them i actually discussed in my paper at sigma 2024 cool yeah so given this
then tell us a a little bit more about what is problematic at the moment and what what you're actually trying
to achieve your research in the sense of okay so fault tolerance is really really important
we need this because obviously we can it can cost lives if we don't look after our data correctly
right yeah so how has that been that problem been approached in stream processing engines
at the moment and what is the problem with that the way it's been done at the minute? Yeah, so the problem is that
most of the fault-tolerance solutions
were developed for the cloud
because mostly stream processing engines
were developed for the cloud.
So they do not have this assumption
of devices being heterogeneous,
meaning that they do not consider
that resources can be limited.
They assume that
we have virtually unlimited resources as cloud has also that network is reliable while we are
having unreliable network. Yeah, these are the main restrictions that we are focusing on and
that makes the solutions inapplicable towards our use case yeah you you give a really
maybe good to good to maybe illustrate this a little bit fairly you actually you you give a
really nice example at the start of the of your paper of your paper as to why or then to give some
numbers some kind of put some values on why how problematic this can be if we if we don't factor
in fault tolerance credits maybe you can tell us a bit more of that sort of experiment to illustrate at this point. Yeah. So the experiment that I highlighted in my paper was
I actually took a Raspberry Pi, which was a really limited amount of memory. It was one
gigabyte available. And I deployed Flink there and I started submitting queries. But first,
I started submitting queries, state first I started submitting queries,
stateful queries that require intermediate processing. So in terms of Flink, I computed
medium, which required additional memory. And I started submitting these queries to this Raspberry
Pi nonstop until the point where query optimizer basically told me that, okay, we are out of memory, sorry,
you cannot deploy more queries.
This was 35 queries.
And then I said, okay, well, 35 queries is not bad,
but what about if I actually start running full tolerance
in the background?
So I deployed everything completely the same,
but I also asked Flynn to enable full tolerance.
They have checkpointing algorithm.
And I again started submitting these queries.
And after the 27 query, I completely killed the device until it started being irresponsive.
It took me two hours to bring it back to life.
And of course, Flynn hasn't noticed that something is wrong because nowadays, unfortunately,
query optimizers do not take fault
tolerance into account they are really well tailored for like resource analysis in terms
of processing in terms of different kinds of workloads but they do not care about fault
tolerance and they kind of think that it's something that is getting around in silo and
does not require analysis actually yeah it's quite a serious drop off right going
from sort of like 35 down to 20 20 20 mid-20s queries it's a significant drop off when you
actually you need false contract from what you said earlier on about like in certain example
it's like in case you have to have this property it's really important we can't lose our data
there's no point in being fast if we lose all our data right so yeah yeah exactly exactly so
given this experiment you ran and then
how did you formulate this um this problem that you wanted to solve well basically the problem
was that well there were two sub problems actually of the problem the first one is that we deploy full
tolerance by default on every device of the system, independently of the resources, independently of the user preferences.
Fault tolerance is run by default everywhere.
And second problem, even though it's run everywhere,
it's just, it's cost not getting analyzed
and included in any decision making.
So we know that every device is participating
in an algorithm that requires additional resources,
but we never analyze
these resources and assume that it's just enough of them at some point so of course it results in
a big problem in terms of systems with limited resources and with small devices especially
like participating there cool so let's talk about the solution then, Anastasia. So yeah, how did you go about solving this problem?
So first of all, I actually look at this solution from the perspective
that we don't need fault tolerance by default being run on all the devices,
especially considering the IoT installations,
which could include thousands of millions of devices,
running fault tolerance on every device
for one query is a bit too much.
So my idea was that we allow user to choose
the importance of the data per query basis.
And then once the query is,
well, the client receives this query request
at the master node or the coordinator node,
which is run on the cloud, it analyzes all the available devices, and it first makes a decision on fault tolerance basis. So before deploying the processing operators, before deploying the query,
it first decides where are the optimal places for fault tolerance to run, and then deploys processing operators only on the devices that were chosen to be reliable enough and that were chosen to participate in fault tolerance.
That being said, sometimes, you know, sensors, if we need a specific sensor, like it was an example that I gave to you about the temperature.
Let's say we need a temperature at the sensor number 22. So we cannot choose there a lot.
However, there is still a path between the sensor and the cloud, which includes
possibly multiple edge devices. So not all of them can participate in fault tolerance due to several reasons.
Some of them can be that it's out of resources.
Some of them can be, I actually include in my analysis also individual reliability of the device based on manufacturer property of the device.
So sometimes it's more profitable that the more reliable device participate with the less resources involved in
processing. So we do kind of the smart decision and then we allow query optimizer to deploy
processing operators on chosen devices. So overall, we basically change the priority of
how we deploy processing operators, like stressing out that fault tolerance is more important
than the processing.
As you pointed out, that speed is not so important
if the device has failed.
Yeah, right.
Yeah, so it's kind of moving fault tolerance
up to being a first-class citizen, right,
and the way you think about your data,
thinking about your system, which is really nice.
And kind of having the ability to have users express that and capture that is is really nice so and it's quite a
complex optimization problem right so you you approach that tackling that kind of optimization
and figure out where to to put to put things to put their fault put fault tolerance in sort of
in just two sort of maybe there's three i think actually there's this is the single objective
optimization then the multi-objective optimization right and there's one that has sort of, maybe there's three, I think, actually. There's the single objective optimization
and the multi-objective optimization.
And there's one that has sort of the dynamic aspect to it as well.
So give us a rundown of the different way,
different approaches we can kind of take
to solve this optimization problem.
Yeah.
So, well, the single optimization basically is easy.
We either fix fault tolerance level or we fix resource utilization.
And then we try to either maximize the reliability, so the fault tolerance, or we fix fault tolerance, and then we try to
minimize the resource utilization, which is like a standard thing. And then a multi-objective
optimization is a bit more complicated. There are several approaches developed for this.
Many of them are used by query optimizer nowadays for processing operators.
So basically, my takeaway there is fault tolerance can be represented as an operator,
as just a processing operator, basically, with also its own needs, its own costs.
And then we could use any of existing multi-optimization strategies
or approaches to actually make a placement decision,
which is, well, NP-hard.
So, yeah.
Cool. I just wanted to dig into a little bit more detail.
I don't think we actually covered it off yet,
but the various different fault tolerance approaches
you actually can have in this system.
I'm not sure if we've gone into them into a few detail.
There is upstream backup, passive's upstream backup passive standby and active
standby so maybe we can tell the listener about those different strategies so they can kind of
maybe visualize how fault tolerance is actually achieved and make this a little bit clearer yeah
yeah yeah sure so basically there are three main classes of approaches that i would say
the most easy one is an upstream backup which is basically buffering. So before sending the data to the next device, we just buffer the data.
And in case next device fails, we will just resend the data.
Then there is an active standby, which is basically sends data on parallel paths, enabling parallel processing.
And in case device on one path fails, then we still ensure that data is getting delivered
because there is an alternative pass running and then there is checkpointing which is makes
like a snapshot of the entire data or and including the state of the device and sends the state or
like snapshot somewhere either remotely or stored persistently, like different system implements differently.
However, the main point here, what is more important in terms of IoT and IoT-based stream
processing systems is that all of them kind of need to save the data.
Well, obviously, so that it can get replayed or resent in terms of failure. But it cannot store all the data because, well, streams are infinite.
So we cannot store all the data.
We have to delete it from time to time.
And we cannot just delete it from time to time.
We cannot say, okay, please delete it every two seconds and it will be fine.
Because we don't know which data has reached the user.
And the data reaches the user when it reaches the cloud and cloud actually gives it away to the user. So what we need,
we need to communicate to the cloud, to the coordinator or master node and ask the master
node, okay, can you please send me what data is safe to be deleted? So this communication
requires some time and there is some delay time and this
results in additional costs that fault tolerance requires. So it obviously, the more often we ask
masternode what can we delete, the less data we store, but the more network we need because we
just more frequently send this question message or like acknowledgement message
so there is a trade-off in between resources for every presented fault tolerance approach
that right now exists just because of the specifics of the iot nice so going on to this
going back to the optimization problem so we've got these different approaches we have in our we
can do we can actually have in our head now,
and revisiting this resource reliability sort of trade-off space
we want to optimize along,
and how do we actually kind of go about then
estimating what fault tolerance costs?
Yeah.
Well, so as I mentioned,
the main hyperparameter here will be the streaming frequency.
So how often do we delete this data that we actually store? As I mentioned, the main hyperparameter here will be the streaming frequency.
So how often do we delete this data that we actually store?
And nowadays, most of the system let it live like a hyperparameter. So they say, okay, I let the user choose how often, or it's just a constant variable that is set in the system for like system-wise,
which is, well, not really optimal.
In my paper, I ran a couple of experiments in the evaluation part, showing how important
it is actually to set this parameter correctly, how crucial it is for the system, because
sometimes just choosing the parameter wrong can result in system being out of,
coming out of like, yeah, becoming out of memory just by,
because of fault tolerance.
So there's not a lot of processing.
It's crazy, right?
It's kind of the systems in a sense,
kick the can down the road and make it the user problem, right?
It's like, you worry about this setting.
I'm not going to do that.
And then they kind of absolve themselves of any responsibility of setting it
properly, which is the easy way out, which is uh is is not is not ideal
cool yes i don't know if we do we have anything else we want to dig into a little bit more on on
the cost estimizer and the cost estimizer sorry the cost estimation sort of angle or should we
talk about some solutions and the actual different placement strategies you you developed
i think yeah we are done.
Basically, there are three costs that I considered.
There's processing, network, and memory.
Processing cost does not quite change.
I mean, I ran a couple of experiments,
and it didn't show any real impact there.
However, for example, in terms of active standby,
because we have to process on parallel paths,
so we would have to deploy redundant processing operators
that will just double the processing cost, basically, for us.
And network memory trade-off, I already mentioned,
it comes from the storing data
and the requirement to delete that data periodically
due to the memory constraints because we cannot
store all data since we have infinite streams yeah cool well in that case and tell us about
naive fault tolerance placement and multi-objective fault tolerance tolerance placement and then the
the one with the hyper parameters which i think is i was i like that one the most i think because
i like i like i like you i don't know I think, because I like it when things are dynamic, right?
And you can like, oh, cool.
Sorry.
So, well, the naive one is the naive one.
We say that, well, for example,
I got inspired by the existing query optimizer strategies.
So there are multiple placement strategies there.
The first ones and the simplest ones,
heuristics-based,
which say, oh, okay, we have a processing operator. It needs one slot, whenever slot means in the
system. And we say that, well, we have a device and it has, let's say, five slots. So we know
that there are five slots. We know the query. We know the number of operators we want to deploy to the device.
So we kind of have the theoretical estimation of resources on the device. And therefore,
my naive strategy was to say, okay, well, fault tolerance also needs one slot. So here we go.
We don't only need one slot for the processing operator. We also need one slot for the process inaugurator. We also need one slot for the fault tolerance. However, when I ran the experiment with this kind of estimation, well, so first of all,
I repeated the experiment with Raspberry Pi and Flink, but with NebulaStream and Raspberry Pi.
And in terms of NebulaStream, I was able to deploy 64 queries without fault tolerance
and 42 queries with fault tolerance.
And 42 queries again, of course killed the Raspberry Pi.
So I killed it twice.
Yes, I did.
So, and it was naive estimation saying,
yeah, one slot processing, one slot fault tolerance.
Since we deploy 64 queries without fault tolerance,
we were able to deploy 32 queries with fault tolerance,
because now instead of one slot, we needed two slots every time. So we decreased the amount of
query by double, 32, but we did not fail the Raspberry Pi, which was good. However, we knew
that we could actually deploy 42, because of the 42 queries queries the Raspberry Pi died the last time.
So we knew that, well, the estimation is not really precise. We could do better than that.
And then I started looking at this memory network pattern. And I calculated that, for example,
we have this hyperparameter like trimming frequency. So how often do we trim? Let's say we trim every 100 tuple buffers or like buffers.
And let's say we know the size of the buffer
because normally we know it.
It's configuration of the system.
So we can calculate approximately
how much memory we will need at the point of time.
So we will store for full tolerance 100 buffers.
Let's say each buffer one kilobyte for the simplicity. So we will need 100 full tolerance 100 buffers, let's say, each buffer one kilobyte for the simplicity.
So we will need 100 kilobytes whenever.
And then after that, we will send message to the master node and master node will reply to us.
And there is still this delay of not knowing how much actually memory will it cost us to actually wait for this message to reach the master and reach back.
But well, knowing the network delay and approximate number of hops, we can still
predict that and this estimation is kind of safe there. So having this estimation,
we reached actually 42 queries, but without failing the Raspberry Pi, which is good. So it was originally targeted
amount of queries, but we did not even kill the poor device, which was good. But then I still
went further and I said, okay, well, but actually we could notice that we are getting out of memory
at some point and we could switch to start sending more often these messages for
deleting the memory for cleaning the memory and like that we will utilize more network but we
will save more memory and having this optimization i reached 56 queries out of 64 original which was
like actually major result in terms of yeah number of queries deployed that's awesome i
mean i was at the start i would say they're gonna stop you from buying raspberry pies if you keep
i hope they don't have my name on there
but it's really nice to see that iterative process as well as kind of as you went through
doing these experiments you kind of like oh actually i can maybe tweak it this way or i can go this way and do these little actions it's really nice to see that iterative process as well as kind of as you went through doing these experiments and you were kind of like oh actually i can maybe tweak it this way or i can go this way
and do these little actions it's really nice to see that iterative process sort of play out and
see because i mean at the end you you always i just read the research paper it's like it's like
you had idea one idea two idea three and they all just came through at the same time but it's nice
to see that sort of iterative process play out that you did these experiments and to see that
and to see the gain each time must have been a nice experience from you for you as well
cool so i guess given that we touched a little bit there on on sort of results and a few numbers
and stuff but maybe we can talk about your experience in a little bit more depth so yeah
overall did you have how did you go evaluating all of these different approaches you developed?
Yeah, and what datasets did you use?
Who did you compare against?
Obviously, Flink and NebulaStream were used, right?
But yeah, tell us more about your experiments is the question I'm trying to get to, Anastasia.
Yeah, sure. No worries.
Well, so the first experiment, I basically brought you as an example in terms of how I came up with a different solution.
So that's exactly the same measurements, but for four Raspberry Pis, not for one.
So you can just multiply these values by four, and that's how you will get my first estimation.
Then I actually looked at the resource consumption to see that actually, yeah, naive
approach is not really good at resource estimation because it just says, yes, it uses some kind of
slots. Who knows what slots mean? How many slots a Raspberry Pi has? And it's a bit like, yeah,
heuristic approach. Then we actually also, in the resource analysis, I think we see how Flink failed,
why it failed. It went out of memory, which was according to the initial assumption.
Then I ran a scalability analysis. Basically, I actually use Kubernetes for that. I used Google Cloud with up to 256 devices, I think. So I basically allocated
256 Raspberry Pis and used Kubernetes and Docker image to deploy them on every instance.
I wanted actually to run 512, but then Google Cloud wrote me a sad message saying that I'm
already utilizing all the CPU of Western Europe
so they cannot give me more.
So I had to stop at 256.
Yeah.
They need bigger
data centers is what we're saying here. This is a call to
action for Google. Build more data centers.
I'm actually a bit worried
how many companies have me
their red list of not giving me any more devices or hardware resources.
Yeah.
And, well, the analysis that I find cool is actually this hyperparameter analysis where I tried different trimming frequencies and see how they influence the system and what happened. Because, well, what can I say is that if we don't trim the data in time,
we just can, well, we will be running out of buffers
because normally how a system works,
they allocate a pool of buffers at the launch.
And then these buffers from this pool are getting used inside the system.
So if unless the buffers are getting trimmed, they still stored in the systems and they cannot be used to retrieve new data to actually send it further.
So if we don't trim enough in time, we can be at the point where we don't have any more buffers but we still haven't sent the message
to the master node asking to delete the buffers so we are just stuck at the moment where we are
constantly waiting for new buffers but never actually freeing any of the buffers so i would
say it's a crucial parameter and it's really important to choose it wisely yeah did you choose
that parameter is that it's obviously how is best to make that decision? Is there sort of a way we can pick that in a manner that will
always guarantee we don't hit this problem of not having any space to get the message back to us to
then be able to trim it, right? Yeah, I actually believe that the optimal value can be calculated
using my formula of memory estimation for fault tolerance so basically
if we say like amount of buffers plus the delay of both sides of trimming message that should be
a really valid formula to use if we want to estimate this value yeah i just wanted to
to double check my understanding as well of that these fault tolerance placement schemes, is it a deterministic guarantee
that we will never, ever end up being under provisioned again,
an OOM or anything like that?
Or is it just the best effort we could possibly still hit an OOM?
Yeah, so there is a basic problem in the IoT
with processing guarantees.
Well, so what you've mentioned right now,
these are processing guarantees and different systems support different guarantees. For example,
the strongest one is exactly once, meaning that we do not lose any data and we do not produce any
duplicates in terms of a failure, in case of a failure. Then there is at least once, which is
we don't lose any data, but we might have duplicates. At most once that we lose data, but we don't have duplicates and then none.
And so the problem of IoT is that because the data is getting produced at the lowest devices,
in case the producers of the devices, the sensor device itself fails,
there is no way independently of my fault tolerance approach that I can reproduce this data because it's actually the producer who failed.
So I say in my papers that we start working from at most once. Unfortunately, at least once
and exactly once is not achievable for us. So we have different flavors of at most ones. But answering your
question, no, you cannot be sure that your data will get delivered because in case a producer
will fail, then there won't be any data to deliver. Otherwise, yes, there will be, there won't be like,
I mean, so there will be a probability with which the data will get delivered i create like
several classes just to say that like yeah there is high probability medium probability and low
probability just so the user wouldn't have to use you know the number between zero and hundred but
have these classes of like oh yeah i want really badly my data to get delivered. So then it's high.
Cool. Yeah, I guess that's really interesting.
I guess kind of leading on from that then,
I always like to ask about limitations of various solutions.
I mean, this is a massive improvement over the state of the art, right?
And kind of just not thinking about fault tolerance at all.
But I always like to kind of ask and sort of probe a little bit about limitations.
And are there any situations obviously this is the the fact that we're getting we can't have exactly
one semantics it's kind of out of our control right it's just the facts of life in a way so
we have to work around that yeah but are there any are there any sort of situations or scenarios in
which the placement strategies are end up being suboptimal? Well, there are multiple. I mean, so basically I am not the
inventor of placement, right? I'm just using the placement algorithm and I also treat fault
tolerance as a black box. So I basically say, yeah, any fault tolerance approach is an operator
and I just deploy it. So I'm just, you know, don't take any responsibility.
I'm serving as a bridge between the query optimizer
and the full tolerance guys.
But of course, there are several sub-optimalities there.
I mean, I used in my paper, for example,
weighted some algorithm,
which is well considered to be sub-optimal
in terms of multi-objective optimizations.
We have the weights,
this weights again chosen by either the developer optimal in terms of multi-objective optimizations. We have the weights, these weights, again,
chosen by either the developer or the user saying, oh, okay, our system is memory restricted,
or our system is lacking network bandwidth, or whatever. Yeah. So in this sense, we are restricted.
What else? We are also, of course, restricted that if we have just many unreliable devices, that wouldn't matter actually where we deploy fault tolerance because probability will be everywhere the same.
So, yeah, I mean, there could be many examples, but the major idea, I think, is still there.
Just trying to help avoiding the problem of like
fault tolerance running in silos. Yeah for sure.
Given the natural next step for your work on fault tolerance in the internet
of things and yeah in NebulaStream. Yeah so as i just mentioned in my in the evaluation i highlighted
that there is a problem with scalability did i mention that maybe i should mention that that
i actually found out that there is a small problem with scalability with any of the approaches just
because of this trimming patterns that we have to ask somebody and we have
to wait. So there are two problems. The first one, if we have really long paths in between the last
device and the master node, so there will be really many hopes to jump in. And at some point,
we cannot find optimal amount for buffers to get stored
because the optimal amount is less than zero,
which is like, obviously, we're just not managing to trim data fast enough.
And the second problem is that master node is serving as a central unit,
which is doing really many parallel processing.
So if we have also really many paths
paths then the master node is struggling answering all these questions about which buffer which
buffers are safe to trim so these are two problems and my future vision of this problem is to
actually move from the centralized architecture in terms of fault tolerance to the decentralized one.
So a bit like peer-to-peer inspired where each worker is taking responsibility for himself and talking to the neighbors, maybe like analyzing neighborhoods.
And yeah, this is also the idea of my papers that i'm currently writing
a teaser yeah for the listener to keep keep their eyes peeled there's more work coming soon
it's interesting you situate you those those future directions because i i was as i was kind
of thinking about this and so like you said there's this maybe this the the network topology
is in in a way that makes a scalability be a
problem right is that if you have how much of a control do you have over the network topology
and which devices are connected to each other how much do you have to just assume that's a given or
do you have a control over the whole architecture that's okay well if we connect these devices
that's going to then make the whole picture a lot nicer if we can if we can connect a and b
then that's going to make all our problems go away like how much control do you have over that actually
or how much do you just assume that the network the way the hardware and the way things are
connected is just a given and we're just going to try and work on top of that platform yeah so at
the moment in nebula stream we have like coordinator running in the cloud that has an understanding of the global topology.
And we kind of have this assumption of the worker running in isolation without knowing
the global topology. So all the optimization decisions are made by the coordinator.
It then propagates those decisions to the workers partially and the workers then performs well in our case we are compiling
this query to run it really efficiently to utilize resources really efficiently but yeah so we have
this isolation assumption nice it'll be really interesting to see how how your work progresses
in this in this area as well and and see how how nebulous stream progresses as well and see how Nebula Stream progresses as well. Cool. Yeah, let's talk about that a little
bit more actually then about impact. And obviously, I don't know how widely used Nebula Stream is
outside of sort of as a research vehicle or if it's used at all in any sort of production setting.
But has your kind of work on fault tolerance had any impact at the moment? Or what sort of impact do you think it can maybe have in the future
and maybe be incorporated into systems like Flink?
Yeah, well, so I find NebulaStream conceptually a bit different from Flink.
So it's like one can say a next generation of the system after Flink.
And what I say is that the algorithm I develop is targeted more towards this device
setting with limited resources. So Flink is running in the cloud, so they basically don't care about
this. But potentially, I think it's a major improvement in terms of, well, not only reliability, but also of understanding the nature of fault tolerance,
understanding the costs of fault tolerance. I think that if, for example, an engineer will
have to set up a large installation of IoT devices, my paper will help understanding better
how much resources, estimate, for example, how many devices are
needed for the system, how many resources a system has. And knowing also the use case,
for example, we know that the installation is with sensitive data or was like important data.
We couldn't estimate overall the amount of devices, we don't like blindly buy like let's say thousands
of devices and assume that yeah it will be enough but we can actually perform some kind of smart
analysis for this decision making yeah yeah nice yeah some data driven decisions right i think is
the the kind of catchphrase for it but cool yeah so i guess my next few questions are more about sort of the not the non-technical stuff but this the uh the stuff that happens along
the way and the journey of when you're working on a paper like this and the first one is
what whilst working i mean maybe not specifically on this paper but maybe on your phd studies
at large what's been the most interesting thing that you've kind of learned so far on that journey
well i i guess i kind of enjoy working with different types of hardware so as you might
have noticed you like destroy different types of hardware yeah maybe i enjoyed a bit too much destroying different types of hardware.
That's true.
Including my laptop a couple of times.
Yeah, yeah.
But to be honest, yeah, before that, I didn't have any experience with raw hardware.
I had to manually connect Raspberry Pi,
create like local network.
I also got to know like Kubernetes,
Docker and all this kind of infrastructure.
So it's like a first hand experience
that I got there,
which I've never had.
And also what was interesting for me
and is still interesting
is to be part of like a larger team
where we are all researchers contributing to what we are trying to build as an industrial project
so yeah that's also something new for me and something that I experienced during my PhD
well of course freedom of decision making That's exactly why I came here.
And that's exactly why I'm doing my PhD,
to have this freedom and to actually have an impact
in the future somewhere, hopefully.
Yeah, I'm sure you will.
I guess going off that, let me segue in from that last bit
about having that freedom to pursue your own ideas
and to kind of get creative. And maybe you hinted ated at maybe you kind of i kind of got an idea of how
you work later on and you said you kind of did this experiment then you tweaked it this way
you kind of ended up sort of developing this this solution space but yeah how you go about
generating ideas and what's your creative process yeah i actually thought about that a bit. And I think that I like visiting different
talks from different people, listen to the current research. We have plenty of visitors
that give like keynotes or talks, also industrial partners, for example. And somehow during these
talks, I get extra inspired. So I keep my notes on my iPhone open and
I just write down some ideas for my own paper sometimes I actually even skip like part of the
talk just because I'm so into my own ideas that I kind of get distracted about that but yeah I would
say this is my major resource of motivation that's a nice approach kind of that
um yeah i've had people before who've answered this question they're sort of like they've
they've mentioned this idea of having a kind of breadth and and boring ideas from our seeing the
way people approach problems in other domains or even within the same space but it's kind of like
closely related maybe not even by not the one thing is not, but it's kind of like closely related, maybe not even by all, not biology, maybe,
but just kind of a different field within computer science.
And there's a lot of parallels that can be drawn between the two
and you can take ideas from those and give you a nice way to sort of
generate ideas, which is cool.
And yeah, so I guess we've made our way all the way to the last word.
But before we do that, I don't know if you saw my email about the possible possibility of giving the listener a recommendation
and yeah i saw it this is a new feature anastasia so um you either feel the you got the honor of
delivering the listener the first recommendation or the recommendation can be anything tech related
it can be a blog post a paper a new system a new tool something you've encountered recently that you thought the listener might might enjoy so yeah what do you have for us anastasia
yeah so first of all please do not destroy your hardware that you're actually supposed to use for
the evaluation but like yeah that's just a general recommendation but i thought about it and i think
i would recommend to try out nebula stream of of course, because I'm, yeah, well,
I am the first had an experience here and we plan open source release on SIGMO 2025. So it should be
released this year. Well, next year, next year, right? And right now it's also available, but only
for research. And we are also having our small communities where community where we support
people that try to do stuff with nebula stream so people should not be scared to use it we will
support everyone and yeah i highly motivate to play to play around with it fantastic and correct
me if i'm wrong but that's kind of quite nice because Sigmund is in Berlin
right? Yep, yep that is correct
There we go
perfect alignment there
We are hosting this
year yes. Yeah so it's a celebration
cool so yeah
now last word what's the one
thing you want the listener to take away
from this podcast other than your awesome recommendation
but yeah one more thing from the last word to finish us off
well so if you're in a database community that i would highly recommend not to ignore
fault tolerance and yeah think of it beforehand and think of it all the time as it is really
important and of course if you're a phd during your work i would say
to i would generally recommend people to take that time enjoy the research part enjoy the freedom of
the decision making and choices and try to change something try to change the world and impact the research community somehow.
That's, I guess, the most
pleasant
part of the PhD, in
my opinion. Awesome. So that's a great
message, Sven. I'm plus one with you on
the thought tolerance.
I agree with you on that one. Very important. We've got to look
after your data. It's been a really nice,
really enjoyable chat, Anastasia. It's been lovely
talking to you, and I'm sure the listener will have really enjoyed
the conversation as well we'll put links to everything in the show notes so the listener
you can go and find and play with nebula stream and read all the all of our stages awesome work
we'll put links to that and everything in in the notes and yeah we'll uh we'll see you all
next time for some more awesome computer science research.