Disseminate: The Computer Science Research Podcast - Anastasiia Kozar | Fault Tolerance Placement in the Internet of Things | #61

Starting point is 00:00:00 Hello and welcome to another episode of Disseminate the Computer Science Research Podcast. I'm your host Jack Wardby. Today's episode we are going to be talking about a few things. We're going to be talking about stream processing, the internet of things and fault tolerance and joining us in that chat to tell us all about her awesome research is Anastasia Koza who is a PhD student at the Technical University of Berlin. Welcome to the show, Anastasia. Thank you, Jack. It's really nice to be here. Thanks for the invitation. Cool. Well, let's jump straight in. So can you start off by telling us your backstory, a little bit more about yourself and how you became interested in data management research?

Starting point is 00:01:02 Yeah. So as you've mentioned, I am a fourth year PhD student at Technical University Berlin. I completed my master's at Otto Friedrich University of Bamberg, which is in Bavaria, Germany. And there I got inspired by my professor, Daniela Nicholas. Big hi to Daniela. She was a professor of databases back there and i somehow got the best grade and the lectures were so interesting and then she asked me also if i'm interested in a job so i got invited as a research assistant to her assistant to her faculty and i started working on iot actually directly we had like a large-scale installation of like cattle-related IoT devices. So we were analyzing behavior of cows in Bavaria and trying to predict

Starting point is 00:01:55 what they are doing at a specific moment of time. For example, drinking, eating, sleeping. Yeah, so that was my research. And then we had a chat with her. I told her that after my master's, I would be interested to like proceed the academic experience or like to work further on that. And she provided me with like a list of options where I can get the PhD. Well, specifically in Germany, to be fair. And the first first one was Berlin so I passed like a few interviews and got accepted to technical university and now I'm working with professor Volker Markle yeah and I'm really happy to be part of this faculty awesome the uh So what do cows spend most of their time doing? That's the first question. Eating.

Starting point is 00:02:54 That's something me and cows have in common, I guess I spend a lot of my time eating. We also shared the same thing that my master's project was also on cows. So mine was also, yeah, all kinds of cows and sheep and things, but it to do with the the spatial temporal analysis of salmonella and anti anti can i can't remember the title now anti microbial resistance that sort of stuff and trying to work out how it spread it was a long time ago now i've forgotten most of the stuff but yeah so yeah i also did my masters related to cows so yeah we've got something else in common there as well. Cool. So yeah, as I teased at the top of the show,

Starting point is 00:03:33 today's topic is actually processing the internet of things and fault tolerance. So let's set a little bit more context for the chat so the listener knows what all these things are. Tell us, yeah, what the heck is the internet of things? And yeah, what is a stream processing engine and fault tolerance? Yeah, introduce all these topics for us Anastasia. Yeah sure well so the internet of things I think it's a really common term nowadays so there is no like really strict definition but it's basically a bunch of devices that are interconnected by a network it's not necessarily even internet they could be interconnected by some kind of local network. In most cases,

Starting point is 00:04:05 they're also connected to a cloud. So it's like unified something like sensor edge cloud network altogether, where we have really many different types of devices. There could be small sensor devices, intermediate edge devices, large scale servers like in cloud. So that's what we mean by IoT. And what do we mean by stream processing engine in relation to the IoT is that the sensors that are normally located somewhere, for example, geo-distributed, geo-distributedly located somewhere in the field, they send data, they submit data with like really high frequencies, and these frequencies have to travel from the sensor devices through this edge devices to the cloud. And normally, well, in most cases, the users then talk to the cloud and can retrieve the data

Starting point is 00:05:00 and work on it, process it, analyze it, and do whenever they want to with the data. Yeah, so this is a context of IoT in relation to the stream processing. Awesome stuff. Yeah, it's everywhere these days, IoT, right? I mean, everything is connected to the internet. You fridge, it's crazy. You're getting more and more connected. There's some crazy numbers in your paper at the very start

Starting point is 00:05:19 about the volume of devices that will be connected in the internet against by this device. I can't remember the exact number. Maybe I've got it next to me. I can pull it up here really quick but 27 billion by 2025 so yeah well it's a serious serious volume there cool so given all this context then obviously you and your colleagues uh to you bell and you're working on an iot data management system that's called nebula stream so can you tell us a little bit more about this is a collaboration as well with a few other universities i I think, right? The whole project.

Starting point is 00:05:48 Yeah. So tell us more about Nebula Stream and what's going on there with that. Well, so the original idea of Nebula Stream project was to process this data that is passing from sensors towards the cloud on the way. So basically there was a paper on CIDR where we ran a couple of experiments and found out that sending all this large amounts of data that are produced by the sensors to the cloud somehow introduces a bottleneck at the cloud when it starts processing this amount of data. So it's sometimes profitable to push down the computation to actual smaller devices. I can give an example so that it will be easier to get it. For instance, if we have a query which has to filter a temperature, let's say temperature less than 22, and then there are a

Starting point is 00:06:39 bunch of sensors deployed, we don't need to send all the temperature values to the cloud to distract to all this temperature that is less than 22 degrees. We can just subtract this data directly at the sensor and send only the data that we actually need. So it will reduce a lot of network usage and it will also save some resources for us, the cloud, to actually enable more continuous computation. Yeah, so you referenced a side of paper there. Just for the listener, what's that paper called in case they're interested? And I'll drop a link to it in the show notes as well. But yeah, I mean, if you can remember what it's called.

Starting point is 00:07:18 It's actually called Nebula Stream. But for everyone who is interested, we have website called uh nebula.stream.io and the paper is called the nebula stream platform data and application management for the internet of things awesome stuff so so given this this nebula stream and your research focuses specifically on sort of in this context fault tolerance right so yeah and you recently published this is why we're trying to say a paper at sigmod this year so can you give us the overview or the problem statement of your actual research here and what you're actually trying to achieve yeah yeah yeah so basically because we are having a processing on this like smaller devices that are also unreliable, plus the network can be unreliable. It's important to make sure that we are handling failures in case they happen. So for instance, as you mentioned

Starting point is 00:08:14 before, that we are having partners. One of our partners is Charité. I don't know if you've heard of it. It's the largest research hospital in Europe. And they have their ICU units, which is intensive care units. And the patients that align there, they're getting constantly analyzed in terms of their heart rate, their temperature, their whatever vitals that are necessary for the analysis. And the idea there is that if we would apply an Ebola stream there, it's crucial for the life safety of the patient that no data is lost and all the data is getting processed. So my topic is about how to make sure that we don't lose any data

Starting point is 00:09:01 in case the data was set by user to be really important and necessary that could be many solutions one of them i actually discussed in my paper at sigma 2024 cool yeah so given this then tell us a a little bit more about what is problematic at the moment and what what you're actually trying to achieve your research in the sense of okay so fault tolerance is really really important we need this because obviously we can it can cost lives if we don't look after our data correctly right yeah so how has that been that problem been approached in stream processing engines at the moment and what is the problem with that the way it's been done at the minute? Yeah, so the problem is that most of the fault-tolerance solutions

Starting point is 00:09:49 were developed for the cloud because mostly stream processing engines were developed for the cloud. So they do not have this assumption of devices being heterogeneous, meaning that they do not consider that resources can be limited. They assume that

Starting point is 00:10:05 we have virtually unlimited resources as cloud has also that network is reliable while we are having unreliable network. Yeah, these are the main restrictions that we are focusing on and that makes the solutions inapplicable towards our use case yeah you you give a really maybe good to good to maybe illustrate this a little bit fairly you actually you you give a really nice example at the start of the of your paper of your paper as to why or then to give some numbers some kind of put some values on why how problematic this can be if we if we don't factor in fault tolerance credits maybe you can tell us a bit more of that sort of experiment to illustrate at this point. Yeah. So the experiment that I highlighted in my paper was I actually took a Raspberry Pi, which was a really limited amount of memory. It was one

Starting point is 00:10:56 gigabyte available. And I deployed Flink there and I started submitting queries. But first, I started submitting queries, state first I started submitting queries, stateful queries that require intermediate processing. So in terms of Flink, I computed medium, which required additional memory. And I started submitting these queries to this Raspberry Pi nonstop until the point where query optimizer basically told me that, okay, we are out of memory, sorry, you cannot deploy more queries. This was 35 queries. And then I said, okay, well, 35 queries is not bad,

Starting point is 00:11:34 but what about if I actually start running full tolerance in the background? So I deployed everything completely the same, but I also asked Flynn to enable full tolerance. They have checkpointing algorithm. And I again started submitting these queries. And after the 27 query, I completely killed the device until it started being irresponsive. It took me two hours to bring it back to life.

Starting point is 00:11:58 And of course, Flynn hasn't noticed that something is wrong because nowadays, unfortunately, query optimizers do not take fault tolerance into account they are really well tailored for like resource analysis in terms of processing in terms of different kinds of workloads but they do not care about fault tolerance and they kind of think that it's something that is getting around in silo and does not require analysis actually yeah it's quite a serious drop off right going from sort of like 35 down to 20 20 20 mid-20s queries it's a significant drop off when you actually you need false contract from what you said earlier on about like in certain example

Starting point is 00:12:34 it's like in case you have to have this property it's really important we can't lose our data there's no point in being fast if we lose all our data right so yeah yeah exactly exactly so given this experiment you ran and then how did you formulate this um this problem that you wanted to solve well basically the problem was that well there were two sub problems actually of the problem the first one is that we deploy full tolerance by default on every device of the system, independently of the resources, independently of the user preferences. Fault tolerance is run by default everywhere. And second problem, even though it's run everywhere,

Starting point is 00:13:12 it's just, it's cost not getting analyzed and included in any decision making. So we know that every device is participating in an algorithm that requires additional resources, but we never analyze these resources and assume that it's just enough of them at some point so of course it results in a big problem in terms of systems with limited resources and with small devices especially like participating there cool so let's talk about the solution then, Anastasia. So yeah, how did you go about solving this problem?

Starting point is 00:13:46 So first of all, I actually look at this solution from the perspective that we don't need fault tolerance by default being run on all the devices, especially considering the IoT installations, which could include thousands of millions of devices, running fault tolerance on every device for one query is a bit too much. So my idea was that we allow user to choose the importance of the data per query basis.

Starting point is 00:14:17 And then once the query is, well, the client receives this query request at the master node or the coordinator node, which is run on the cloud, it analyzes all the available devices, and it first makes a decision on fault tolerance basis. So before deploying the processing operators, before deploying the query, it first decides where are the optimal places for fault tolerance to run, and then deploys processing operators only on the devices that were chosen to be reliable enough and that were chosen to participate in fault tolerance. That being said, sometimes, you know, sensors, if we need a specific sensor, like it was an example that I gave to you about the temperature. Let's say we need a temperature at the sensor number 22. So we cannot choose there a lot. However, there is still a path between the sensor and the cloud, which includes

Starting point is 00:15:17 possibly multiple edge devices. So not all of them can participate in fault tolerance due to several reasons. Some of them can be that it's out of resources. Some of them can be, I actually include in my analysis also individual reliability of the device based on manufacturer property of the device. So sometimes it's more profitable that the more reliable device participate with the less resources involved in processing. So we do kind of the smart decision and then we allow query optimizer to deploy processing operators on chosen devices. So overall, we basically change the priority of how we deploy processing operators, like stressing out that fault tolerance is more important than the processing.

Starting point is 00:16:08 As you pointed out, that speed is not so important if the device has failed. Yeah, right. Yeah, so it's kind of moving fault tolerance up to being a first-class citizen, right, and the way you think about your data, thinking about your system, which is really nice. And kind of having the ability to have users express that and capture that is is really nice so and it's quite a

Starting point is 00:16:30 complex optimization problem right so you you approach that tackling that kind of optimization and figure out where to to put to put things to put their fault put fault tolerance in sort of in just two sort of maybe there's three i think actually there's this is the single objective optimization then the multi-objective optimization right and there's one that has sort of, maybe there's three, I think, actually. There's the single objective optimization and the multi-objective optimization. And there's one that has sort of the dynamic aspect to it as well. So give us a rundown of the different way, different approaches we can kind of take

Starting point is 00:16:54 to solve this optimization problem. Yeah. So, well, the single optimization basically is easy. We either fix fault tolerance level or we fix resource utilization. And then we try to either maximize the reliability, so the fault tolerance, or we fix fault tolerance, and then we try to minimize the resource utilization, which is like a standard thing. And then a multi-objective optimization is a bit more complicated. There are several approaches developed for this. Many of them are used by query optimizer nowadays for processing operators.

Starting point is 00:17:31 So basically, my takeaway there is fault tolerance can be represented as an operator, as just a processing operator, basically, with also its own needs, its own costs. And then we could use any of existing multi-optimization strategies or approaches to actually make a placement decision, which is, well, NP-hard. So, yeah. Cool. I just wanted to dig into a little bit more detail. I don't think we actually covered it off yet,

Starting point is 00:17:57 but the various different fault tolerance approaches you actually can have in this system. I'm not sure if we've gone into them into a few detail. There is upstream backup, passive's upstream backup passive standby and active standby so maybe we can tell the listener about those different strategies so they can kind of maybe visualize how fault tolerance is actually achieved and make this a little bit clearer yeah yeah yeah sure so basically there are three main classes of approaches that i would say the most easy one is an upstream backup which is basically buffering. So before sending the data to the next device, we just buffer the data.

Starting point is 00:18:29 And in case next device fails, we will just resend the data. Then there is an active standby, which is basically sends data on parallel paths, enabling parallel processing. And in case device on one path fails, then we still ensure that data is getting delivered because there is an alternative pass running and then there is checkpointing which is makes like a snapshot of the entire data or and including the state of the device and sends the state or like snapshot somewhere either remotely or stored persistently, like different system implements differently. However, the main point here, what is more important in terms of IoT and IoT-based stream processing systems is that all of them kind of need to save the data.

Starting point is 00:19:19 Well, obviously, so that it can get replayed or resent in terms of failure. But it cannot store all the data because, well, streams are infinite. So we cannot store all the data. We have to delete it from time to time. And we cannot just delete it from time to time. We cannot say, okay, please delete it every two seconds and it will be fine. Because we don't know which data has reached the user. And the data reaches the user when it reaches the cloud and cloud actually gives it away to the user. So what we need, we need to communicate to the cloud, to the coordinator or master node and ask the master

Starting point is 00:19:57 node, okay, can you please send me what data is safe to be deleted? So this communication requires some time and there is some delay time and this results in additional costs that fault tolerance requires. So it obviously, the more often we ask masternode what can we delete, the less data we store, but the more network we need because we just more frequently send this question message or like acknowledgement message so there is a trade-off in between resources for every presented fault tolerance approach that right now exists just because of the specifics of the iot nice so going on to this going back to the optimization problem so we've got these different approaches we have in our we

Starting point is 00:20:43 can do we can actually have in our head now, and revisiting this resource reliability sort of trade-off space we want to optimize along, and how do we actually kind of go about then estimating what fault tolerance costs? Yeah. Well, so as I mentioned, the main hyperparameter here will be the streaming frequency.

Starting point is 00:21:04 So how often do we delete this data that we actually store? As I mentioned, the main hyperparameter here will be the streaming frequency. So how often do we delete this data that we actually store? And nowadays, most of the system let it live like a hyperparameter. So they say, okay, I let the user choose how often, or it's just a constant variable that is set in the system for like system-wise, which is, well, not really optimal. In my paper, I ran a couple of experiments in the evaluation part, showing how important it is actually to set this parameter correctly, how crucial it is for the system, because sometimes just choosing the parameter wrong can result in system being out of, coming out of like, yeah, becoming out of memory just by,

Starting point is 00:21:49 because of fault tolerance. So there's not a lot of processing. It's crazy, right? It's kind of the systems in a sense, kick the can down the road and make it the user problem, right? It's like, you worry about this setting. I'm not going to do that. And then they kind of absolve themselves of any responsibility of setting it

Starting point is 00:22:04 properly, which is the easy way out, which is uh is is not is not ideal cool yes i don't know if we do we have anything else we want to dig into a little bit more on on the cost estimizer and the cost estimizer sorry the cost estimation sort of angle or should we talk about some solutions and the actual different placement strategies you you developed i think yeah we are done. Basically, there are three costs that I considered. There's processing, network, and memory. Processing cost does not quite change.

Starting point is 00:22:34 I mean, I ran a couple of experiments, and it didn't show any real impact there. However, for example, in terms of active standby, because we have to process on parallel paths, so we would have to deploy redundant processing operators that will just double the processing cost, basically, for us. And network memory trade-off, I already mentioned, it comes from the storing data

Starting point is 00:22:59 and the requirement to delete that data periodically due to the memory constraints because we cannot store all data since we have infinite streams yeah cool well in that case and tell us about naive fault tolerance placement and multi-objective fault tolerance tolerance placement and then the the one with the hyper parameters which i think is i was i like that one the most i think because i like i like i like you i don't know I think, because I like it when things are dynamic, right? And you can like, oh, cool. Sorry.

Starting point is 00:23:30 So, well, the naive one is the naive one. We say that, well, for example, I got inspired by the existing query optimizer strategies. So there are multiple placement strategies there. The first ones and the simplest ones, heuristics-based, which say, oh, okay, we have a processing operator. It needs one slot, whenever slot means in the system. And we say that, well, we have a device and it has, let's say, five slots. So we know

Starting point is 00:23:59 that there are five slots. We know the query. We know the number of operators we want to deploy to the device. So we kind of have the theoretical estimation of resources on the device. And therefore, my naive strategy was to say, okay, well, fault tolerance also needs one slot. So here we go. We don't only need one slot for the processing operator. We also need one slot for the process inaugurator. We also need one slot for the fault tolerance. However, when I ran the experiment with this kind of estimation, well, so first of all, I repeated the experiment with Raspberry Pi and Flink, but with NebulaStream and Raspberry Pi. And in terms of NebulaStream, I was able to deploy 64 queries without fault tolerance and 42 queries with fault tolerance. And 42 queries again, of course killed the Raspberry Pi.

Starting point is 00:24:50 So I killed it twice. Yes, I did. So, and it was naive estimation saying, yeah, one slot processing, one slot fault tolerance. Since we deploy 64 queries without fault tolerance, we were able to deploy 32 queries with fault tolerance, because now instead of one slot, we needed two slots every time. So we decreased the amount of query by double, 32, but we did not fail the Raspberry Pi, which was good. However, we knew

Starting point is 00:25:19 that we could actually deploy 42, because of the 42 queries queries the Raspberry Pi died the last time. So we knew that, well, the estimation is not really precise. We could do better than that. And then I started looking at this memory network pattern. And I calculated that, for example, we have this hyperparameter like trimming frequency. So how often do we trim? Let's say we trim every 100 tuple buffers or like buffers. And let's say we know the size of the buffer because normally we know it. It's configuration of the system. So we can calculate approximately

Starting point is 00:25:56 how much memory we will need at the point of time. So we will store for full tolerance 100 buffers. Let's say each buffer one kilobyte for the simplicity. So we will need 100 full tolerance 100 buffers, let's say, each buffer one kilobyte for the simplicity. So we will need 100 kilobytes whenever. And then after that, we will send message to the master node and master node will reply to us. And there is still this delay of not knowing how much actually memory will it cost us to actually wait for this message to reach the master and reach back. But well, knowing the network delay and approximate number of hops, we can still predict that and this estimation is kind of safe there. So having this estimation,

Starting point is 00:26:38 we reached actually 42 queries, but without failing the Raspberry Pi, which is good. So it was originally targeted amount of queries, but we did not even kill the poor device, which was good. But then I still went further and I said, okay, well, but actually we could notice that we are getting out of memory at some point and we could switch to start sending more often these messages for deleting the memory for cleaning the memory and like that we will utilize more network but we will save more memory and having this optimization i reached 56 queries out of 64 original which was like actually major result in terms of yeah number of queries deployed that's awesome i mean i was at the start i would say they're gonna stop you from buying raspberry pies if you keep

Starting point is 00:27:31 i hope they don't have my name on there but it's really nice to see that iterative process as well as kind of as you went through doing these experiments you kind of like oh actually i can maybe tweak it this way or i can go this way and do these little actions it's really nice to see that iterative process as well as kind of as you went through doing these experiments and you were kind of like oh actually i can maybe tweak it this way or i can go this way and do these little actions it's really nice to see that iterative process sort of play out and see because i mean at the end you you always i just read the research paper it's like it's like you had idea one idea two idea three and they all just came through at the same time but it's nice to see that sort of iterative process play out that you did these experiments and to see that and to see the gain each time must have been a nice experience from you for you as well

Starting point is 00:28:07 cool so i guess given that we touched a little bit there on on sort of results and a few numbers and stuff but maybe we can talk about your experience in a little bit more depth so yeah overall did you have how did you go evaluating all of these different approaches you developed? Yeah, and what datasets did you use? Who did you compare against? Obviously, Flink and NebulaStream were used, right? But yeah, tell us more about your experiments is the question I'm trying to get to, Anastasia. Yeah, sure. No worries.

Starting point is 00:28:42 Well, so the first experiment, I basically brought you as an example in terms of how I came up with a different solution. So that's exactly the same measurements, but for four Raspberry Pis, not for one. So you can just multiply these values by four, and that's how you will get my first estimation. Then I actually looked at the resource consumption to see that actually, yeah, naive approach is not really good at resource estimation because it just says, yes, it uses some kind of slots. Who knows what slots mean? How many slots a Raspberry Pi has? And it's a bit like, yeah, heuristic approach. Then we actually also, in the resource analysis, I think we see how Flink failed, why it failed. It went out of memory, which was according to the initial assumption.

Starting point is 00:29:34 Then I ran a scalability analysis. Basically, I actually use Kubernetes for that. I used Google Cloud with up to 256 devices, I think. So I basically allocated 256 Raspberry Pis and used Kubernetes and Docker image to deploy them on every instance. I wanted actually to run 512, but then Google Cloud wrote me a sad message saying that I'm already utilizing all the CPU of Western Europe so they cannot give me more. So I had to stop at 256. Yeah. They need bigger

Starting point is 00:30:15 data centers is what we're saying here. This is a call to action for Google. Build more data centers. I'm actually a bit worried how many companies have me their red list of not giving me any more devices or hardware resources. Yeah. And, well, the analysis that I find cool is actually this hyperparameter analysis where I tried different trimming frequencies and see how they influence the system and what happened. Because, well, what can I say is that if we don't trim the data in time, we just can, well, we will be running out of buffers

Starting point is 00:30:54 because normally how a system works, they allocate a pool of buffers at the launch. And then these buffers from this pool are getting used inside the system. So if unless the buffers are getting trimmed, they still stored in the systems and they cannot be used to retrieve new data to actually send it further. So if we don't trim enough in time, we can be at the point where we don't have any more buffers but we still haven't sent the message to the master node asking to delete the buffers so we are just stuck at the moment where we are constantly waiting for new buffers but never actually freeing any of the buffers so i would say it's a crucial parameter and it's really important to choose it wisely yeah did you choose

Starting point is 00:31:43 that parameter is that it's obviously how is best to make that decision? Is there sort of a way we can pick that in a manner that will always guarantee we don't hit this problem of not having any space to get the message back to us to then be able to trim it, right? Yeah, I actually believe that the optimal value can be calculated using my formula of memory estimation for fault tolerance so basically if we say like amount of buffers plus the delay of both sides of trimming message that should be a really valid formula to use if we want to estimate this value yeah i just wanted to to double check my understanding as well of that these fault tolerance placement schemes, is it a deterministic guarantee that we will never, ever end up being under provisioned again,

Starting point is 00:32:29 an OOM or anything like that? Or is it just the best effort we could possibly still hit an OOM? Yeah, so there is a basic problem in the IoT with processing guarantees. Well, so what you've mentioned right now, these are processing guarantees and different systems support different guarantees. For example, the strongest one is exactly once, meaning that we do not lose any data and we do not produce any duplicates in terms of a failure, in case of a failure. Then there is at least once, which is

Starting point is 00:33:01 we don't lose any data, but we might have duplicates. At most once that we lose data, but we don't have duplicates and then none. And so the problem of IoT is that because the data is getting produced at the lowest devices, in case the producers of the devices, the sensor device itself fails, there is no way independently of my fault tolerance approach that I can reproduce this data because it's actually the producer who failed. So I say in my papers that we start working from at most once. Unfortunately, at least once and exactly once is not achievable for us. So we have different flavors of at most ones. But answering your question, no, you cannot be sure that your data will get delivered because in case a producer will fail, then there won't be any data to deliver. Otherwise, yes, there will be, there won't be like,

Starting point is 00:34:01 I mean, so there will be a probability with which the data will get delivered i create like several classes just to say that like yeah there is high probability medium probability and low probability just so the user wouldn't have to use you know the number between zero and hundred but have these classes of like oh yeah i want really badly my data to get delivered. So then it's high. Cool. Yeah, I guess that's really interesting. I guess kind of leading on from that then, I always like to ask about limitations of various solutions. I mean, this is a massive improvement over the state of the art, right?

Starting point is 00:34:41 And kind of just not thinking about fault tolerance at all. But I always like to kind of ask and sort of probe a little bit about limitations. And are there any situations obviously this is the the fact that we're getting we can't have exactly one semantics it's kind of out of our control right it's just the facts of life in a way so we have to work around that yeah but are there any are there any sort of situations or scenarios in which the placement strategies are end up being suboptimal? Well, there are multiple. I mean, so basically I am not the inventor of placement, right? I'm just using the placement algorithm and I also treat fault tolerance as a black box. So I basically say, yeah, any fault tolerance approach is an operator

Starting point is 00:35:20 and I just deploy it. So I'm just, you know, don't take any responsibility. I'm serving as a bridge between the query optimizer and the full tolerance guys. But of course, there are several sub-optimalities there. I mean, I used in my paper, for example, weighted some algorithm, which is well considered to be sub-optimal in terms of multi-objective optimizations.

Starting point is 00:35:43 We have the weights, this weights again chosen by either the developer optimal in terms of multi-objective optimizations. We have the weights, these weights, again, chosen by either the developer or the user saying, oh, okay, our system is memory restricted, or our system is lacking network bandwidth, or whatever. Yeah. So in this sense, we are restricted. What else? We are also, of course, restricted that if we have just many unreliable devices, that wouldn't matter actually where we deploy fault tolerance because probability will be everywhere the same. So, yeah, I mean, there could be many examples, but the major idea, I think, is still there. Just trying to help avoiding the problem of like fault tolerance running in silos. Yeah for sure.

Starting point is 00:36:33 Given the natural next step for your work on fault tolerance in the internet of things and yeah in NebulaStream. Yeah so as i just mentioned in my in the evaluation i highlighted that there is a problem with scalability did i mention that maybe i should mention that that i actually found out that there is a small problem with scalability with any of the approaches just because of this trimming patterns that we have to ask somebody and we have to wait. So there are two problems. The first one, if we have really long paths in between the last device and the master node, so there will be really many hopes to jump in. And at some point, we cannot find optimal amount for buffers to get stored

Starting point is 00:37:25 because the optimal amount is less than zero, which is like, obviously, we're just not managing to trim data fast enough. And the second problem is that master node is serving as a central unit, which is doing really many parallel processing. So if we have also really many paths paths then the master node is struggling answering all these questions about which buffer which buffers are safe to trim so these are two problems and my future vision of this problem is to actually move from the centralized architecture in terms of fault tolerance to the decentralized one.

Starting point is 00:38:09 So a bit like peer-to-peer inspired where each worker is taking responsibility for himself and talking to the neighbors, maybe like analyzing neighborhoods. And yeah, this is also the idea of my papers that i'm currently writing a teaser yeah for the listener to keep keep their eyes peeled there's more work coming soon it's interesting you situate you those those future directions because i i was as i was kind of thinking about this and so like you said there's this maybe this the the network topology is in in a way that makes a scalability be a problem right is that if you have how much of a control do you have over the network topology and which devices are connected to each other how much do you have to just assume that's a given or

Starting point is 00:38:55 do you have a control over the whole architecture that's okay well if we connect these devices that's going to then make the whole picture a lot nicer if we can if we can connect a and b then that's going to make all our problems go away like how much control do you have over that actually or how much do you just assume that the network the way the hardware and the way things are connected is just a given and we're just going to try and work on top of that platform yeah so at the moment in nebula stream we have like coordinator running in the cloud that has an understanding of the global topology. And we kind of have this assumption of the worker running in isolation without knowing the global topology. So all the optimization decisions are made by the coordinator.

Starting point is 00:39:38 It then propagates those decisions to the workers partially and the workers then performs well in our case we are compiling this query to run it really efficiently to utilize resources really efficiently but yeah so we have this isolation assumption nice it'll be really interesting to see how how your work progresses in this in this area as well and and see how how nebulous stream progresses as well and see how Nebula Stream progresses as well. Cool. Yeah, let's talk about that a little bit more actually then about impact. And obviously, I don't know how widely used Nebula Stream is outside of sort of as a research vehicle or if it's used at all in any sort of production setting. But has your kind of work on fault tolerance had any impact at the moment? Or what sort of impact do you think it can maybe have in the future and maybe be incorporated into systems like Flink?

Starting point is 00:40:30 Yeah, well, so I find NebulaStream conceptually a bit different from Flink. So it's like one can say a next generation of the system after Flink. And what I say is that the algorithm I develop is targeted more towards this device setting with limited resources. So Flink is running in the cloud, so they basically don't care about this. But potentially, I think it's a major improvement in terms of, well, not only reliability, but also of understanding the nature of fault tolerance, understanding the costs of fault tolerance. I think that if, for example, an engineer will have to set up a large installation of IoT devices, my paper will help understanding better how much resources, estimate, for example, how many devices are

Starting point is 00:41:25 needed for the system, how many resources a system has. And knowing also the use case, for example, we know that the installation is with sensitive data or was like important data. We couldn't estimate overall the amount of devices, we don't like blindly buy like let's say thousands of devices and assume that yeah it will be enough but we can actually perform some kind of smart analysis for this decision making yeah yeah nice yeah some data driven decisions right i think is the the kind of catchphrase for it but cool yeah so i guess my next few questions are more about sort of the not the non-technical stuff but this the uh the stuff that happens along the way and the journey of when you're working on a paper like this and the first one is what whilst working i mean maybe not specifically on this paper but maybe on your phd studies

Starting point is 00:42:19 at large what's been the most interesting thing that you've kind of learned so far on that journey well i i guess i kind of enjoy working with different types of hardware so as you might have noticed you like destroy different types of hardware yeah maybe i enjoyed a bit too much destroying different types of hardware. That's true. Including my laptop a couple of times. Yeah, yeah. But to be honest, yeah, before that, I didn't have any experience with raw hardware. I had to manually connect Raspberry Pi,

Starting point is 00:43:05 create like local network. I also got to know like Kubernetes, Docker and all this kind of infrastructure. So it's like a first hand experience that I got there, which I've never had. And also what was interesting for me and is still interesting

Starting point is 00:43:22 is to be part of like a larger team where we are all researchers contributing to what we are trying to build as an industrial project so yeah that's also something new for me and something that I experienced during my PhD well of course freedom of decision making That's exactly why I came here. And that's exactly why I'm doing my PhD, to have this freedom and to actually have an impact in the future somewhere, hopefully. Yeah, I'm sure you will.

Starting point is 00:43:58 I guess going off that, let me segue in from that last bit about having that freedom to pursue your own ideas and to kind of get creative. And maybe you hinted ated at maybe you kind of i kind of got an idea of how you work later on and you said you kind of did this experiment then you tweaked it this way you kind of ended up sort of developing this this solution space but yeah how you go about generating ideas and what's your creative process yeah i actually thought about that a bit. And I think that I like visiting different talks from different people, listen to the current research. We have plenty of visitors that give like keynotes or talks, also industrial partners, for example. And somehow during these

Starting point is 00:44:42 talks, I get extra inspired. So I keep my notes on my iPhone open and I just write down some ideas for my own paper sometimes I actually even skip like part of the talk just because I'm so into my own ideas that I kind of get distracted about that but yeah I would say this is my major resource of motivation that's a nice approach kind of that um yeah i've had people before who've answered this question they're sort of like they've they've mentioned this idea of having a kind of breadth and and boring ideas from our seeing the way people approach problems in other domains or even within the same space but it's kind of like closely related maybe not even by not the one thing is not, but it's kind of like closely related, maybe not even by all, not biology, maybe,

Starting point is 00:45:25 but just kind of a different field within computer science. And there's a lot of parallels that can be drawn between the two and you can take ideas from those and give you a nice way to sort of generate ideas, which is cool. And yeah, so I guess we've made our way all the way to the last word. But before we do that, I don't know if you saw my email about the possible possibility of giving the listener a recommendation and yeah i saw it this is a new feature anastasia so um you either feel the you got the honor of delivering the listener the first recommendation or the recommendation can be anything tech related

Starting point is 00:46:00 it can be a blog post a paper a new system a new tool something you've encountered recently that you thought the listener might might enjoy so yeah what do you have for us anastasia yeah so first of all please do not destroy your hardware that you're actually supposed to use for the evaluation but like yeah that's just a general recommendation but i thought about it and i think i would recommend to try out nebula stream of of course, because I'm, yeah, well, I am the first had an experience here and we plan open source release on SIGMO 2025. So it should be released this year. Well, next year, next year, right? And right now it's also available, but only for research. And we are also having our small communities where community where we support people that try to do stuff with nebula stream so people should not be scared to use it we will

Starting point is 00:46:53 support everyone and yeah i highly motivate to play to play around with it fantastic and correct me if i'm wrong but that's kind of quite nice because Sigmund is in Berlin right? Yep, yep that is correct There we go perfect alignment there We are hosting this year yes. Yeah so it's a celebration cool so yeah

Starting point is 00:47:17 now last word what's the one thing you want the listener to take away from this podcast other than your awesome recommendation but yeah one more thing from the last word to finish us off well so if you're in a database community that i would highly recommend not to ignore fault tolerance and yeah think of it beforehand and think of it all the time as it is really important and of course if you're a phd during your work i would say to i would generally recommend people to take that time enjoy the research part enjoy the freedom of

Starting point is 00:47:55 the decision making and choices and try to change something try to change the world and impact the research community somehow. That's, I guess, the most pleasant part of the PhD, in my opinion. Awesome. So that's a great message, Sven. I'm plus one with you on the thought tolerance. I agree with you on that one. Very important. We've got to look

Starting point is 00:48:20 after your data. It's been a really nice, really enjoyable chat, Anastasia. It's been lovely talking to you, and I'm sure the listener will have really enjoyed the conversation as well we'll put links to everything in the show notes so the listener you can go and find and play with nebula stream and read all the all of our stages awesome work we'll put links to that and everything in in the notes and yeah we'll uh we'll see you all next time for some more awesome computer science research.

Your Ad Here

Disseminate: The Computer Science Research Podcast - Anastasiia Kozar | Fault Tolerance Placement in the Internet of Things | #61

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.