Disseminate: The Computer Science Research Podcast - Draco Xu | TSUBASA: Climate Network Construction on Historical and Real-Time Data | #3

Episode Date: July 4, 2022

Summary: A climate network represents the global climate system by the interactions of a set of anomaly time-series. Network science has been applied on climate data to study the dynamics of a climate... network. The core task and first step to enable interactive network science on climate data is the efficient construction and update of a climate network on user-defined time-windows. In this interview Draco talks about TSUBASA, an algorithm for the efficient construction of climate networks based on the exact calculation of Pearson’s correlation of large time-series. By pre-computing simple and low-overhead statistics, TSUBASA can efficiently compute the exact pairwise correlation of time-series on arbitrary time windows at query time. For real-time data, TSUBASA proposes a fast and incremental way of updating a network at interactive speed. TSUBASA is faster than approximate solutions at least one order of magnitude for both historical and real-time data and outperforms a baseline for time-series correlation calculation up to two orders of magnitude.Questions: 0:54 - Can you introduce your work, describe the problem your paper is aiming to solve and the motivation for doing so?4:11 - What is the solution you developed? How did you tackle the problem?6:50 - What is the improvement of TSUBASA over existing work?8.59 - Are your tools/algorithms publicly available?10:21 - What is the most interesting lesson or challenge faced whilst working on this topic?11:51 - What are the future directions for your research?15:43 - Are there other domains your research can be applied to?Links: HomepagePaper (arXiv)tsupy libraryContact Info: Email: dracoxu@stanford.eduTwitter: @DracoyunlongXu Hosted on Acast. See acast.com/privacy for more information.

Transcript
Discussion (0)
Starting point is 00:00:00 Hello and welcome to Disseminate, the podcast bringing you the latest computer science research. I'm your host, Jack Wardby. Today we're recording from ACM SIGMOD pods in sunny Philadelphia. I'm delighted to say I'm joined by Draco Zhu. Hi, I'm Draco. From the University of Rochester. And today we'll be talking about his paper, Tubasa, Climate Network Construction on Historical and Real-Time Data.
Starting point is 00:00:46 Thank you for joining us on the show. Thank you for inviting me. Let's dive straight in. Can you first introduce your work and describe the problem that your paper is aiming to solve and the motivation for doing so? Okay, so the goal of Tsubasa is to provide climate science as a way to construct the climate network from the time series. And here, the climate network could be... There are two basic concepts in a network. The first is the nodes, and the second is the age. The nodes, you could just imagine it's like a city so for example we have some people they are recording the temperature of for example the philadelphia so what's the temperature
Starting point is 00:01:31 like in this second and every five second so they will record like a new one so it's like updating one and so everybody like in every city is doing that so you can imagine maybe there are 1 000 cities in the world so you have like 1,000 people they are recording and they update it like every five seconds and then so that's the nodes so the nodes are all the cities in the world and then you want to know the age so what's the age? Age could be like the degree of the correlation so between the behaviors of this time series so for example you want to to know whether the weather here in Philadelphia is correlated with the weather in New York City. So as a climate scientist you might want to know these kind of
Starting point is 00:02:13 things. And then we can compute a measurement called Pearson correlation to measure the degree of the correlation of these time series. And then, so you could compute that for Philadelphia and the New York City, and you could compute it for every pair of the cities. And for example, we have like 1,000 cities. Then you could have like 1,000 times 1,000 pairs. And then you could get like a matrix. So the matrix, is nothing special. So just like for each point in this
Starting point is 00:02:48 matrix, it tells you a correlation between every pair. And so that could be like a large network if you have enough nodes. And the key point is that if you want to update for every 5 seconds, because
Starting point is 00:03:03 the temperature will update it. And maybe your computer is not good enough to compute it in five minutes, or in five seconds, or in one second. So you need a fast way to compute that one. And the researchers in this area, they have proposed some ways to compute it. And here, and now, so like, so before I address these, so these problems,
Starting point is 00:03:31 so they can do it like in a very good way. So they can get like an accuracy above like 90%, but it's still not exact the value. And if you want to get exact value, because in some certain types of research, the exact correlation value is really important. And if you want to get the exact value, so can we have a way to do that? So can we find a measure to get exact value of the correlation?
Starting point is 00:04:00 So that's the main goal of this research and motivation. Fantastic. That's brilliant. So knowing that, what is the solution you developed in your research? How did you tackle this problem? Okay, so like the main idea is that you want to have something like prepared. So there is a basic concept that so when we're talking about a correlation, so we have like a time period that you're talking about. So you want to know like whether the temperature in Philadelphia is correlated with the weather in New York City. But you need to specify a time range for it.
Starting point is 00:04:39 So for example in the last one day or in the last ten days, in the last one year or something like that. And then, so like in our research, there's a definition to describe these things. It's called a curl window. Okay. And then the curl window might be small, might be large. And each time you update this one, so only a small amount of the data will be changed. For example, you have like, so you want to know the correlation in the last five days. And then after five minutes, it's updated. So only one point is updated. And you want to update this one point. So the key idea is that you want to summarize information for the last four days plus blah, blah, blah.
Starting point is 00:05:29 And you want to add information of one data point so like the here are two steps the first one is that you want to summarize information for the past few days and then you want to find a way to combine the information from the new coming ones with like the ones that we pre like so with the ones that we prepared. So when we're summarizing the information for the past data, that's why we need a concept called basic windows. So basically you need to cut these time series into small groups. So you can imagine that we can group the data point in one hour to a small window,
Starting point is 00:06:11 and that's called a basic window. And then, for example, for yesterday, so we can have like the 24 basic windows. And then you want to get some statistics from these basic windows. You don't want just to keep the raw data. So the raw data, there could be hundreds of data there, but you might only want to keep one, two, three, like three data points.
Starting point is 00:06:34 And from these, you can get the correlation. And these statistics, so what we use is the correlation between these basic windows and their mean and their standard deviation. They are easy to compute. And so, yeah, I think that's the main point. Fantastic. What's the improvement of your solution over the existing work? And obviously you alluded to before that yours is obviously an exact solution, whereas other ones are approximations of correlation. Okay, yeah. So, like, as I mentioned,
Starting point is 00:07:06 the most advantage or the improvement is that we can achieve the exact one, which is 100%. And the second one is that in the previous work, the usability of the curve window is really limited. So let me explain this. So, for example, the basic window concept is used in the previous works.
Starting point is 00:07:27 So when you want to get the correlation, so for example, maybe you want to know the correlation between the half-past three to the half-past five. And if the basic window are like the 3 to 4 and 4 to 5, you can never ask the algorithms or the system these questions because the starting point is the HAPA 3, and it's not aligned with the 3. And so it requires that the size of the query window is divisible by the size of the basic window.
Starting point is 00:08:06 So that's the first requirement of the previous algorithms. And the second one is the starting point and the ending point of the query window should be aligned with the starting point and ending point of the basic windows. And that's the second limitation. And in our solution, we don't have these things. So you can query from any point and so that's like uh one easier sense because like the base because the size of the basic windows is really important and if you just limited the user that you can only start from
Starting point is 00:08:37 this point and from that point that's really like uh not efficient like in the common research, because they provide you a way to get more possibility to do the research, I guess. Yeah. Awesome. Great. So, I imagine that you have a toolkit that you can use that is available online on GitHub, that people can go and use to use your model, use your techniques. Yeah. So, besides the codes that we provide for this paper, online on GitHub that people can go and use to use your model, use your techniques? Yeah, yeah. So, I mean, so besides, like, the codes that we provide for this paper, we also, like,
Starting point is 00:09:11 implemented it as a Python library. Okay, nice. And it's called Supy. Actually, so Jin Xu, who is sitting just beside me, so he is, like, the main author of this library. And so in this library, we implemented like all the algorithms that we mentioned in this paper. And we also implemented some further functions
Starting point is 00:09:34 that is useful in the climate research. For example, the sliding window one or some other functions. And so in that library, so you can just search the SUPY on the GitHub, T-S-U-P-Y. And you can also download it from the Pi version. There is like a document there, you can follow the document and in the Python code or in your Jupyter notebook, you can just use the functions that we have
Starting point is 00:10:03 already prepared for you. And you can just load the data set and do our research that's brilliant well we'll link to the the library in the show notes and all the relevant information so okay people can go and find that if they okay okay awesome so the next question I have is what is the the most interesting or unexpected or challenging lesson that you learned while working on this topic? So, like the most challenging one? Yeah. I think the challenge is that it's like a trade-off. So, like in the research, we talk about the space complexity and the time complexity.
Starting point is 00:10:46 So basically, these two ideas is that I mentioned that we need to summarize some information from time series. So here, we only keep three statistics. But if we keep more, we might be more efficient in the current time, which is generating
Starting point is 00:11:02 a network. But if the scratch and the summarization is too much, then maybe you need a larger computer to start these things. You need a large server to do these things. And if you store like too few things, then you need more time to do the query. So since we really want these algorithms and the library to be useful in the research, so we really need to find a balance between these kind of things.
Starting point is 00:11:34 And I think that's like a key challenge in this research. My next question is, what do you have for plan for future research in this area? What's the next steps? Where do you go from here? Okay, so as mentioned here, we need the trade-offs between the time and the space, and we can do some improvement for that. For example, we can minimize the space complexity. And for time complexity, we can also do something. So we have another idea is that we first to put in some pairs of the time series.
Starting point is 00:12:10 So you know which pair of these time series could have a high correlation. And you pick them. And then you check them. So the picking step is not too expensive. So you can pick the candidates, and then you choose from the candidates. So here we just compute every pair of these basic windows.
Starting point is 00:12:33 And it's not smart enough. So the idea is that we pick all the candidates, and we choose from the candidates. So that's the main idea for the next step. And here, actually, so you know like the paper. So after we submit a paper, so there has several months and we also worked on another algorithm,
Starting point is 00:12:54 which is the sliding one. So you can imagine that in the historical data. So for example, you want to know the correlation for the 2020, the 2021, 2022, and so that's called lag siding. So the size of the window is n chance, but a starting point and end point is chance. In a siding, we have an improved algorithms,
Starting point is 00:13:18 which is like you make some prediction for the next step. In that way, you might not need to compute that one, but you can get, like, the result. So that's, like, two things that we want to do. And also, so this kind of correlation-based network is common in a lot of, like, areas. It's not only important in the climate research. For example, here I mentioned, like,
Starting point is 00:13:43 the nose could be a city, but nose could also, like also be a brain region. So the neuroscientists, they also want to know the correlations between the brain regions. And in neuroscience, that's called the connectivity or it's
Starting point is 00:14:00 called the functional connectivity. They want to know when you are looking at something, when you are thinking of something. So maybe two parts of your brain are correlated. And maybe this kind of correlation will tell us how the brain, so how it processes information, how it sends this information. And from these neuroimaging scenes, we can do the single neural record to have some more neurophysiology scenes or something like that.
Starting point is 00:14:33 But the point here is that this kind of correlation-based network is coming in a lot of other regions. But another important thing is that in each region, like in each research field, the point is different. So here, in climate research, the time resolution is really important. For example, the temperature will be updated in one second, in five seconds.
Starting point is 00:14:58 So we need to achieve that kind of speed. But maybe, for example, in neuroscience, it's not updated so quick. Maybe we need to achieve a higher accuracy or low overhead or the faster speed. I don't know. For different kind of fields,
Starting point is 00:15:19 they have different kind of requirements. If we really want to extend it to the other domains, and, for example, we want to write want to extend it to the other domains and for example we want to write a library for the scientists in other domains then we need to know like their requirements so and then we do some improvement to it yeah that's great so that will allow you to explore other areas of these trade-offs you're talking about yeah yeah which yeah that's great what are the other application areas? So you mentioned, obviously, this climate, this neuroscience. What's the other common
Starting point is 00:15:48 application areas for this work? For example, you know, finance, I guess. So, like, the stock market. So, some kind of the financial research, so they want to know the correlation between some stocks and other, like, financial time series. Or there is one called Bitcoin. Okay, yeah. Bitcoin, you want to know the correlation. And that one is updated really frequently. And I guess it's more than one second. So you can see this kind of research.
Starting point is 00:16:22 Or you can see also in the physics. So they have some experimental physics. So they are doing the experiments. So you have some sensors. You're putting them in the lab, and you want to know the correlation matrix or something like that. Okay, brilliant. Well, fantastic.
Starting point is 00:16:39 We'll end it there. Okay. Thank you so much, Draco, for coming on the show. If you're interested in knowing more about Draco's work then the links to the paper and all of the relevant libraries and everything will be put in the show notes. So thank you and see you next time. Thank you.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.