Disseminate: The Computer Science Research Podcast - Draco Xu | TSUBASA: Climate Network Construction on Historical and Real-Time Data | #3
Episode Date: July 4, 2022Summary: A climate network represents the global climate system by the interactions of a set of anomaly time-series. Network science has been applied on climate data to study the dynamics of a climate... network. The core task and first step to enable interactive network science on climate data is the efficient construction and update of a climate network on user-defined time-windows. In this interview Draco talks about TSUBASA, an algorithm for the efficient construction of climate networks based on the exact calculation of Pearson’s correlation of large time-series. By pre-computing simple and low-overhead statistics, TSUBASA can efficiently compute the exact pairwise correlation of time-series on arbitrary time windows at query time. For real-time data, TSUBASA proposes a fast and incremental way of updating a network at interactive speed. TSUBASA is faster than approximate solutions at least one order of magnitude for both historical and real-time data and outperforms a baseline for time-series correlation calculation up to two orders of magnitude.Questions: 0:54 - Can you introduce your work, describe the problem your paper is aiming to solve and the motivation for doing so?4:11 - What is the solution you developed? How did you tackle the problem?6:50 - What is the improvement of TSUBASA over existing work?8.59 - Are your tools/algorithms publicly available?10:21 - What is the most interesting lesson or challenge faced whilst working on this topic?11:51 - What are the future directions for your research?15:43 - Are there other domains your research can be applied to?Links: HomepagePaper (arXiv)tsupy libraryContact Info: Email: dracoxu@stanford.eduTwitter: @DracoyunlongXu Hosted on Acast. See acast.com/privacy for more information.
Transcript
Discussion (0)
Hello and welcome to Disseminate, the podcast bringing you the latest computer science research.
I'm your host, Jack Wardby.
Today we're recording from ACM SIGMOD pods in sunny Philadelphia.
I'm delighted to say I'm joined by Draco Zhu.
Hi, I'm Draco.
From the University of Rochester.
And today we'll be talking about his paper,
Tubasa, Climate Network Construction on Historical and Real-Time Data.
Thank you for joining us on the show. Thank you for inviting me. Let's dive straight in.
Can you first introduce your work and describe the problem that your paper is aiming to solve
and the motivation for doing so? Okay, so the goal of Tsubasa is to provide climate science as a way to construct the climate network from the time series.
And here, the climate network could be...
There are two basic concepts in a network.
The first is the nodes, and the second is the age.
The nodes, you could just imagine it's like a city so for example we have some
people they are recording the temperature of for example the philadelphia so what's the temperature
like in this second and every five second so they will record like a new one so it's like
updating one and so everybody like in every city is doing that so you can imagine maybe there are
1 000 cities in the world so you have
like 1,000 people they are recording and they update it like every five seconds and then so
that's the nodes so the nodes are all the cities in the world and then you want to know the age so
what's the age? Age could be like the degree of the correlation so between the behaviors of this
time series so for example you want to to know whether the weather here in Philadelphia is correlated with the
weather in New York City. So as a climate scientist you might want to know these kind of
things. And then we can compute a measurement called Pearson correlation
to measure the degree of the correlation of these time series.
And then, so you could compute that for Philadelphia and the New York City,
and you could compute it for every pair of the cities.
And for example, we have like 1,000 cities.
Then you could have like 1,000 times 1,000 pairs.
And then you could get like a matrix.
So the matrix, is nothing special. So just like for each point in this
matrix, it tells you a correlation
between
every pair. And so
that could be like a large network
if you have enough nodes.
And the key point is that
if you want to update
for every 5 seconds, because
the temperature will update it.
And maybe your computer is not good enough
to compute it in five minutes,
or in five seconds, or in one second.
So you need a fast way to compute that one.
And the researchers in this area,
they have proposed some ways to compute it.
And here, and now, so like, so before I address these, so these problems,
so they can do it like in a very good way.
So they can get like an accuracy above like 90%, but it's still not exact the value.
And if you want to get exact value,
because in some certain types of research,
the exact correlation value is really important.
And if you want to get the exact value,
so can we have a way to do that?
So can we find a measure to get exact value of the correlation?
So that's the main goal of this research and motivation.
Fantastic. That's brilliant. So knowing that, what is the solution you developed in your
research? How did you tackle this problem? Okay, so like the main idea is that you
want to have something like prepared. So there is a basic concept
that so when we're talking about a correlation, so we have like a
time period that you're talking about. So you want to know like whether the
temperature in Philadelphia is correlated with the weather in New York City.
But you need to specify a time range for it.
So for example in the last one day or in the last ten days, in the last one year or something like that. And then, so like in our research, there's a definition to describe these things.
It's called a curl window. Okay. And then the curl window might be small, might be large.
And each time you update this one, so only a small amount of the data will be changed.
For example, you have like, so you want to know the correlation in the last five days.
And then after five minutes, it's updated.
So only one point is updated.
And you want to update this one point.
So the key idea is that you want to summarize information for the last four days plus blah, blah, blah.
And you want to add information of one data point so like the here are two steps the first one is that you want to summarize information for the
past few days and then you want to find a way to combine the information from the new coming ones
with like the ones that we pre like so with the ones that we prepared.
So when we're summarizing the information for the past data,
that's why we need a concept called basic windows.
So basically you need to cut these time series into small groups.
So you can imagine that we can group the data point
in one hour to a small window,
and that's called a basic window.
And then, for example, for yesterday,
so we can have like the 24 basic windows.
And then you want to get some statistics
from these basic windows.
You don't want just to keep the raw data.
So the raw data, there could be hundreds of data there,
but you might only want to keep one, two, three, like three data points.
And from these, you can get the correlation.
And these statistics, so what we use is the correlation between these basic windows
and their mean and their standard deviation.
They are easy to compute.
And so, yeah, I think that's the main point.
Fantastic.
What's the improvement of your solution over the existing work?
And obviously you alluded to before that yours is obviously an exact solution, whereas other ones are approximations of correlation. Okay, yeah. So, like, as I mentioned,
the most advantage or the improvement
is that we can achieve the exact one,
which is 100%.
And the second one is that in the previous work,
the usability of the curve window is really limited.
So let me explain this.
So, for example,
the basic window concept is used in the previous works.
So when you want to get the correlation,
so for example, maybe you want to know the correlation between the half-past three to the half-past five.
And if the basic window are like the 3 to 4 and 4 to 5,
you can never ask the algorithms or the system these questions
because the starting point is the HAPA 3,
and it's not aligned with the 3.
And so it requires that the size of the query window
is divisible by the size of the basic window.
So that's the first requirement of the previous algorithms.
And the second one is the starting point and the ending point of the query window
should be aligned with the starting point and ending point of the basic windows.
And that's the second limitation.
And in our solution, we don't have these things.
So you can query from any point
and so that's like uh one easier sense because like the base because the size of the basic
windows is really important and if you just limited the user that you can only start from
this point and from that point that's really like uh not efficient like in the common research, because they provide you a way to get
more possibility to do the research, I guess.
Yeah.
Awesome. Great. So, I imagine that you have
a toolkit that you can use that is available online on GitHub,
that people can go and use to use your model, use your techniques.
Yeah. So, besides the codes that we provide for this paper, online on GitHub that people can go and use to use your model, use your techniques? Yeah, yeah.
So, I mean, so besides, like, the codes that we provide for this paper, we also, like,
implemented it as a Python library.
Okay, nice.
And it's called Supy.
Actually, so Jin Xu, who is sitting just beside me, so he is, like, the main author of this
library.
And so in this library,
we implemented like all the algorithms that we mentioned in this paper.
And we also implemented some further functions
that is useful in the climate research.
For example, the sliding window one
or some other functions.
And so in that library,
so you can just search the SUPY on the
GitHub, T-S-U-P-Y. And you can also download it from the Pi version.
There is like a document there, you can follow the document and in the Python
code or in your Jupyter notebook, you can just use the functions that we have
already prepared for you. And you can just load the data set and do our research that's brilliant
well we'll link to the the library in the show notes and all the relevant
information so okay people can go and find that if they okay okay awesome so
the next question I have is what is the the most interesting or unexpected or challenging lesson that you learned while working on this topic?
So, like the most challenging one?
Yeah.
I think the challenge is that it's like a trade-off.
So, like in the research, we talk about the space complexity and the time complexity.
So basically, these two ideas is that
I mentioned that we need to summarize
some information from time series.
So here, we only keep
three statistics.
But if we keep more,
we might be more efficient
in the current time, which is generating
a network.
But if the scratch and the summarization is too much,
then maybe you need a larger computer to start these things.
You need a large server to do these things.
And if you store like too few things,
then you need more time to do the query.
So since we really want these algorithms and the library to be useful in the research,
so we really need to find a balance between these kind of things.
And I think that's like a key challenge in this research.
My next question is, what do you have for plan for future research in this area?
What's the next steps?
Where do you go from here?
Okay, so as mentioned here, we need the trade-offs between the time and the space,
and we can do some improvement for that. For example, we can minimize the space complexity.
And for time complexity, we can also do something. So we have another idea is that we first
to put in some pairs of the time series.
So you know which pair of these time series
could have a high correlation.
And you pick them.
And then you check them.
So the picking step is not too expensive.
So you can pick the candidates, and then you
choose from the candidates.
So here we just compute every pair of these basic windows.
And it's not smart enough.
So the idea is that we pick all the candidates,
and we choose from the candidates.
So that's the main idea for the next step.
And here, actually, so you know like the paper.
So after we submit a paper,
so there has several months and we
also worked on another algorithm,
which is the sliding one.
So you can imagine that in the historical data.
So for example, you want to know the correlation
for the 2020,
the 2021, 2022, and so that's called lag siding.
So the size of the window is n chance,
but a starting point and end point is chance.
In a siding, we have an improved algorithms,
which is like you make some prediction for the next step.
In that way, you might not need to compute that one,
but you can get, like, the result.
So that's, like, two things that we want to do.
And also, so this kind of correlation-based network
is common in a lot of, like, areas.
It's not only important in the climate research.
For example, here I mentioned, like,
the nose could be a city,
but nose could also, like also be a brain region.
So the neuroscientists,
they also want to know
the correlations between the brain
regions. And in neuroscience,
that's called the connectivity
or it's
called the functional connectivity. They want
to know when you are looking at something,
when you are thinking of something. So maybe two parts of your brain are correlated.
And maybe this kind of correlation will tell us how the brain, so how it processes information,
how it sends this information. And from these neuroimaging scenes,
we can do the single neural record
to have some more neurophysiology scenes
or something like that.
But the point here is that
this kind of correlation-based network
is coming in a lot of other regions.
But another important thing is that
in each region,
like in each research field, the point is different.
So here, in climate research, the time resolution is really important.
For example, the temperature will be updated in one second, in five seconds.
So we need to achieve that kind of speed.
But maybe, for example, in neuroscience, it's not updated so quick.
Maybe we need to achieve
a higher accuracy
or low overhead
or the faster speed.
I don't know.
For different kind of fields,
they have different kind of requirements.
If we really want to extend it
to the other domains,
and, for example, we want to write want to extend it to the other domains and for example we
want to write a library for the scientists in other domains then we need to know like their
requirements so and then we do some improvement to it yeah that's great so that will allow you to
explore other areas of these trade-offs you're talking about yeah yeah which yeah that's great
what are the other application areas? So you mentioned, obviously, this climate, this neuroscience. What's the other common
application areas for this work? For example, you know, finance, I guess. So, like, the stock market.
So, some kind of the financial research, so they want to know the correlation between some
stocks and other, like, financial time series. Or there is one called Bitcoin.
Okay, yeah.
Bitcoin, you want to know the correlation.
And that one is updated really frequently.
And I guess it's more than one second.
So you can see this kind of research.
Or you can see also in the physics.
So they have some experimental physics.
So they are doing the experiments.
So you have some sensors.
You're putting them in the lab,
and you want to know the correlation matrix or something like that.
Okay, brilliant.
Well, fantastic.
We'll end it there.
Okay.
Thank you so much, Draco, for coming on the show.
If you're interested in knowing more about Draco's work
then the links to the paper and all of the relevant
libraries and everything will be put in the
show notes. So thank you and
see you next time. Thank you.