Disseminate: The Computer Science Research Podcast - Thomas Hütter | JEDI: These aren’t the JSON documents you’re looking for | #4
Episode Date: July 8, 2022Summary:The JavaScript Object Notation (JSON) is a popular data format used in document stores to natively support semi-structured data.In this interview, Thomas talks about how he addressed the probl...em of JSON similarity lookup queries: given a query document and a distance threshold, retrieve all documents that are within the threshold from the query document, i.e., get me all similar documents!. Different from other hierarchical formats such as XML, JSON supports both ordered and unordered sibling collections within a single document which poses a new challenge to the tree model and distance computation. Thomas talks about his proposal JSON tree, a lossless tree representation of JSON documents, and define the JSON Edit Distance (JEDI), the first edit-based distance measure for JSON. He talks about the development of QuickJEDI, an algorithm that computes JEDI by leveraging a new technique to prune expensive sibling matchings. It outperforms a baseline algorithm by an order of magnitude in runtime. Our experimental evaluation shows that our solution scales to databases with millions of documents and JSON trees with tens of thousands of nodes.Questions:0:47: Can you explain to the listeners what is JSON?1:14: What is the problem you're trying to solve in your research? 1:48: What was the reason JSON was under researched? 2:13: What is the motivation for this research? Why do we need it? 2:52: What was the solution you developed to solve this problem?4:35: How does tree edit distance work?5:18: How do we go from tree edit distance to JEDI? 6:29: How did you evaluate JEDI? 8:31: Do other database systems provide similar functionality? 9:33: Can you tell the listeners more about AsterixDB? 10:20: What was the most challenge aspect of working on this topic?10:59: What are the future plans for this research? 11:56: What attracted you to working on similarity queries? Links:PaperSIGMOD PresentationAsterixDBthomas.huetter@plus.ac.atHomepage Hosted on Acast. See acast.com/privacy for more information.
Transcript
Discussion (0)
Hello and welcome to Disseminate, the podcast bringing you the latest computer science research.
I'm your host, Jack Wardby, and today we're recording from ACM SIGMOD pods in Philadelphia.
I'm delighted to say I'm joined by Thomas Hooter, who will be talking about his paper, Jedi.
These are the JSON documents you're looking for, which I think is the best name for a paper I've ever seen.
Thanks.
Thomas is a postdoc at the Database Research Group at the University of Salzburg,
and he focuses normally on novel algorithms for similarity queries
as well as their integration into fully-fledged database systems.
Welcome, Thomas. Thanks for joining us.
Thanks for having me, Chuck.
So let's dive straight in.
First of all, can you set the scene for your research for us and tell us what is JSON?
Okay, JSON is a data format. I guess everyone has in some variety touched it, whether it might be a data set, which has just the data format to store it, or I don't know, exchange data in mobile applications, or especially here for the database community, I guess we used it from document stores where we stored the semi-structured data.
Awesome. So what's the problem that you're trying to solve in your research?
So initially we have a lot of background, as you mentioned, in similarity search in the group and also hierarchical data.
And XML was like this huge of related work that was done but json kind of
wasn't touched ever so the initial question was why and i mean we found out during the process
that it's not that trivial okay but that was the initial spark so so what was the reason why it was
never took yeah so this is one of the major problems that occurred because JSON contains ordered as well as unordered data.
So in XML, you usually have the order given, and unordered data is actually really hard to tackle in any sense,
and especially for similarity.
What's the motivation for this line of work? Why do we need it?
So in similarity search in general, it's like not always it works that you do an exact
match like you don't always have a number and you can't compare five to five but sometimes like in
our example in the paper if you're following that you may crawl the web and you get from different
data sources different kind of data that are actually the same but stored in a different way so this exact match might not work at all and therefore some
differences should be tolerated and it's of course perfect if you can somehow quantify it
what was the solution you developed then talk us through the approach you took to solving it okay yeah so we started by
checking related work as usual and what you find is like yeah millions of online tools where you
paste your json documents you compare them but what it actually is it's a text document that
considers a string line by line and this essentially ignores a lot of information so the hierarchy because
json is nested you can have yeah yeah whatever many nestings and this is ignored as well as the
unordered nature of the document so the key values of the documents so those are actually not really
what we would like to have i mean it's nice to have, but essentially not, I guess, the end of the story.
And, yeah, therefore, we do not want to lose information.
So first of all, what we did is we need a representation of JSON documents that we can work with and does not lose information and due to the hierarchy a tree is a somehow natural representation
that we extend it to also keep like the order unordered information and based on that we went on
due to our experience with hierarchical similarity queries we went the straight approach there is
like the tree at a distance which is a common and well applied
similarity measure however we applied it for them and we have proven eventually that it's
mp complete so again not what we want can you talk a little bit more about how how the tree
at a distance actually works okay yeah at a distance in general is like so the huge benefit is that it's totally it's simplistic actually also for strings
or other things you have a set of edit operations and you basically just count the number of those
operations that you need to transform one document into the other and for us in the j's and t's
representation it's just how many nodes do you have to delete, insert,
or exchange the information in the node
that you get actually the final distance value.
So it's this number of edit operations.
So once we're at this point, what's the next step?
How do we move forward from this tree edit distance to Jedi?
To Jedi, yeah, right.
So we were at this dead end, like it's empty complete.
So what to do next?
Yeah, actually we could have a closer look
because what actually happens to the Chasen trees
if we apply the distance?
And we found out that it's too permissive
and it's too powerful.
So what it does, the tree at a distance
allows to split subtrees to merge them back again.
And since Chasen, each subtree is a subdocument, The tree-edit distance allows to split subtrees to merge them back again.
And since JSON each subtree is a subdocument, the document has its information.
We do not want to split it somehow or move across.
So we introduced a constraint that we would like to preserve this document structure within the JSON document.
And by introducing that constraint on top of the tree-edit distance, we were able to define this first,
JSON edit distance or chat-eye.
How did you go about evaluating your new approach?
We were able, so for the chat-eye,
to construct a baseline algorithm out of existing algorithms by combining them, adapting,
and that runs in higher complexities,
but still it's doable.
And we applied that as a baseline to do a similarity lookup.
So I have a database with a lot of JSON documents,
and I just query a document against the database
and give me all similar ones.
And, yeah, so the baseline approach, we evaluated,
I think, 22 data sets, and we were not really happy. So it didn't really scale and complexity
was too high. Yeah, the next step was to address two main issues that usually occur in similarity
search. So what happens if you have large databases? What happens if we have large documents?
And a typical approach, it's called filter verification framework, is we try to apply
that on top of it, but there are many steps we do have to address. One of them, so processing
large databases by comparing to all of the documents, a natural approach is to
build an index. Since there is no related work on JSON similarity, we had to introduce the first
JSON similarity index. And we did that. And further on, we also applied filters,
since the verification is that expensive and yeah also the verification algorithm
we were able to identify some weaknesses that we could address and overall we had then now this
nice filter verification framework that we again performed the evaluation on those data sets and
they scale nicely given this this finding and these algorithms you've developed,
how do you see them being integrated into maybe commercial database systems, document database
systems, such as Mongo and Couchbase, those sort of things? What do they do at the moment or don't
they have anything like this? And would your work naturally fit into a system such as that?
Yeah, I also explored that. And actually, if database systems provide similarity support,
then it's usually on simple data types,
like strings or maybe even arrays.
But that's the end of the story.
So it's not provided.
And actually, a current effort is to integrate my solutions into AsterixDB.
And I'm currently on that.
How's it going?
Yeah, challenging.
So the system is like...
I don't know much about this system.
Can you tell the listeners and myself a little bit more about the system?
Yeah, it's a big data management system developed at mainly UCI, so UC and Irvine.
Yeah. And, yeah, it's a parallel database system,
and, yeah, one main aspect where we've chosen it
is that it has a JSON-like data format.
It's a bit extended, but it's no issue for distance.
So my next question is,
what was the most interesting and unexpected
or challenging aspect or lesson that you've learned while working on this topic?
And more generally in the field of similarity queries.
I think on the one hand it was like these incremental problems that showed up and now we're there.
And then the next one arised.
And yeah, also actually that it's not that well supported as i also mentioned previously in
in database systems yeah so i think there's a lot of room to for future research in the area
that kind of leads naturally into the next question you've obviously mentioned you're
working on implementing this in a system right what's the other future directions for this
research where are you going to go with it next yeah so we have some things going on um one of
them is actually right now we compute a distance so it's a value that tells you how different but
what we're currently doing is creating a patch then we have json patches which may be beneficial
for example you don't always have to send the document all over again you just send the diff
and therefore it used the traffic for For example, we're also currently working
on clustering JSON documents,
where also the distance can be used
to group them together in similar groups,
which may be used, for example, to extract schemas,
which then helps to optimize
the query efficiency in database systems.
I guess the last thing I'll kind of ask is,
how did you end up working in this area?
And specifically, what attracted you to it?
And how can other people get started necessarily in your area of research?
So basically, I have to admit also that in our research group,
the similarity topic was kind of present.
And I also started at that university.
And also my pre-knowledge got directed in some other similarity.
So I started to gain knowledge and to also get on the topic, actually.
Brilliant.
Well, I'll put a link to your paper and everything in the show notes.
And thanks for joining us today.
And we'll leave it there.
Thanks for having me. Brilliant, yeah. And enjoy the rest of your time in Philadelphia. notes. And thanks for joining us today. And we'll leave it there. Thanks for having me.
Brilliant, yeah.
And enjoy the rest of your time in Philadelphia.
Thanks, you too. Bye.