Disseminate: The Computer Science Research Podcast - Thomas Hütter | JEDI: These aren’t the JSON documents you’re looking for | #4

Starting point is 00:00:00 Hello and welcome to Disseminate, the podcast bringing you the latest computer science research. I'm your host, Jack Wardby, and today we're recording from ACM SIGMOD pods in Philadelphia. I'm delighted to say I'm joined by Thomas Hooter, who will be talking about his paper, Jedi. These are the JSON documents you're looking for, which I think is the best name for a paper I've ever seen. Thanks. Thomas is a postdoc at the Database Research Group at the University of Salzburg, and he focuses normally on novel algorithms for similarity queries as well as their integration into fully-fledged database systems.

Starting point is 00:00:57 Welcome, Thomas. Thanks for joining us. Thanks for having me, Chuck. So let's dive straight in. First of all, can you set the scene for your research for us and tell us what is JSON? Okay, JSON is a data format. I guess everyone has in some variety touched it, whether it might be a data set, which has just the data format to store it, or I don't know, exchange data in mobile applications, or especially here for the database community, I guess we used it from document stores where we stored the semi-structured data. Awesome. So what's the problem that you're trying to solve in your research? So initially we have a lot of background, as you mentioned, in similarity search in the group and also hierarchical data. And XML was like this huge of related work that was done but json kind of

Starting point is 00:01:47 wasn't touched ever so the initial question was why and i mean we found out during the process that it's not that trivial okay but that was the initial spark so so what was the reason why it was never took yeah so this is one of the major problems that occurred because JSON contains ordered as well as unordered data. So in XML, you usually have the order given, and unordered data is actually really hard to tackle in any sense, and especially for similarity. What's the motivation for this line of work? Why do we need it? So in similarity search in general, it's like not always it works that you do an exact match like you don't always have a number and you can't compare five to five but sometimes like in

Starting point is 00:02:34 our example in the paper if you're following that you may crawl the web and you get from different data sources different kind of data that are actually the same but stored in a different way so this exact match might not work at all and therefore some differences should be tolerated and it's of course perfect if you can somehow quantify it what was the solution you developed then talk us through the approach you took to solving it okay yeah so we started by checking related work as usual and what you find is like yeah millions of online tools where you paste your json documents you compare them but what it actually is it's a text document that considers a string line by line and this essentially ignores a lot of information so the hierarchy because json is nested you can have yeah yeah whatever many nestings and this is ignored as well as the

Starting point is 00:03:34 unordered nature of the document so the key values of the documents so those are actually not really what we would like to have i mean it's nice to have, but essentially not, I guess, the end of the story. And, yeah, therefore, we do not want to lose information. So first of all, what we did is we need a representation of JSON documents that we can work with and does not lose information and due to the hierarchy a tree is a somehow natural representation that we extend it to also keep like the order unordered information and based on that we went on due to our experience with hierarchical similarity queries we went the straight approach there is like the tree at a distance which is a common and well applied similarity measure however we applied it for them and we have proven eventually that it's

Starting point is 00:04:32 mp complete so again not what we want can you talk a little bit more about how how the tree at a distance actually works okay yeah at a distance in general is like so the huge benefit is that it's totally it's simplistic actually also for strings or other things you have a set of edit operations and you basically just count the number of those operations that you need to transform one document into the other and for us in the j's and t's representation it's just how many nodes do you have to delete, insert, or exchange the information in the node that you get actually the final distance value. So it's this number of edit operations.

Starting point is 00:05:14 So once we're at this point, what's the next step? How do we move forward from this tree edit distance to Jedi? To Jedi, yeah, right. So we were at this dead end, like it's empty complete. So what to do next? Yeah, actually we could have a closer look because what actually happens to the Chasen trees if we apply the distance?

Starting point is 00:05:34 And we found out that it's too permissive and it's too powerful. So what it does, the tree at a distance allows to split subtrees to merge them back again. And since Chasen, each subtree is a subdocument, The tree-edit distance allows to split subtrees to merge them back again. And since JSON each subtree is a subdocument, the document has its information. We do not want to split it somehow or move across. So we introduced a constraint that we would like to preserve this document structure within the JSON document.

Starting point is 00:06:07 And by introducing that constraint on top of the tree-edit distance, we were able to define this first, JSON edit distance or chat-eye. How did you go about evaluating your new approach? We were able, so for the chat-eye, to construct a baseline algorithm out of existing algorithms by combining them, adapting, and that runs in higher complexities, but still it's doable. And we applied that as a baseline to do a similarity lookup.

Starting point is 00:06:34 So I have a database with a lot of JSON documents, and I just query a document against the database and give me all similar ones. And, yeah, so the baseline approach, we evaluated, I think, 22 data sets, and we were not really happy. So it didn't really scale and complexity was too high. Yeah, the next step was to address two main issues that usually occur in similarity search. So what happens if you have large databases? What happens if we have large documents? And a typical approach, it's called filter verification framework, is we try to apply

Starting point is 00:07:14 that on top of it, but there are many steps we do have to address. One of them, so processing large databases by comparing to all of the documents, a natural approach is to build an index. Since there is no related work on JSON similarity, we had to introduce the first JSON similarity index. And we did that. And further on, we also applied filters, since the verification is that expensive and yeah also the verification algorithm we were able to identify some weaknesses that we could address and overall we had then now this nice filter verification framework that we again performed the evaluation on those data sets and they scale nicely given this this finding and these algorithms you've developed,

Starting point is 00:08:07 how do you see them being integrated into maybe commercial database systems, document database systems, such as Mongo and Couchbase, those sort of things? What do they do at the moment or don't they have anything like this? And would your work naturally fit into a system such as that? Yeah, I also explored that. And actually, if database systems provide similarity support, then it's usually on simple data types, like strings or maybe even arrays. But that's the end of the story. So it's not provided.

Starting point is 00:08:36 And actually, a current effort is to integrate my solutions into AsterixDB. And I'm currently on that. How's it going? Yeah, challenging. So the system is like... I don't know much about this system. Can you tell the listeners and myself a little bit more about the system? Yeah, it's a big data management system developed at mainly UCI, so UC and Irvine.

Starting point is 00:09:02 Yeah. And, yeah, it's a parallel database system, and, yeah, one main aspect where we've chosen it is that it has a JSON-like data format. It's a bit extended, but it's no issue for distance. So my next question is, what was the most interesting and unexpected or challenging aspect or lesson that you've learned while working on this topic? And more generally in the field of similarity queries.

Starting point is 00:09:31 I think on the one hand it was like these incremental problems that showed up and now we're there. And then the next one arised. And yeah, also actually that it's not that well supported as i also mentioned previously in in database systems yeah so i think there's a lot of room to for future research in the area that kind of leads naturally into the next question you've obviously mentioned you're working on implementing this in a system right what's the other future directions for this research where are you going to go with it next yeah so we have some things going on um one of them is actually right now we compute a distance so it's a value that tells you how different but

Starting point is 00:10:11 what we're currently doing is creating a patch then we have json patches which may be beneficial for example you don't always have to send the document all over again you just send the diff and therefore it used the traffic for For example, we're also currently working on clustering JSON documents, where also the distance can be used to group them together in similar groups, which may be used, for example, to extract schemas, which then helps to optimize

Starting point is 00:10:41 the query efficiency in database systems. I guess the last thing I'll kind of ask is, how did you end up working in this area? And specifically, what attracted you to it? And how can other people get started necessarily in your area of research? So basically, I have to admit also that in our research group, the similarity topic was kind of present. And I also started at that university.

Starting point is 00:11:06 And also my pre-knowledge got directed in some other similarity. So I started to gain knowledge and to also get on the topic, actually. Brilliant. Well, I'll put a link to your paper and everything in the show notes. And thanks for joining us today. And we'll leave it there. Thanks for having me. Brilliant, yeah. And enjoy the rest of your time in Philadelphia. notes. And thanks for joining us today. And we'll leave it there. Thanks for having me. Brilliant, yeah.

Starting point is 00:11:26 And enjoy the rest of your time in Philadelphia. Thanks, you too. Bye.

Your Ad Here

Disseminate: The Computer Science Research Podcast - Thomas Hütter | JEDI: These aren’t the JSON documents you’re looking for | #4

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.