Disseminate: The Computer Science Research Podcast - Haralampos Gavriilidis | In-Situ Cross-Database Query Processing | #27

Starting point is 00:00:00 Hello and welcome to Disseminate, the Science Research Podcast. I'm your host Jack Wardby. I'm delighted to say I'm joined today by Haralampus Gabrilidis, who will be talking about his ICD 2023 paper, In-Situ Cross-Database Query Processing. research associate in the Database Systems and Information Management Group of TU Berlin, and his research interests lie within cross-platform query optimization and execution in federated polystore systems, big data query languages, and IoT data management. Harald Ampers, welcome to the show. Hi, Jack. Great to be here. Thanks for the invitation. The pleasure is all ours.

Starting point is 00:01:02 Let's dive straight in. So can you maybe tell us a little bit more about yourself and how you became interested researching in data management? Yeah, sure. So it became all in my previous life where I was a web developer. And there, I actually enjoyed working with databases. So I was always waiting for the time to, you know, write my SQL queries and tune the database and things like that. And yeah, so that led that to the fact that during my studies, I then kind of specialized in data management, if you will. And then I actually wanted to, you know, become a web developer. But then I realized that,

Starting point is 00:01:46 yeah, I don't want to become a web developer, I want to become a data engineer. But I said, okay, now I know nothing about data management. So, you know, let's do a master's in data management. And then at the same time, I dropped, development and I started working as a research assistant in our group. And this is when I found out that I'm also a little bit into research. And then, yeah, I was like, OK, let's give it a try. Let's continue with a PhD. So, yeah, a couple of years later um yeah here talking about uh my cross database paper fantastic so from the front end to the back end to research that's the that's the path

Starting point is 00:02:33 um cool so let's let's talk a little bit about cross database query processing then so maybe you can tell us kind of what it is and maybe why we need it yeah sure so um so what it is and maybe why we need it. Yeah, sure. So what it is, it's essentially decentralized query processing on, let's say, multiple heterogeneous data sources. So just imagine you have data lying around in multiple systems, multiple databases, and you want to process, let's say, analytics queries on that. So essentially, you need some way to answer your queries. And yeah, that's pretty much it, like from the very high level.

Starting point is 00:03:18 Essentially, it's something that we need for data integration, let's say. Okay, nice. So I guess maybe then we can talk about maybe some of the existing approaches to um to kind of sort of solving this problem and why we why we need to do research in this area still right right right right right right so um yeah so essentially we have the media to wrapper approach um that was introduced to, let's say, integrate data from different sources. And let's say, looking at nowadays setups where we have multiple systems, and we still use this architecture, let's say this has two major downsides.

Starting point is 00:04:00 So first, so we need an additional execution engine. And we all know query engines are really complex software artifacts. I mean, they get decades or many years to be built, and they need specialized people to be maintained, and so on and so forth. And then when having this additional query engine, you actually also need to deploy it somewhere. So you essentially need hardware resources,

Starting point is 00:04:27 like either on-premise or in the cloud, you would need to, let's say, give it some hardware to run on. And then you also need to maintain that infrastructure. So essentially you need what we nowadays call DevOps engineers to actually, you know, maintain this infrastructure with your software on it. And this all leads to monetary costs.

Starting point is 00:04:48 So this is one aspect. And the other, let's say, disadvantage of the Mediator App architecture is that it results to redundant data movement. So let's say I have two data sources, A and B, wanting to join and I want to join these two tables. Why should I pay for shipping both A and B instead of just shipping one of the two, like the smallest, like just a very simple example. And yeah, so overall, we see that we need additional hardware resources and human resources. And we have additional redundant data movement, which also is a bottleneck for performance. So this is, let's say, why I believe the mediator architecture is not suitable for all kinds of queries.

Starting point is 00:05:40 Fascinating. Yeah. So you've got the fact that you've got to provision extra resources and the monetary cost and then also the fact that you're moving data around, right? Cool. And what are the examples of systems that are architected in this way? Like what are the most common systems? Yeah. So, I mean, in the ancient times you would have this Tzimis system. I think it came from Stanford, if I'm not wrong. It was a research prototype.

Starting point is 00:06:07 Then we had the very popular Garlic system from IBM. When you take a course on data integration, this is like the thing that you will talk about. And then, so nowadays, we have this very popular Presto system, or like I think Amazon also markets this as AWS Athena. We can also do this with Spark. Essentially, every system that allows you to connect to different sources. So for example, also in Spark SQL, I can have different connectors for different databases,

Starting point is 00:06:44 but there essentially we would do the same. Like Spark, its distributed engine would act as a middleware. It would connect to the sources, see what it can push down, fetch all the data centrally in its execution engine, and then perform all these cross-database operations. Yeah, so other approaches are also Apache Drill. I'm not sure if you're aware of that so this is based i think on google's dremel paper um we also have cal's apache calcite um there are many systems

Starting point is 00:07:15 that allow you to do this kind of analytics and essentially there's a a lot of um yeah a lot of there's a big market in data integration i would say so that there are a lot of systems that um allow you to do this kind of analytics and they all rely on this um mediator app architecture essentially you have their own query engine you implement or you provide the wrappers essentially the connectors to the underlying sources and what they do is they fetch the data centrally and perform these cross database operations there. In the best case, they will also push down certain operations like projection selections to transfer this data, let's say. Cool. Great stuff. So we've kind of discussed like what the problems are with immediate wrapper

Starting point is 00:08:00 approach then. Can you, I know the system that you've introduced is called XDB and this idea of a cross-DBMS approach to solving this problem. So maybe you could give us the elevator pitch for this work and the system and for XDB. Yeah, sure. So nowadays we have data in many different systems. So coming from many different sources, we have transactional databases or data from data marketplaces or open data whatsoever. And these are all hosted on different systems. And to perform analytics on them, and actually gain value out of the data, we many times need to combine it to integrate it. And so today, you would use the mediator upper approach, or like systems that are based on the mediator-upper architecture. And as we discussed before, this is both slow and expensive.

Starting point is 00:08:52 And instead, we propose this cross-database query processing paradigm, which is leveraging existing systems to process queries. So what does this mean? I have my systems. They know systems to process queries. So what does this mean? I have my systems. They know how to process. They know how to, let's say, process queries. I want to exploit them. I don't want to pay for a new system. I want to use existing systems.

Starting point is 00:09:19 Okay, nice. Can we maybe dig into the details a little bit here? So the system that you implement is called XDB, right? So how do you go about achieving this idea of cross-database execution? So the idea is to combine existing techniques from different distributed query processing paradigms. So, I mean, there is the media interoperable architecture that we talked about, which is, let's say, based on research on federated databases.

Starting point is 00:09:48 But then we also have, let's say, distributed and parallel databases, or we have peer-to-peer databases. And they all have some nice characteristics. And we actually take the best from all and propose cross-database query processing. So essentially what we want to have is, let's say, in peer-to-peer and distributed databases, you have the workers, if you wish, talking with each other. So we want that. This results to efficiency. In federated databases, we have this nice user interface that abstracts away all the different sources. So we want that. Yeah,

Starting point is 00:10:33 so essentially, it's a mixture of all these approaches. Okay, cool. So given that you've taken, I guess, all the best bits, I think, and rolled them all together, how does, I guess, maybe tell us a little bit more about the actual architecture of xdb then what are the core components in it yeah yeah yeah right so um so let me guide you through uh let's say query processing in in our system so like i said, we want this nice user interface. So we expose to the users an interface where they can use multiple tables from multiple systems and essentially write SQL queries. And then we take this query and we have an optimizer that has three phases. So essentially, we optimize the query logically. Then we do some operator placement.

Starting point is 00:11:29 Essentially, we want to decide which operation is going to be executed on which underlying system. Then we have what we call the plan finalization. So there we actually create the delegation plan. And this delegation plan is the, let's say, core abstraction of our system because this is what allows us to then ship instructions to the underlying systems for then executing the query.

Starting point is 00:12:02 So essentially, you can imagine we get the query from the user, we optimize it, and also decide where it's going to be executed. So which part of the query is going to be executed where. And then we ship certain instructions to the underlying systems, such that then in the execution phase, they know how to process the query. And yeah, so after building the delegation plan, we deploy this on the system, on the underlying systems. And this is essentially a cascade of views on different systems.

Starting point is 00:12:35 And when we, let's say, query the root view, then we start this decentralized execution. So essentially a recursive evaluation of all the views that we just registered. And we can talk about the mechanics, of course, later. But this is the high-level overview. Okay, cool. So the thing that kind of jumped out there is that, and we'll talk a little bit more of this when we go into the mechanics,

Starting point is 00:13:01 but how are these systems, is there a kind of a common language amongst these systems in that you said that they ship instructions between, you basically determine the instructions you want and then ship to each system, and then they kind of communicate with each other. How do you ensure, like what sort of interface do you use, what sort of API do you use to ensure that everyone's

Starting point is 00:13:19 sort of speaking the same language? Right, right. So in this paper, we focus on relational systems so essentially um yeah systems with relational apis like sql apis and we also have so what we call let's say wrappers or we call them actually database connectors but essentially it's the same thing we also have wrappers to translate the instructions from, let's say, our intermediate representation to the underlying systems. And so you can imagine we also have wrappers,

Starting point is 00:13:52 but they don't really participate in execution. So we just have them to, let's say, deploy this delegation plan to the underlying systems. And what we use is essentially views, which, I mean, yeah, if not all, most of the systems support. And we also use now for the systems to talk with each other, we rely on the SQL Med standard. So I'm not sure if you're aware of the SQL Med standard. So let me say a few sentences about this. So I think this was introduced in the late 90s or beginning of the 2000.

Starting point is 00:14:32 And MAT stands for Management of External Data. And essentially, this was introduced to allow databases process queries with tables that are not located physically within the systems so you know because databases community has also received a lot of criticism about this so databases actually cage the data so essentially a database wants to know like wants to decide how the data is going to be layout, how the data is going to look like, and what kind of indexes you're going to build on it. So essentially, database needs to know about the data. But many times, in industrial scenarios, you need to also, you know, combine multiple data sources. So I think it was introduced to allow users querying, let's say, text files or spreadsheets

Starting point is 00:15:30 together with their tables in the databases. And then it was also extended to allow querying data from different databases. So the SQL MAT standard is implemented by many of today's systems. And, yeah, as with any standard, every database vendor tries their best, let's say. And we have

Starting point is 00:15:56 in Postgres, let's say this is implemented through what they call foreign data wrappers. Essentially you give some connection details. You can have, let's say, a JDBC foreign data wrapper, where you just load your JDBC driver and the connection details, and then you can query the data. Or in MariaDB, it's implemented through a new storage engine, which is called the Connect Engine. I don't know, in Apache Hive, it can be supported also through JDBC connectors.

Starting point is 00:16:29 So they're called external tables, not foreign tables. Essentially, we see many vendors implementing it in many different ways, but the functionality remains the same. So the SQL MATCH standard allows you to process query on a database on data that is not native to it. And it fetches it during the runtime. Closing the parentheses on the SQL math standard. So again, going back to your original question. So we use, let's say, wrappers ourselves to send the instructions to the underlying systems,

Starting point is 00:17:03 which are essentially just translating relational operations to database-specific dialects. And for the communication between the systems, we use the SQLMED standard. Cool. So you alluded to it a little bit earlier, but let's dig into the mechanics. There's these two phases in XTB, the delegation and the execution phases. So maybe you can start off with the delegation phase, and then we, the delegation and the execution phases. So maybe you can start for the delegation phase, and then we'll move on to the execution phase. Yeah, so actually, before the delegation phase,

Starting point is 00:17:32 we also have this optimization phase, let's say. So as I said before, we get the query from the user, and then we have the three-phase optimization. So in the first phase, and we do this this because I mean, if you think about it, query optimization is already like super hard, you know, and then in the distributed system, it's like a bit harder. And then we are also in this kind of black box environment. And there it gets even more out of hand. So we decide to have this three-phase approach to simplify a bit optimization. So in the first phase, we act as if we would be on a single machine, on a single database system.

Starting point is 00:18:18 Essentially, we do traditional query optimization like selection, projection, pushdown, join ordering based on existing statistics, this kind of stuff. And then when we have this, let's say, logically, but not entirely logically because we also do join ordering, which is based on statistics, but let's say we have kind of an optimal plan, we need to decide where to place each operation, right? And now we make several assumptions here. So for example, all the operations,

Starting point is 00:18:55 so having a query plan and so traversing it bottom up, let's say, all the operations that originate from a single database, we want to keep it on that database, on that system. So it doesn't make sense to send my data away and then apply a projection. So essentially, I would want to minimize the data that I transfer as much as possible. So that means that for every, let's say, non-cross-database operation, we place it on the system of its origin. And now for cross-database operations,

Starting point is 00:19:41 we assume that it's good to execute it either on the one system or on the other, and not on a third one. So essentially, we don't want to rely on this, let's say, mediator pattern again, where we have a third system, and we ship our two datasets to another third system. So we make this assumption that it's good to transfer less data. And so this means that for cross-database operations, we just pick between these two systems. Yeah, and then we essentially place all the operations on the system. So we traverse our query plan bottom up and we decide the operator placement.

Starting point is 00:20:26 So where is each operator going to be executed? Yeah, then we create in the last phase, we create what we call this delegation plan. Essentially a delegation plan. You can imagine it as a direct acyclic graph. So we have nodes that are tasks. And then we also have edges between those nodes that are the data movement, right?

Starting point is 00:20:50 And now how we construct this delegation plan is by fusing together all the operators that have the same annotation, right? So you can imagine, again, we traverse our query plan bottom-up. Now it's annotated. And we fuse together all the operators with the same annotations into one task. With the annotations, I think I missed the step on what the annotations refer to.

Starting point is 00:21:19 During this operator placement phase, we want to decide for each operator where it's going to be executed. So essentially, the annotation is the database system that this operation is going to be executed on. And I mean, you can imagine like a query plan, which is this tree with relational operators. And we go operator by operator and we decide okay this operator is going to be executed there this operator is going to be executed there here um um i mean during the optimization we need yeah to decide um two things essentially one where the operator is going to be executed and the other is um how the operator is going to be executed. And the other is how the data is going to be moved. And now, I'd like to talk a bit about how we decide the placement, right? So we have this cross database operation. Essentially,

Starting point is 00:22:18 I'm at a node now in my query graph, in my query tree, that has inputs from two different sources. And there is where I need to decide, okay, where am I going to place this? And so for that, we have two approaches. Let's say one is, let's say we ask the systems. So, you know, there is like this explain like functionality in the systems, and there you also have a cost. So we ask the underlying systems like, oh, hey, if I would execute this query on your system, how much would I pay? And then we also ask the other system and yeah, then we decide. But of course, this is only like in an ideal world where I have the same database type, let's say, which gives me the same cost unit.

Starting point is 00:23:09 And what we can do there is, let's say, to calibrate between the costs, between the different costs, or to even calibrate, let's say, execute different queries on different systems and see how they perform between each other. And then depending on the characteristics of my queries and my data, I have some, let's say, function that gives me which placement is better, let's say. But we're also looking in machine learning techniques to learn the costs, essentially executing many operations on many systems. And then, let's say, let the model tell us, if you have this join and you're going to choose between this and this system, choose this one. Because we want to support this more.

Starting point is 00:24:02 So right now, we're more like in the gray box scenario where we have some kind of cost or something. But if we want to, let's say be really dark black box, we, yeah, we need to look into this same machine learning approach where we ask differences.

Starting point is 00:24:21 So we run different queries and then, and then based on the data and query characteristics, we can decide. So this is the first thing that we need to decide, placing the operations. And then second is the data movement. So what types of data movement do we have? So right now we just have two. But essentially this is pipelining or materializing so because i can move the data either you know by pipelining it or or materializing it and so why would i want to

Starting point is 00:24:55 choose between two because many systems would choose different local execution strategies if the data is local to them so for example with Postgres and its foreign data wrappers, it cannot parallelize operations when, let's say, we use a foreign data wrapper. Instead, when we materialize the data, it can do parallel processing. So this can, in some cases, lead to better performance. So essentially, these are the two things that we decide. So where are the operations going to be placed and how is the data going to be moved?

Starting point is 00:25:38 And once we decide this, we, let's say, encode this information into the delegation plan. I had a question there on the, when you're determining the cost of moving data around, do you factor in the actual dollar cost of moving, let's say you're operating like a cloud environment, of moving data between sort of maybe moving it from AWS to Azure or something like that,

Starting point is 00:26:03 or is that sort of out of scope? Yeah, very interesting question. So we actually don't look at monetary costs right now. But I mean, these are things that you can build into your optimizer. So also, I mean, I guess we're going to talk about this later, but also in the evaluation, we kind of evaluate also how much data is flowing around, right? And sometimes you might have different constraints. So for example, you might have three databases that are all located in an AWS region. And so I don't know how the pricing works, but I assume that, let's say,

Starting point is 00:26:47 it's cheaper to move data within a cloud vendor than switching cloud vendors. But these properties, you can all somehow encode or add to your optimizer. Another example for this would be, let's say, now we assume that we have this, let's say, mesh topology where all the nodes are connected with other nodes, which might not be true because sometimes you might have, let's say, an intranet and there may be within the intranet three databases can communicate with each other. But then, yeah, only one can communicate with the outside world. So we would have, let's say, placement constraints.

Starting point is 00:27:30 And these are all things that we also plan to look at and enhance our optimizer. So yeah, the first step was just to get something to actually have some somehow optimal query plans, and this worked fine. But of course, there are many ways we can extend this. And I'd say this intra-cloud or inter-cloud communication is definitely one of the things. I think also there was a paper about that.

Starting point is 00:28:02 It's called now Sky Computing. I think it was this American Hot OS conference, I think, where they talk about this processing data over multiple clouds and the next cloud computing thing, let's say. Just let me add one more thing on that. So I think also, Marina, I mentioned before data marketplaces, but so the focus right now in data marketplaces is, let's say, buying and selling data.

Starting point is 00:28:37 But I think this is going to evolve into like buying data together with compute because, let's say, we have several open issues. Let's say you're a data supplier, and I buy your data. And now I want to, let's say, get it. I mean, you host the data. I don't physically copy it. But then I want to do operations on the data so for example i want to select or project or aggregate who pays for that

Starting point is 00:29:14 do you pay as a data supplier or like how i don't know like um i'm also not into um this area of research but i think if let say, data suppliers also start thinking about compute together with data, this will be a very interesting direction. And there was this, I'm not sure if you're aware of the Mariposa paper. So this is also about data integration, but also views it from the economical perspective.

Starting point is 00:29:46 So data marketplaces are going to play an important role in the upcoming years. And I also think that we'll need different processing paradigms to combine data from multiple providers and follow different legislatory constraints. So GDPR, things like, oh, this data cannot move out of this continent, things like that. But, yeah, this is all, yeah, in the future. Okay, cool. Yeah, so where were we in the – we were talking about –

Starting point is 00:30:20 because we got a little bit sidetracked there, right? So we were talking about the details and the mechanisms of XDB, and we spoke about the delegation phase. Have we gotten to the execution phase? Yeah, no. We talked about optimization and delegation, and now we're going to talk about execution, I guess. So, okay. Where were we? we're going to talk about execution, I guess.

Starting point is 00:30:45 So, okay. Where were we? We had our delegation plan. Okay. And so actually we need to, let me refresh the delegation phase. So we take all these tasks and we translate them into DBMS specific statements. So this is where our database connectors or uppers come into play, right?

Starting point is 00:31:07 Because I have some tasks having some relational algebra expressions. And now I have different databases. I need to somehow store these tasks on the databases. So what we do is we go task by task. And we register our tasks as views on the systems so essentially i go i have my first task i go to my first system i register it as a view now this task is a dependency for the next next task so what i do is i go to the next system and I register this view as a foreign table, foreign tables we talked about before, right? And then I, let's say, define another view

Starting point is 00:31:55 that joins the foreign table with a local table. And then I do this for the remaining tasks. And what we have now is a cascade of views on on multiple systems right and and up to now nothing was really executed so we deployed our plan essentially now we have a cascade of views um essentially these are the instructions that um for the databases to then execute the query plan, right? So now we have all those views on all the systems. And if we want to execute that query, the only thing that we need to do is to go to the last system. And okay, this we abstract also way through our XDB client.

Starting point is 00:32:41 But what happens behind the scenes is we execute the last view on the system. So essentially the view of the system so essentially the root the view of the root node of our query plan and by executing it it will need to recursively evaluate the rest of the views right so and this has this um yeah let's call it trickle-down effect, where one view evaluates the other view, and then tuples start flowing. I mean, when we reach the last or like the first registered view, tuples start flowing through the systems, right? And then we have this nice inter-DBMS pipeline where we execute parts of the query on different systems in a decentralized way.

Starting point is 00:33:33 And yeah, I mean, that's pretty much it. Awesome. One question that jumped out at me, and maybe this is a bit of a stupid question, but you end up with this like decentralized execution right and is there any notion of like a session between so i'm guessing what i'm trying to say is is like is one dbms kind of aware that it's a member of like a bigger query basically or other than the fact that it has some like views created from like foreign tables but is it aware that that's sort of like tagged with this is query one basically so i thing that I'm trying to run here, or a query with some ID, or is it totally like it just does its part

Starting point is 00:34:09 and passes it on and then forgets about it? Yeah, good question. So actually, so the DBMS itself, no, it doesn't know that it's part of this bigger execution plan, right? It just knows I'm executing this part of the query. It doesn't even know that this thing that it's executing is part of a bigger query. But yeah, so then, of course, you assume access for injecting external data into systems and this is not allowed and things like that. And I believe that this is more like of a kind of like an implementation detail that, let's say, we should have access to, let's say, special part of the database where we are allowed to create you know temporary tables and things and there also need to be some mechanics there to you know clean

Starting point is 00:35:12 up after the query execution because you know we create all those views and foreign tables and yes of course i mean after each query um we we need to clean up and there are mechanisms for that. Yeah, so we're also working on, let's say, at some point releasing XTB. And there we want something more sophisticated. So right now it's just, you know, registering some views for a table showing, okay, this works. But, of course, if you want to do it in a more principled way, like, you know, production-like ready, I will not even say production ready or something.

Starting point is 00:35:55 Yeah, we would need something like sessions and each user being assigned to different, yeah, registered views and, like, to avoid name collisions and clean up even if something crashes, to have some mechanisms there to be able to clean up things. Yes, yes, of course. I guess let's talk a little bit more about XDB then. So how did you go about implementing it?

Starting point is 00:36:22 And obviously it's very much proof of concept at the moment, but like maybe you can talk us through the engineering effort that went into it and how you went about doing it. Yeah. So the nice thing about XDB is that it's really simple. I mean, so, I mean, I implemented it in Java, but, you know, you could have used pretty much any language to implement it. Yeah, so what we're also working on is, let's say, transitioning to Calcite for also query

Starting point is 00:36:56 parsing and parts of the optimization. So I'm not sure if you're aware of apache calcite so super powerful um uh query not only query optimization framework but also um query parsing and things like that also it also has an execution engine i mean it's an in-memory java-based engine which is not too performant but um yeah many different projects use it um either for SQL parsing or for query optimization. So I don't know. Also Apache Hive uses it. Flink uses it.

Starting point is 00:37:32 Many systems rely on Apache Consight. Yeah, but right now, you know, like we have pretty simple query parsing, straightforward traditional query optimization techniques and yeah database connectors we implement essentially through um yeah the respective i don't know like yeah we just need the respective um yeah jdbc drivers and um yeah the the different so we have interfaces that define the different methods to do the specifics of the delegation. So some systems, let's say, it's called like create external table and others is like create foreign table. You need different, yeah, let's say configurations. But yeah, I mean, the idea is pretty simple and um also the

Starting point is 00:38:28 implementation and we're actually um working on releasing a nice version of it nice that's great i was going to ask at some point kind of is this is it publicly available and i guess soon it will be is the answer i mean it is also right now. It's just that, you know, we're preparing it a little bit. Yeah, we're prettying it up. Giving it a polish, yeah. Right, yeah. How long did it take to implement, by the way? Like what was the sort of time there, Joris?

Starting point is 00:38:59 You said it was quite straightforward to do. Oh, yeah. I mean, it's quite straightforward. And i i mean i can't tell you now like a time range or something because i i actually worked on many different things at the same time i i don't have it in my head right now but it's it's pretty straightforward also adding a new system so if you just implement this interface with, you know, like register, foreign table, create view, things like that, it's very easy to add also new systems. Okay, cool. Yeah.

Starting point is 00:39:32 Okay, cool. Let's talk evaluation then. So I'm convinced that the cross database approach is the way, across DB is the approach to go. So let's talk some numbers. So let's go into the evaluation and maybe tell us, paint the picture for us here. What were you trying to, what questions were you trying to answer?

Starting point is 00:39:51 And what was the setup? Yeah, sure. So again, the goal, the general goal of XTB is also to kind of leverage and exploit what's already out there. So my initial thought was, okay, you know, we have all those systems.

Starting point is 00:40:12 So we have data stored on different database management systems, and they all know how to do relational operations. So why do we need this extra system? And these extra systems that are, let's say, introduced for this kind of job, so what one may call federated query engines, they're quite powerful within their execution engine. And one would say that it's a very difficult fight because there has been a lot of, let's say, engineering effort

Starting point is 00:40:51 and a lot of effort in building those query engines. So now I'm talking, let's say, about Presto, let's say, or Apache Spark. Yeah, so the main dimensions that we wanted to evaluate are, let's say, three or like four, depending on how you see it. So first, runtime performance. So, right, I mean, everyone wants to have fast queries, right? Then we also touched this a bit before. We have the amount of data that is transferred.

Starting point is 00:41:34 So how much data do we actually transfer? Because in the beginning I said, oh, we transfer less. And so we wanted to see like, okay, do we really transfer less? And so the first two things also have also, so there we are also interested in, let's say, scaling system, both in terms of systems that participate or like joints in a query or like different data sizes, right? And then we're also interested in the overhead that we produce with our optimization and delegation phase. Because we do multiple round trips with the databases and one might say, oh, this is expensive. You do all those round trips, ask for the costs, register all the views. This may be a bottleneck.

Starting point is 00:42:31 So evaluate XDB and our in-situ approach following those dimensions. And with regard to runtime performance, so actually we compare our system to the baselines I mentioned before Presto, but we actually compare against Trino, which is this fork of Presto that's a bit more actively maintained. And we also compare against the single node Postgres, which we use as a mediator. So essentially, you can use this SQL MAT standard also to act as a mediator, right?

Starting point is 00:43:03 So I have one Postgres, I register all the underlying tables on it. Yeah, so with regard to performance, we saw that we're up to six times faster. And I mean, I didn't mention that. So the setup is as follows. We take the TPC-H dataset. So this is like the standard benchmarking dataset for all app queries. So we now take this

Starting point is 00:43:31 dataset and we distribute it on multiple systems. So we put different tables on different systems. And to also have kind of different topology characteristics, we have what we call those table distributions. So we place the tables differently every time. And yeah, we executed some TPC-H queries, actually the ones with the joins. And yeah, so we saw that with regard to performance, we can achieve performance improvements. And this is based also on the fact that we're kind of parallelizing the query between the nodes, right? And we also transfer less data.

Starting point is 00:44:20 Now coming to data transfer, there we evaluate two scenarios. One is where, let's say, we have XDB on the cloud, essentially as a mediator, and the systems are all on-premise. And there we measure the data that goes from the, let's say, on-premise infrastructure to the cloud. And we also have what we call in the paper the geo-distributed setup. There we have this, let's say, geo-distributed setup of the systems, and we measure all the data that flows between the systems. And there we also saw that, yeah, we actually move quite less data than the mediator-upper approach.

Starting point is 00:45:10 And then finally, with regard to this, I mean, of course, we also executed the different queries with different scale factors, right? So this is, let's say, different sizes of data. And yeah, we saw that the performance scales well. And yeah, finally, we also execute. So we measure all those, let's say, let me call them preparation phases before execution. And there we saw that, okay, the overhead is very minimal. So the worst case, we had a 10-second overhead. I mean, of course for

Starting point is 00:45:45 short queries um 10 seconds is a lot yeah but um yeah when you need when you have larger scale factors and you need to wait more for the query um to finish um this overhead is quite negligible okay cool yes i i'm i'm convinced that the next time I need to do some cross-database query processing, I need XDB. But I guess to play sort of devil's advocate, or maybe not devil's advocate, but to take the other side for a moment,

Starting point is 00:46:16 what are the limitations of XDB? And like, what scenarios is the performance suboptimal? I mean, maybe touch on one there, like if you've got short queries, the preparation phase proportionally to it can be quite long can maybe dominate the query performance almost but are there any other um kind of limitations um yeah so limitations with with regard to performance yeah with regard to performance and maybe just sort of um the general sort of usability this is some kind of prototype but maybe just sort of the general usability.

Starting point is 00:46:48 But I mean, some of the shortcomings of it. Yeah, yeah. So look, so now we rely on, so in the evaluation, we also, we have two setups. The one is where we use, let's say, all Postgres databases with their foreign data wrappers because they have this, let's say what they call, or I call it the native foreign wrapper, which is specialized for Postgres, let's say. But we also have the Regenius setup where we use Postgres, MariaDB and Hive. And there we also, so the performance also depends on the performance of this implementation of the SQL MAT standard.

Starting point is 00:47:34 So we do the evaluation on the heterogeneous setup to show that it's possible and that still we have performance gains. But of course, there we are more dependent on the implementation of the SQL MAT standard. So for example if you have a slow JDBC driver

Starting point is 00:47:55 this might result to suboptimal performance. But this is also then this is also, then this would also hurt, let's say, the mediator because it would use the same kind of connector.

Starting point is 00:48:13 But yeah, essentially, this is one thing. And then also, if you think about it, yeah, we may run into suboptimal performance because of different query characteristics. So if you have, let's say, I don't know, if you want to union the intermediate results of many databases,

Starting point is 00:48:37 this might lead to, let's say, more data transfer. But yeah, so this is another case. Cool. Yeah, I mean, that makes sense. I guess this is sort of the use of SQL kind of men and the fact that you've scoped this work initially to focus on relational systems. I guess maybe it's on you. We can maybe touch on this when we speak about future work.

Starting point is 00:49:01 But I guess, have you going to explore maybe systems with different data models, like key value stores, speak about future work but i guess right how are you going to explore maybe systems of different data models like key value stores graph databases to have to kind of support processing across the heterogeneous data models as well right right right very interesting question and also i mean yes you can see that also has a limitation of the current xdb so that it works only for let's say relational systems and yeah so there we have there has been the idea of extending it also to other data models and query languages as well so the concept would still remain the same right so you could still instead of centralizing all your data,

Starting point is 00:49:46 kind of let different systems talk with each other. And now I also think if it's about joining data, it shouldn't be a problem either. But yeah, then when we run into, let's say, different semantics of data and also different, yeah, how to say, query languages or like query representations, we get into other troubles. And this is what also the poly stores, the poly store arena, the poly store research area try to address a bit. But yeah, so we're definitely also looking into incorporating other systems.

Starting point is 00:50:39 Essentially, what we need is, let's say, views and foreign tables or some kind of SQL math standard. And yeah, we're actually looking into that. Nice. Cool. That'd be really interesting to see how that work develops. Cool. So I guess kind of bringing the work back to how a software developer might be able to leverage your findings, or like a data engineer or whatever, might be able to leverage the findings in your research. What sort of impact do you think it can have, I guess is what I'm trying to ask.

Starting point is 00:51:17 You could think of the idea of decentralized query processing, right? Because, you know, whenever you get into this situation where you need to join data from two different sources, you're like, okay, I'm going to have this new system and I'm going to use that to integrate. But yeah, maybe one finding is that there are already good enough systems out there and we should see how we can leverage them to the maximum instead of you know every time building a new system reinventing the wheel um decentralization um can also pay off under of course under the right circumstances right so you cannot apply this to like every use case but um yeah essentially um leveraging what is what is out there and we have a lot of good really good

Starting point is 00:52:16 databases um so for example postgres is like really um well as an open source project, it's really strong. I admire it. Also other database systems out there. We should leverage what's out there and try introducing new concepts, new systems

Starting point is 00:52:40 all the time. The next question then, and we mentioned a little bit throughout the course of the show, but what's next for XDB? What's the roadmap? Yeah, so we mentioned extending it beyond the relational model. So I want to support more, let's say, nowadays you would call them NoSQL databases.

Starting point is 00:53:06 There is a lot, there is more that needs to be done in the optimizer. So different assumptions that we make in this work, we want to kind of lift them. And yeah, so enhance optimizer and then yeah, so also with regard to data movement. So we want to see there how we can do things better because with this decentralized approach, we have, let's say, intensive system communications, let's say. The Mediator Upper, by the way, also has, but we have between different types of systems and things like that.

Starting point is 00:53:54 So we also want to look there how we can do this more efficient and more generalized. And I hope that in the future I can also tell you about those things. Yeah, for sure. I'll get you back on to tell us about those in the future for definite. The next question is, I think I asked all my guests. So how do you go about kind of generating these ideas and then selecting the ones to work? And what's your process behind coming up with all these really cool, interesting projects?

Starting point is 00:54:23 Yeah. process behind coming up with all these really cool interesting projects yeah um so i'm a lot on twitter i mean i just recently joined but i see what people are talking about there i see what um people are complaining about because i ideally i want my research to be practical i mean like um yeah people should benefit from it um soon uh because you know as researchers and you probably also know like you know we do some we come up with some problems and solutions. And, you know, then we hope that, you know, in 10 or 20 years, when someone stumbles across this problem, they will, you know, open the book of the wise people. And they will, you know, find this one approach, like, oh, this is how I'm going to implement it, which is happening nowadays. So I think I, so, so when talking to people building database systems, like, oh, yeah, of course, we use standard techniques. I mean, when you, you're going to build a database system, you're not going to invent your new join algorithm,

Starting point is 00:55:35 of course, you're gonna, you know, go back to what was out there. But yeah, so then this is also what I like about databases. So it's quite practical and it touches also many different domains, right? Everyone somehow interacts with the database. So, yeah, I see what problems people have. I try to talk a lot with people, like also students, bachelor, master students, because they can also, you know, bachelor, master students, because they can also, you know, provoke you and like question your research. Sometimes the silliest questions are the ones that, you know, make you think a step forward.

Starting point is 00:56:22 Yeah. And then, I don't know, walking. Yeah. And then, I don't know, walking. Yeah. That's a really nice answer to that question. So, yeah, penultimate question now. What do you think is the biggest challenge in database research right now yeah so um the biggest challenge in in our community um i would say is to um somehow fit ai into the picture so everyone is talking about ai and you know like with different um ways like oh machine learning or one thing or the other um and there is the fear in the community somehow that um

Starting point is 00:57:15 everything now is about machine learning and like pure um database research will somehow vanish. And I mean, if you look into the conferences, which you probably do, you'll see that a lot of work is about machine learning. And I think a big challenge, but I think we're tackling it the right way, is to see how we can actually leverage machine learning to our benefit and not let it destroy us or consume us. And we see there is a nice line of work, which is called, I mean, you have systems for ML, right? And then you have also ML for systems.

Starting point is 00:58:20 And this is a pretty interesting line of work. I think, yeah, we're still early to make conclusions. But I'm sure that, let's say, we can leverage machine learning for database research. And I'm really curious to see what the future will bring on this. We have multiple startups in the space, and we have identified several use cases where using machine learning is beneficial. But, yeah, I think we have a longer road in front of us. Yeah, it's a really interesting direction to see how that plays out.

Starting point is 00:59:06 Cool. It's time for the last word now. So what's the one takeaway you want listeners to take away from this episode today and from your work on XTB? Yeah, so with regard to XTB, and I mean, one can also generalize this. So we don't need a new query engine or a new, yeah. Yeah, we don't need a new query engine for everything. We have good engines out there.

Starting point is 00:59:40 We have good existing pieces of work out there we have good um existing um pieces of work out there and um we we just need to find the right ways to leverage them and um yeah sometimes um decentralizing things can be beneficial um so yeah i think from my side, that's it. Brilliant. Let's, let's end it there. Thank you so much, our lampers for coming on.

Starting point is 01:00:12 It's been a fantastic episode. If the listeners more is interested to know more about her lampers work, I'll put the links to everything in all the relevant materials in the show notes. And you can support the podcast through buy me a coffee. And we'll put a link to that in the show notes. And you can support the podcast through Buy Me a Coffee, and we'll put a link to that in the show notes as well. And we'll see you all next time for some more awesome computer science research. Thank you.

Disseminate: The Computer Science Research Podcast - Haralampos Gavriilidis | In-Situ Cross-Database Query Processing | #27

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.