Disseminate: The Computer Science Research Podcast - Haralampos Gavriilidis | Fast and Scalable Data Transfer across Data Systems | #62

Starting point is 00:00:00 Hello and welcome to Disseminate the Computer Science Research Podcast. I'm your host, Jack Wardby. We're back today with another cutting edge episode and we're going to be talking about fast and scalable data transfer across data systems with Harry Gavrilidis. Harry is a, well, three times now you've been on the podcast, Harry, right? So welcome back for the third time. We're really glad to have you back on the show again to tell us about your latest research, which is due to be published at Sigmod this year, right?

Starting point is 00:00:31 So yeah, welcome to the show again, Harry. Yeah. Thanks for having me again. I hope you can cope with me the third time. Cool. Yeah. A quick plug as well for Harry's other episodes. We've got an in situ cross database query processing, which was episode 27. Go check that out. And he's also got an episode where we talk about

Starting point is 00:00:49 his work on SheetReader in the context of the DuckDB research series, which hasn't actually been released yet. So depending on when you listen to this podcast, it may be out, it may not be out, but yeah, it'll be hopefully out and in your feed somewhere. So yeah, go do check that out listener. Two really good podcast episodes, that's for sure. So let's get on to the topic at hand today then, and Harry, but before we do that, the usual tell us about your story. How did you kind of get interested in database management

Starting point is 00:01:16 and you tell us about your journey to where you are today? Yeah, so overall, my journey in new database system started when I did my bachelor's at TU Berlin. In the day, the lab was developing Apache Flink. I guess it has become a big success story. I joined the lab when the Apache Flink story was fading out and new things were starting to happen. I did my bachelor thesis back then at the lab and then I thought data management looks pretty cool.

Starting point is 00:01:58 I went on and did a master's there as well. And yeah, the one thing led to the other and here I am. Now hopefully soon finishing the PhD. Yeah, and I'm looking forward to talk to you today about the data transfer topic. That's one of the, yeah, it's the last topic of the PhD actually. Nice, last and best? Yeah. I mean, you know how it goes when you advance in the PhD.

Starting point is 00:02:33 You get more experience. This is true. Yeah, for sure. Well, I read it and it was a very interesting read. So I'm sure the listeners are going to like what we've got to say about it today as well. So cool. Going off that then, let's set a bit more context for this chat and talking about data transfer. So tell us about what data transfer looks like in the modern environment, in the modern data ecosystem. Tell us why it's important and what are the various challenges with it today? Right, so yeah, this work also originated, I mean, it's related also to the XDB system,

Starting point is 00:03:10 where we were dealing with federated query processing. And yeah, overall, we see that there are many use cases around where you have to transfer data from one system to the other, right? You have the typical ETL use cases where data flows through systems, gets transformed, and prepared for downstream tasks or for storage in a data warehouse.

Starting point is 00:03:36 You have federated query processing where you need to combine multiple data systems together to answer analytics queries. You have IoT use cases where you have also different nodes communicating with one another and they also transfer data. Overall, data transfer is crucial for data systems interoperability. When you have multiple systems working together or when you need to load your data, as one would say, from one system into the other, this is a crucial task.

Starting point is 00:04:18 What we realized when we were doing the XDB work is that a large portion of the query processing time went actually to data transfer. Data transfer involves many operations and these are expensive. In this work, we try to see how to mitigate this. One other aspect is that data transfer occurs in many different environments. So you may have your systems on the cloud. So when both systems are in the same cloud and there is a high-speed network between them and the nodes are super powerful, you might have two systems that are located

Starting point is 00:05:04 on different clouds, so the network between them is not the best. Then you might have data scientists working on their laptops with like a normal internet connection connected to the cloud. You may have IOT nodes that are, let's say, less powerful. And overall, we have different environments where data transfer occurs.

Starting point is 00:05:26 And then related to, yeah, you asked about, yeah, state of the art for data transfer. And this is, so I categorize data transfer approaches in two main categories. So on the one hand, we have what I call generic data transfer methods. And there you would have something like JDBC. There you just plug and play your driver and you can talk to any system and that's super fine because it's generic, so it allows you talking to many systems. But on the other hand, this is not very well performing. And there was also a paper some time ago from CWI, actually the DuckDB folks, and they showed that the JDBC protocols are actually not that good.

Starting point is 00:06:19 And so that's the one category, the generic data transfer. And then we also have what I call specialized data transfer techniques. And there we have, let's say, these point-to-point connectors where I can implement very efficient connectors between pairs of systems. And of course, the advantage there is that I can squeeze out performance and it can be the best achievable performance. But this means that I also have to implement for every pair of systems a new connector. And with this work at XTBC, we want to strike the balance between both worlds. So we want to be generic and we also want to be efficient. Yeah.

Starting point is 00:07:09 Nice. Yeah. Just the first comment that's what I'm going to make off of the pack of that. It's really nice to see how one piece of research has sort of led to another with your work on XDB is kind of lit the light bulb and kind of gone, oh, well, this is a problem. We can do this. I see how like it's a good good example of how a research can be interesting by the act of doing things on Earth, new things and new directions.

Starting point is 00:07:30 So that's really cool to see. But yeah, I'm kind of getting from you there that data transfer, a lot of heterogeneity across these systems and trying to, it's difficult to sort of have this movement of data between these systems. Is a lot of that, do you think, maybe a function of systems don't make it easy to let go of their data because they want to lock the data in? Is this a business use case for making it sticky to get your data out because you don't want your customers to go away? I don't know if that plays into it at all as well. Yeah, so that's a very good point. Also, we have seen that the emergence of Apache Arrow,

Starting point is 00:08:07 so this in columnar, in memory data format or representation. And the idea there is to standardize formats across systems, which is nice also as an interchange format because of its binary presentation and the compact format and the nice integrations. But then, as you said, all systems internally most of the time use their own format. And yes, on the one hand, I think you're right. It's because of the business case and you want your data and your customers caged in your system. But on the other hand, I think it's also because of system-specific optimizations, so to say. So each system is optimized for a different use case

Starting point is 00:09:01 and they may benefit on the processing side from their own internal format. Of course, in a perfect world, everyone would have something like Apache Arrow internally and then you could just transfer your data using Apache Arrow everywhere and you wouldn't need to transform it. But I think we're still far away from that. Yeah, yeah, cool. And yes, just to recap quickly, there's a lot of this kind of I guess the current state of the art is you have these generic sort of approaches like the JDBC, which is kind of good. It's generalized, you can kind of plug and play and sort of these things together, but pretty much bad performance doesn't really sort of perform well. And then on the other end of the spectrum, you've got these really specialized things, between two systems, system A and system B, and can do between this pair of systems really fast and really efficiently, but then that doesn't generalize.

Starting point is 00:09:53 So XCBC is going to solve this problem for us and give us a nice generalizable framework, but also something that is not sacrificing performance for this as well. So, yeah, let's talk about the architecture of XTBC then. And what does this system that's going to give us these nice properties actually look like? Mm-hmm. Yeah.

Starting point is 00:10:16 So, we looked at the data transfer pipeline, right? So, and so because we want to be generic,, let's focus on the generic part here. For example, in JDBC or in the Python ecosystem, SQL Alchemy, or you name it, connectors, connectors, you have a very tightly coupled pipeline. There are many steps involved in data transfer, so you have to read your data from the source, you need to deserialize it into an intermediate format, you need to maybe compress it, you need to send it over the network, and then on the other side you do the reverse. What we observed is that there is little configuration opportunity there or like optimization. So there are very few parameters that you can set and also that all these steps

Starting point is 00:11:19 are kind of tightly coupled together. So the main idea of XTBC is to first to decompose this pipeline and then to be able to individually configure all these different steps of the pipeline to achieve best performance in all environments. So the architecture, you can imagine it as like different components that are interconnected with queues, and they are individually configurable and scalable, let's say, in terms of parallelism. When you want to add a new system, you implement only a few of those components. We have some interfaces, and for example, to add a new source, you would implement the reader and the deserializer and then you can join the party and I'd say act as a source for all the different systems that already are part of the XCBC framework.

Starting point is 00:12:19 Nice cool, so we've got some these reader-writer interfaces and then the general idea is to sort of decompose this pipeline and so that decompose all these and the process into these different. So I'm going to start again. Decompose this whole process of moving data into these different component logical if we went through the lifecycle or the flow of a actual request of like, hey, all of a sudden, I want to get some data from system A to system B. How does that look like with XDBC? What's the flow of requests and how does that all piece together? how does it all piece together? Right, right. So, I mean, yeah, the flow you already mentioned. So, let's have an example. Actually, let's say we want to load the table from a database

Starting point is 00:13:19 into a Pandas data frame, right? I think this is also the example that we have in the paper. The idea there is that to transfer, to now load the Pandas data frame from the database table, you would initiate the transfer. On the Pandas side, we have the XTBC client. XTBC consists of a server that sits next to our source system and the client that sits next to the target system. The request of the framework that wants to load the data goes through the client, and then the client communicates with the server with which data you want to have transferred and then the server actually initiates the reader and the reader starts pulling data from the source system and

Starting point is 00:14:18 then these data are organized in buffers. So we fill those buffers, the buffers get into our system from the reader to the deserializer. Their deserialization happens to some intermediate formats. We support different intermediate formats. Then you go to the compressor, data gets compressed. You send the data over the network, it gets into the client, we may decompress the data if it's compressed, then we serialize it into the target format, and finally we load it, in this case, into the Pandas data frame. And yeah, so as I said, there are these many different steps involved. And now, let's say the Pandas environment runs in a laptop, right?

Starting point is 00:15:18 So this means that maybe we don't have so much memory available or we don't have so many cores available. The cool part here is that we can configure different amount of threads for all the different components. We have seen in the experiments that, for example, when over-provisioning, this leads to worse performance. You have to actually find out the best parameter configuration. And this is part of the optimizer that I will come to later. But also, for example, if now the laptop is somewhere with a normal internet connection, we would probably enable compression. But then, for example, if you use some service like Google

Starting point is 00:16:06 Colab or something that sits in the cloud next to your database you would probably not enable compression. So this is the cool part. So you can configure all those different steps individually in different environments and adapt the parameters in such a way that you get the best performance possible. Yeah, like the granularity you have there of each component, and we can kind of change it to suit the heterogeneous environment that you happen to find yourself in, right? It feels great. I have a question on the...

Starting point is 00:16:38 So when we're setting up, does the client spawn the server process or do you have to manually set up both of these processes next to or in their own? Do you always co-locate them with the source and the target system or can they be standalone? Do you have to stand those two things up before you then can send the client the request to the server? Right, very good question. Yeah, so this is up to the implementation. But yeah, we have both. So for example, when you want to load data from a Posterous database, so right now we have like a standalone XTBC server that actually sits next to the Posterous instance. server that actually sits next to the Postgres instance. But we have also done some work recently where we try to integrate it as an extension into Postgres so that you can actually

Starting point is 00:17:34 just hit the Postgres database without any service running next to it. Then for the clients, it's a bit different. For example, we have an Apache Spark source, and there we just use the data source API. It's just another source in the system. XTBC is like a shared library, so the main functionality is packaged as a shared library. So the main functionality is packaged as a shared library and then you kind of need to integrate it into your framework. So for example, when we want to load data into Postgres, we use a foreign data wrapper that allows you to read data from external sources. Then for example, for Apache Spark, we use the data source API as mentioned, and there we wrap the shared library with JNI, because XTBC is implemented in C++. So you need the bridge between JVM and native code. And then for pandas, we also use the CPython integration or PyBind, where you can also integrate native code with Python code.

Starting point is 00:18:52 Nice, cool. Going off the components, there are a few other ingredients in the runtime of XCBC that hold everything together. You touched on the memory management a second ago, but if you want to go into that bit more, then please do. But there are other aspects of it around parallelism and back pressure. And maybe you can tell us about those runtime components as well. Yeah. So, as I mentioned, we have these different components that communicate through queues and the memory is also configurable and by memory, I mean, the memory that you want your server to use or your client to use.

Starting point is 00:19:27 For example, if you're on the cloud, you may have available a lot of memory. But if you're on a resource-constrained device, you might not want to use a lot of memory. And in our experiments, we actually also saw that with very little memory in the order of megabytes, you can also efficiently transfer gigabytes of data. But yeah, so we have a memory pool and we divide it into logical buffers. And then, so the pipeline starts at the reader, so at the server side, right? And we actually get the foreign data that sits in our example, for example, in Postgres, and we read it into our buffer. So then we have Postgres data in our buffers.

Starting point is 00:20:16 And then we place the... So we then need to send it to the T-serializer, but we actually don't send the whole buffers. We just send pointers to the data and the components communicate through queues. So we actually queue if you want the buffer IDs and all the queues also have a maximum capacity. We enqueue the elements, the buffer IDs into the queues, and this way all the different threads of one component can actually, whenever they are done, they get an element from their input queue, do the work, and send it to the next component. This means that a component can at one time occupy a thread to a buffer,

Starting point is 00:21:17 so the one that it gets from the upstream task, and then you get the free one. You do the work, you put it in there and you send it to the downstream component. Now, of course, in a queuing system like this, the throughput is bound by the slowest component. By setting this maximum capacity at the queues, we ensure that the back pressure is handled. So whenever some component is a bottleneck, it will back propagate and through these constraints of the queues, we will not overflow the system. So in the end, the reader will just stop reading new buffers and putting them into the system.

Starting point is 00:22:09 Nice, you have this inherent back pressure, which is really nice, right? Because you don't want these things overflowing. Cool, yeah, so we've been talking about the components and sort of the abstract sense of sort of serialize, serialize, et cetera. Let's talk about the actual physical implementations you have of those sort of, I guess, thinking about them in terms of database and like operators

Starting point is 00:22:30 almost in a way, right? So, yeah, tell us about the physical implementations you have of the various components in this pipeline. Yeah, right. As you mentioned. So in a sense, in a way, these the architecture is like a data flow system. We have logical components that we have talked about, the read, the serialize, etc. And we then have some physical implementations of them.

Starting point is 00:22:59 For example, the reader and the serializer physical implementations are up to the developers who want to connect their systems. For example, you could have a physical reader who would be reading data from a post-dress database or from a CSV file or from a click-house database or whatever system you want. And then the de-zero-lizer. Here we need to transform the data from the source format into an intermediate format. And here we can choose the intermediate format and we provide our own row format. I mean, row and columnar formats.

Starting point is 00:23:41 This is, yeah, I mean, is columnar Pax-like. Essentially, this is like Arrow, where we organize the data in record batches, if I would use the Arrow terminology. You can also bring your own intermediate format. This means you would then implement a new deserializer that takes postgres data, for example, or whatever data you have, and will serialize it into a different intermediate format. Then we have the different compressors. Right now, we have some support for some standard compression library. So the physical implementations would be for ZSTD, LZ4, LZO and SNAPI. But because we operate on a buffer at a time, you can very easily just bring your own compression algorithm and plug it in as a physical compression operator. a physical compression operator. And then for the network, right now we use TCP transfers.

Starting point is 00:24:49 So right now we leverage the Boost ASIO library for networking. But again, if you want, you can bring your own RDMA library or UDP or whatever. So then you would need to implement pairs of senders and receivers. And then on the other side, it's just the reverse, right? So you would need to decompressor. I mean, this goes again, hand in hand with the compressors and on the other hand, the serializer and the writers. These are also system specific. Nice culture given that we've got this

Starting point is 00:25:28 this sort of this smorgasbord of various components and everything the the really interesting part I think is the really appealing part of this whole thing is the optimization how to pick given a certain environment and systems and whatnot, what is the best setup to use? So yeah, tell us about the optimization algorithm you have to sort of pick all these different components, these different operators and piece them all together to be the most efficient. Yeah, so yeah, up to now we talked about the general transfer framework, right?

Starting point is 00:26:05 And that we now have all those nice options to configure, so we can configure the different threads of the component. We can pick different intermediate formats. We can pick different buffer sizes. We can pick different compression algorithms. And of course, as you said, yeah as you said, we can configure it now, but who knows what to pick in each environment. And to show the impact of different configurations,

Starting point is 00:26:42 we ran some experiments where we just varied the amount of threads and we simulated different environments and we just varied the amount of threads that we use and we saw that actually there might be an order of magnitude performance difference. So if you pick a very bad configuration, instead of a just okay one, you are very off and you don't want this to happen. So how do we go about this? To solve this, we have an optimizer.

Starting point is 00:27:23 So the optimizer gets us an input, what we call the base throughput rates of each component. This we obtain through profiling. You can imagine either you do one run in the beginning or this can even be... One can obtain this information also on the fly, like just in the beginning of a transfer. So you want to know how fast each component processes a buffer. And given this information, now we have our base throughputs. And we also have an amount of threads that we can give to our server and client overall. The optimizer is pretty straightforward in that sense that we go to the slowest component

Starting point is 00:28:15 and we just keep adding threads until it's not the slowest anymore. We do this in an iterative fashion until we are out of threads. So for example, we start and we see that the T serializer is the slowest component. So we just add threads to it and we estimate the throughput that it would get with different threads using Gustafsson's law. So, of course, you don't have linear scalability when you just add threads to a task. So, I mean, there is some estimation work there. But the idea is that we go to the slowest component, for example, the T-serializer. We say, okay, with one thread, you achieve X throughput, now take two threads.

Starting point is 00:29:04 And now we reevaluate and we see, okay, now the network is the bottleneck. Okay, now let's add compression. Now we have compression, can we do faster? And yeah, what I forgot to say is that we also have some bounds, right? For example, you have your network limit, so you cannot, even if you try very hard, we also have some bounds. For example, you have your network limit. So, even if you try very hard, you cannot be faster than the network. So, we optimize our threats until we optimize the amount of workers for each component

Starting point is 00:29:41 until we run out of available threads or we have reached the upper limit, let's say. Sorry, I continue. I just wanted to say that also, Ben, we might talk about this later as well, but in the evaluation, we saw that also depending on the data set, different compression libraries work differently or depending on the available amount of threats that you have in your system, you may want to use different compression libraries. So we saw that it actually makes sense to configure these components individually.

Starting point is 00:30:20 Nice. Yeah, I was just going to say eventually you reach the bounds of the laws of physics, right? Essentially, you can't go any faster. It's physically impossible to. Yeah, I mean, on this, how quickly can you converge to maybe I'm getting into the evaluation, but how quickly do you converge to the optimal strategy of the transfer? Like, if it's a large, if it's a really short transfer, I guess it might take time to reach the optimal setup. Is there a side of transfer before it? We'll be getting into the evaluation, but then my other sub-questions that was like, is there memory over time of being like, oh, this transfer

Starting point is 00:31:03 looks similar to this one I did before on this same sort of hardware environment, therefore I can jump to nearer the optimal quicker? Yeah, very good question. So in the first version of XTBC, I mean, this is already a couple of months old, but it's an ongoing project. But in the first version, we would

Starting point is 00:31:26 just have a profile for a particular data transfer, a particular environment, and the heuristic optimizer would output one configuration and this would be it. So, of course, when you would run the transfer again, you would get your new profiles and adjust, but this we didn't explore so much. Now, we have an ongoing project where we actually try to learn, let's say, from past transfers or we try to build. It's very cool work actually, but I don't have the full results yet, but the idea there is that we can categorize different environments and we can sample some configurations so that we see which configurations run well in different environments

Starting point is 00:32:20 so that we have a good starting point. Because we do all these micro benchmarks, and the idea is to build a model that we can already embed into the connector so that it has a good starting point on the one hand, so that if you identify in which environment you are, you already go with a good transfer, but it also will learn on the fly.

Starting point is 00:32:48 And we have seen that, so after a couple of iterations, you already converge close enough to the best possible throughput. But this is not part of the original paper. Yeah. Oh, sure. And also I guess following on from that the big acronym that's lurking in the background, AI and Machine Learning. When are we going to get an ML based optimizer for XTBC? I guess we're kind of getting there, we're back and it's going to learn over time so it's getting into that sort of thing but yeah I guess there's a whole host of techniques from the ML based

Starting point is 00:33:24 query optimisation and all that sort of stuff, techniques yeah, I guess there's a whole host of techniques from sort of maybe the ML based sort of query optimization and all that sort of techniques that can maybe be applied to this setting as well, maybe. Yeah, yeah. So, I mean, I don't want to spoil, but this is ongoing work. And actually, we have applied some ML techniques, but also more traditional optimization techniques. And to be honest, traditional optimization techniques also don't perform that badly. Yeah. That's kind of consistent with some other research I've seen before and people have spoke to around a lot of the time that these traditional less fancy techniques can get you like 80% of the way there or good enough 90% of the time or these traditional less fancy techniques can get you like 80% of the

Starting point is 00:34:05 way there or good enough 90% of the time or whatever, which is, and is that complexity simply understandability trade off right as well. And yeah, so that's interesting that that may apply to this setting as well. What's it like from, I mean, obviously this has been a research project, but from like the user experience, what is the tooling around monitoring how things are happening? Can I visualize how long my transfer might take given the way things are at the moment? Is that sort of information exposed in any way? Yeah. So we have, because this is also not part of the original work but

Starting point is 00:34:47 this is ongoing work things move fast. So here we also want to be able to change the parameters during runtime right so in the first version I said we have the heuristic optimizer. We give it one configuration, the transfer runs, and then maybe in the next transfer you can re-optimize. But what we're working on now is actually dynamic reconfiguration, so that during runtime you can scale the different amount of threads for each component and stuff like that. And therefore, we expose the... because the optimizer actually needs this information. I mean, you may be also, as you mentioned, as a user might also want to see why or what is the bottleneck of your current data transfer. But this is also important for the optimizer, right? So it needs to get this information to take a better decision. So yeah, we have like an endpoint where you can both set the different parameters,

Starting point is 00:35:51 but also get the current runtime information. Yeah. Nice. That's really cool. Let's talk about some numbers and an evaluation and put some numbers to this discussion. So yeah, first of all, set us up. What was the experimental setup like? What were the benchmarks you ran? And yeah, tell us the results. Yeah. So first we wanted to see end-to-end. We have different state-of-the-art tools for transferring data? And can we reach the performance of specialized connectors?

Starting point is 00:36:29 And do we also outperform the generic solutions out there? So this was our goal. And to also have different environments, we used a setup with Docker, and Docker has some really nice properties where you can set the amount of different, the amount of CPUs that you want the container to have, and you can also change the network characteristics, like with the Docker, I mean, it's a TC-like utility. So this is the setup.

Starting point is 00:37:02 So we spawn our systems in Docker containers, we configure different CPU and network characteristics, and we run our benchmarks. And as I said, we wanted to first see end-to-end how does the performance look like, and therefore we picked different use cases. For example, from the data science domain, one would want to load the Postgres table into a Panos data frame. And here we also tried out different data sets. And an important baseline here was ConnectorX.

Starting point is 00:37:41 This is also a recent VLDB paper. We have this as a specialized connector because this one is really optimized for transferring your database tables into PANAS data frames. There we saw that we are actually on par with ConnectorX, sometimes a little bit faster, sometimes a little bit slower, but overall we're in the same ballpark. Then we also have Taktibi because it has a specialized, let's say, Postgres reader or database readers, but in this case we use a Postgres reader and here we are, I would say, around three to four times faster when loading a Postgres table into a data frame.

Starting point is 00:38:33 Then we also use these generic connectors like modin has one which also can parallelize and then we also have the TurboDVC connector and there we actually are yeah i mean it's around maybe 10 times faster yeah but again these are generic connectors right so this is expected behavior then maybe i can also talk talk a bit about when we want to transfer a Postgres table to Spark. So here we also use the standard JDBC driver of Postgres in Spark. And here we also see this is also something like, yeah, for the line item dataset, for example, that I see here is around five times,

Starting point is 00:39:29 XBZ is five times faster than the default JDBC connection. Then I also have another interesting experiment where we transfer a Postgres table to a Postgres table. So here we also have as baselines JDBC. So Postgres allows you to have these nice foreign data wrappers, right? And we talked about this also in the XDB podcast. So the idea there is that you can actually access data that's not inside your Postgres database, but living somewhere else. And one way to implement a foreign data wrapper is with a JDBC connector.

Starting point is 00:40:10 And there, for the line item data set, we had a timeout after 20 minutes, so the transfer couldn't finish. But for the others, we also saw something, yeah, it something like an order of magnitude difference. And the interesting part here is that Postgres also has its own native foreign data wrappers. So these are kind of built to transfer Postgres data between Postgres instances. And here we also actually performed quite better. So this is also like two to five times better, which is pretty interesting because yeah, these Postgres foreign data rubbers are really optimized

Starting point is 00:40:58 for Postgres instances. Yeah, yeah, yeah. Yeah. So these were some numbers from the end-to-end experiments. And then we also have different, let's say, microbel marks. For example, we wanted to see what is the bottleneck in different pipelines. So, for example, when transferring, we have this microbenchmark where we transfer a CSV file to a CSV file. And there, if we increase the read parallelism, it doesn't help because

Starting point is 00:41:47 parallelism, it doesn't help. Because it's not the bottleneck, it's the deserialization. So there we see, okay, we might need to increase the parallelism degree of the deserializer. But then when we did exactly the same experiment, but we read from Postgres, we saw, okay, actually the read is the bottleneck. So we should increase the read parallelism. And this is again to motivate a little bit why we need an optimizer and why each environment is different. And then I think I just want to share different some for the formats because there is a conception that columnar format is always better. You should always go columnar. This is not true actually. So we saw that of course when you compress sometimes, I mean because the value domain is more constrained in each column, this may help,

Starting point is 00:42:48 but it's not the case for all datasets. Here we saw that columnar is not always beneficial as an intermediate format. Also, with regard to compression, what is interesting is that the compression library that you should choose also depends on the amount of threads's compression speed, compression ratio, and decompression speed. You always trade off those three things. We saw that, for example, in very high-speed networks, compression doesn't help if you don't put enough workers for the compression and decompression. And then for a low amount of parallelism, for example, if you are on a constraint device and you can only allocate a few threads for compression, you might want to choose a different compression algorithm that will give you better end-to-end performance.

Starting point is 00:44:09 Even though maybe you may transfer more data because your compression and decompression is slower, you might want to choose a particular compression mechanism, but then this also changes when increasing the threads you allocate to a compressor and decompressor. Yeah, and maybe one last thing. What we also try to do is to evaluate our cost model. So we compared the estimated throughputs with actual throughputs that our optimizer provided. They're not perfect, but they work out. They're not too far away from reality, the estimated ones.

Starting point is 00:45:00 Most of the time, we see that it's overestimated, which is not a problem because this means that we would allocate a bit more threads. We wouldn't have worse performance than expected. In that sense, the cost model works fine right now. But as I said, we also have ongoing work on the optimizer, so we hope to be better there as well. Yeah, to close that gap over time. Yeah, there's some fantastic insights there as well, but I think if you kind of, if we take like a holistic review of all of the results, the idea behind XDBT is clearly validated here. This thing definitely works over what we've got else out there

Starting point is 00:45:51 in the state of the art. I guess the natural question that follows in this discussion is, we've probably touched on quite a lot of bits of where you go next, what's in the pipeline with XDBT coming up. One topic I wanted to ask you about, and I don't know where this sort of fits into XCBC is around fault tolerance about what happens when data transfers, like things crash, the network's a little bit flaky, like how do we handle faults at the moment, because from personal experience I've seen before at large data

Starting point is 00:46:20 transfers, the longer they run, the more chance someone's going to go wrong. So how do we handle what's. Yeah. How do you handle things going wrong? Essentially? Yeah. Good question. Um, but I don't have a good answer.

Starting point is 00:46:32 Um, so, um, this work was, um, more focused on, okay. Do we benefit from decomposing the data transfer pipeline? Um, do we benefit from the data transfer pipeline? Do we benefit from different configuration knobs? But of course, we have not addressed different things like, for example, fault tolerance. So, on the network side, we rely on TCP. So, if something gets lost in the network, we hope that the underlying TCP library will take care of that. But yeah, there could be more sophisticated things on the fault tolerance side. So, for example, having different checkpoints or being able to...

Starting point is 00:47:24 Because right now, all this is in memory, right? Yes. And crashes, forget about it. But we could have something like a write ahead log where you could say how much data actually you read from the source and how much of it was written in the target system so that you can just replay from a particular point in time and not just restart everything. The next set of questions obviously like kind of impacts and surprises and surprises. So kind of jumping off with the impact to start off with. If we've got our crystal ball here, what sort of future impact do you think, or maybe already has done with XCBC, what impact can it have going long term? What's a long term vision for it, maybe over a longer time horizon?

Starting point is 00:48:18 Like what impact do you think it can have? Yeah, I think on the one hand, it would be interesting to see if existing data transfer tools like Standard XTBC could expose more configuration knobs. Because right now, for example, in JDBC, you can just configure, what is it? Fetch size or how many tuples you fetch from the database. That's it. I could see different systems exposing more configuration knobs on the one hand. I mean, optimizer in production is always a tricky thing, so people don't want too many things to happen automatically. Of course, in a database,

Starting point is 00:49:17 you want the query optimizer to pick the best plan for you. But these machine learning approaches, I'm the, yeah, I'm not sure how much people want to use them in production. Because we're talking about the real work now outside the research paper. But yeah, so exposing more configuration knobs for data transfer techniques would be one thing. And then, yeah, so there is little interoperability mechanisms between systems. So maybe this work could spark some more initiatives i mean obviously Apache arrow is one such thing but this is like more like an intermediate format. Of course there are also things built around it like you know there is the flight server which also kind of provide similar functionality here but. This also not really yeah decomposable and configurable. You always need to go through Arrow. It would be interesting to see how different systems

Starting point is 00:50:39 may want to work together. That would be cool. Of course, with the XTBC project we will try to provide different connectors for different systems so that you can play around and transfer your data fast. But again, right now we're just a little team and this can be seen like you could use it as a as a prototype. But of course, I mean, to go further, this would need more serious work. Yeah, yeah, cool. I guess on that is on that note, it is a kind of an open source project people can go around and play with,

Starting point is 00:51:21 right? If they if they so choose. can go around and play with if they so choose? Yeah. As we talk, the project is not... I think during SIGMOD we will have an open source version that people can play around with. Right now we have code for the experiments, of course, because this was part of the... When you submit the research paper, you always need to be reproducible. But it's a different thing to expose a framework that's usable. But we have done good progress on that in the last months. And I think that during SIGMOD, we will be open source.

Starting point is 00:51:59 Nice. Ready for the big reveal. Cool. Yeah. So I guess just we'll go for two more questions here. And the first is, surprises, what caught you off guard whilst working on XCBC? What was the most interesting thing that you did not expect that? Yeah, so I mean, interesting for me was that I hadn't done too serious work with C++ before. So for me, it was also an adventure.

Starting point is 00:52:35 Yeah, getting to know the low-level details. And it's crazy how a branch can cut your performance. It's so crazy. But then what was more interesting was that different systems for different environments, you need so much different configurations like this. I didn't expect. Yeah. Yeah, that's cool. Yeah, there's a lot of diversity and heterogeneity out there in the wild, right? Yeah.

Starting point is 00:53:17 Yeah. Cool. Yeah. So the penultimate one is something that I'm trying out, Harry. So if you've not got anything so that comes to mind then don't worry about it. But it's given a, I don't know whether we did this for SheetReader, I think we did it but for an InductDB context about some sort of recommendation, anything sort of tech related

Starting point is 00:53:37 or computer science related, be it a blog post, paper or tool that you've encountered recently that you thought oh that was fun, I enjoyed that, oh, that was cool, go check it out? Yeah, I think it may sound funny, but Sea Lion. So we often don't appreciate the tooling that we get so much. So yeah, it's very nice because we also had very different environment setups and very different, let's say, so we integrated also Python with C++ and then also Java and JNI and we have different containers running here and there. And it's amazing to see how these tools, in particular, I use and the JetBrains family, let's say, but it was very nice to see how well they work and what amazing solutions

Starting point is 00:54:41 they provide and how much they make our life easier. Yeah, we take it for granted a little bit, right? And also, Sea Lion is like a great play on words as well. It's a good name, so yeah, it has that going for it as well. Yeah, cool. I guess with that then, let's move on to the last word then. So what's the one thing you'd like the listener to get from this podcast episode today? Yeah, so, yeah, just recapping.

Starting point is 00:55:15 So data transfer, I think data transfer is crucial in today's applications. And we actually really focused on making the performance better in each data system. Right. But then often we forget about how important integration is and in particular the inter-variability between systems. So I think a lot of performance and resources get wasted on like poor inter-variability mechanisms. And yeah, I think XBC is a step towards making inter-variability mechanisms better, but of course there is more to come.

Starting point is 00:55:49 And we look forward to that. Thanks again Harry for coming on the show. It's always a pleasure to have you on. And yeah, and thanks again. Oh yeah, we'll put links to everything in the show, and that's as well a listener, so if you want to anything we've mentioned today, you can go and find it there. Yeah, and I would also like to mention that this is work in collaboration with my colleagues Kauschtuck-Bittcar from the IIT in Delhi and Matthias Böhm at the Beifeld Institute in Tüberlin and of course Volker Markl also from Beifeld and Tüberlin. Cool, great stuff. And yeah, I guess we'll see you all next time for some more awesome computer science research. Yeah, thanks for having me Jack.

Disseminate: The Computer Science Research Podcast - Haralampos Gavriilidis | Fast and Scalable Data Transfer across Data Systems | #62

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.