Disseminate: The Computer Science Research Podcast - Haralampos Gavriilidis | Fast and Scalable Data Transfer across Data Systems | #62
Episode Date: June 16, 2025In this episode of Disseminate, we welcome Harry Gavrilidis back to the podcast to explore his latest research on fast and scalable data transfer across systems, soon to be presented at SIGMOD 2025. B...uilding on his work with XDB, Harry introduces XDBC, a novel data transfer framework designed to balance performance and generalizability. They dive into the challenges of moving data across heterogeneous environments—ranging from cloud systems to IoT devices—and critique the limitations of current generic methods like JDBC and specialized point-to-point connectors.Harry walks us through the architecture of XDBC, which modularizes the data transfer pipeline into configurable stages like reading, serialization, compression, and networking. The episode highlights how this architecture adapts to varying performance constraints and introduces a cost-based optimizer to automate tuning for different environments. We also touch on future directions, including dynamic reconfiguration, fault tolerance, and learning-based optimizations. If you're interested in systems, performance engineering, or database interoperability, this episode is a must-listen. Hosted on Acast. See acast.com/privacy for more information.
Transcript
Discussion (0)
Hello and welcome to Disseminate the Computer Science Research Podcast.
I'm your host, Jack Wardby.
We're back today with another cutting edge episode and we're going to be talking about
fast and scalable data transfer across data systems with Harry Gavrilidis.
Harry is a, well, three times now you've been on the podcast, Harry, right?
So welcome back for the third time.
We're really glad to have you back on the show again to tell us about your latest
research, which is due to be published at Sigmod this year, right?
So yeah, welcome to the show again, Harry.
Yeah.
Thanks for having me again.
I hope you can cope with me the third time.
Cool.
Yeah.
A quick plug as well for Harry's other episodes.
We've got an in situ cross database query processing, which was episode 27. Go check that out. And he's also got an episode where we talk about
his work on SheetReader in the context of the DuckDB research series, which hasn't actually
been released yet. So depending on when you listen to this podcast, it may be out, it
may not be out, but yeah, it'll be hopefully out and in your feed somewhere. So yeah, go
do check that out listener. Two really good podcast episodes, that's for sure.
So let's get on to the topic at hand today then,
and Harry, but before we do that,
the usual tell us about your story.
How did you kind of get interested in database management
and you tell us about your journey to where you are today?
Yeah, so overall, my journey in new database system
started when I did my bachelor's at TU Berlin.
In the day, the lab was developing Apache Flink.
I guess it has become a big success story.
I joined the lab when the Apache Flink story was fading out and new things were starting
to happen.
I did my bachelor thesis back then at the lab and then I thought data management looks pretty cool.
I went on and did a master's there as well. And yeah, the one thing led to the other and here I am.
Now hopefully soon finishing the PhD.
Yeah, and I'm looking forward to talk to you today
about the data transfer topic.
That's one of the, yeah,
it's the last topic of the PhD actually.
Nice, last and best?
Yeah. I mean, you know how it goes when you advance in the PhD.
You get more experience.
This is true. Yeah, for sure.
Well, I read it and it was a very interesting read.
So I'm sure the listeners are going to like what we've got to say about it today as well.
So cool. Going off that then, let's set a bit more context for this chat and talking about data transfer.
So tell us about what data transfer looks like in the modern environment, in the modern data ecosystem.
Tell us why it's important and what are the various challenges with it today? Right, so yeah, this work also originated,
I mean, it's related also to the XDB system,
where we were dealing with federated query processing.
And yeah, overall, we see that there are many use cases
around where you have to transfer data
from one system to the other, right?
You have the typical ETL use cases
where data flows through systems, gets transformed,
and prepared for downstream tasks
or for storage in a data warehouse.
You have federated query processing
where you need to combine multiple data systems together
to answer analytics queries.
You have IoT use cases where you have also different nodes communicating with one another
and they also transfer data.
Overall, data transfer is crucial for data systems interoperability.
When you have multiple systems working together or when you need to load your data, as one
would say, from one system into the other, this is a crucial task.
What we realized when we were doing the XDB work is that a large portion of the query processing time went
actually to data transfer. Data transfer involves many operations and these are expensive. In this work, we try to see how to mitigate this. One other aspect is that data transfer
occurs in many different environments.
So you may have your systems on the cloud.
So when both systems are in the same cloud
and there is a high-speed network between them
and the nodes are super powerful,
you might have two systems that are located
on different clouds,
so the network between them is not the best.
Then you might have data scientists working
on their laptops with like a normal internet connection
connected to the cloud.
You may have IOT nodes that are, let's say, less powerful.
And overall, we have different environments
where data transfer occurs.
And then related to, yeah, you asked about, yeah, state of the art for data transfer.
And this is, so I categorize data transfer approaches in two main categories.
So on the one hand, we have what I call generic data transfer methods.
And there you would have something like JDBC. There you just plug and play your driver and
you can talk to any system and that's super fine because it's generic, so it allows you
talking to many systems. But on the other hand, this is not very well performing.
And there was also a paper some time ago from CWI, actually the DuckDB folks, and they showed
that the JDBC protocols are actually not that good.
And so that's the one category, the generic data transfer.
And then we also have what I call specialized data transfer techniques.
And there we have, let's say, these point-to-point connectors
where I can implement very efficient connectors between pairs of systems.
And of course, the advantage there is that I can squeeze out performance and
it can be the best achievable performance. But this means that I also have to implement
for every pair of systems a new connector. And with this work at XTBC, we want to strike the balance between both worlds. So we want to be generic and we also want to be efficient.
Yeah.
Nice. Yeah.
Just the first comment that's what I'm going to make off of the pack of that.
It's really nice to see how one piece of research has sort of led to another with
your work on XDB is kind of lit the light bulb and kind of gone,
oh, well, this is a problem.
We can do this.
I see how like it's a good good example of how a research can be
interesting by the act of doing things on Earth, new things and new directions.
So that's really cool to see. But yeah, I'm kind of getting from you there that
data transfer, a lot of heterogeneity across these systems and trying to, it's
difficult to sort of have this movement of data between these systems. Is a lot of
that, do you think, maybe a function of
systems don't make it easy to let go of their data because they want to lock the data in?
Is this a business use case for making it sticky to get your data out because you don't want your
customers to go away? I don't know if that plays into it at all as well.
Yeah, so that's a very good point. Also, we have seen that the emergence of Apache Arrow,
so this in columnar, in memory data format or representation.
And the idea there is to standardize formats
across systems, which is nice also as an interchange format
because of its binary presentation and the compact format and the nice integrations.
But then, as you said, all systems internally most of the time use their own format.
And yes, on the one hand, I think you're right. It's because of the business case and you want your
data and your customers caged in your system. But on the other hand, I think it's also because of
system-specific optimizations, so to say. So each system is optimized for a different use case
and they may benefit on the processing side from their own internal format.
Of course, in a perfect world, everyone would have something like Apache Arrow internally and then
you could just transfer your data using Apache Arrow everywhere and you wouldn't need to transform
it. But I think we're still far away from that.
Yeah, yeah, cool. And yes, just to recap quickly, there's a lot of this kind of I guess the current state of the art is you have these generic sort of approaches like the JDBC, which is kind of good. It's generalized, you can kind of plug and play and sort of these things together, but pretty much bad performance doesn't really sort of perform well. And then on the other end of the spectrum, you've got these really specialized things,
between two systems, system A and system B,
and can do between this pair of systems really fast and really efficiently,
but then that doesn't generalize.
So XCBC is going to solve this problem for us
and give us a nice generalizable framework,
but also something that is not sacrificing performance for this as well.
So, yeah, let's talk about the architecture of XTBC then.
And what does this system that's going to give us these nice
properties actually look like?
Mm-hmm.
Yeah.
So, we looked at the data transfer pipeline, right?
So, and so because we want to be generic,, let's focus on the generic part here.
For example, in JDBC or in the Python ecosystem, SQL Alchemy, or you name it, connectors,
connectors, you have a very tightly coupled pipeline. There are many steps involved in data transfer, so you have to read your data
from the source, you need to deserialize it into an intermediate format, you need
to maybe compress it, you need to send it over the network, and then on the other
side you do the reverse. What we observed is that there is little configuration opportunity there or like
optimization. So there are very few parameters that you can set and also that all these steps
are kind of tightly coupled together. So the main idea of XTBC is to first to decompose this pipeline and then to
be able to individually configure all these different steps of the pipeline to achieve
best performance in all environments. So the architecture, you can imagine it as
like different components that are interconnected with queues,
and they are individually configurable and scalable, let's say, in terms of parallelism.
When you want to add a new system, you implement only a few of those components. We have some
interfaces, and for example, to add a new source, you would implement the reader and the deserializer and then you can join the party and I'd say act as a source
for all the different systems that already are part of the XCBC framework.
Nice cool, so we've got some these reader-writer interfaces and then the general idea is to sort
of decompose this pipeline and so that decompose all these and the process into these different.
So I'm going to start again. Decompose this whole process of moving data into these different component logical if we went through the lifecycle or the flow of
a actual request of like, hey, all of a sudden, I want to get some data from system A to system B.
How does that look like with XDBC? What's the flow of requests and how does that all piece together?
how does it all piece together? Right, right. So, I mean, yeah, the flow you already mentioned.
So, let's have an example.
Actually, let's say we want to load the table from a database
into a Pandas data frame, right?
I think this is also the example that we have in the paper.
The idea there is that to transfer, to now load the Pandas data frame from the database table,
you would initiate the transfer. On the Pandas side, we have the XTBC client. XTBC consists of a server that sits next to our source system
and the client that sits next to the target system.
The request of the framework that wants to load the data goes through the client,
and then the client communicates with the server with which data you want to have transferred and then the server
actually initiates the reader and the reader starts pulling data from the source system and
then these data are organized in buffers. So we fill those buffers, the buffers get into our system from the reader to the deserializer.
Their deserialization happens to some intermediate formats.
We support different intermediate formats.
Then you go to the compressor, data gets compressed.
You send the data over the network, it gets
into the client, we may decompress the data if it's compressed, then we serialize it into
the target format, and finally we load it, in this case, into the Pandas data frame. And yeah, so as I said, there are these many different steps involved.
And now, let's say the Pandas environment runs in a laptop, right?
So this means that maybe we don't have so much memory available or we don't have so
many cores available.
The cool part here is that we can configure different amount of threads for all the different components.
We have seen in the experiments that, for example, when over-provisioning, this leads to worse performance.
You have to actually find out the best parameter configuration.
And this is part of the optimizer that I will come to later. But also, for example, if now
the laptop is somewhere with a normal internet connection, we would probably enable compression.
But then, for example, if you use some service like Google
Colab or something that sits in the cloud next to your database you would
probably not enable compression. So this is the cool part. So you can configure
all those different steps individually in different environments and adapt the
parameters in such a way that you get the best performance possible.
Yeah, like the granularity you have there of each component,
and we can kind of change it to suit the heterogeneous environment
that you happen to find yourself in, right? It feels great.
I have a question on the...
So when we're setting up, does the client spawn the server process
or do you have to manually set up both of these processes next to or in their own?
Do you always co-locate them with the source and the target system or can they be standalone?
Do you have to stand those two things up before you then can send the client the request to the server?
Right, very good question. Yeah, so this is up to the implementation.
But yeah, we have both. So for example, when you want to load data from a Posterous database,
so right now we have like a standalone XTBC server that actually sits next to the Posterous instance.
server that actually sits next to the Postgres instance. But we have also done some work recently where we try to integrate it as an extension into Postgres so that you can actually
just hit the Postgres database without any service running next to it.
Then for the clients, it's a bit different. For example, we have an Apache Spark source, and there we just use the data source API.
It's just another source in the system. XTBC is like a shared library, so the main functionality is packaged as a shared library. So the main functionality is packaged as a shared library and then you kind of need to integrate it into your framework. So for example, when we want to load data into Postgres, we use
a foreign data wrapper that allows you to read data from external sources. Then for example,
for Apache Spark, we use the data source API as mentioned, and there we wrap the shared library with JNI,
because XTBC is implemented in C++.
So you need the bridge between JVM and native code.
And then for pandas, we also use the CPython integration or PyBind, where you can also integrate native code with Python code.
Nice, cool. Going off the components, there are a few other ingredients in the runtime of XCBC that hold everything together.
You touched on the memory management a second ago, but if you want to go into that bit more, then please do. But there are other aspects of it around parallelism and back pressure.
And maybe you can tell us about those runtime components as well.
Yeah.
So, as I mentioned, we have these different components that communicate
through queues and the memory is also configurable and by memory, I mean, the
memory that you want your server to use
or your client to use.
For example, if you're on the cloud, you may have available a lot of memory.
But if you're on a resource-constrained device, you might not want to use a lot of memory.
And in our experiments, we actually also saw that with very little memory
in the order of megabytes, you can also efficiently transfer gigabytes of data.
But yeah, so we have a memory pool and we divide it into logical buffers.
And then, so the pipeline starts at the reader, so at the server side, right? And
we actually get the foreign data that sits in our example, for example,
in Postgres, and we read it into our buffer. So then we have Postgres data in our buffers.
And then we place the... So we then need to send it to the T-serializer, but we actually don't send the whole buffers.
We just send pointers to the data and the components communicate through queues.
So we actually queue if you want the buffer IDs and all the queues also have a maximum capacity.
We enqueue the elements, the buffer IDs into the queues,
and this way all the different threads of one component
can actually, whenever they are done,
they get an element from their input queue, do the work, and send it to the next component.
This means that a component can at one time occupy a thread to a buffer,
so the one that it gets from the upstream task, and then you get the free one. You do the work, you put it in there and
you send it to the downstream component.
Now, of course, in a queuing system like this,
the throughput is bound by the slowest component.
By setting this maximum capacity at the queues, we ensure that the back pressure is handled.
So whenever some component is a bottleneck, it will back propagate and through these constraints of the queues,
we will not overflow the system. So in the end, the reader will just stop reading new buffers
and putting them into the system.
Nice, you have this inherent back pressure,
which is really nice, right?
Because you don't want these things overflowing.
Cool, yeah, so we've been talking about the components
and sort of the abstract sense of sort of serialize,
serialize, et cetera.
Let's talk about the actual physical implementations you have of those sort of, I guess,
thinking about them in terms of database and like operators
almost in a way, right?
So, yeah, tell us about the physical implementations
you have of the various components in this pipeline.
Yeah, right. As you mentioned.
So in a sense, in a way,
these the architecture is like a data flow system.
We have logical components that we have talked about, the read, the serialize, etc.
And we then have some physical implementations of them.
For example, the reader and the serializer physical implementations are up to the developers
who want to connect their systems.
For example, you could have a physical reader who would be reading data from a post-dress
database or from a CSV file or from a click-house database or whatever system you want.
And then the de-zero-lizer.
Here we need to transform the data from the source format into an intermediate format.
And here we can choose the intermediate format and we provide our own row format.
I mean, row and columnar formats.
This is, yeah, I mean, is columnar Pax-like. Essentially, this is like Arrow,
where we organize the data in record batches, if I would use the Arrow terminology. You
can also bring your own intermediate format.
This means you would then implement a new deserializer that takes postgres data, for example, or whatever data you have, and will serialize it into a different intermediate format.
Then we have the different compressors. Right now, we have some support for some standard compression library. So the physical implementations would be for ZSTD, LZ4, LZO and SNAPI.
But because we operate on a buffer at a time, you can very easily just bring your own compression
algorithm and plug it in as a physical compression operator.
a physical compression operator. And then for the network, right now we use TCP transfers.
So right now we leverage the Boost ASIO library for networking.
But again, if you want, you can bring your own RDMA library or UDP or whatever.
So then you would need to implement pairs of senders and receivers.
And then on the other side, it's just the reverse, right?
So you would need to decompressor.
I mean, this goes again, hand in hand with the compressors and on the other
hand, the serializer and the writers.
These are also system specific. Nice culture given that we've got this
this sort of this smorgasbord of various components and everything the the
really interesting part I think is the really appealing part of this whole
thing is the optimization how to pick given a certain environment and systems and whatnot,
what is the best setup to use?
So yeah, tell us about the optimization algorithm
you have to sort of pick all these different components,
these different operators and piece them all together to be the most efficient.
Yeah, so yeah, up to now we talked about the general transfer framework, right?
And that we now have all those nice options to configure,
so we can configure the different threads of the component.
We can pick different intermediate formats.
We can pick different buffer sizes.
We can pick different compression algorithms.
And of course, as you said, yeah as you said, we can configure it now,
but who knows what to pick in each environment.
And to show the impact of different configurations,
we ran some experiments where we just
varied the amount of threads and we simulated different environments and we
just varied the amount of threads that we use and we saw that actually
there might be an order of magnitude performance difference. So if you pick a very bad configuration,
instead of a just okay one, you are very off
and you don't want this to happen.
So how do we go about this?
To solve this, we have an optimizer.
So the optimizer gets us an input, what we call the base throughput rates of each component.
This we obtain through profiling.
You can imagine either you do one run in the beginning or this can even be...
One can obtain this information also on the fly, like just in the beginning of a transfer.
So you want to know how fast each component processes a buffer.
And given this information, now we have our base throughputs.
And we also have an amount of threads that we can give to our server and client overall.
The optimizer is pretty straightforward in that sense that we go to the slowest component
and we just keep adding threads until it's not the slowest anymore.
We do this in an iterative fashion until we are out of threads.
So for example, we start and we see that the T serializer is the slowest component.
So we just add threads to it and we estimate the throughput that it would get with different threads using Gustafsson's law.
So, of course, you don't have linear scalability when you just add threads to a task.
So, I mean, there is some estimation work there.
But the idea is that we go to the slowest component, for example, the T-serializer.
We say, okay, with one thread, you achieve X throughput, now take two threads.
And now we reevaluate and we see, okay, now the network is the bottleneck.
Okay, now let's add compression.
Now we have compression, can we do faster?
And yeah, what I forgot to say is that
we also have some bounds, right? For example, you have your network
limit, so you cannot, even if you try very hard, we also have some bounds. For example, you have your network limit.
So, even if you try very hard, you cannot be faster than the network.
So, we optimize our threats until we optimize the amount of workers for each component
until we run out of available threads or we have reached the
upper limit, let's say.
Sorry, I continue.
I just wanted to say that also, Ben, we might talk about this later as well, but in the
evaluation, we saw that also depending on the data set, different compression libraries work differently or
depending on the available amount of threats that you have in your system, you may want
to use different compression libraries.
So we saw that it actually makes sense to configure these components individually.
Nice.
Yeah, I was just going to say eventually you reach the bounds of the laws of
physics, right? Essentially, you can't go any faster. It's physically impossible to.
Yeah, I mean, on this, how quickly can you converge to maybe I'm getting into the
evaluation, but how quickly do you converge to the optimal strategy of the transfer? Like,
if it's a large, if it's a really short transfer, I guess it might take time to reach the optimal
setup. Is there a side of transfer before it? We'll be getting into the evaluation, but then
my other sub-questions that was like, is there memory over time of being like, oh, this transfer
looks similar to this one I did before on this same sort of hardware environment,
therefore I can jump to nearer the optimal quicker?
Yeah, very good question.
So in the first version of XTBC,
I mean, this is already a couple of months old,
but it's an ongoing project.
But in the first version,
we would
just have a profile for a particular data transfer, a particular environment,
and the heuristic optimizer would output one configuration and this would be it.
So, of course, when you would run the transfer again, you would get your new profiles and adjust, but this we didn't explore
so much.
Now, we have an ongoing project where we actually try to learn, let's say, from past transfers
or we try to build.
It's very cool work actually, but I don't have the full results yet, but the idea there is that we can categorize different environments
and we can sample some configurations so that we see which configurations run well in different environments
so that we have a good starting point. Because we do all these micro benchmarks,
and the idea is to build a model
that we can already embed into the connector
so that it has a good starting point on the one hand,
so that if you identify in which environment you are,
you already go with a good transfer,
but it
also will learn on the fly.
And we have seen that, so after a couple of iterations, you already converge close enough
to the best possible throughput.
But this is not part of the original paper.
Yeah. Oh, sure. And also I guess following on from that the big
acronym that's lurking in the background, AI and Machine Learning.
When are we going to get an ML based optimizer for
XTBC? I guess we're kind of getting there, we're back and it's going to learn over time so it's getting
into that sort of thing but yeah I guess there's a whole host of techniques from the ML based
query optimisation and all that sort of stuff, techniques yeah, I guess there's a whole host of techniques from sort of maybe the ML based sort of query optimization and all that sort of techniques that can maybe be applied to
this setting as well, maybe.
Yeah, yeah. So, I mean, I don't want to spoil, but this is ongoing work. And actually, we have
applied some ML techniques, but also more traditional optimization techniques. And to be honest, traditional optimization techniques also don't perform that badly.
Yeah.
That's kind of consistent with some other research I've seen before and people have
spoke to around a lot of the time that these traditional less fancy techniques can get
you like 80% of the way there or good enough 90% of the time or these traditional less fancy techniques can get you like 80% of the
way there or good enough 90% of the time or whatever, which is, and is that complexity
simply understandability trade off right as well. And yeah, so that's interesting that that may
apply to this setting as well. What's it like from, I mean, obviously this has been a research
project, but from like the user experience, what is the tooling around monitoring how things are happening?
Can I visualize how long my transfer might take given the way things are at the moment?
Is that sort of information exposed in any way?
Yeah.
So we have, because this is also not part of the original work but
this is ongoing work things move fast. So here we also want to be able to change
the parameters during runtime right so in the first version I said we have the heuristic optimizer. We give it one configuration,
the transfer runs, and then maybe in the next transfer you can re-optimize. But what we're
working on now is actually dynamic reconfiguration, so that during runtime you can scale the different
amount of threads for each component and stuff like that. And therefore, we expose the... because the optimizer
actually needs this information. I mean, you may be also, as you mentioned, as a user might also
want to see why or what is the bottleneck of your current data transfer. But this is also important
for the optimizer, right? So it needs to get this information to take a better decision. So yeah, we have like an endpoint where you can both set the different parameters,
but also get the current runtime information.
Yeah.
Nice. That's really cool.
Let's talk about some numbers and an evaluation and put some numbers to this discussion.
So yeah, first of all, set us up.
What was the experimental setup like?
What were the benchmarks you ran? And yeah, tell us the results.
Yeah. So first we wanted to see end-to-end. We have different state-of-the-art tools for transferring data? And can we reach the performance of specialized connectors?
And do we also outperform the generic solutions out there?
So this was our goal.
And to also have different environments, we used a setup with Docker, and Docker has some really nice properties
where you can set the amount of different,
the amount of CPUs that you want the container to have,
and you can also change the network characteristics,
like with the Docker, I mean, it's a TC-like utility.
So this is the setup.
So we spawn our systems in Docker containers,
we configure different CPU and network characteristics, and we run our
benchmarks. And as I said, we wanted to first see end-to-end how does the
performance look like, and therefore we picked different use cases.
For example, from the data science domain, one would want to load the Postgres table
into a Panos data frame.
And here we also tried out different data sets.
And an important baseline here was ConnectorX.
This is also a recent VLDB paper. We have this as a specialized connector because this one is really optimized
for transferring your database tables into PANAS data frames.
There we saw that we are actually on par with ConnectorX,
sometimes a little bit faster, sometimes a little bit slower,
but overall we're in the same ballpark.
Then we also have Taktibi because it has a specialized, let's say,
Postgres reader or database readers, but in this case we use a Postgres reader and here we are, I would say, around three to four times faster
when loading a Postgres table into a data frame.
Then we also use these generic connectors like
modin has one which also can parallelize and then we also have
the TurboDVC connector and there we actually are
yeah i mean it's around maybe 10 times faster yeah but again these are generic connectors right so
this is expected behavior then maybe i can also talk talk a bit about when we want to transfer a Postgres
table to Spark. So here we also use the standard JDBC driver of Postgres in Spark.
And here we also see this is also something like, yeah, for the line item dataset,
for example, that I see here is around five times,
XBZ is five times faster than the default JDBC connection.
Then I also have another interesting experiment
where we transfer a Postgres table to a Postgres table.
So here we also have as baselines JDBC.
So Postgres allows you to have these nice foreign data wrappers, right?
And we talked about this also in the XDB podcast.
So the idea there is that you can actually access data that's not inside your Postgres database,
but living somewhere else. And one way to implement a foreign data wrapper is with a JDBC connector.
And there, for the line item data set, we had a timeout after 20 minutes, so the transfer couldn't
finish. But for the others, we also saw something, yeah, it something like an order of magnitude difference.
And the interesting part here is that Postgres also has its own native foreign data wrappers.
So these are kind of built to transfer Postgres data between Postgres instances.
And here we also actually performed quite better.
So this is also like two to five times better,
which is pretty interesting because yeah,
these Postgres foreign data rubbers are really optimized
for Postgres instances.
Yeah, yeah, yeah.
Yeah.
So these were some numbers from the end-to-end experiments.
And then we also have different, let's say, microbel marks.
For example, we wanted to see what is the bottleneck in different pipelines.
So, for example, when transferring, we have this microbenchmark where we transfer a CSV
file to a CSV file. And there, if we increase the read parallelism, it doesn't help because
parallelism, it doesn't help. Because it's not the bottleneck, it's the deserialization. So there we see, okay, we might need to increase the parallelism degree of the deserializer.
But then when we did exactly the same experiment, but we read from Postgres, we saw, okay, actually
the read is the bottleneck. So we should increase the read parallelism. And this is again to
motivate a little bit why we need an optimizer and why each environment is
different. And then I think I just want to share different some for the formats because there is a conception that columnar format is always better.
You should always go columnar. This is not true actually.
So we saw that of course when you compress sometimes,
I mean because the value domain is more constrained in each column, this may help,
but it's not the case for all datasets. Here we saw that columnar is not always beneficial
as an intermediate format. Also, with regard to compression, what is interesting is that the compression library that you should choose also depends on the amount of threads's compression speed, compression ratio, and decompression speed.
You always trade off those three things.
We saw that, for example, in very high-speed networks,
compression doesn't help if you don't put enough workers for the compression and decompression.
And then for a low amount of parallelism, for example, if you are on a constraint device
and you can only allocate a few threads for compression, you might want to choose a different compression algorithm
that will give you better end-to-end performance.
Even though maybe you may transfer more data
because your compression and decompression is slower,
you might want to choose a particular compression mechanism,
but then this also changes when increasing the threads
you allocate to a compressor and decompressor.
Yeah, and maybe one last thing. What we also try to do is to evaluate our cost model.
So we compared the estimated throughputs with actual throughputs that our optimizer provided.
They're not perfect, but they work out. They're not too far away from reality, the estimated ones.
Most of the time, we see that it's overestimated, which is not a problem because this means
that we would allocate a bit more threads.
We wouldn't have worse performance than expected.
In that sense, the cost model works fine right now.
But as I said, we also have ongoing work on the optimizer, so we hope to be better there as well.
Yeah, to close that gap over time.
Yeah, there's some fantastic insights there as well, but I think if you kind of, if we take like a holistic review of all of the results, the idea behind XDBT is clearly validated here.
This thing definitely works over what we've got else out there
in the state of the art.
I guess the natural question that follows in this discussion
is, we've probably touched on quite a lot of bits of where
you go next, what's in the pipeline with XDBT coming up.
One topic I wanted to ask you about, and I don't know where this sort of
fits into XCBC is around fault tolerance about what happens when data transfers,
like things crash, the network's a little bit flaky, like how do we handle faults
at the moment, because from personal experience I've seen before at large data
transfers, the longer they run, the more chance someone's going to go wrong.
So how do we handle what's.
Yeah.
How do you handle things going wrong?
Essentially?
Yeah.
Good question.
Um, but I don't have a good answer.
Um, so, um, this work was, um, more focused on, okay.
Do we benefit from decomposing the data transfer pipeline?
Um, do we benefit from the data transfer pipeline? Do we benefit from different configuration knobs?
But of course, we have not addressed different things like, for example, fault tolerance.
So, on the network side, we rely on TCP. So, if something gets lost in the network,
we hope that the underlying TCP library will take care of that.
But yeah, there could be more sophisticated things on the fault tolerance side.
So, for example, having different checkpoints or being able to...
Because right now, all this is in memory, right?
Yes.
And crashes, forget about it.
But we could have something like a write ahead log where you could say how much data actually you read from the source and how much of it was written in the target system so that you can just replay from a particular point in time and not just restart everything.
The next set of questions obviously like kind of impacts and surprises and surprises. So kind of jumping off with the impact to start off with.
If we've got our crystal ball here, what sort of future impact do you think,
or maybe already has done with XCBC, what impact can it have going long term?
What's a long term vision for it, maybe over a longer time horizon?
Like what impact do you think it can have?
Yeah, I think on the one hand, it would be interesting to see if existing data transfer tools like Standard XTBC could expose more configuration knobs. Because right now, for example, in JDBC,
you can just configure, what is it?
Fetch size or how many tuples you fetch from the database.
That's it. I could see
different systems exposing more configuration knobs on the one hand.
I mean, optimizer in production is always a tricky thing,
so people don't want too many things to happen automatically. Of course, in a database,
you want the query optimizer to pick the best plan for you.
But these machine learning approaches, I'm the, yeah, I'm not sure
how much people want to use them in production. Because we're talking about the real work
now outside the research paper. But yeah, so exposing more configuration knobs for data transfer techniques would be one thing. And then, yeah, so there is little interoperability mechanisms between systems.
So maybe this work could spark some more initiatives i mean obviously Apache arrow is one such thing but this is like more like an intermediate format.
Of course there are also things built around it like you know there is the flight server which also kind of provide similar functionality here but.
This also not really yeah decomposable and configurable.
You always need to go through Arrow. It would be interesting to see how different systems
may want to work together. That would be cool. Of course, with the XTBC project we will try to provide
different connectors for different systems so that you can play around
and transfer your data fast. But again, right now we're just a little team
and this can be seen like you could use it as a
as a prototype. But of course, I mean, to go further, this would
need more serious work.
Yeah, yeah, cool. I guess on that is on that note, it is a kind
of an open source project people can go around and play with,
right? If they if they so choose.
can go around and play with if they so choose? Yeah. As we talk, the project is not...
I think during SIGMOD we will have an open source version that people can play around with.
Right now we have code for the experiments, of course, because this was part of the...
When you submit the research paper, you always need to be reproducible.
But it's a different thing to expose a framework that's usable.
But we have done good progress on that in the last months.
And I think that during SIGMOD, we will be open source.
Nice.
Ready for the big reveal.
Cool.
Yeah.
So I guess just we'll go for two more questions here. And the first is,
surprises, what caught you off guard whilst working on XCBC? What was the most interesting
thing that you did not expect that?
Yeah, so I mean, interesting for me was that I hadn't done too serious work with C++ before. So for me, it was also an adventure.
Yeah, getting to know the low-level details.
And it's crazy how a branch can cut your performance. It's so crazy.
But then what was more interesting was that different systems for different environments, you need so much different configurations like this.
I didn't expect.
Yeah.
Yeah, that's cool.
Yeah, there's a lot of diversity and heterogeneity out there in the wild, right?
Yeah.
Yeah.
Cool.
Yeah.
So the penultimate one is something that I'm trying out, Harry.
So if you've not got anything so that comes
to mind then don't worry about it.
But it's given a, I don't know whether we did this for SheetReader, I think we did it
but for an InductDB context about some sort of recommendation, anything sort of tech related
or computer science related, be it a blog post, paper or tool that you've encountered
recently that you thought oh that was fun, I enjoyed that, oh, that was cool, go check it out?
Yeah, I think it may sound funny, but Sea Lion.
So we often don't appreciate the tooling that we get so much.
So yeah, it's very nice because we also had very different environment setups and
very different, let's say, so we integrated also Python with C++ and then also
Java and JNI and we have different containers running here and there. And it's amazing to see how these tools, in particular, I use and the JetBrains family,
let's say, but it was very nice to see how well they work and what amazing solutions
they provide and how much they make our life easier.
Yeah, we take it for granted a little bit, right?
And also, Sea Lion is like a great play on words as well.
It's a good name, so yeah, it has that going for it as well.
Yeah, cool.
I guess with that then, let's move on to the last word then.
So what's the one thing you'd like the listener to get from this podcast episode today?
Yeah, so, yeah, just recapping.
So data transfer, I think data transfer is crucial in today's applications. And we actually really focused on making the performance better in each data system.
Right.
But then often we forget about how important integration is and in
particular the inter-variability between systems.
So I think a lot of performance and resources get wasted on like
poor inter-variability mechanisms.
And yeah, I think XBC is a step towards making inter-variability mechanisms
better, but of course there is more to come.
And we look forward to that. Thanks again Harry for coming on the show. It's always a pleasure
to have you on. And yeah, and thanks again. Oh yeah, we'll put links to everything in the show,
and that's as well a listener, so if you want to anything we've mentioned today, you can go and
find it there. Yeah, and I would also like to mention that this is work in collaboration with my colleagues
Kauschtuck-Bittcar from the IIT in Delhi and Matthias Böhm at the Beifeld Institute in Tüberlin
and of course Volker Markl also from Beifeld and Tüberlin.
Cool, great stuff. And yeah, I guess we'll see you all next time for some more awesome computer science research.
Yeah, thanks for having me Jack.