Disseminate: The Computer Science Research Podcast - Lukas Vogel | Data Pipes: Declarative Control over Data Movement | #28
Episode Date: March 28, 2023Summary:Today’s storage landscape offers a deep and heterogeneous stack of technologies that promises to meet even the most demanding data intensive workload needs. The diversity of technologies, ho...wever, presents a challenge. Parts of it are not controlled directly by the application, e.g., the cache layers, and the parts that are controlled, often require the programmer to deal with very different transfer mechanisms, such as disk and network APIs. Combining these different abstractions properly requires great skill, and even so, expert-written programs can lead to sub-optimal utilization of the storage stack and present performance unpredictability. In this episode, Lukas Vogel tells us how we can combat these issues with a new programming abstraction called Data Pipes. Tune in to learn more! Links:PaperHomepageTwitterLinkedin Hosted on Acast. See acast.com/privacy for more information.
Transcript
Discussion (0)
Hello and welcome to Disseminate the Computer Science Research Podcast. I'm your host Jack Wardby.
Today we are joined by Lucas Vogel who will be talking about his CIDR paper,
Data Pipes, the Clarity of Control over Data Movement. Lucas is a PhD student at the Technical
University of Munich and his research areas are adaptive storage and non-volatile memory.
Welcome to the show, Lukas.
Thanks for having me.
Great stuff. Can you start off maybe by telling us a little bit more about yourself and how you became interested in researching databases and data management?
Yeah, okay. So like you said, I'm Lukas. I'm currently a fifth year student at the database group at TU Munich.
And I'm now more or less pretty close to submitting my dissertation towards getting the database research well I think I was always interested in programming
close to the hardware like think about cache locality associativity where that kind of stuff
matters right where you have to think about SIMD instructions and so on and I never thought about
databases of being in this context right but at But at TU Munich, the database share is very close to hardware.
So during my master's, I attended a lecture where we actually had to build our own database system in C++ from the ground up with Professor Neumann.
And this kind of showed me that a database could be very close to hardware.
And then I started there and i never regretted it
so that's my path to database research amazing like trying to build a database from scratch as
a master's sort of project it's quite a daunting task right i bet that was fun yeah really fun like
of course we didn't do everything right but like we we got it to execute sql queries and uh lot of
fun a lot of hard work.
But I think it was a lecture I learned most, actually, at the whole master's and bachelor's degree,
because I actually had to build some stuff.
Amazing. Yeah, that sounds great.
Cool. So let's talk about the star of the show today, data pipes. So declarative control over data movement.
Can you maybe give us the high-level sell for this, kind of the elevator pitch?
Yeah, of course.
So I would say most programs that are performance-critical algorithms
probably are a lot about data movement as well, right?
You have data on some disk or somewhere.
You then have to move it to your CPU through DRAM caches and so on,
do some stuff with it, then maybe buffer it somewhere,
get it back, and all that kind of stuff.
So you have a lot of data movement already in algorithms, if you think about it or not.
And of course, hardware manufacturers noted that and they tried to introduce shortcuts
for users to use without using the CPU.
So the CPU would be free for computation.
So that is what brought us DMA, direct memory access and stuff like this, right?
The issue, however, is that those shortcuts are often bottom-up, right?
So you're not meant as a developer to know about those shortcuts.
The hardware thinks about them and tries to activate them whenever it's a good thing to use them.
So that movement happens implicitly.
For example, think about using M-App, right?
You M-App some memory region, and then your operating system thinks about when to move that stuff
actually into your DRAM and then the CPU thinks about when to moving that stuff into cache.
But this is hard to use, right?
This is all happening implicitly.
And if you want to do it better than what the hardware figures out by itself, you have
issues, right?
And for those reasons, we present the vision of data pipes of the ideas instead of this
like bottom-up
approach where everything is happening implicitly we say why not do data movement top down right you
as the developer explicitly state which data you expect to be where and when to move it to where
you need it so you can make more efficient use of the resources of the system that's kind of the
abstract elevator pitch i would say amazing that's really cool um i guess i guess we could maybe as we dig into southern working maybe you can start by telling
us a little bit more about sort of the the modern storage hierarchy right there's a lot of acronyms
in this space right so maybe you can sort of like give us an overview at least what all these things
all these things mean and kind of the primitives we have um available today yeah a little bit there
we need the high level cell but talk about maybe some more of the um the the primitives we have um available today a little bit there we need
the high level cell but talk about maybe some more of the um the other primitives we have
yeah then actually there was a lot of issues right like when we started talking about that stuff and
like in the research group like lots of acronyms i didn't even know and like yeah really hard right
so so i'll give you the the overview of the most important ones i'd say so i think the issue
nowadays is you'll just have this classical stoic height tree right you have like hdd on the bottom really slow then you have your drem
and then you have your caches and your registers and so on and like this has been true for like 40
years and applications have been built around that's kind of hierarchy the operating system
expects this hierarchy to be there and so on but usually nowadays we have lots of new devices that
kind of fit in this hierarchy but not really so for example we have nvme ssds which are a lot faster compared to the old zata ssds
also more expensive but mostly worth it right um but they are attached over pcie express for
example then nowadays unlike it's been killed now but like a year ago we had intel obtains
persistent memory which is like DRAM, right?
You do it, address it via load and source instructions from the CPU, but it's persistent, right?
You can shut the system off and you still have your memory there.
The downside of it, it's slower than DRAM.
So it's not just a drop-in replacement.
It's like slower by a factor of two or three or something.
Then of course, we also have network, right?
So we can have everything that is attached locally.
We can also attach it remotely via some
network interconnect via RDMA.
We nowadays even have
disaggregated storage over like CXL,
which also works over PCIe Express.
So yeah, so
this pyramid is like, there are lots of
weird attachments to the site now.
So it's really hard to manage.
And of course, everything in this pyramid is also
accessed differently. So some stuff is accessed quite manage. And of course, everything in this pyramid is also accessed differently.
So some stuff is accessed quite easily, like, for example, SSDs over NVMe.
That's just a protocol you can use or your operating system can use.
But then we have kind of those leak abstractions.
For example, if you think about the cache, in history, we never really meant to know about the cache, right?
It's there to be transparent
so you just access some stuff and then it will be moved into the cache by the cpu and will be
flushed when we don't need it anymore but nowadays for example with persistent memory we kind of need
to have control over the cache so we have instructions like cl flush which flushes cache
lines back to dram we have cld mode which demodes stuff from L2 cache to L3 cache. We have the prefetcher that
prefetches stuff into the cache. And
if you build a modern performance-critical application,
you have to know about that stuff, but you're
not really meant to know about this, actually,
by history, right?
Intel doesn't want you to control the cache
as much. And then, of course, we have
even stranger shortcuts.
So in the paper, we have two. We call
DDIO and IOATat so the idea of both
was actually i think iot was introduced in 2006 by intel ddio in 2012 the idea was if you have like
networking and you have a your network card and you have to move the package to the cpu
it's too slow to do it with the cpu. So with IOT, you have a DMA unit
that directly moves that data into memory for you
without the CPU's involvement.
And we found out you can kind of misuse this
to also move data between persistent memory and DRAM.
I think IOT was invented in 2006 or introduced,
and PMEM was introduced like three years ago.
So it was never meant to be used that way,
but you can kind of use it that way.
A happy coincidence.
Yeah, I think nobody even at Intel knew about that.
It's not just how it worked that way,
which is really great for us,
but there are issues I think will come to them later as well.
And the other thing is DDIO.
This was also meant for networking,
I think introduced in 2012 for 100 gigabit Ethernet.
And there the idea was that you could directly move stuff from the network card into your free cache.
Because if you moved it to DRAM, it would be too slow.
Because then when you access it, you would have a cache and had to load it.
So to process the package at line speeds, you can directly DMA them into the cache.
And it turns out also, not meant by intel that way i think
is you don't only have to use network cards for that it works over pcie you can just as well use
nvme ssds as well to move data from your ssd direct into your cache so we thought that great
like great stuff right why not use this to like have more efficient data paths that we don't have
to involve the cpu at all
because everything here works with dma but the issues with like this zoo of primitives is that
you have a lot of different abstraction levels here like is it managed by the system is there
an interface you can use some interface but you're not really meant to use it different philosophies
and originally designed for different tasks like i said like we
we tried to use it for stuff it wasn't meant to be used yeah so that's kind of a course overview
yeah you know nowadays awesome stuff um so you took something out of that you can kind of use
these primitives and like ddio to improve improve things right so So you actually have a really nice case study
in your paper that uses external sorts
to illustrate the potential of these primitives
if we can kind of, not misuse,
but use them in certain ways
to improve how we move data around, right?
So can you maybe walk us through this example
and illustrate why, how it can improve that?
So external sort, like we said,
we want to speed up cases where your data
movement is kind of predictable.
And I felt like there's not a better example than external sort, right?
It's very predictable in the way you have to move data, right?
So you start with your data on some kind of background storage, let's say an SSD.
Um, then you have to sort initial runs, right?
Like you, you move small packets of data into your caches, then sort them and move them back
to DRAM.
So that's predictive movement, right?
You know the size of the stuff you move in, you know the size you have to move out of
your cache again.
Then when your main memory is full, you kind of have to spill those sorted runs from DRAM
to some kind of background storage.
That's the point of external sort if it doesn't fit into DRAM.
Then, of course, later on, you have to load them back into the system.
Here again, you can then move them directly into the cache because you have to merge the
sorted ones in the cache again with your CPU.
And then you have to write them back from, like you mentioned, that you send to DRAM
and then from DRAM you have to merge them back and the merged runs back to
to the output storage like at the end so the idea is that like those are really predictable
movements right you know beforehand when to move what where and also we found out there exists a
lot of data movement primitives for that stuff right so for the first part where you load the
unsorted runs into cache you can use the DIDIO to move that directly from SSD to cache.
Then, of course, from the cache to DRAM, you can explicitly flush them if they see a flush instruction from your CPU,
because you know you won't need that piece of run again after you've sorted it.
Then from DRAM to PMEM, we can use IOAT to move from one kind of memory to the other kind of memory,
unbeknownst to Intel, who didn't inventOT to move from one kind of memory to the other kind of memory, unbeknownst
to Intel, who didn't invent it, to be used
that way. And then later on,
when we have those
sorted ones on PMEM, we can move them back into
cache with IOT as well
and then move them back to
SSD again. So we thought,
for every kind of movement, there exists
a nice primitive that doesn't involve the CPU
at all, so the CPU can be busy doing sorting or other database stuff in the background.
But we do the data movement for the CPU.
So we thought, why not use external sort as a motivating example and for the paper and show how you could profit here.
Fantastic. And that's a nice segue into the next question is like you actually quantify the performance gains you
can get by the strategic use of these of these primitives so can you tell us a little bit about
how you actually went about quantifying like kind of tell us about your experimental setup and the
questions you were trying to answer obviously you're trying to see how fast it went right but
i mean yeah can you elaborate on that a little bit more please yeah yeah actually interestingly
fastness was not really that important to us.
So,
so we,
we thought we had kind of two goals.
So of course we benchmark DDI on IOAT,
right?
So moving data to and from SST and to and from PMEM.
And the idea here was for one,
of course,
are there actually performance benefits of doing it that way?
Right?
Like if,
if the old fashioned way of just using the CPU to move data around is as
fast or faster,
why, why do it that way?
So the idea was it should be at least the same speed while offloading
computation. So you don't need the CPU to do that stuff.
So like if that weren't true, like why would we care? Right.
But the second point I think, and that's equally as important is,
the question was, are they actually usable in the way we wanted them to use?
Right. Up until now, it's just like, wouldn't it be nice if we could do that?
Is it actually possible to do it?
It was not really, like you said, nobody used it that way, I think.
There are some papers that did kind of that stuff, but like I said, DDIO was meant for Nix and IoT as well.
So we designed two experiments. So for DDIO, we said, okay, let's assume we want to have this case
where we load unsorted data in chunks from your disk,
from SSD into the cache and sort them.
So we simulated them by moving data with varying chunk size from the SSD
directly to the L3 cache with DDIO enabled and disabled to compare.
And then we just iterated over the data and memory
to assure that it's actually touched, right?
So you actually need to load it to the registers.
And on the same time,
we also run a bandwidth intensive workload at the site
to really stress the DRAM bandwidth
to make sure we actually could see
if we have an advantage of directly
loading into the cache or not. So this simulates like the system and at the same time does something
completely different as well, right? We're not the only tenant on the system. Okay, so that was
CDIO. And for IOT, right, the point we wanted to make is when we have the data sorted on DRAM in
runs, we now have to evict them into backing storage,
in this case, simulated by PMEM,
and then back again later on
when we have to merge those sorted ones again.
So in this case,
we also had those runs in varying chunk size
and just move it back and forth once it's IoT enabled.
And the other case where we disabled it,
we just used memcpy, right?
So this is the way you would move stuff from or to PMM
if you don't have some kind of DMA unit that can do that.
And those were the two experiments we tried to run.
Awesome, great. So let's talk numbers then.
So what were the key results of each of the experiments?
What were the gains and was it more usable as well?
Obviously, there was that angle to the experiments as well.
What were your findings there? Yeah, so for DDI ddio we were surprised that it actually was usable right we
we thought you know um moving stuff into l3 cache from an ssd you know dm is already so much faster
than ssd it shouldn't really matter if you move it into dm or cache because like the time of it
takes to to load something from cache from dm to cache is not that
big compared to like you only have a throughput of like three gigabytes a second or something
but it turned out if you really stress the system it makes a huge difference like for as long as the
chunks you load fit into our free cache like i think we had like uh improvements from like uh one
and one gigabit gigabyte a second to like 1.4 gigabytes a second or something like
this and of course also reduced latency so ddio actually is a good thing here right even if you
think about really slow ssds the downside however like this was the second part of the experiment
the usability is really bad because ddio like it's not something Intel wants you to mess with. So it's either globally enabled for a device or disabled.
And there are some configuration parameters,
but they're not documented at all.
They are like some proprietary registers in the CPU.
You can kind of change some values
that nobody really knows what they do,
but they make it faster or slower.
And I actually looked at like inter documentation and
they like agreed like it would be nice to have some kind of parameters to change this but this
is not documented and not so what at the moment and stay tuned and i think this documentation
hasn't been updated the last 10 years so and you know it makes sense for intel because they say
like right we built this for this one use case and it just works out of the box and it makes it
faster so why not just use it that way but it kind of and it makes it faster. So why not just use it that way?
But it kind of makes it bad if you, like me,
try to use it for something it wasn't intended to be used for.
So much about DDIO.
For IoT, of course, we also did the experiment.
And we found out IoT is really great if you move stuff from PMEM to DRAM.
So we have three times the throughput.
So nine gigabytes a second
instead of three gigabytes a second.
And on top of that,
we don't have any CPU involvement at all.
So the CPU is free to do other stuff.
So this, of course, is great
because PMEM actually is really intensive on the CPU
because every move would be allowed in store instruction
because it's the same interface as DRAM.
So great in that regard.
So really awesome.
If you want to move stuff from PMEM to DRAM,
use IOAT.
On the other hand, the issue is
if you move stuff from DRAM to PMEM,
it's actually really bad.
And we were really puzzled by that.
Why would it be?
It's fast in one way.
It should also be fast the other way.
Also, if you move stuff that way,
like PMEM has like a really high read and write bandwidth.
Like this shouldn't be a bottleneck.
So to get behind this,
we also measured actually the write traffic
that was on the PMEM DIMM stick itself.
So it comes into the sticks
that look like normal DRAM DIMM sticks.
And you can measure on the physical memory
what is the throughput there.
And it turns out it's three times higher there
than it is on the data that's actually moved.
And we were really puzzled by that.
Why is that the case?
And then we found out that IoT actually has a feature
which is called Direct Cache Access, TCA.
The idea being that Intel said in 2006,
right, if you move some stuff from one memory card
to another, you probably also want to do some computation on it.
Because otherwise, why would you have moved it?
So they've also put it into the cache, which is great if you actually want to use it.
But if you just want to move it out of DRAM into PMEM, you don't need it to be cached because you actually explicitly don't want to access it.
So for one, it's slower already because we have this detour to the cache.
On the other hand, also,
the cache is then evicted semi-randomly in the end.
So it's not evicted sequentially.
And PMEM internally has a block size of 256 bytes.
So it only reads data in 256-byte blocks.
And if you write randomly to it,
that means you have a write amplification, right?
Because you just put some data into one of those blocks, and
then write the whole block back if you write random
data. In this case, cache line is
64 bytes, but those blocks are 256
bytes in size, so four times
the size, and you have
big write amplification there. So it turns
out, because of this feature, what
was a really great idea in 2006,
PMEM now is slow.
And the worst thing is, DCA, you can't even disable it in modern CPUs.
I think there is some virus settings
in like CPUs that are 15 years old,
but nowadays it's just forgotten.
So nothing you can do there.
So like with DDIO, it's like great in theory,
but in practice,
it doesn't really work out really well
because hardware designers 20 years ago
made assumptions that just don't hold
for modern hardware anymore.
Yeah, I just wanted to ask you about
when you were trying to figure out how to use these
and you're looking in the docs,
how easy was that to kind of figure out,
oh, I need to change this number to make things go faster?
It was horrible.
Yeah, so there's a mix of
documentation from 2006
of some Linux kernel maintainers that built this stuff in 2006.
Then there's documentation from 2011,
why they threw it out of the kernel again,
because it's not...
Then we have some papers that kind of try some of the stuff.
Then, you know, there is not really an interface you can use.
So we used SPDK because it supports IOAT.
But then, of course, SPDK is like a beast of itself,
that is really hard to be used.
It's a mess.
And, like, of course, like, it's my own fault, right?
Because we try to do something that hasn't been connected in that way before.
So we can't expect, like, otherwise it wouldn't be interesting research.
But I said, like like the idea is for you
like you're not meant to use that stuff in the way we used it the idea is intel just built that
stuff for specific workloads and the idea is that you just buy the cpu and stuff just gets faster
without you doing anything and that is really great if you are like on this happy path i say
yeah i have a nick and i need 100 gigabit ethernet and I just can buy the newest Xeon CPU and it just works.
But it's like this narrow path.
If you stray on this path, it's great.
If you go somewhere else, it just falls apart.
Yeah, yeah, cool.
But I'm convinced that we need to do something here
so we can leverage these primitives.
We can do something interesting in this space.
So tell me more about the data pipes vision
and maybe some of the key principles that
underpin this vision yeah okay so the underlying issue to us was like this kind of bottom-up design
so so those parameters are designed for a specific use case and if your use case differs you're kind
of out of luck so and what we want to do is we want to say data movement should be explicit so
if you have to care about it anyway, right, like if
you know about cache
associativity and like cache sizes
and
when to move data where anyway,
you might as well do it explicitly
because the interfaces
don't really work and you have to
work around interfaces that don't really work.
You might as well have a nice interface
and have to use it, right? So we want
to make it declarative. So
you say, tell the system
what to move where
and when, but not how.
And to this end,
we introduce two things.
First one is a type system
for the location that the data could be in.
So currently, if you look at the status
quo, more or less everything is just kind of a pointer, right?
So access to some memory location might,
even though the data being pointed to is already in DRAM,
then it just moves the data into caches.
Maybe it's already cached and nothing happens at all.
Maybe like the memory area you point to was like MMAP
and actually is on a slow HDD
and accessing it might just be a page fault
and you have to move all the stuff into DRAM
and then into cache.
Maybe it uses the OS page cache and so on.
So actually you don't really know what's happening
if you just access some data.
So instead what we want to introduce
is what we call resource locators.
And the idea is that everything is behind a resource locator
and this resource locator forces
you to think about it, right? For example,
you would never directly access a byte in
a HD resource locator, while you
would in a cache resource locator.
And this forces you to think about where your data
actually is. And secondly,
after we've typed our
memory in that way,
we can say, okay, now to move data between resource locators, which is important, right?
Like you can't process data that is being stored on SSD.
We introduce the concept of data pipes.
And data pipes are pipes that kind of connect those locators.
So you can think about a DRAM to PMEM pipe that uses IoT or a PMEM to cache pipe that uses IoT the other way around the cache access.
And the idea is that these pipes then connect between those locators.
And then you have some kind of transmit call or something similar, which moves actually issues a request to move that data.
And this makes it declarative and explicit. And optimally,
of course, those pipes map onto existing primitives like IOAT or DDIO or whatever,
or maybe even new primitives in the future if vendors introduce new primitives. Of course,
there are no primitives for all combinations of source and sink, right? So you might have a
software fallback for those kinds of things. But even if you use a software fallback at least you have this notion of you know where your data is and where
it's been moved to and what we also present in the paper is different flavors of pipes so the idea is
that you declare but then in the background some runtime has to schedule the movement when to move
what where and we have like three proposals where we say,
one would be you have this blocking system,
like traditional IO.
For example, you would use a read syscall to move the data.
You say pipe.transmit, and then the pipes block until it's done.
We also have this approach where you say,
maybe you want to have inversion of control,
where you just say, I want this data moved there,
and please notify me when you're done. Or we even have a report where we say maybe the OS could have support for pipes as well.
But in some way, the vision is that we say we have those kind of implicit leaky abstractions,
and we would like them to be like an explicit, intentional interface where you declare what
to move there.
Awesome.
Yeah, I'd just like to touch on what would the syntax look like for this?
Obviously, it's harder in an audio medium to kind of talk about this,
but how would that look like?
Yeah, so
in the paper, we actually have
three different
code listings to actually show you how
the syntax could look like. Of course,
it's hard to describe here, but yeah, the idea
is to use a pipe, what you
would do is, you first would declare your locators, right?
We'd say, okay, kind of a case of external sort, I have data on SSD.
So I declare an SSD locator, which would, as an argument, for example, take like the path to the file, if it's like a file on an SSD.
And then, of course, I say, okay, I want to move that data into the cache.
So I would also have a cache resource locator, which says, okay, I want to move that data to be processed.
It needs to be in the cache.
And this is already assuming a lot, right?
This already assumes that cache is kind of addressable, which is currently not really true.
But nonetheless, this is the vision we have.
And then, of course, you have like those two locators.
And then you just instantiate a pipe.
Okay, I want to have a pipe from locator A being the SSD locator
to locator B being the cache locator
and this just instantiates
a pipe and then you just call some kind
of transfer method that moves the data
from the SSD
to the cache. The idea being that
however, everything up until the
transfer call is just declarative. So we tell the system
where we have
our data, where we want it to be, and how
it should be connected, but we don't
tell the system, like we don't
issue calls to transfer
the data before we actually need it.
So the idea is that the system already knows
of your intentions before you actually
move the data, so it can do optimizations
there as well.
Nice, cool. I guess I'd just like to touch on a little bit as well what you think the sort of
the limitations slash downsides of this vision would be obviously we're adding like another
abstraction layer so maybe there might be some potential performance implications there with that
another abstraction where um yeah so maybe you could elaborate on on that for us a little bit
yeah yeah so first of all um regarding performance yeah you're right all right it would be an
abstraction that would have performance impact so one thing is feed down really like like for us if
the performance stays the same this is already a win so of course performance important but but
i'd say for us,
it's more important
to kind of get this vision
where you'd be intentional
about this stuff.
And even if the performance
of like transferring data
would not be faster,
the whole thing about it
being intentionally
and you only transfer
what you actually need to transfer
already probably impacts
the overall performance
of the system again.
So yeah, but you're right.
This is one possible downside.
There are downsides.
So another downside is it's not really applicable
if you don't know where to move data beforehand.
So in like this big data intensive workloads,
that's easy to know.
But like if you say like have a system
that has a lot of transactions,
transactional database systems,
you might have a lot of erratic random reads or writes
where you really don't know beforehand
because you don't know when the user will start a new transaction.
In that case, pipes are not really a great fit, I'd say.
So if moving data around is not your problem,
pipes might not be the fit for your problem.
And of course, pipes kind of want to use the primitives,
optimize primitives, right?
So if we already have primitives, pipes are great.
If we don't have primitives for some kind of device pair,
we would like to have a primitive for that pair.
But, you know, it's asking a lot to go to Intel
and tell them that right now, like, we have this vision,
like, please, please build this thing.
So we don't think that's like a way we can go
so um so yeah it's kind of dependent on there being being primitives to use so so we think
with the primitives that are already there it's already a good thing and like said you can have
software fallbacks and even if you trust your software we think that's a good abstraction to
have but um optimally we would have more primitives there as well awesome you never know right it could become so popular
hopefully that it actually then the kind of feedback loop intel actually motivated to kind of
almost kind of fall in line a little bit or kind of help out in that sense so maybe yeah i guess
but i think the issue here is that that intel really like, it's just a mismatch of what the goal is.
Because for us, it's like, you know,
like we are experts in the system
and we want to use it to the fullest of its potential.
And, you know, in the database community,
we think about that stuff, right?
Like, how can you optimize for caching and so on?
But for Intel, like, they want to sell their custom
like this big legacy application
and just use our chip and we
build this custom kind of
accelerator for exactly your use case
and it will get faster and you don't have to do anything.
And
this is a big selling point for those people
because they don't build new systems. They
maintain big legacy systems.
So I think this is kind of the crux here, that there are people because they don't build new systems they maintain like big legacy systems so so i think
this is kind of the crux here that that they are different kind of goals to optimize for cool um
okay so where do we go now uh next with with data pipes and what's next on the research agenda
how do you go about realizing this this vision one thing inter already uh announced it's like
it's i think it's already released with Sapphire Rapids platform,
the data stream accelerator
that already tries to unify all this stuff a little bit.
So we would like to, of course, look at that.
We really didn't have hardware or the time
of the data pipes paper to do that.
And of course, there are lots of open questions to tackle.
It's not secret, this is mostly a vision paper here.
So we have some code to prove that stuff,
but it's not an implementation
you can just use in your code.
So I think a big open question is,
for example,
how could we schedule the data movement?
Currently we say
there's probably some kind of runtime
you tell it to transmit data
through this pipe from A to B,
but how does this runtime actually look right um is it just like does it just run on an additional core it's like a just some kind
of library in the background maybe would it be an os feature that your operating system would
support data pipes as like native thing where you can then just like fopen dev slash ssdpipe or something like this.
Or even thinking further,
we thought a lot about like cloud context, right?
Like, for example, in the cloud,
you have big issues with like noisy neighbors, right?
If you have like two people running on the same hardware
and one like tries to do a like a data intensive workload,
it might steal resource from the other one. And that reason they have to over provision a lot right
so that if something like this happens they can kind of buffer it but if you add data pipes of
course in the cloud context your your system would know about you your intention to move data and
could schedule it more efficiently so in the cloud context we think it could reduce a lot of issues of noisy
neighbors, and therefore you don't need to over-provision as much, which makes it really
enticing for cloud vendors, in our opinion. So yeah, cloud context would be another thing we
would like to look into. Cool. Awesome. Yeah. So I mean, for my next question, obviously,
with this being a vision paper, there's not necessarily a tool that a software developer
can go away and use today. But how do you think kind of data data engineers database administrators can leverage
the findings in your research and kind of maybe longer term what impact do you think it could
potentially have i think yeah like like you said it's a vision paper so so you can't just take the
implementation and make stuff faster right now but But I think the biggest takeaway should be that the paper should inspire people to be more intentional about the data movement and think about what's actually happening below the stack.
Because we're like building abstractions on top of abstractions on top of abstractions, right?
Like if you nowadays, like I said before, and if I access a pointer, you have no idea what's actually happening.
Like, of course you can find out,
but in general, and then people are happy about that
because it's easy.
But I think if we throw away a lot of those abstractions
and re-engineer them in a way
to be like more close to the hardware,
like for example, data pipes could be a bit,
pretty thin wrapper around like those primitives
we talked about earlier.
You could get an interface that's
not a lot harder to use what we have but could you give you a lot more benefits and performance
benefits so so i think the takeaway here is think about how data is being moved in your system
if you think about for example like like like Postgres or something, like database systems, they were engineered like 30 years ago.
And the whole thing is that they say, right, HDDs are slow.
We don't need to care to optimize a lot of other stuff in the beginning because we are IO bound anyway.
And I don't fault them for it, right?
Like this is how it has been. But if you throw away this assumption nowadays
of the hardware you have and the accelerators you have,
you might have engineered the system completely different.
Just thinking about that, I think, could bring some benefits.
Yeah, for sure.
I was just going to say that the general awareness of this
is obviously, I think, in itself has potential for big impact.
So, yeah.
Cool.
Whilst you're working on this,
obviously you've kind of touched on loads of different things.
You've kind of gone deep into the weeds
and said loads of different sort of primitives
and different sort of pieces of hardware and whatnot.
So if you can kind of capture,
what was the most sort of interesting thing
you kind of learned while working on this paper?
Maybe the thing that kind of caught you off guard as well.
I think the biggest thing probably was like the the difference between air theory and the real world so um it's
not my first paper right so in my previous papers in the beginning i said like i would like to do
this and then i mostly achieved that and of course there were like setbacks and roadblocks and like
detours and bumps on the way or like like otherwise it wouldn't be interesting research paper but in
the end i more or less did what i intended to do and here the story of the paper more or less
completely changed half the way through as we didn't really find a way to achieve our original
goal because our original goal was actually to build this merge sort thing and say like
see like you can build a really fast merge
sort and then I started implementing it
and I found like we talked about earlier
lacking documentation
interfaces didn't really work
you had to use opinion frameworks like
SPTK and so on
like it got really messy
I started out let's not write a paper about this merge sort write a paper about like how messy it is like, SPTK and so on, like, it got really messy. It turned out,
let's not write a paper about this merch sort,
write a paper about, like, how messy it is
and how could you do it maybe better.
And it turned out that, like,
embracing those difficulties
and making them into a story in the end
really worked out great.
And I think the paper really got better because of it
because I'm pretty proud now
that our paper solves a problem that I know it exists because I
encountered it while trying to write a different paper.
So,
so yeah,
I think the biggest lesson here is to like embrace failure,
I guess.
Like,
I was like not happy when it didn't work and I go,
no,
like sleepless nights,
right?
Like the whole paper falls apart,
but then it turns out actually it made the paper better in the end, I guess.
Yeah, that's good.
I mean, I normally ask kind of about the origin story
and the background of the paper
and how bumpy that journey was
from the kind of initial conception of the idea
to the actual end paper.
But it seems like the whole thing changed on you
halfway through, which was, I guess, unpleasant.
But in the end, it worked out for the best, right?
Yeah.
And like I said, the thing that kind of held the paper together,
however, from the beginning to end was that you had this idea, right?
We have those well-behaved algorithms, and they are about data movement.
And how can we make this work?
And I think this was like the core of the
story from beginning to end so i think this helped us that we said like yeah but we have this problem
and maybe we approach another aspect of that problem but we still try to solve this problem
about how can we have those well-behaved workloads where we know what data moves from when to where
how can we make them better sure just that just out of interest i mean obviously a lot of kind of
building off this assumption having that well-behaved um algorithm just so we can kind of
control i guess the state space of things that can kind of go on but how do you think it would
perform on certain algorithms that may be a bit more unpredictable yeah so so this this is an idea
we had like halfway through the paper i think think, that in actuality, if you think about data structures,only data structure where you try to append new data and
in the background try to merge them
to make sequential reads
and writes all the time because they're optimized for
disks where sequential reads and writes are king.
And in the end,
for example, if you use some
key-value store with an LSM tree backing it,
this workload is not well-behaved.
You have random reads and writes coming
in all the time, but the LSM tree kind of forces your erratic workload into a well-behaved. You have like random reads and writes coming in all the time, but the LSM tree kind of forces
your erratic workload into a well-behaved one
by making it well-behaved,
by being append only
and then being a sequential write.
So I think you can make most
not well-behaved workloads
into well-behaved workloads
by thinking about the right data structure.
And yeah, we've heard about data structures
about this like workload transformers, which
try to transform them into something that then again could benefit.
Like I see an LSM tool could benefit from a data pipe because it's really predictable
then.
But of course, you have to build stuff on top that's not covered by the paper, right?
You need the right data structures to make that work.
Yeah.
Yeah.
I guess also as well
what sort of other research are you working on at the moment i mean you've mentioned before that
year this isn't your first paper right fifth year phd student you've been through this process many
of times so kind of what other research are you working on at the moment or have been in the past
yeah so so so my my first big thing was actually analytical query processing.
And there actually price matters a lot.
Like you have this big cloud databases where you have to read a lot of data and hardware becomes a commodity and you need to be cheap.
And there I built Mosaic, which was a storage engine,
which can fetch the data for you.
It's part of the database
system but it also can recommend you what hardware you should buy to maximize your performance for
the given budget right so the idea was that it says right like 80 of your data you don't read
anyway so you might as well put it on the cheapest storage possible and then it draws you like this
nice perito curve is it like if you increase your budget by 10%, you could
increase your performance by 30%.
On the other hand, if
you are out of budget,
from 80% of the budget
you could still have like 95% of
the performance or something like this.
I published this as VLDB
three years ago and then I said,
okay, enough of analytical, let's do
transactional and then I built, okay, enough of analytical, let's do transactional.
And then I built Plush,
which is a persistent hash table for persistent storage.
And it kind of tries to be an LSM tree at the same time as well.
So the idea here is that persistent memory
has a really low write latency,
like insanely low.
So it's comparable to DRAM.
And at the same time being persistent. So we thought
why not leverage this to have
a data structure that can cope
a lot with inserts.
And so we take the best of
LSM trees and apply it to
hash tables and
let this work on persistent memory.
Unfortunately, Intel now killed persistent
memory, which I'm still very angry about
because I think it's such great hardware.
It has such low load-to-write latency.
It won't be reached in the next decade
by anything else, I guess.
But I guess it just didn't make it profitable enough for them.
So it turns out they killed it.
Yeah, so those were the big other two papers
I wrote in the past.
The second one being on VLDB 22.
But now, yeah, we're thinking about some follow-up for data pipes,
but I probably won't be the primary author for that
because I'm currently in the process of finishing my dissertation.
Cool. Yeah, the next question, I like to ask this to all my guests,
and it's really interesting to see how the responses diverge.
It's about the kind of the creative process of generating ideas and then selecting which ones to work on so i'd like to
kind of get your take on how you approach this yeah so so so i think i never really had a structured
approach there so so what i did is i said like let's do whatever sounds fun and interesting at
the moment for different reasons.
So I had this discussion, I think, after my first big paper with my supervisor and some senior lab members.
I said, right, I did this Mosaic thing now, right?
It's on VLDB, and it's great, and I like it.
Should I now look into different aspects of that?
And how can we do it in the cloud?
How can we do it faster?
And they said, of course, you can can do that but that sounds pretty incremental and also you will never have the
chance in your life to be a self-guided again as you are now as a phd student so just do what is
fun right so because you you have the opportunity now so so i thought yeah okay like um pm sounds
pretty interesting um new stuff from Intel, upcoming technology.
I didn't know that it would be killed a year later,
but still, and I did a little analysis
to my transactional stuff.
And so I came to that,
and I'm very grateful for my supervisor, of course,
because he allows me that freedom, right?
He's just like, yeah, as long as you do some interesting stuff,
it's fine so so so i really yeah my my way to do it was just like do what seems fun within like the
confines of like the general topic that that has to be done because yeah i try to enjoy my phd and
do what what sounds interesting to me yeah Yeah, as a guidance or a principle,
if it's fun and interesting, right,
then it naturally makes working on something more enjoyable
and therefore maybe you generate more ideas
based on the fact that you're enjoying it.
Also, my supervisor once taught me the issue is, right,
you can do a lot of follow-up stuff on stuff you already did,
but as soon as you exactly know
what the path will be, it's by
definition not an interesting research
topic, but because, you know,
if there isn't a possibility to
fail, it's probably not something very
new or novel, so
do something out there, right?
Maybe it fails, but as we've seen in the data
pipes, it still turns out
interesting, and like, if you in the data paper, it still turns out interesting.
If you just write 10,000 on a call and then this will just work,
it might also be a nice paper, but
I think it's definitely not as interesting
as something that's totally out there
and might as well just not be worth it.
Yeah, I think you hit a nice point there.
The fact that something doesn't work in
itself can often be interesting.
It doesn't have to be, when you start out the the end goal doesn't have to be this perfect amazing
super fantastic system or whatever right like the fact that you tried something and it failed is in
itself interesting result a lot of the time right and yeah but and maybe it's harder to publish that
sort of that sort of stuff right yeah unfortunately it's really hard to publish negative findings i
think still because like there's so the stigma of why should i care i think it would be a lot better as a research community if we encourage that more yeah i agree
with you there lucas yeah for sure cool um i guess i'm just gonna on that a little bit like what do
you think is the the biggest challenge in uh database research at the moment well i think i Well, I think I have two answers. So the first one would be, let's say, outward facing.
So I think the issue is to get people outside of our community to see how great database systems are.
So I talked a lot with like the bioinformatics people at our university and also the machine learning people.
And, you know, they do a lot of stuff outside the database system right they they just use the database system as like the store of data
and then they do all their stuff in python and then they like try to reinvent joints and
so on and like do all of this on the application level and um of course we can now like point at
them and say yeah of course you don't know how it's supposed to be. But I think it's a failure on us
that we as a community didn't get those on board
to build the tools into our databases for them to use.
So I think we should invest a lot more
into the tooling to make it easier
for such people to use our systems
and show the advantages.
For example, if you look at DuckDB,
they just built a really easy to use database system. It's like two lines of code and it works,
and it just can replace whatever else people use beforehand just out of the box. And it has great
adoption because of that. And I think we totally missed the goal there in the past. So I think that's a direction we could go in more.
And then I think inward facing,
so I think that's like the TU Munich thing,
where we say people leave a lot of performance on the table.
So like we talked about earlier,
we build abstractions on top of abstractions.
We build like Spark clusters with lots of instances and lots of nodes.
And I think if we really carefully engineer the system we can do a lot on a single node and um yeah but it's like really
hard work to do but i think especially nowadays where like performance doesn't scale as nicely
like like slows down the improvements over the years um it is worth a lot if we maybe refocus a little bit
and try to get out the most of the hardware
we actually have standing around
and it's mostly idling.
Yeah, I totally agree.
Two really interesting challenges facing us there.
When you were talking about the outward-facing one,
the first thing that came to mind was DuckDB as IOL
and they've positioned themselves perfectly
of solving that sort of usability problem and getting it
like getting data
scientists to use
databases because I've
experienced you there, people in
bioinformatics and other sort of areas
that they don't want to touch a
database because a lot of the time it's
hard to install and hard to operate.
They're just like, oh no, I'll just reinvent
the wheel myself in my own hacky way. But but no yeah so they'll be there and also again on the on the um on the
uh inward facing kind of and direction as well i feel like the people a lot historically have been
like yeah we want to make this distribute we want to get as many nodes through as much computer as
possible and that doesn't necessarily give you the best outcome i don't think so like you say
we leave a lot of performance on the table i'll have done in the past so yeah interesting stuff um i
guess it's time for the last last question now so what's the the one thing you want the listener to
take away from this from your work and from this podcast today yeah so so i would say the main
thing is um we shouldn't try to hide hide the complexity of what is happening below us on the stack.
So we should be aware of it.
And of course, not everybody can manage all the complexity and not everybody should.
So I think we should have nice interfaces helping us to deal with that complexity.
And we should think about them and i
think we all agree if cpus were invented last year exactly the performance they have now people
probably have chosen a lot of different abstractions for the stuff because a lot of stuff is just grown
over the years because it was a good idea at the time so i think we should take care of that and think about how we could re-engineer stuff
to more better fit the current landscape.
And so I would say,
like if you're concerned at all with performance,
think of what's happening is below the stack,
below in the stack
and how could we better speak with that part
would be the big takeaway for me.
Great stuff.
Well, let's end it there.
Thanks so much, Lukas, for coming on the show.
It's been a fascinating conversation and best of luck with the write up.
I hope that,
um,
go smoothly.
You hit the,
hit the Q2 deadline.
Um,
great stuff.
Um,
if the listeners interested in knowing more about Lucas's work,
we'll put links to,
um,
all the relevant materials in the show notes.
And if you enjoy the podcast,
please consider supporting the show through buy me a coffee.
And we will see you all next time for some more awesome computer science research