Disseminate: The Computer Science Research Podcast - Lukas Vogel | Data Pipes: Declarative Control over Data Movement | #28

Episode Date: March 28, 2023

Summary:Today’s storage landscape offers a deep and heterogeneous stack of technologies that promises to meet even the most demanding data intensive workload needs. The diversity of technologies, ho...wever, presents a challenge. Parts of it are not controlled directly by the application, e.g., the cache layers, and the parts that are controlled, often require the programmer to deal with very different transfer mechanisms, such as disk and network APIs. Combining these different abstractions properly requires great skill, and even so, expert-written programs can lead to sub-optimal utilization of the storage stack and present performance unpredictability. In this episode, Lukas Vogel tells us how we can combat these issues with a new programming abstraction called Data Pipes. Tune in to learn more! Links:PaperHomepageTwitterLinkedin Hosted on Acast. See acast.com/privacy for more information.

Transcript
Discussion (0)
Starting point is 00:00:00 Hello and welcome to Disseminate the Computer Science Research Podcast. I'm your host Jack Wardby. Today we are joined by Lucas Vogel who will be talking about his CIDR paper, Data Pipes, the Clarity of Control over Data Movement. Lucas is a PhD student at the Technical University of Munich and his research areas are adaptive storage and non-volatile memory. Welcome to the show, Lukas. Thanks for having me. Great stuff. Can you start off maybe by telling us a little bit more about yourself and how you became interested in researching databases and data management? Yeah, okay. So like you said, I'm Lukas. I'm currently a fifth year student at the database group at TU Munich.
Starting point is 00:01:05 And I'm now more or less pretty close to submitting my dissertation towards getting the database research well I think I was always interested in programming close to the hardware like think about cache locality associativity where that kind of stuff matters right where you have to think about SIMD instructions and so on and I never thought about databases of being in this context right but at But at TU Munich, the database share is very close to hardware. So during my master's, I attended a lecture where we actually had to build our own database system in C++ from the ground up with Professor Neumann. And this kind of showed me that a database could be very close to hardware. And then I started there and i never regretted it so that's my path to database research amazing like trying to build a database from scratch as
Starting point is 00:01:51 a master's sort of project it's quite a daunting task right i bet that was fun yeah really fun like of course we didn't do everything right but like we we got it to execute sql queries and uh lot of fun a lot of hard work. But I think it was a lecture I learned most, actually, at the whole master's and bachelor's degree, because I actually had to build some stuff. Amazing. Yeah, that sounds great. Cool. So let's talk about the star of the show today, data pipes. So declarative control over data movement. Can you maybe give us the high-level sell for this, kind of the elevator pitch?
Starting point is 00:02:23 Yeah, of course. So I would say most programs that are performance-critical algorithms probably are a lot about data movement as well, right? You have data on some disk or somewhere. You then have to move it to your CPU through DRAM caches and so on, do some stuff with it, then maybe buffer it somewhere, get it back, and all that kind of stuff. So you have a lot of data movement already in algorithms, if you think about it or not.
Starting point is 00:02:48 And of course, hardware manufacturers noted that and they tried to introduce shortcuts for users to use without using the CPU. So the CPU would be free for computation. So that is what brought us DMA, direct memory access and stuff like this, right? The issue, however, is that those shortcuts are often bottom-up, right? So you're not meant as a developer to know about those shortcuts. The hardware thinks about them and tries to activate them whenever it's a good thing to use them. So that movement happens implicitly.
Starting point is 00:03:18 For example, think about using M-App, right? You M-App some memory region, and then your operating system thinks about when to move that stuff actually into your DRAM and then the CPU thinks about when to moving that stuff into cache. But this is hard to use, right? This is all happening implicitly. And if you want to do it better than what the hardware figures out by itself, you have issues, right? And for those reasons, we present the vision of data pipes of the ideas instead of this
Starting point is 00:03:44 like bottom-up approach where everything is happening implicitly we say why not do data movement top down right you as the developer explicitly state which data you expect to be where and when to move it to where you need it so you can make more efficient use of the resources of the system that's kind of the abstract elevator pitch i would say amazing that's really cool um i guess i guess we could maybe as we dig into southern working maybe you can start by telling us a little bit more about sort of the the modern storage hierarchy right there's a lot of acronyms in this space right so maybe you can sort of like give us an overview at least what all these things all these things mean and kind of the primitives we have um available today yeah a little bit there
Starting point is 00:04:24 we need the high level cell but talk about maybe some more of the um the the primitives we have um available today a little bit there we need the high level cell but talk about maybe some more of the um the other primitives we have yeah then actually there was a lot of issues right like when we started talking about that stuff and like in the research group like lots of acronyms i didn't even know and like yeah really hard right so so i'll give you the the overview of the most important ones i'd say so i think the issue nowadays is you'll just have this classical stoic height tree right you have like hdd on the bottom really slow then you have your drem and then you have your caches and your registers and so on and like this has been true for like 40 years and applications have been built around that's kind of hierarchy the operating system
Starting point is 00:04:58 expects this hierarchy to be there and so on but usually nowadays we have lots of new devices that kind of fit in this hierarchy but not really so for example we have nvme ssds which are a lot faster compared to the old zata ssds also more expensive but mostly worth it right um but they are attached over pcie express for example then nowadays unlike it's been killed now but like a year ago we had intel obtains persistent memory which is like DRAM, right? You do it, address it via load and source instructions from the CPU, but it's persistent, right? You can shut the system off and you still have your memory there. The downside of it, it's slower than DRAM.
Starting point is 00:05:35 So it's not just a drop-in replacement. It's like slower by a factor of two or three or something. Then of course, we also have network, right? So we can have everything that is attached locally. We can also attach it remotely via some network interconnect via RDMA. We nowadays even have disaggregated storage over like CXL,
Starting point is 00:05:52 which also works over PCIe Express. So yeah, so this pyramid is like, there are lots of weird attachments to the site now. So it's really hard to manage. And of course, everything in this pyramid is also accessed differently. So some stuff is accessed quite manage. And of course, everything in this pyramid is also accessed differently. So some stuff is accessed quite easily, like, for example, SSDs over NVMe.
Starting point is 00:06:11 That's just a protocol you can use or your operating system can use. But then we have kind of those leak abstractions. For example, if you think about the cache, in history, we never really meant to know about the cache, right? It's there to be transparent so you just access some stuff and then it will be moved into the cache by the cpu and will be flushed when we don't need it anymore but nowadays for example with persistent memory we kind of need to have control over the cache so we have instructions like cl flush which flushes cache lines back to dram we have cld mode which demodes stuff from L2 cache to L3 cache. We have the prefetcher that
Starting point is 00:06:46 prefetches stuff into the cache. And if you build a modern performance-critical application, you have to know about that stuff, but you're not really meant to know about this, actually, by history, right? Intel doesn't want you to control the cache as much. And then, of course, we have even stranger shortcuts.
Starting point is 00:07:02 So in the paper, we have two. We call DDIO and IOATat so the idea of both was actually i think iot was introduced in 2006 by intel ddio in 2012 the idea was if you have like networking and you have a your network card and you have to move the package to the cpu it's too slow to do it with the cpu. So with IOT, you have a DMA unit that directly moves that data into memory for you without the CPU's involvement. And we found out you can kind of misuse this
Starting point is 00:07:32 to also move data between persistent memory and DRAM. I think IOT was invented in 2006 or introduced, and PMEM was introduced like three years ago. So it was never meant to be used that way, but you can kind of use it that way. A happy coincidence. Yeah, I think nobody even at Intel knew about that. It's not just how it worked that way,
Starting point is 00:07:53 which is really great for us, but there are issues I think will come to them later as well. And the other thing is DDIO. This was also meant for networking, I think introduced in 2012 for 100 gigabit Ethernet. And there the idea was that you could directly move stuff from the network card into your free cache. Because if you moved it to DRAM, it would be too slow. Because then when you access it, you would have a cache and had to load it.
Starting point is 00:08:17 So to process the package at line speeds, you can directly DMA them into the cache. And it turns out also, not meant by intel that way i think is you don't only have to use network cards for that it works over pcie you can just as well use nvme ssds as well to move data from your ssd direct into your cache so we thought that great like great stuff right why not use this to like have more efficient data paths that we don't have to involve the cpu at all because everything here works with dma but the issues with like this zoo of primitives is that you have a lot of different abstraction levels here like is it managed by the system is there
Starting point is 00:08:57 an interface you can use some interface but you're not really meant to use it different philosophies and originally designed for different tasks like i said like we we tried to use it for stuff it wasn't meant to be used yeah so that's kind of a course overview yeah you know nowadays awesome stuff um so you took something out of that you can kind of use these primitives and like ddio to improve improve things right so So you actually have a really nice case study in your paper that uses external sorts to illustrate the potential of these primitives if we can kind of, not misuse,
Starting point is 00:09:33 but use them in certain ways to improve how we move data around, right? So can you maybe walk us through this example and illustrate why, how it can improve that? So external sort, like we said, we want to speed up cases where your data movement is kind of predictable. And I felt like there's not a better example than external sort, right?
Starting point is 00:09:51 It's very predictable in the way you have to move data, right? So you start with your data on some kind of background storage, let's say an SSD. Um, then you have to sort initial runs, right? Like you, you move small packets of data into your caches, then sort them and move them back to DRAM. So that's predictive movement, right? You know the size of the stuff you move in, you know the size you have to move out of your cache again.
Starting point is 00:10:17 Then when your main memory is full, you kind of have to spill those sorted runs from DRAM to some kind of background storage. That's the point of external sort if it doesn't fit into DRAM. Then, of course, later on, you have to load them back into the system. Here again, you can then move them directly into the cache because you have to merge the sorted ones in the cache again with your CPU. And then you have to write them back from, like you mentioned, that you send to DRAM and then from DRAM you have to merge them back and the merged runs back to
Starting point is 00:10:45 to the output storage like at the end so the idea is that like those are really predictable movements right you know beforehand when to move what where and also we found out there exists a lot of data movement primitives for that stuff right so for the first part where you load the unsorted runs into cache you can use the DIDIO to move that directly from SSD to cache. Then, of course, from the cache to DRAM, you can explicitly flush them if they see a flush instruction from your CPU, because you know you won't need that piece of run again after you've sorted it. Then from DRAM to PMEM, we can use IOAT to move from one kind of memory to the other kind of memory, unbeknownst to Intel, who didn't inventOT to move from one kind of memory to the other kind of memory, unbeknownst
Starting point is 00:11:26 to Intel, who didn't invent it, to be used that way. And then later on, when we have those sorted ones on PMEM, we can move them back into cache with IOT as well and then move them back to SSD again. So we thought, for every kind of movement, there exists
Starting point is 00:11:42 a nice primitive that doesn't involve the CPU at all, so the CPU can be busy doing sorting or other database stuff in the background. But we do the data movement for the CPU. So we thought, why not use external sort as a motivating example and for the paper and show how you could profit here. Fantastic. And that's a nice segue into the next question is like you actually quantify the performance gains you can get by the strategic use of these of these primitives so can you tell us a little bit about how you actually went about quantifying like kind of tell us about your experimental setup and the questions you were trying to answer obviously you're trying to see how fast it went right but
Starting point is 00:12:18 i mean yeah can you elaborate on that a little bit more please yeah yeah actually interestingly fastness was not really that important to us. So, so we, we thought we had kind of two goals. So of course we benchmark DDI on IOAT, right? So moving data to and from SST and to and from PMEM.
Starting point is 00:12:34 And the idea here was for one, of course, are there actually performance benefits of doing it that way? Right? Like if, if the old fashioned way of just using the CPU to move data around is as fast or faster, why, why do it that way?
Starting point is 00:12:45 So the idea was it should be at least the same speed while offloading computation. So you don't need the CPU to do that stuff. So like if that weren't true, like why would we care? Right. But the second point I think, and that's equally as important is, the question was, are they actually usable in the way we wanted them to use? Right. Up until now, it's just like, wouldn't it be nice if we could do that? Is it actually possible to do it? It was not really, like you said, nobody used it that way, I think.
Starting point is 00:13:14 There are some papers that did kind of that stuff, but like I said, DDIO was meant for Nix and IoT as well. So we designed two experiments. So for DDIO, we said, okay, let's assume we want to have this case where we load unsorted data in chunks from your disk, from SSD into the cache and sort them. So we simulated them by moving data with varying chunk size from the SSD directly to the L3 cache with DDIO enabled and disabled to compare. And then we just iterated over the data and memory to assure that it's actually touched, right?
Starting point is 00:13:51 So you actually need to load it to the registers. And on the same time, we also run a bandwidth intensive workload at the site to really stress the DRAM bandwidth to make sure we actually could see if we have an advantage of directly loading into the cache or not. So this simulates like the system and at the same time does something completely different as well, right? We're not the only tenant on the system. Okay, so that was
Starting point is 00:14:15 CDIO. And for IOT, right, the point we wanted to make is when we have the data sorted on DRAM in runs, we now have to evict them into backing storage, in this case, simulated by PMEM, and then back again later on when we have to merge those sorted ones again. So in this case, we also had those runs in varying chunk size and just move it back and forth once it's IoT enabled.
Starting point is 00:14:40 And the other case where we disabled it, we just used memcpy, right? So this is the way you would move stuff from or to PMM if you don't have some kind of DMA unit that can do that. And those were the two experiments we tried to run. Awesome, great. So let's talk numbers then. So what were the key results of each of the experiments? What were the gains and was it more usable as well?
Starting point is 00:15:02 Obviously, there was that angle to the experiments as well. What were your findings there? Yeah, so for DDI ddio we were surprised that it actually was usable right we we thought you know um moving stuff into l3 cache from an ssd you know dm is already so much faster than ssd it shouldn't really matter if you move it into dm or cache because like the time of it takes to to load something from cache from dm to cache is not that big compared to like you only have a throughput of like three gigabytes a second or something but it turned out if you really stress the system it makes a huge difference like for as long as the chunks you load fit into our free cache like i think we had like uh improvements from like uh one
Starting point is 00:15:42 and one gigabit gigabyte a second to like 1.4 gigabytes a second or something like this and of course also reduced latency so ddio actually is a good thing here right even if you think about really slow ssds the downside however like this was the second part of the experiment the usability is really bad because ddio like it's not something Intel wants you to mess with. So it's either globally enabled for a device or disabled. And there are some configuration parameters, but they're not documented at all. They are like some proprietary registers in the CPU. You can kind of change some values
Starting point is 00:16:18 that nobody really knows what they do, but they make it faster or slower. And I actually looked at like inter documentation and they like agreed like it would be nice to have some kind of parameters to change this but this is not documented and not so what at the moment and stay tuned and i think this documentation hasn't been updated the last 10 years so and you know it makes sense for intel because they say like right we built this for this one use case and it just works out of the box and it makes it faster so why not just use it that way but it kind of and it makes it faster. So why not just use it that way?
Starting point is 00:16:45 But it kind of makes it bad if you, like me, try to use it for something it wasn't intended to be used for. So much about DDIO. For IoT, of course, we also did the experiment. And we found out IoT is really great if you move stuff from PMEM to DRAM. So we have three times the throughput. So nine gigabytes a second instead of three gigabytes a second.
Starting point is 00:17:08 And on top of that, we don't have any CPU involvement at all. So the CPU is free to do other stuff. So this, of course, is great because PMEM actually is really intensive on the CPU because every move would be allowed in store instruction because it's the same interface as DRAM. So great in that regard.
Starting point is 00:17:27 So really awesome. If you want to move stuff from PMEM to DRAM, use IOAT. On the other hand, the issue is if you move stuff from DRAM to PMEM, it's actually really bad. And we were really puzzled by that. Why would it be?
Starting point is 00:17:40 It's fast in one way. It should also be fast the other way. Also, if you move stuff that way, like PMEM has like a really high read and write bandwidth. Like this shouldn't be a bottleneck. So to get behind this, we also measured actually the write traffic that was on the PMEM DIMM stick itself.
Starting point is 00:17:58 So it comes into the sticks that look like normal DRAM DIMM sticks. And you can measure on the physical memory what is the throughput there. And it turns out it's three times higher there than it is on the data that's actually moved. And we were really puzzled by that. Why is that the case?
Starting point is 00:18:13 And then we found out that IoT actually has a feature which is called Direct Cache Access, TCA. The idea being that Intel said in 2006, right, if you move some stuff from one memory card to another, you probably also want to do some computation on it. Because otherwise, why would you have moved it? So they've also put it into the cache, which is great if you actually want to use it. But if you just want to move it out of DRAM into PMEM, you don't need it to be cached because you actually explicitly don't want to access it.
Starting point is 00:18:41 So for one, it's slower already because we have this detour to the cache. On the other hand, also, the cache is then evicted semi-randomly in the end. So it's not evicted sequentially. And PMEM internally has a block size of 256 bytes. So it only reads data in 256-byte blocks. And if you write randomly to it, that means you have a write amplification, right?
Starting point is 00:19:02 Because you just put some data into one of those blocks, and then write the whole block back if you write random data. In this case, cache line is 64 bytes, but those blocks are 256 bytes in size, so four times the size, and you have big write amplification there. So it turns out, because of this feature, what
Starting point is 00:19:19 was a really great idea in 2006, PMEM now is slow. And the worst thing is, DCA, you can't even disable it in modern CPUs. I think there is some virus settings in like CPUs that are 15 years old, but nowadays it's just forgotten. So nothing you can do there. So like with DDIO, it's like great in theory,
Starting point is 00:19:38 but in practice, it doesn't really work out really well because hardware designers 20 years ago made assumptions that just don't hold for modern hardware anymore. Yeah, I just wanted to ask you about when you were trying to figure out how to use these and you're looking in the docs,
Starting point is 00:19:53 how easy was that to kind of figure out, oh, I need to change this number to make things go faster? It was horrible. Yeah, so there's a mix of documentation from 2006 of some Linux kernel maintainers that built this stuff in 2006. Then there's documentation from 2011, why they threw it out of the kernel again,
Starting point is 00:20:14 because it's not... Then we have some papers that kind of try some of the stuff. Then, you know, there is not really an interface you can use. So we used SPDK because it supports IOAT. But then, of course, SPDK is like a beast of itself, that is really hard to be used. It's a mess. And, like, of course, like, it's my own fault, right?
Starting point is 00:20:35 Because we try to do something that hasn't been connected in that way before. So we can't expect, like, otherwise it wouldn't be interesting research. But I said, like like the idea is for you like you're not meant to use that stuff in the way we used it the idea is intel just built that stuff for specific workloads and the idea is that you just buy the cpu and stuff just gets faster without you doing anything and that is really great if you are like on this happy path i say yeah i have a nick and i need 100 gigabit ethernet and I just can buy the newest Xeon CPU and it just works. But it's like this narrow path.
Starting point is 00:21:08 If you stray on this path, it's great. If you go somewhere else, it just falls apart. Yeah, yeah, cool. But I'm convinced that we need to do something here so we can leverage these primitives. We can do something interesting in this space. So tell me more about the data pipes vision and maybe some of the key principles that
Starting point is 00:21:25 underpin this vision yeah okay so the underlying issue to us was like this kind of bottom-up design so so those parameters are designed for a specific use case and if your use case differs you're kind of out of luck so and what we want to do is we want to say data movement should be explicit so if you have to care about it anyway, right, like if you know about cache associativity and like cache sizes and when to move data where anyway,
Starting point is 00:21:54 you might as well do it explicitly because the interfaces don't really work and you have to work around interfaces that don't really work. You might as well have a nice interface and have to use it, right? So we want to make it declarative. So you say, tell the system
Starting point is 00:22:10 what to move where and when, but not how. And to this end, we introduce two things. First one is a type system for the location that the data could be in. So currently, if you look at the status quo, more or less everything is just kind of a pointer, right?
Starting point is 00:22:27 So access to some memory location might, even though the data being pointed to is already in DRAM, then it just moves the data into caches. Maybe it's already cached and nothing happens at all. Maybe like the memory area you point to was like MMAP and actually is on a slow HDD and accessing it might just be a page fault and you have to move all the stuff into DRAM
Starting point is 00:22:50 and then into cache. Maybe it uses the OS page cache and so on. So actually you don't really know what's happening if you just access some data. So instead what we want to introduce is what we call resource locators. And the idea is that everything is behind a resource locator and this resource locator forces
Starting point is 00:23:08 you to think about it, right? For example, you would never directly access a byte in a HD resource locator, while you would in a cache resource locator. And this forces you to think about where your data actually is. And secondly, after we've typed our memory in that way,
Starting point is 00:23:23 we can say, okay, now to move data between resource locators, which is important, right? Like you can't process data that is being stored on SSD. We introduce the concept of data pipes. And data pipes are pipes that kind of connect those locators. So you can think about a DRAM to PMEM pipe that uses IoT or a PMEM to cache pipe that uses IoT the other way around the cache access. And the idea is that these pipes then connect between those locators. And then you have some kind of transmit call or something similar, which moves actually issues a request to move that data. And this makes it declarative and explicit. And optimally,
Starting point is 00:24:06 of course, those pipes map onto existing primitives like IOAT or DDIO or whatever, or maybe even new primitives in the future if vendors introduce new primitives. Of course, there are no primitives for all combinations of source and sink, right? So you might have a software fallback for those kinds of things. But even if you use a software fallback at least you have this notion of you know where your data is and where it's been moved to and what we also present in the paper is different flavors of pipes so the idea is that you declare but then in the background some runtime has to schedule the movement when to move what where and we have like three proposals where we say, one would be you have this blocking system,
Starting point is 00:24:48 like traditional IO. For example, you would use a read syscall to move the data. You say pipe.transmit, and then the pipes block until it's done. We also have this approach where you say, maybe you want to have inversion of control, where you just say, I want this data moved there, and please notify me when you're done. Or we even have a report where we say maybe the OS could have support for pipes as well. But in some way, the vision is that we say we have those kind of implicit leaky abstractions,
Starting point is 00:25:13 and we would like them to be like an explicit, intentional interface where you declare what to move there. Awesome. Yeah, I'd just like to touch on what would the syntax look like for this? Obviously, it's harder in an audio medium to kind of talk about this, but how would that look like? Yeah, so in the paper, we actually have
Starting point is 00:25:34 three different code listings to actually show you how the syntax could look like. Of course, it's hard to describe here, but yeah, the idea is to use a pipe, what you would do is, you first would declare your locators, right? We'd say, okay, kind of a case of external sort, I have data on SSD. So I declare an SSD locator, which would, as an argument, for example, take like the path to the file, if it's like a file on an SSD.
Starting point is 00:26:00 And then, of course, I say, okay, I want to move that data into the cache. So I would also have a cache resource locator, which says, okay, I want to move that data to be processed. It needs to be in the cache. And this is already assuming a lot, right? This already assumes that cache is kind of addressable, which is currently not really true. But nonetheless, this is the vision we have. And then, of course, you have like those two locators. And then you just instantiate a pipe.
Starting point is 00:26:28 Okay, I want to have a pipe from locator A being the SSD locator to locator B being the cache locator and this just instantiates a pipe and then you just call some kind of transfer method that moves the data from the SSD to the cache. The idea being that however, everything up until the
Starting point is 00:26:43 transfer call is just declarative. So we tell the system where we have our data, where we want it to be, and how it should be connected, but we don't tell the system, like we don't issue calls to transfer the data before we actually need it. So the idea is that the system already knows
Starting point is 00:27:00 of your intentions before you actually move the data, so it can do optimizations there as well. Nice, cool. I guess I'd just like to touch on a little bit as well what you think the sort of the limitations slash downsides of this vision would be obviously we're adding like another abstraction layer so maybe there might be some potential performance implications there with that another abstraction where um yeah so maybe you could elaborate on on that for us a little bit yeah yeah so first of all um regarding performance yeah you're right all right it would be an
Starting point is 00:27:30 abstraction that would have performance impact so one thing is feed down really like like for us if the performance stays the same this is already a win so of course performance important but but i'd say for us, it's more important to kind of get this vision where you'd be intentional about this stuff. And even if the performance
Starting point is 00:27:50 of like transferring data would not be faster, the whole thing about it being intentionally and you only transfer what you actually need to transfer already probably impacts the overall performance
Starting point is 00:28:00 of the system again. So yeah, but you're right. This is one possible downside. There are downsides. So another downside is it's not really applicable if you don't know where to move data beforehand. So in like this big data intensive workloads, that's easy to know.
Starting point is 00:28:15 But like if you say like have a system that has a lot of transactions, transactional database systems, you might have a lot of erratic random reads or writes where you really don't know beforehand because you don't know when the user will start a new transaction. In that case, pipes are not really a great fit, I'd say. So if moving data around is not your problem,
Starting point is 00:28:36 pipes might not be the fit for your problem. And of course, pipes kind of want to use the primitives, optimize primitives, right? So if we already have primitives, pipes are great. If we don't have primitives for some kind of device pair, we would like to have a primitive for that pair. But, you know, it's asking a lot to go to Intel and tell them that right now, like, we have this vision,
Starting point is 00:29:00 like, please, please build this thing. So we don't think that's like a way we can go so um so yeah it's kind of dependent on there being being primitives to use so so we think with the primitives that are already there it's already a good thing and like said you can have software fallbacks and even if you trust your software we think that's a good abstraction to have but um optimally we would have more primitives there as well awesome you never know right it could become so popular hopefully that it actually then the kind of feedback loop intel actually motivated to kind of almost kind of fall in line a little bit or kind of help out in that sense so maybe yeah i guess
Starting point is 00:29:39 but i think the issue here is that that intel really like, it's just a mismatch of what the goal is. Because for us, it's like, you know, like we are experts in the system and we want to use it to the fullest of its potential. And, you know, in the database community, we think about that stuff, right? Like, how can you optimize for caching and so on? But for Intel, like, they want to sell their custom
Starting point is 00:30:06 like this big legacy application and just use our chip and we build this custom kind of accelerator for exactly your use case and it will get faster and you don't have to do anything. And this is a big selling point for those people because they don't build new systems. They
Starting point is 00:30:21 maintain big legacy systems. So I think this is kind of the crux here, that there are people because they don't build new systems they maintain like big legacy systems so so i think this is kind of the crux here that that they are different kind of goals to optimize for cool um okay so where do we go now uh next with with data pipes and what's next on the research agenda how do you go about realizing this this vision one thing inter already uh announced it's like it's i think it's already released with Sapphire Rapids platform, the data stream accelerator that already tries to unify all this stuff a little bit.
Starting point is 00:30:52 So we would like to, of course, look at that. We really didn't have hardware or the time of the data pipes paper to do that. And of course, there are lots of open questions to tackle. It's not secret, this is mostly a vision paper here. So we have some code to prove that stuff, but it's not an implementation you can just use in your code.
Starting point is 00:31:12 So I think a big open question is, for example, how could we schedule the data movement? Currently we say there's probably some kind of runtime you tell it to transmit data through this pipe from A to B, but how does this runtime actually look right um is it just like does it just run on an additional core it's like a just some kind
Starting point is 00:31:33 of library in the background maybe would it be an os feature that your operating system would support data pipes as like native thing where you can then just like fopen dev slash ssdpipe or something like this. Or even thinking further, we thought a lot about like cloud context, right? Like, for example, in the cloud, you have big issues with like noisy neighbors, right? If you have like two people running on the same hardware and one like tries to do a like a data intensive workload,
Starting point is 00:32:04 it might steal resource from the other one. And that reason they have to over provision a lot right so that if something like this happens they can kind of buffer it but if you add data pipes of course in the cloud context your your system would know about you your intention to move data and could schedule it more efficiently so in the cloud context we think it could reduce a lot of issues of noisy neighbors, and therefore you don't need to over-provision as much, which makes it really enticing for cloud vendors, in our opinion. So yeah, cloud context would be another thing we would like to look into. Cool. Awesome. Yeah. So I mean, for my next question, obviously, with this being a vision paper, there's not necessarily a tool that a software developer
Starting point is 00:32:43 can go away and use today. But how do you think kind of data data engineers database administrators can leverage the findings in your research and kind of maybe longer term what impact do you think it could potentially have i think yeah like like you said it's a vision paper so so you can't just take the implementation and make stuff faster right now but But I think the biggest takeaway should be that the paper should inspire people to be more intentional about the data movement and think about what's actually happening below the stack. Because we're like building abstractions on top of abstractions on top of abstractions, right? Like if you nowadays, like I said before, and if I access a pointer, you have no idea what's actually happening. Like, of course you can find out, but in general, and then people are happy about that
Starting point is 00:33:29 because it's easy. But I think if we throw away a lot of those abstractions and re-engineer them in a way to be like more close to the hardware, like for example, data pipes could be a bit, pretty thin wrapper around like those primitives we talked about earlier. You could get an interface that's
Starting point is 00:33:45 not a lot harder to use what we have but could you give you a lot more benefits and performance benefits so so i think the takeaway here is think about how data is being moved in your system if you think about for example like like like Postgres or something, like database systems, they were engineered like 30 years ago. And the whole thing is that they say, right, HDDs are slow. We don't need to care to optimize a lot of other stuff in the beginning because we are IO bound anyway. And I don't fault them for it, right? Like this is how it has been. But if you throw away this assumption nowadays of the hardware you have and the accelerators you have,
Starting point is 00:34:28 you might have engineered the system completely different. Just thinking about that, I think, could bring some benefits. Yeah, for sure. I was just going to say that the general awareness of this is obviously, I think, in itself has potential for big impact. So, yeah. Cool. Whilst you're working on this,
Starting point is 00:34:46 obviously you've kind of touched on loads of different things. You've kind of gone deep into the weeds and said loads of different sort of primitives and different sort of pieces of hardware and whatnot. So if you can kind of capture, what was the most sort of interesting thing you kind of learned while working on this paper? Maybe the thing that kind of caught you off guard as well.
Starting point is 00:35:06 I think the biggest thing probably was like the the difference between air theory and the real world so um it's not my first paper right so in my previous papers in the beginning i said like i would like to do this and then i mostly achieved that and of course there were like setbacks and roadblocks and like detours and bumps on the way or like like otherwise it wouldn't be interesting research paper but in the end i more or less did what i intended to do and here the story of the paper more or less completely changed half the way through as we didn't really find a way to achieve our original goal because our original goal was actually to build this merge sort thing and say like see like you can build a really fast merge
Starting point is 00:35:50 sort and then I started implementing it and I found like we talked about earlier lacking documentation interfaces didn't really work you had to use opinion frameworks like SPTK and so on like it got really messy I started out let's not write a paper about this merge sort write a paper about like how messy it is like, SPTK and so on, like, it got really messy. It turned out,
Starting point is 00:36:05 let's not write a paper about this merch sort, write a paper about, like, how messy it is and how could you do it maybe better. And it turned out that, like, embracing those difficulties and making them into a story in the end really worked out great. And I think the paper really got better because of it
Starting point is 00:36:21 because I'm pretty proud now that our paper solves a problem that I know it exists because I encountered it while trying to write a different paper. So, so yeah, I think the biggest lesson here is to like embrace failure, I guess. Like,
Starting point is 00:36:36 I was like not happy when it didn't work and I go, no, like sleepless nights, right? Like the whole paper falls apart, but then it turns out actually it made the paper better in the end, I guess. Yeah, that's good. I mean, I normally ask kind of about the origin story
Starting point is 00:36:50 and the background of the paper and how bumpy that journey was from the kind of initial conception of the idea to the actual end paper. But it seems like the whole thing changed on you halfway through, which was, I guess, unpleasant. But in the end, it worked out for the best, right? Yeah.
Starting point is 00:37:08 And like I said, the thing that kind of held the paper together, however, from the beginning to end was that you had this idea, right? We have those well-behaved algorithms, and they are about data movement. And how can we make this work? And I think this was like the core of the story from beginning to end so i think this helped us that we said like yeah but we have this problem and maybe we approach another aspect of that problem but we still try to solve this problem about how can we have those well-behaved workloads where we know what data moves from when to where
Starting point is 00:37:39 how can we make them better sure just that just out of interest i mean obviously a lot of kind of building off this assumption having that well-behaved um algorithm just so we can kind of control i guess the state space of things that can kind of go on but how do you think it would perform on certain algorithms that may be a bit more unpredictable yeah so so this this is an idea we had like halfway through the paper i think think, that in actuality, if you think about data structures,only data structure where you try to append new data and in the background try to merge them to make sequential reads and writes all the time because they're optimized for
Starting point is 00:38:31 disks where sequential reads and writes are king. And in the end, for example, if you use some key-value store with an LSM tree backing it, this workload is not well-behaved. You have random reads and writes coming in all the time, but the LSM tree kind of forces your erratic workload into a well-behaved. You have like random reads and writes coming in all the time, but the LSM tree kind of forces your erratic workload into a well-behaved one
Starting point is 00:38:48 by making it well-behaved, by being append only and then being a sequential write. So I think you can make most not well-behaved workloads into well-behaved workloads by thinking about the right data structure. And yeah, we've heard about data structures
Starting point is 00:39:04 about this like workload transformers, which try to transform them into something that then again could benefit. Like I see an LSM tool could benefit from a data pipe because it's really predictable then. But of course, you have to build stuff on top that's not covered by the paper, right? You need the right data structures to make that work. Yeah. Yeah.
Starting point is 00:39:24 I guess also as well what sort of other research are you working on at the moment i mean you've mentioned before that year this isn't your first paper right fifth year phd student you've been through this process many of times so kind of what other research are you working on at the moment or have been in the past yeah so so so my my first big thing was actually analytical query processing. And there actually price matters a lot. Like you have this big cloud databases where you have to read a lot of data and hardware becomes a commodity and you need to be cheap. And there I built Mosaic, which was a storage engine,
Starting point is 00:40:01 which can fetch the data for you. It's part of the database system but it also can recommend you what hardware you should buy to maximize your performance for the given budget right so the idea was that it says right like 80 of your data you don't read anyway so you might as well put it on the cheapest storage possible and then it draws you like this nice perito curve is it like if you increase your budget by 10%, you could increase your performance by 30%. On the other hand, if
Starting point is 00:40:30 you are out of budget, from 80% of the budget you could still have like 95% of the performance or something like this. I published this as VLDB three years ago and then I said, okay, enough of analytical, let's do transactional and then I built, okay, enough of analytical, let's do transactional.
Starting point is 00:40:46 And then I built Plush, which is a persistent hash table for persistent storage. And it kind of tries to be an LSM tree at the same time as well. So the idea here is that persistent memory has a really low write latency, like insanely low. So it's comparable to DRAM. And at the same time being persistent. So we thought
Starting point is 00:41:08 why not leverage this to have a data structure that can cope a lot with inserts. And so we take the best of LSM trees and apply it to hash tables and let this work on persistent memory. Unfortunately, Intel now killed persistent
Starting point is 00:41:23 memory, which I'm still very angry about because I think it's such great hardware. It has such low load-to-write latency. It won't be reached in the next decade by anything else, I guess. But I guess it just didn't make it profitable enough for them. So it turns out they killed it. Yeah, so those were the big other two papers
Starting point is 00:41:41 I wrote in the past. The second one being on VLDB 22. But now, yeah, we're thinking about some follow-up for data pipes, but I probably won't be the primary author for that because I'm currently in the process of finishing my dissertation. Cool. Yeah, the next question, I like to ask this to all my guests, and it's really interesting to see how the responses diverge. It's about the kind of the creative process of generating ideas and then selecting which ones to work on so i'd like to
Starting point is 00:42:09 kind of get your take on how you approach this yeah so so so i think i never really had a structured approach there so so what i did is i said like let's do whatever sounds fun and interesting at the moment for different reasons. So I had this discussion, I think, after my first big paper with my supervisor and some senior lab members. I said, right, I did this Mosaic thing now, right? It's on VLDB, and it's great, and I like it. Should I now look into different aspects of that? And how can we do it in the cloud?
Starting point is 00:42:42 How can we do it faster? And they said, of course, you can can do that but that sounds pretty incremental and also you will never have the chance in your life to be a self-guided again as you are now as a phd student so just do what is fun right so because you you have the opportunity now so so i thought yeah okay like um pm sounds pretty interesting um new stuff from Intel, upcoming technology. I didn't know that it would be killed a year later, but still, and I did a little analysis to my transactional stuff.
Starting point is 00:43:14 And so I came to that, and I'm very grateful for my supervisor, of course, because he allows me that freedom, right? He's just like, yeah, as long as you do some interesting stuff, it's fine so so so i really yeah my my way to do it was just like do what seems fun within like the confines of like the general topic that that has to be done because yeah i try to enjoy my phd and do what what sounds interesting to me yeah Yeah, as a guidance or a principle, if it's fun and interesting, right,
Starting point is 00:43:47 then it naturally makes working on something more enjoyable and therefore maybe you generate more ideas based on the fact that you're enjoying it. Also, my supervisor once taught me the issue is, right, you can do a lot of follow-up stuff on stuff you already did, but as soon as you exactly know what the path will be, it's by definition not an interesting research
Starting point is 00:44:10 topic, but because, you know, if there isn't a possibility to fail, it's probably not something very new or novel, so do something out there, right? Maybe it fails, but as we've seen in the data pipes, it still turns out interesting, and like, if you in the data paper, it still turns out interesting.
Starting point is 00:44:28 If you just write 10,000 on a call and then this will just work, it might also be a nice paper, but I think it's definitely not as interesting as something that's totally out there and might as well just not be worth it. Yeah, I think you hit a nice point there. The fact that something doesn't work in itself can often be interesting.
Starting point is 00:44:43 It doesn't have to be, when you start out the the end goal doesn't have to be this perfect amazing super fantastic system or whatever right like the fact that you tried something and it failed is in itself interesting result a lot of the time right and yeah but and maybe it's harder to publish that sort of that sort of stuff right yeah unfortunately it's really hard to publish negative findings i think still because like there's so the stigma of why should i care i think it would be a lot better as a research community if we encourage that more yeah i agree with you there lucas yeah for sure cool um i guess i'm just gonna on that a little bit like what do you think is the the biggest challenge in uh database research at the moment well i think i Well, I think I have two answers. So the first one would be, let's say, outward facing. So I think the issue is to get people outside of our community to see how great database systems are.
Starting point is 00:45:35 So I talked a lot with like the bioinformatics people at our university and also the machine learning people. And, you know, they do a lot of stuff outside the database system right they they just use the database system as like the store of data and then they do all their stuff in python and then they like try to reinvent joints and so on and like do all of this on the application level and um of course we can now like point at them and say yeah of course you don't know how it's supposed to be. But I think it's a failure on us that we as a community didn't get those on board to build the tools into our databases for them to use. So I think we should invest a lot more
Starting point is 00:46:17 into the tooling to make it easier for such people to use our systems and show the advantages. For example, if you look at DuckDB, they just built a really easy to use database system. It's like two lines of code and it works, and it just can replace whatever else people use beforehand just out of the box. And it has great adoption because of that. And I think we totally missed the goal there in the past. So I think that's a direction we could go in more. And then I think inward facing,
Starting point is 00:46:50 so I think that's like the TU Munich thing, where we say people leave a lot of performance on the table. So like we talked about earlier, we build abstractions on top of abstractions. We build like Spark clusters with lots of instances and lots of nodes. And I think if we really carefully engineer the system we can do a lot on a single node and um yeah but it's like really hard work to do but i think especially nowadays where like performance doesn't scale as nicely like like slows down the improvements over the years um it is worth a lot if we maybe refocus a little bit
Starting point is 00:47:26 and try to get out the most of the hardware we actually have standing around and it's mostly idling. Yeah, I totally agree. Two really interesting challenges facing us there. When you were talking about the outward-facing one, the first thing that came to mind was DuckDB as IOL and they've positioned themselves perfectly
Starting point is 00:47:43 of solving that sort of usability problem and getting it like getting data scientists to use databases because I've experienced you there, people in bioinformatics and other sort of areas that they don't want to touch a database because a lot of the time it's
Starting point is 00:47:59 hard to install and hard to operate. They're just like, oh no, I'll just reinvent the wheel myself in my own hacky way. But but no yeah so they'll be there and also again on the on the um on the uh inward facing kind of and direction as well i feel like the people a lot historically have been like yeah we want to make this distribute we want to get as many nodes through as much computer as possible and that doesn't necessarily give you the best outcome i don't think so like you say we leave a lot of performance on the table i'll have done in the past so yeah interesting stuff um i guess it's time for the last last question now so what's the the one thing you want the listener to
Starting point is 00:48:35 take away from this from your work and from this podcast today yeah so so i would say the main thing is um we shouldn't try to hide hide the complexity of what is happening below us on the stack. So we should be aware of it. And of course, not everybody can manage all the complexity and not everybody should. So I think we should have nice interfaces helping us to deal with that complexity. And we should think about them and i think we all agree if cpus were invented last year exactly the performance they have now people probably have chosen a lot of different abstractions for the stuff because a lot of stuff is just grown
Starting point is 00:49:16 over the years because it was a good idea at the time so i think we should take care of that and think about how we could re-engineer stuff to more better fit the current landscape. And so I would say, like if you're concerned at all with performance, think of what's happening is below the stack, below in the stack and how could we better speak with that part would be the big takeaway for me.
Starting point is 00:49:41 Great stuff. Well, let's end it there. Thanks so much, Lukas, for coming on the show. It's been a fascinating conversation and best of luck with the write up. I hope that, um, go smoothly. You hit the,
Starting point is 00:49:50 hit the Q2 deadline. Um, great stuff. Um, if the listeners interested in knowing more about Lucas's work, we'll put links to, um, all the relevant materials in the show notes.
Starting point is 00:49:59 And if you enjoy the podcast, please consider supporting the show through buy me a coffee. And we will see you all next time for some more awesome computer science research

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.