Disseminate: The Computer Science Research Podcast - Hamish Nicholson | HetCache: Synergising NVMe Storage and GPU acceleration for Memory-Efficient Analytics | #22

Starting point is 00:00:00 Hello and welcome to Disseminate, the computer science research podcast. I'm your host, Jack Wardby. Today we have another installment from the 2023 edition of CIDR, and I'm delighted to say I'm joined by Hamish Nicholson, who will be talking about his paper, HECCACH, Synergizing NVMe Storage and GPU Acceleration for Memory-Efficient Analytics. Hamish is a PhD student at EPFL in Switzerland, and his research looks at how we can use modern storage hardware to improve analytical processing.

Starting point is 00:00:53 Hamish, welcome to the show. Thanks for having me. It's great to have you. I hope I pronounced, heck, cash is right, right? I mean, I can't mess this one up. Sometimes I mess the names up, but heck, that's just a pretty easy one. Okay, cool. Right, let's get started. So obviously I've given you a very brief introduction there, but can you maybe tell us a little bit more about yourself and how you became interested in database research and specifically in how we can apply Mon hardware to this area? Yeah, for sure.

Starting point is 00:01:21 So about myself, I'm a second year PhD, like you said, at EPFL, working with Professor Anastasia Elmaki. And I originally actually got interested in database research during my undergrad when I took a course with Stratos Idrios. And by the second week, it was straight in to very hardware-aware optimizations like Radix Joins and being TLB-aware. And that just got me hooked on designing software that is very hardware-aware for maximum performance. That's cool. So let's dive into the topic of today, Hetcache. Can you start off by giving us maybe the elevator pitch for it, kind of what it is and sort of set the scene for us?

Starting point is 00:02:09 Yeah. So Hetcache is an approach for caching for scan-heavy analytical workloads, specifically using heterogeneous hardware for query processing, for example, using both CPUs and GPUs. And so Hetc cache leverages both workload and hardware information to make caching decisions. So from a hardware perspective, it takes into account block versus byte addressable storage and the bandwidths of different transfer paths.

Starting point is 00:02:35 For example, transferring data between the NVMe storage and GPUs or between NVMe storage and CPU memory. And from a workload perspective, it considers the selectivity of access to each page of input data, as well as observing the processing throughput of query execution. And by processing throughput,

Starting point is 00:02:55 I mean if you're running your query on a CPU, how many gigabytes per second is the CPU consuming of input data? And so using this information determines how much of each column to cache on each device. The aim of avoiding caching more data in memory then is beneficial. And so it's trying to use memory more efficiently to get the same or better performance than existing approaches to caching. Okay, nice. And that kind of leads nicely into my next question. Obviously, caching is not a new thing, right? It's as old as time almost. So,

Starting point is 00:03:31 what are the problems with other caching policies? Yeah, caching has been around since we've had computers. Current approaches to caching assume that every cache hit is always a win, and conversely, that every cache miss is always a loss. However, with modern hardware and with workloads that are more bandwidth-oriented, such as analytics with its heavy sequential accesses, bandwidth is more important than latency. And so when we have high bandwidth storage as we do today, for example, with NVMe storage, we don't necessarily need every data access to be a cache hit in order to get maximum performance.

Starting point is 00:04:15 And so current approaches are generally frequency or recency based, and they just assume that it's best to cache the most frequently used data, even if it doesn't necessarily improve query execution times. And so the heuristic that has worked for a very long time, no longer works as well with modern hardware with these sorts of analytical workloads. And the other approach to caching, rather than caching, for example, pages, is to move entire objects and pin them in memory. For example, placing entire columns or tables in memory. But this can also result in wasting memory because you don't necessarily need to have all of your data in memory to achieve maximum performance. You maybe only need a proportion of your data in memory to achieve maximum performance you maybe

Starting point is 00:05:05 only need a proportion of the data in memory for example if you have your uh column made up of pages maybe you only need like half of the column to be in memory so that the effective bandwidth to access that column is greater than the query processing throughput nice cool so yeah so yeah the frequency based approaches don't map well and obviously the the other approach will just stick in as much as you can in in memory is also is not idealizer cool so you kind of also put forward in your paper that um which there still is like a necessity to store some data in memory. We can't just get it all off MDME. Why is this?

Starting point is 00:05:53 So you're right. And so the benefit of memory really depends on your workload. If your workload is very lightweight and you're just basically scanning very large amounts of data and maybe doing a very simple aggregate, it's very hard to beat memory because your query effectively runs at memory bandwidth. And memory bandwidth is basically always strictly greater than storage bandwidth because when you're accessing data from a block device, such as an NVMe drive, you first need to transfer the data from the NVMe drive into your memory, which uses memory bandwidth, and then you need to read it from memory to your CPU, which uses memory bandwidth

Starting point is 00:06:35 again. And so it's not really possible to have your effective storage bandwidth be greater than half of your memory bandwidth. So if your query is particularly fast and runs close to memory bandwidth, then storage will always be slower. But these are very, very lightweight workloads. And most query processing workloads are a bit more complicated, may involve multiple joins or expensive or more expensive user-defined functions. And in that case, like the query processing throughput is generally quite a bit slower than your memory bandwidth. And these are the cases where you don't need to have your data in memory. Okay, nice.

Starting point is 00:07:13 So, yes, it kind of really illustrates the importance of the characteristics of the workload there. So I was obviously, when we were taught at school, from a class at university, there so i was obviously we when we when we're taught at school that we uh from classical university that our memory hierarchies are nice and nice and linear but unfortunately we don't live in this world anymore right so how does all of the heterogeneity and sort of memory hierarchies these days change our assumptions about how we should do caching? Yes. So it's nice to have this abstract thought of the conventional memory hierarchy

Starting point is 00:07:49 because it's quite easy to reason about. You go up the memory hierarchy and you have more performance. You have lower latency, lower capacity. It's more expensive, but it's faster. But when we have, for example, accelerators that have their own memory as well, and GPUs are the prime example we use in this paper, that these accelerators can access both their own local memory and they can also access other types of memory within the server. To give a very specific example, there is NVIDIA's universal virtual address space feature,

Starting point is 00:08:28 which enables GPUs to use pointers that can point either into CPU memory or into their own local memory. And so they can directly access data from system memory. And this complicates caching quite significantly because instead of having a linear hierarchy where you just move from one device to the next going up the hierarchy, there's actually kind of like a fork. And so we end up with what I've been calling

Starting point is 00:08:57 a bushy hierarchy, where from storage, you can go into GPU memory or from storage, you can go into system memory. And then from the GPU, it can either read data from its own memory that memory or from storage, you can go into system memory. And then from the GPU, it can either read data from its own memory that's come from storage, or it can read data that's in system memory. And so the reason this makes caching complicated

Starting point is 00:09:15 is that we no longer just need to decide what to cache. We also need to decide where to cache. I'm sure you can see the sort of the explosion and the possible paths that you can take through this sort of hierarchy now. Yeah, so it kind of, I'm sure you can see the sort of the explosion and the possible paths that you can take through this sort of hierarchy now. Yeah, so I'm convinced. I'm really, the need for head cash has really been strongly motivated

Starting point is 00:09:34 by these last few points you've made. So can we dig into the details a little bit about head cash? And can you tell us about, there's sort of two main features to it, shall we say there's stage simulated transfers and the heterogeneity where caching let's start off with the stage simulated transfers and tell me how what that is and how that works yeah uh so let me first just set the scene uh with some numbers for context for this discussion and so today uh gpu is typically attached to a server using PCIe and using 16 PCIe lanes.

Starting point is 00:10:08 And what this means is that with current generation PCIe 4, this results in about 32 gigabytes of bandwidth between the GPU and everything else on the system. And in practice, this is slightly less than 32 gigabytes a second due to the overheads of PCIe transactions. An NVMe drive is also connected using PCIe, but normally using four lanes. In practice, this means we see about seven gigabytes a second of read bandwidth from an NVMe drive. In our setup, we were using an array of NVMe drives. And so we had a total of 86 gigabytes a second of read bandwidth. And this compares to about 125 gigabytes per second of memory bandwidth

Starting point is 00:10:51 on our particular server. And these numbers will vary quite significantly across different hardware, but we're using this as a point of reference. And so with staged semi-lazy transfers, the problem that we're trying to solve is that we have storage bandwidth that exceeds the interconnect bandwidth to our GPU. So we have more storage bandwidth than ability to move data onto our GPU. And so our main motivation that leads to stage semi-lazy transfers is that during query processing, when you're doing a table scan, you often don't actually need to access all of the data. For example, if you had a query over multiple columns and you were applying a filter to the first column and then doing some

Starting point is 00:11:41 other calculations such as an aggregate on the other columns that passed that predicate, then you only need to access the values that pass the predicate on those other columns. But with NVMe drives, you can't just skip the values that didn't pass the predicate because NVMe drives are a block device. And so you need to read entire blocks from the device at a time because it's not a byte addressable device like DRAM or GPU memory. So leading into the stage, semi-lazy transfers, our approach to do transfers from NVMe storage to the GPU is to do direct transfers from the NVMe drives to the GPU for pages of data that will be accessed nearly in their entirety and where we're going to need most of the values. And so we use NVIDIA's GPU direct storage API that allows us to go from NVMe to GPU and bypass the system memory for

Starting point is 00:12:42 those sorts of transfers. Then when we want to transfer pages that will be more selectively accessed where we're going to need fewer values than the entire page, we stage the pages by moving them from NVMe storage into CPU memory, and then the GPU can directly access the values it needs in DRAM and only move those values across the interconnect. And so the purpose of stage semi-lazy transfers is to reduce the amount of data that we need to move over the GPU

Starting point is 00:13:12 interconnect, which is our main bottleneck when we're doing query processing on the GPU for NVMe resident data. Nice. So kind of following on from that, maybe obviously this works in tandem with the heterogeneity where caching. So how do you determine which pages you should take which path, basically? Which data should take which path? How does that work? For which pages should take which path, we currently use a heuristic based on the estimated selectivity for that page. And so we found that in micro benchmarks for our particular system, if you were selecting less than 10% of the values on a page, it was more efficient to stage the page in memory than to directly transfer the page to the GPU. That's interesting.

Starting point is 00:14:06 I was 10%. I guess, is there any sort of underlying reason why 10% or does it just kind of, I guess, fall out of the microbench, but maybe it just happens to be that? Is there any sort of deterministic reason why? It gets a little bit complicated to get into the details, but the reasons are around basically NVIDIA's bit of a black box. Okay.

Starting point is 00:14:32 And how, so the feature we leverage, UVA from NVIDIA, effectively bypasses the GPU memory so the GPU can directly access pages, can directly access effectively cache lines from the CPU. But there is a bit of haziness around like read coalescing, where if the GPU issues multiple reads that are close together, whether they'll be merged into one request to go over to the CPU and fetch those values. And so it's a bit murky to have an analytical model for this. Although some people I believe have tried. And so what we did was we formed a micro benchmark to determine like

Starting point is 00:15:14 what cutoff we would use. Sure. That makes sense. Yeah. I think that value would depend on, for example, like which generation of NVIDIA GPU you're using and which version of GPU drivers you're using. And so I would, if you're trying to replicate that number on different hardware, I would suggest also running a micro benchmark. Cool. Yeah, great. So we've got stage simulated transfers in the bank. Tell us a little bit about heterogeneity aware caching.

Starting point is 00:15:44 How does this work? Cool. So this plays in in the bank tell us a little bit about heterogeneity aware caching how does this work cool so this plays in in the case where when we have staged semi-lazy transfers and we've actually moved in some cases we can move the bottleneck from the interconnect to the storage bandwidth because even though we're reducing the amount of data that goes over the interconnect we still need to load the data from our NVMe drives into system memory. And we can, in some cases with very high selectivities, actually move the bottleneck back to the NVMe storage. And so in this case, it is beneficial to cache the data that the GPU is going to be consuming

Starting point is 00:16:19 on the CPU memory to reduce the amount of data that needs to be loaded from the NVMe drive. And we can do this, or it makes sense to do this rather, because the CPU has relatively higher access to the storage bandwidth. And so it doesn't need to cache as much data for its own processing and its own memory. And so we can use that larger DRAM capacity to cache data for the GPU because the GPU is relatively memory capacity constrained. And we also do some caching on the GPU itself, but we can't cache a very large amount of data and so what we do cache on the gpu is the data that's accessed with very high selectivity uh since those pages consume the most interconnect bandwidth if they're uncached so as a frame of reference how much can you actually cache on a gpu uh we set the cache size to 10 gigabytes uh because we have a 40 gigabyte gpu you still need GPU memory for query processing as well as just caching.

Starting point is 00:17:29 Cool. That makes sense. Great. So we've got these two features and you have implemented these right in a system called Proteus. Can you tell us a little bit about Proteus and how you went about implementing these ideas in Proteus. Yes. So Proteus is a database engine that my lab at EPFL has been working on. And Proteus is a compiled database engine, and we leverage the LLVM framework to generate and compile code for each query. However, Hacknash was primarily implemented in plain C++, and the generated code calls into some of the C++ functions to request pages be made available on different devices. Okay, was it a pretty, I probably realized I just pronounced proteus wrong, it's proteus, right, okay. What does proteus actually mean? Is it like a Roman thing? So there have been a very large number.

Starting point is 00:18:28 There's a very heavy Greek influence in my lab over the years. My memory probably serves me wrong, but if I remember, I think it comes from Greek mythology. And I think Proteus might be an alternative name for Zeus. But fact check that before putting it out. Yeah, I'll double check that before we put that out. Cool. What else did I want to ask?

Starting point is 00:18:55 Yes, on the implementation effort, how long did it take you to implement it? Was it a difficult thing to implement or was it pretty straightforward? I mean, I have no familiarity with Proteus or how easy it is to adjust and augment. So different parts of Proteus are easy to adjust in different ways. Because Proteus, before I arrived as a PhD, was primarily an in-memory system, I had been working to add support for accessing device on persistent storage, accessing data on persistent storage, primarily NVMe drives.

Starting point is 00:19:33 And so it wasn't too bad to add support for NVMe drives, particularly because in HeadCache, we're looking at every-to-only workload, which makes our lives substantially easier. And the way I went about implementing it was reasonably straightforward because Proteus implements the head exchange framework, which was published a couple of years ago in VLDB. And in that framework, it introduces a couple of new operators. One, there is the router, which is similar to the classical exchange operator. And then the other one that's more relevant to this discussion is the memmove operator. And so what this operator does is it ensures that the input pages for a pipeline, for a query pipeline you're about to invoke, are in the appropriate data or in the appropriate memory location for that pipeline. And that was primarily used for CPU-GPU hybrid processing and to move data from a from cpu memory to gpu memory and i basically hijacked

Starting point is 00:20:46 that operator to call into the storage layer instead and so be able to move logical pages that may be on disk or in memory into a specific location cool cool yeah just on a quick aside real quick i've just quickly googled what proteus was and it turns out he was uh an early sea god or god of rivers and and one of the several deities who home whom homer calls the old man of the sea so yeah he seems like he was some sea god basically in greek mythology and but yeah, cool, there we go. Nice. Okay, cool. So given that, let's talk about the evaluation.

Starting point is 00:21:32 Yeah. What was you trying to ascertain when you were evaluating Hetkash? What were the questions you were trying to answer? And can you maybe just tell us about your experimental setup and workloads used and things such as this? Yes. So there were kind of two main questions we wanted to answer. One was how sensitive to the selectivity of the staged semi-lazy transfers were. And the other is in ideal conditions, how memory efficient our approach could be. How much less memory could we use

Starting point is 00:22:05 and still achieve approximately in memory performance? And for the workload, we were using queries from the star schema benchmark. And for in terms of hardware, we were running on a AMD epic server with an Nvidia A40 GPU. And we had 12 PCIe 4 NVMe SSDs. Cool.

Starting point is 00:22:28 Can I interject real quick? What does PCIe, I should have asked this earlier on, what does PCIe stand for? What's the acronym stand for? You know what? I'm going to Google it. It's something, something interconnect express. Okay. In the past, there was peripheral component interconnect express. Got it. Okay. In the past, there was Peripheral Component Interconnect

Starting point is 00:22:46 Express. Got it. It's called Express because there was an older PCI, which predates PCIe, which was older and slower. Nice. So there's TACPE on the end of it, and we've got the Express. Cool. Nice.

Starting point is 00:23:01 Sorry, I interrupted you there. Did you have anything else to add on, like experimental stuff? So you mentioned you used the star schema benchmark. Can you maybe tell us a little bit more about that? Yes. So the star schema benchmark is intended as a benchmark to emulate kind of data warehousing queries. And by its name, it implements a star schema and there are a number of different query flights within the benchmark there are four query flights and each query flight has a number

Starting point is 00:23:33 of queries that are similar but vary in their selectivity and it's meant to give a sense of real world-ish workloads for data warehousing. Nice, cool. So let's talk about the key results then. Tell me the numbers. Yes. So our two main results are that stage semi-lazy transfers can improve query execution times on GPUs

Starting point is 00:24:00 with NVMe resident data quite significantly. So in the case of query 3.4, which is a very selective query from the benchmark, we can improve the execution time by 45% as we move the bottleneck for this query from the GPU. Then from the caching perspective, we show that when using both the CPU and GPU for processing, we can get within a couple percent of fully in-memory performance

Starting point is 00:24:24 without storing all of the data in memory. So for query 1.3 from the benchmark, which is a fairly lightweight query that does little more than a table scan, we can use 25% less memory to achieve in-memory performance. But for query 3.1, which is more processing intensive, we can approach in-memory performance without using any system memory for caching. And for both of those queries, we still use a little bit of the GPU caching, about policies to be workload aware and to know which data is useful or not useful to admit to the cache nice and so that's obviously the headline number there 45 and um it sounds like you've really um made the point that we need to care about our workload here so i kind of want to touch on, are there any situations necessarily

Starting point is 00:25:29 where head caching performance is suboptimal? Or is it always going to be an improvement? I'm going to hit it here. What are the limitations of this approach? Yes. So first of all, any caching solution, if your workload's never going to read the same data twice, it's not going to be all that helpful.

Starting point is 00:25:46 And head cache is also not going to be super great when you have a very diverse set of queries, but which all have the same working set. And by diverse here, I mean in terms of the selectivities of the queries and also the processing throughput of queries. So in this case, even if half of your queries don't need any data in memory, if a few of those queries heavily depend on memory, you're going to need to cache the data regardless to have the maximum performance for the workload overall. And so there won't really be a memory efficiency gain in that case, just because while it's useful for half your workload, the other half of your workload using the same data really needs the memory. And kind of on the limitation side, we currently, in our implementation, kind of assume that the

Starting point is 00:26:37 columns are reasonably uniformly distributed. As from our planner, we use the per column selectivity estimates and assume that they apply it to each page. So if you had wildly different selectivities per page, the current implementation will struggle a little bit. But if you had good per page statistics for the estimated selectivity, it should still perform decently. And the other main limitation at the moment is that it's currently tabled to full table scan workloads. It doesn't work with data skipping, such as skipping partitions or page skipping with zone maps. But this is actually something we're working on at the moment.

Starting point is 00:27:16 Nice. I look forward to reading that paper when it comes out. So you mentioned earlier on that this obviously is just totally read only and would they like how would it look like from adding some sort of rights into the mix or would is the sort of work we're targeting here just purely like obviously it's just analytical right so that typically is read only right but what if like how does like sort of ingestation ingestation of data and stuff kind of work? Or is that kind of an orthogonal problem to this? In some ways it's orthogonal, but not entirely.

Starting point is 00:27:52 In terms of ingesting data, if you're append-only, it's relatively simple. It's nearly as simple as being just read-only. If you start adding updates and transactions into the mix, it starts getting a little bit more complicated as you start needing to think about consistency and data freshness and snapshotting. And I think it is applicable, but there would be substantially more engineering effort to make it work.

Starting point is 00:28:28 It's something that I've been whiteboarding with colleagues in my lab because I have colleagues who are interested in HTAP or hybrid transactional analytical processing. I'm trying to figure out how to integrate and create a buffer pool that works both well for transactions and well for analytics. Nice, nice. works both well for transactions and well for analytics. Nice, nice. Like orthogonal goals. One is like latency and the other is like bandwidth. That sounds good. So I mean, on that as well, so obviously you've run the star schema benchmark.

Starting point is 00:29:00 How would it, I mean, I guess maybe it was the reason why you chose that, not, the probably reason why you chose why you chose that one right but obviously did you look at like maybe doing something else i mean like using tpc tpch or all those other benchmarks to see how it would fare on those or would you think it'd just be the same sort of outcome um i think it would probably be similar-ish. The reason that the Star Schema Benchmark was particularly... The reason we chose the Star Schema Benchmark was because it has these query flights that have similar queries with similar processing throughputs, but with different selectivities across the flight.

Starting point is 00:29:42 That was the main reason I went with it, because it was helpful to show that selectivity sensitivity sure sure that makes sense yeah and i kind of another thing that kind of springs to mind off the back of what we've just been talking about with the respective transactions is that obviously this focuses heavily on um cashing for like analytics and workloads would there be any benefit at all if you had the same sort of heterogeneous heterogeneous hardware and we just let's just like say we're all tp for a second we're not doing any of the big heavy stuff are there any sort of uh can it be like adapted to

Starting point is 00:30:15 that situation as well or is there no sort of really obviously the the bottlenecks are in different places for that type of workload so maybe this this wouldn't be useful. But I'm just thinking if you had a transaction processing system on top of heterogeneous hardware, how could you leverage that? That's a good question. There hasn't been a huge amount of work of transaction processing in general on heterogeneous hardware. There's been a little bit recently of some people... There has been work

Starting point is 00:30:47 on doing transaction processing on GPUs, but it is reasonably tricky, especially with ad hoc workloads because GPUs aren't really as good at this kind of task parallel processing. But ignoring the merits of GPUs for transactions, I think it probably applies less because when you're doing transaction processing,

Starting point is 00:31:17 you have, to use the word again, you have a bit more of a heterogeneous workload and transactions may be accessing like quite different data. And so the way you partition the work across your CPU and GPU might be quite different than in the analytical case. Like in the GPU case, I think a common-ish approach for LTP workloads is to batch transactions together and ship them to the GPU for execution. And so I think in that case, you existing approaches already have to worry about where the data is for each transaction,

Starting point is 00:31:54 rather than worrying about how much of a query the CPU or the GPU will process. It's more like, where will each transaction execute? Sure. That, that, that, that sounds good. And also as well kind of so i'm guessing um head cash is publicly available right and i want to kind of obviously ask or there's a follow-on to that um is proteus is it purely an academic system or does it have any sort of use outside and is anyone kind of using it for anything other than experimentation? Yes. Proteus, well, no. Proteus is purely an academic system at the moment. And while it's not currently open source, we are actively in the process of open sourcing it.

Starting point is 00:32:37 So Proteus should be available in the near future. Sounds good. I can update the show notes when it does become publicly available. Cool. So my next question is, as a software developer, for example, how can I leverage these findings in your research? On top of that, what sort of impact do you think these findings can have for system architects,

Starting point is 00:33:07 for software developers, and people working with databases? Yeah. So I think the main impact is to inspire people to rethink caching and the storage hierarchy. And while the exact approach that we've taken in this paper may not be directly applicable to every system in the world, I think it's important to realize how powerful modern storage is and to think of the hierarchy as less of this strict linear hierarchy we all learned in undergrad computer science, where each level is more performant than the last,

Starting point is 00:33:32 but to think of storage media, including memory, in terms of its properties instead. For example, is it byte addressable or is it block storage? How much bandwidth does it provide? What is its latency? And I think thinking about storage in these sorts of,

Starting point is 00:33:49 thinking about storage with these sorts of properties, instead of as a hierarchy one, we get more interesting and relevant over the next decade as we get more interesting and niche storage devices. For example, I think Samsung has this hybrid DRAM flash byte addressable CXL

Starting point is 00:34:09 storage, which doesn't really fit in this current storage hierarchy anywhere. It's like byte addressable persistent storage. You can't just like slot that in as like, ah, that goes between memory and my NVvme ssd cool so that could be a game changer right that works how like how far away from um sort of being available to play around with is that sort of so far is it just sort of rumored at the moment um it's one of those things that it's been announced but not when it will be available and if you have i believe that if you are a partner with samsung you might be able to get access to this hardware for evaluation um but it's not publicly available currently like i don't this is pure speculation, but my guess would be in a year

Starting point is 00:35:06 or two away. Look forward to seeing what impact that has. I guess kind of following on from that, it's hard to predict this sort of thing, but do you think there are, maybe this is it, maybe that is the game changer, but do you think there's any other sort of hardware on

Starting point is 00:35:21 the horizon that could really change the way we totally architect systems, like totally rethink everything. I mean, this probably sounds like a candidate actually, but is there anything else sort of out there that you think might be interesting? There is a couple, like one like broad class of hardware that's coming in the coming years is

Starting point is 00:35:43 CXL. And so CXL is kind of like PCIe, but better with lower licensee and which will enable some interesting new devices. And particularly the one that people are very hyped up and excited about is this idea of memory pooling and disaggregated memory or remote memory, far memory, whatever people want to call it, where you can attach more memory to your server by basically just like plugging it into a CXL slot or accessing it across

Starting point is 00:36:20 the rack that's in a, and it could be in a different chassis or server. And so you can dynamically expand the amount of memory available to you. And I think this will be particularly interesting, especially because in future iterations of CXL, you'll be able to share this memory potentially across different servers and so you may have shared memory pooling i think that will be interesting as we're going to end up with kind of like byte addressable far storage and i'm interested to see what what people do with this plenty of interesting things on horizon and it's kind of going. My next question is going to be, how easy do you think it would be to

Starting point is 00:37:10 take head cache and implement it into an existing system? Do you think it would be pretty straightforward to do or difficult? It really depends on the exact system. There are a couple of axes at play here. First, whether or not the database is already designed to run on GPUs or not. For a system that uses GPUs or uses GPUs and CPUs, I think that staged semi-lazy transfers would be reasonably straightforward to integrate. If the system doesn't leverage GPUs or other types of accelerators, remember already, then, you know,

Starting point is 00:37:43 staged semi-lazy transfers are not really applicable unless you also want to build a processing engine for accelerators. And kind of the second part is the execution model. And so implementing a head cache with a pipelined execution model is more straightforward as it's easier to observe how quickly a query is consuming the input data. And for the caching aspect, it may be challenging software engineering, as most current caching policies don't need to be aware of what they're caching or really higher levels of abstraction, they just need to know how often this page is used. So implementing hit cache may require providing the caching layer with quite a bit more metadata from higher layers than is conventional. I would imagine that could make the software engineering aspect a little tricky. Yeah, sure. I mean, I guess in the same breath, though, I mean, 40% potential performance gains is quite an attractive proposition to to try to strive for

Starting point is 00:38:46 and maybe probably is worth the engineering effort but yeah no cool and the next kind of series of questions i ask i kind of ask everyone and the kind of um like kind of like my stock questions but it's always interesting to see the diversity in answers across these questions um so the first one is, what do you think, well, what was the most interesting and maybe unexpected lesson that you learned while working on Hetcache? So I think the most interesting lesson is that storage, or at least locally attached storage,

Starting point is 00:39:16 is really quite fast. And in some cases, it's so fast, you need to begin looking at CPU microarchitecture. So when I was working on this last year, I'm working on an AMD Epic Milan machine. And each of these CPUs is composed of four different chiplets sitting on top of a storage die. And so there are four PCIe roots, one for each of these chiplets. And these chiplets contain the cores and their caches. And then the IO die has the PCI controller and the memory controllers.

Starting point is 00:39:52 And what I ran into is that I had, I think, all of my drives sitting on one of the NUMA nodes corresponding to one of the chiplets. And I couldn't get all my storage bandwidth. And I was going nuts trying to find a bug in my code, thinking like, why can't I get my storage bandwidth? What's going on here? And then after a couple of weeks looking at my code, I'm like, oh my God, it's a hardware issue. Because they're all attached to this one NUMA node,

Starting point is 00:40:16 the way it was working was that all these drives were using the memory bandwidth from that local NUMA node. And so I couldn't achieve my expected storage bandwidth from that many drives because I was out of memory bandwidth on that NUMA node with the bias configurations I had. And then I was like, wow, chiplets are complicated. And storage is more complicated than I thought it was.

Starting point is 00:40:42 And I went into a long deep dive about how CPUs work and how PCIe works. Sounds like a fun journey though. It was fun. It was, you don't really expect it to be a hardware issue when you're writing code. You always kind of expect that you're the one who's messed up somewhere. Yeah, that's true. That's really cool. How long was it before you realised it was a hardware problem? I think it was like 10 days or like two weeks. That's not too bad. I mean, at least you weren't like kind of at that like banging head against the wall for like months before you realised, right?

Starting point is 00:41:16 So two weeks is not pleasant, but I mean, still. Yeah, could have been worse. Great. So, I mean, have been worse. Great. So, I mean, you kind of hinted at there, like, kind of you've been on this journey for a while. So from the initial idea of this paper to the side of publication, what were the things along the way that you tried that failed? And just maybe talk us through that sort of research journey in general. Yeah. So the initial idea was how do we combine fast MDMA storage with GPU acceleration for analytical query processing? And so kind of the initial line of work we had started was looking at compressed data and whether we should be caching compressed pages on the

Starting point is 00:42:03 GPU because GPUs are pretty good at decompression. And that would mean we could get more data over the interconnect to the GPU. And we could also cache more data on the GPU and then decompress it on the fly when we do query execution. And we kind of ultimately gave up on this idea because I got stuck in a pit of engineering

Starting point is 00:42:23 and ran into some other weird uh bugs that i could not explain after a couple weeks and so i kind of put it on the back burner and shifted ideas yeah on the backlog we'll move on to this yeah you may come back to in this compression and gpus how long was this end-to-end journey then? From like, let's do this to let's publish. How long was that journey? So I had published a paper at Daemon 2022 looking at this NVMe storage and CPU execution side. And then after that,

Starting point is 00:43:05 my advisor was like, you should submit something to CIDR. Think about what you should submit. And so basically since Damon last year, I've been thinking that's how long the process has been. And I've also started working with GPUs because one of the senior PhDs in the lab at the time was a GPU expert and so having him on hand was very helpful guidance. Yes I can imagine that's cool and so

Starting point is 00:43:35 where do you go next then with head cash what's what's in store in the future? There's a few things that are kind of related to each other and kind of not in some ways. One, at some point, I want to get back to this compression part, because realistically, in the stored completely uncompressed just because the overhead of decompressing data on the CPU is a pretty high overhead. And then another direction is potentially looking into network storage, especially as we have these high bandwidth NICs these days and seeing maybe whether these ideas can also be applied to network storage and whether we need to like cache data locally.

Starting point is 00:44:31 And that's kind of like also kind of like looking more into like the cloud context with like disaggregated storage. That's a very vague idea at the moment. And then a bit more concretely, trying to work on support for more than just full table scans or full column scans and adding the support for data skipping and taking that into account

Starting point is 00:44:56 in the caching. Nice. Plenty of interesting research directions there. That would be a lot of fun to be had, I'm sure. Great stuff. So, I mean, kind of on that kind of how do you go about sort of determining determining like generating ideas of like where to take your research it and how do you then determine which one's worth pursuing it's a very difficult question to answer entirely.

Starting point is 00:45:26 So one of the sorts of ideas is we work on an academic database system, right? And so there are lots of things that are unimplemented until we need to use them and implement them. And so whenever we need to implement something, for example, a buffer pool, we sit around and whiteboard things for a while, thinking about what assumptions people might have made in the past to design these components and seeing whether these assumptions are still true for new or emerging hardware. And so we often find we get good ideas just by implementing stuff and seeing what problems or insights we find along the way. For example, when I originally started my PhD, I had no idea what I was going to do.

Starting point is 00:46:07 And someone pointed me at like, oh, we have fast storage now, like maybe write a join algorithm that's optimized for spilling to NVMe storage if the intermediate state is too large. And when I started doing this, I was like, huh, the storage isn't the bottleneck for query processing for a bunch of these queries. That's odd. That goes against like the assumption I have in my head that storage is always the bottleneck.

Starting point is 00:46:32 And so kind of just like running into like insights by implementing things and seeing what happens. That's a nice, that's a nice process. I like the, the approach of sort of like having the brain trust as well of where you kind of get down. So, okay, now we need to implement a nice process. I like the approach of having the brain trust as well, of where you kind of get down. So, okay, now we need to implement a buff pool. How should we do this?

Starting point is 00:46:49 That sounds like a really enjoyable, creative process. For sure. Great stuff. So, I've got two more questions for you. The penultimate one is kind of, I guess, quite high level, quite vague. What do you think is the biggest challenge in database research now i think that one of the biggest challenges is how hardware is becoming increasingly heterogeneous heterogeneous and this is difficult because there's a trade-off between performance and portability. So conventionally, if you're doing performance tuning on, say,

Starting point is 00:47:36 a regular CPU like x86 or ARM, it's reasonably straightforward. You kind of think about your cap hierarchy and fitting stuff into memory or into into caches but when you have all these different accelerators it gets more difficult to reason about things especially if you have software that can run in lots of different places and ensuring that it's like performant in all of those places and I think hardware is only going to become more heterogeneous. For example, like Intel's taking this turn with Sapphire Rapids to piling on, Sapphire Rapids being the CPUs they released a couple of weeks ago. And with this new generation CPU, Intel has piled on all these accelerators into their new CPUs. And this means that software developers have to take into account all these different features that may or may not be present on all the different systems they run.

Starting point is 00:48:42 And so I think that we need to have new abstractions in order to use these accelerators. because otherwise they're just like too difficult and time consuming to be worth programming for, even if they do result in more efficient or faster processing. And from a data movement perspective, specifically, there was a really good paper at CIDR this year as well, called Data Pipes, which basically made data movement declarative. And then underneath the hood would leverage these different hardware features if they existed to do transfers the most efficiently. And so I think working on these sorts of abstractions to make programming for heterogeneous hardware easier is going to be really important in the coming years. That's really cool.

Starting point is 00:49:26 And now it's time for the last word. What's the one thing you want the listener to take away from your work on Hetcache? Cool. I'm going to cheat and say two things. Okay, yeah, yeah. I'll let you off. First, NVMe storage is really fast and competitive

Starting point is 00:49:41 with main memory for primary storage for coarse-grained accesses. So for scale-up systems, we should preferably be using memory where it's actually better for latency-critical or finely accessed data, and then leverage NVMe storage for data where the bandwidth aspect is more important than the latency. And secondly, the storage hierarchy is only going to get messier and less linear over the next decade. And so when we have new technologies such as the Excel remote memory and accelerators only getting more common,

Starting point is 00:50:16 we're going to need a new mental model for thinking about the storage hierarchy. Amazing. Let's end it there. Thanks so much, Hamish. It's been a fantastic conversation. And if the listeners interested in knowing more about Hamish's work, we'll put links to all the relevant materials in the show notes. So you can go and check those out and we will see you all next time for some

Starting point is 00:50:35 more awesome computer science research. Thank you.

Disseminate: The Computer Science Research Podcast - Hamish Nicholson | HetCache: Synergising NVMe Storage and GPU acceleration for Memory-Efficient Analytics | #22

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.