Disseminate: The Computer Science Research Podcast - Hamish Nicholson | HetCache: Synergising NVMe Storage and GPU acceleration for Memory-Efficient Analytics | #22
Episode Date: February 13, 2023Summary:In this episode, Hamish Nicholson tells us about HetCache, a storage engine for analytical workloads that optimizes the data access paths and tunes data placement by co-optimizing for the comb...inations of different memories, compute devices, and queries. Specifically, we present how the increasingly complex storage hierarchy impacts analytical query processing in GPU-NVMe-accelerated servers. HetCache accelerates analytics on CPU-GPU servers for larger-than-memory datasets through proportional and access-path-aware data placement. Tune in to hear more!Links:PaperPersonal websiteLinkedInTwitter Hosted on Acast. See acast.com/privacy for more information.
Transcript
Discussion (0)
Hello and welcome to Disseminate, the computer science research podcast.
I'm your host, Jack Wardby.
Today we have another installment from the 2023 edition of CIDR,
and I'm delighted to say I'm joined by Hamish Nicholson,
who will be talking about his paper,
HECCACH, Synergizing NVMe Storage and GPU Acceleration for Memory-Efficient Analytics.
Hamish is a PhD student at EPFL in Switzerland,
and his research looks at how we can use modern storage hardware to improve analytical processing.
Hamish, welcome to the show. Thanks for having me.
It's great to have you. I hope I pronounced, heck, cash is right, right? I mean, I can't
mess this one up. Sometimes I mess the names up, but heck, that's just a pretty easy one.
Okay, cool. Right, let's get started.
So obviously I've given you a very brief introduction there,
but can you maybe tell us a little bit more about yourself and how you became interested in database research
and specifically in how we can apply Mon hardware to this area?
Yeah, for sure.
So about myself, I'm a second year PhD, like you said, at EPFL, working with Professor Anastasia Elmaki.
And I originally actually got interested in database research during my undergrad when I took a course with Stratos Idrios.
And by the second week, it was straight in to very hardware-aware optimizations like Radix Joins and being TLB-aware.
And that just got me hooked on designing software that is very hardware-aware for maximum performance.
That's cool.
So let's dive into the topic of today, Hetcache.
Can you start off by giving us maybe the elevator pitch for it,
kind of what it is and sort of set the scene for us?
Yeah. So Hetcache is an approach for caching for scan-heavy analytical workloads,
specifically using heterogeneous hardware for query processing,
for example, using both CPUs and GPUs.
And so Hetc cache leverages both workload
and hardware information to make caching decisions.
So from a hardware perspective,
it takes into account block versus byte addressable storage
and the bandwidths of different transfer paths.
For example, transferring data
between the NVMe storage and GPUs
or between NVMe storage and CPU memory.
And from a workload perspective,
it considers the selectivity of access to each page of input data,
as well as observing the processing throughput
of query execution.
And by processing throughput,
I mean if you're running your query on a CPU,
how many gigabytes per second
is the CPU consuming of input data?
And so using this information determines how much
of each column to cache on each device. The aim of avoiding caching more data in memory then is
beneficial. And so it's trying to use memory more efficiently to get the same or better performance
than existing approaches to caching. Okay, nice. And that kind of leads nicely into
my next question. Obviously, caching is not a new thing, right? It's as old as time almost. So,
what are the problems with other caching policies? Yeah, caching has been around since we've had
computers. Current approaches to caching assume that every cache hit is always a win, and
conversely, that every cache miss is always a loss. However, with modern hardware and with
workloads that are more bandwidth-oriented, such as analytics with its heavy sequential accesses,
bandwidth is more important than latency. And so when we have high bandwidth storage as we do today,
for example, with NVMe storage,
we don't necessarily need every data access to be a cache hit
in order to get maximum performance.
And so current approaches are generally frequency or recency based,
and they just assume that it's best to cache the most frequently used data,
even if it doesn't necessarily improve query execution times. And so the heuristic that has
worked for a very long time, no longer works as well with modern hardware with these sorts of
analytical workloads. And the other approach to caching, rather than caching, for example, pages, is to move entire objects and pin them in memory.
For example, placing entire columns or tables in memory.
But this can also result in wasting memory because you don't necessarily need to have all of your data in memory to achieve maximum performance.
You maybe only need a proportion of your data in memory to achieve maximum performance you maybe
only need a proportion of the data in memory for example if you have your uh column made up of
pages maybe you only need like half of the column to be in memory so that the effective bandwidth to
access that column is greater than the query processing throughput nice cool so yeah so yeah the frequency based approaches don't map well
and obviously the the other approach will just stick in as much as you can in in memory is also
is not idealizer cool so you kind of also put forward in your paper that um which there still
is like a necessity to store some data in memory.
We can't just get it all off MDME.
Why is this?
So you're right.
And so the benefit of memory really depends on your workload.
If your workload is very lightweight and you're just basically scanning very large
amounts of data and maybe doing a very simple aggregate, it's very hard to beat memory because
your query effectively runs at memory bandwidth. And memory bandwidth is basically always strictly
greater than storage bandwidth because when you're accessing data from a block device, such as an
NVMe drive, you first need to transfer the data from the NVMe drive into your memory, which uses
memory bandwidth, and then you need to read it from memory to your CPU, which uses memory bandwidth
again. And so it's not really possible to have your effective storage bandwidth be greater than
half of your memory bandwidth. So if your query is particularly fast and runs close to memory bandwidth, then storage will always be slower.
But these are very, very lightweight workloads. And most query processing workloads are a bit
more complicated, may involve multiple joins or expensive or more expensive user-defined functions.
And in that case, like the query processing throughput
is generally quite a bit slower than your memory bandwidth.
And these are the cases where you don't need to have your data in memory.
Okay, nice.
So, yes, it kind of really illustrates the importance
of the characteristics of the workload there.
So I was obviously, when we were taught at school,
from a class at university, there so i was obviously we when we when we're taught at school that we uh from classical
university that our memory hierarchies are nice and nice and linear but unfortunately we don't
live in this world anymore right so how does all of the heterogeneity and sort of memory hierarchies
these days change our assumptions about how we should do caching? Yes. So it's nice to have this abstract thought
of the conventional memory hierarchy
because it's quite easy to reason about.
You go up the memory hierarchy
and you have more performance.
You have lower latency, lower capacity.
It's more expensive, but it's faster.
But when we have, for example,
accelerators that have their own memory as well, and GPUs are the prime example we use in this paper,
that these accelerators can access both their own local memory and they can also access other types of memory within the server. To give a very specific example, there is NVIDIA's universal virtual address space feature,
which enables GPUs to use pointers that can point either into CPU memory or into their own local memory.
And so they can directly access data from system memory.
And this complicates caching quite significantly
because instead of having a linear hierarchy
where you just move from one device to the next
going up the hierarchy,
there's actually kind of like a fork.
And so we end up with what I've been calling
a bushy hierarchy,
where from storage, you can go into GPU memory
or from storage, you can go into system memory.
And then from the GPU, it can either read data from its own memory that memory or from storage, you can go into system memory. And then from the GPU,
it can either read data from its own memory
that's come from storage,
or it can read data that's in system memory.
And so the reason this makes caching complicated
is that we no longer just need to decide what to cache.
We also need to decide where to cache.
I'm sure you can see the sort of the explosion
and the possible paths that you can take through this sort of hierarchy now. Yeah, so it kind of, I'm sure you can see the sort of the explosion and the possible paths that you can take
through this sort of hierarchy now.
Yeah, so I'm convinced.
I'm really, the need for head cash
has really been strongly motivated
by these last few points you've made.
So can we dig into the details a little bit
about head cash?
And can you tell us about,
there's sort of two main features to it, shall we say there's stage simulated transfers and the heterogeneity where caching
let's start off with the stage simulated transfers and tell me how what that is and how that works
yeah uh so let me first just set the scene uh with some numbers for context for this discussion
and so today uh gpu is typically attached to a server using PCIe and using 16 PCIe lanes.
And what this means is that with current generation PCIe 4, this results in about 32 gigabytes of bandwidth between the GPU and everything else on the system.
And in practice, this is slightly less than 32 gigabytes a second due to the overheads of
PCIe transactions.
An NVMe drive is also connected using PCIe, but normally using four lanes.
In practice, this means we see about seven gigabytes a second of read bandwidth from
an NVMe drive.
In our setup, we were using an array of NVMe drives. And so we had a total of 86 gigabytes a second
of read bandwidth. And this compares to about 125 gigabytes per second of memory bandwidth
on our particular server. And these numbers will vary quite significantly across different
hardware, but we're using this as a point of reference. And so with staged semi-lazy transfers,
the problem that we're trying to solve
is that we have storage bandwidth that exceeds the interconnect bandwidth to our GPU. So we have
more storage bandwidth than ability to move data onto our GPU. And so our main motivation that
leads to stage semi-lazy transfers is that during query processing, when you're doing a table scan,
you often don't actually need to access all of the data. For example, if you had a query
over multiple columns and you were applying a filter to the first column and then doing some
other calculations such as an aggregate on the other columns that passed
that predicate, then you only need to access the values that pass the predicate on those other
columns. But with NVMe drives, you can't just skip the values that didn't pass the predicate
because NVMe drives are a block device. And so you need to read entire blocks from the device at a time because it's not
a byte addressable device like DRAM or GPU memory. So leading into the stage, semi-lazy transfers,
our approach to do transfers from NVMe storage to the GPU is to do direct transfers from the NVMe drives to the GPU for pages of data that will be accessed nearly
in their entirety and where we're going to need most of the values. And so we use NVIDIA's GPU
direct storage API that allows us to go from NVMe to GPU and bypass the system memory for
those sorts of transfers. Then when we want to transfer pages
that will be more selectively accessed
where we're going to need fewer values than the entire page,
we stage the pages by moving them from NVMe storage
into CPU memory,
and then the GPU can directly access the values it needs in DRAM
and only move those values across the interconnect.
And so the purpose of stage semi-lazy transfers is to reduce the amount of data that we need to move over the GPU
interconnect, which is our main bottleneck when we're doing query processing on the GPU
for NVMe resident data. Nice. So kind of following on from that, maybe obviously this works in tandem with the
heterogeneity where caching. So how do you determine which pages you should take which
path, basically? Which data should take which path? How does that work? For which pages should
take which path, we currently use a heuristic based on the estimated selectivity for that page. And so we found that in micro
benchmarks for our particular system, if you were selecting less than 10% of the values on a page,
it was more efficient to stage the page in memory than to directly transfer the page to the GPU.
That's interesting.
I was 10%. I guess, is there any sort of underlying reason why 10%
or does it just kind of, I guess, fall out of the microbench,
but maybe it just happens to be that?
Is there any sort of deterministic reason why?
It gets a little bit complicated to get into the details,
but the reasons are around
basically NVIDIA's bit of a black box.
Okay.
And how, so the feature we leverage,
UVA from NVIDIA,
effectively bypasses the GPU memory
so the GPU can directly access pages,
can directly access effectively cache lines from the CPU.
But there is a bit of haziness around like read coalescing, where if the GPU issues multiple reads that are close together, whether they'll be merged into one request to go over to the CPU and fetch those values.
And so it's a bit murky to have an analytical model for this. Although some
people I believe have tried. And so what we did was we formed a micro benchmark to determine like
what cutoff we would use. Sure. That makes sense. Yeah. I think that value would depend on, for
example, like which generation of NVIDIA GPU you're using
and which version of GPU drivers you're using.
And so I would, if you're trying to replicate that number
on different hardware, I would suggest also running a micro benchmark.
Cool. Yeah, great.
So we've got stage simulated transfers in the bank.
Tell us a little bit about heterogeneity aware caching.
How does this work?
Cool. So this plays in in the bank tell us a little bit about heterogeneity aware caching how does this work cool so this plays in in the case where when we have staged semi-lazy transfers and we've actually moved in some cases we can move the bottleneck from the interconnect to the storage
bandwidth because even though we're reducing the amount of data that goes over the interconnect we
still need to load the data from our NVMe drives into system memory.
And we can, in some cases with very high selectivities,
actually move the bottleneck back to the NVMe storage.
And so in this case,
it is beneficial to cache the data that the GPU is going to be consuming
on the CPU memory to reduce the amount of data that needs to be loaded from the NVMe drive.
And we can do this, or it makes sense to do this rather, because the CPU has relatively higher
access to the storage bandwidth. And so it doesn't need to cache as much data for its own processing
and its own memory. And so we can use that larger DRAM capacity to cache data for the GPU because the GPU is relatively memory capacity constrained.
And we also do some caching on the GPU itself, but we can't cache a very large amount of data and so what we do cache on the gpu is the data that's accessed with very high
selectivity uh since those pages consume the most interconnect bandwidth if they're uncached
so as a frame of reference how much can you actually cache on a gpu uh we set the cache
size to 10 gigabytes uh because we have a 40 gigabyte gpu you still need GPU memory for query processing as well as just caching.
Cool. That makes sense.
Great. So we've got these two features and you have implemented these right in a system called Proteus.
Can you tell us a little bit about Proteus and how you went about implementing these ideas in Proteus.
Yes. So Proteus is a database engine that my lab at EPFL has been working on. And Proteus is a compiled database engine, and we leverage the LLVM framework to generate and compile code
for each query. However, Hacknash was primarily implemented in plain C++, and the generated code
calls into some of the C++ functions to request pages be made available on different devices.
Okay, was it a pretty, I probably realized I just pronounced proteus wrong, it's proteus, right,
okay. What does proteus actually mean? Is it like a Roman thing? So there have been a very large number.
There's a very heavy Greek influence in my lab over the years.
My memory probably serves me wrong, but if I remember,
I think it comes from Greek mythology.
And I think Proteus might be an alternative name for Zeus.
But fact check that before putting it out.
Yeah, I'll double check that before we put that out.
Cool.
What else did I want to ask?
Yes, on the implementation effort,
how long did it take you to implement it?
Was it a difficult thing to implement
or was it pretty straightforward?
I mean, I have no familiarity with Proteus or how easy it is to adjust and augment.
So different parts of Proteus are easy to adjust in different ways.
Because Proteus, before I arrived as a PhD, was primarily an in-memory system, I had been working to add support for accessing device on persistent
storage, accessing data on persistent storage, primarily NVMe drives.
And so it wasn't too bad to add support for NVMe drives, particularly because in HeadCache,
we're looking at every-to-only workload, which makes our lives substantially easier. And the way I went about implementing it was
reasonably straightforward because Proteus implements the head exchange framework,
which was published a couple of years ago in VLDB. And in that framework,
it introduces a couple of new operators. One, there is the router, which is similar to the
classical exchange operator. And then the other one that's more relevant to this discussion is
the memmove operator. And so what this operator does is it ensures that the input pages for a pipeline, for a query pipeline you're about to invoke, are in the appropriate data or in the appropriate memory location for that pipeline.
And that was primarily used for CPU-GPU hybrid processing and to move data from a from cpu memory to gpu memory and i basically hijacked
that operator to call into the storage layer instead and so be able to move
logical pages that may be on disk or in memory into a specific location cool cool yeah just on
a quick aside real quick i've just quickly googled what
proteus was and it turns out he was uh an early sea god or god of rivers and and one of the several
deities who home whom homer calls the old man of the sea so yeah he seems like he was some sea god
basically in greek mythology and but yeah, cool, there we go.
Nice. Okay, cool.
So given that, let's talk about the evaluation.
Yeah.
What was you trying to ascertain when you were evaluating Hetkash?
What were the questions you were trying to answer?
And can you maybe just tell us about your experimental setup and workloads used and things such as this?
Yes. So there were kind of two main questions we wanted to answer.
One was how sensitive to the selectivity of the staged semi-lazy transfers were.
And the other is in ideal conditions, how memory efficient our approach could be.
How much less memory could we use
and still achieve approximately in memory performance?
And for the workload, we were using queries
from the star schema benchmark.
And for in terms of hardware,
we were running on a AMD epic server
with an Nvidia A40 GPU.
And we had 12 PCIe 4 NVMe SSDs.
Cool.
Can I interject real quick?
What does PCIe, I should have asked this earlier on, what does PCIe stand for?
What's the acronym stand for?
You know what?
I'm going to Google it.
It's something, something interconnect express.
Okay.
In the past, there was peripheral component interconnect express. Got it. Okay. In the past, there was Peripheral Component Interconnect
Express. Got it.
It's called Express because there was an
older PCI,
which predates PCIe, which
was older and slower.
Nice. So there's
TACPE on the end of it, and we've got the Express.
Cool. Nice.
Sorry, I interrupted you there. Did you have anything
else to add on, like experimental stuff?
So you mentioned you used the star schema benchmark.
Can you maybe tell us a little bit more about that?
Yes.
So the star schema benchmark is intended as a benchmark to emulate kind of data warehousing queries.
And by its name, it implements a star schema and there are a number of different
query flights within the benchmark there are four query flights and each query flight has a number
of queries that are similar but vary in their selectivity and it's meant to give a sense of
real world-ish workloads for data warehousing.
Nice, cool.
So let's talk about the key results then.
Tell me the numbers.
Yes.
So our two main results are that stage semi-lazy transfers
can improve query execution times on GPUs
with NVMe resident data quite significantly.
So in the case of query 3.4,
which is a very selective query from the benchmark,
we can improve the execution time by 45%
as we move the bottleneck for this query from the GPU.
Then from the caching perspective,
we show that when using both the CPU and GPU for processing,
we can get within a couple percent of fully in-memory performance
without storing all of the data in memory. So for query 1.3 from the benchmark, which is a fairly
lightweight query that does little more than a table scan, we can use 25% less memory to achieve
in-memory performance. But for query 3.1, which is more processing intensive, we can approach in-memory performance without using any system memory for caching.
And for both of those queries, we still use a little bit of the GPU caching, about policies to be workload aware and to know which
data is useful or not useful to admit to the cache nice and so that's obviously the headline number
there 45 and um it sounds like you've really um made the point that we need to care about our
workload here so i kind of want to touch on,
are there any situations necessarily
where head caching performance is suboptimal?
Or is it always going to be an improvement?
I'm going to hit it here.
What are the limitations of this approach?
Yes.
So first of all, any caching solution,
if your workload's never going to read the same data twice,
it's not going to be all that helpful.
And head cache is also not going to be super great when you have a very diverse set of queries, but which all have the same working set.
And by diverse here, I mean in terms of the selectivities of the queries and also the processing throughput of queries.
So in this case, even if half of your queries don't need any data in memory, if a few of those
queries heavily depend on memory, you're going to need to cache the data regardless to have the
maximum performance for the workload overall. And so there won't really be a memory efficiency
gain in that case, just because while it's useful for half
your workload, the other half of your workload using the same data really needs the memory.
And kind of on the limitation side, we currently, in our implementation, kind of assume that the
columns are reasonably uniformly distributed. As from our planner, we use the per column
selectivity estimates and assume that they
apply it to each page. So if you had wildly different selectivities per page, the current
implementation will struggle a little bit. But if you had good per page statistics for the estimated
selectivity, it should still perform decently. And the other main limitation at the moment is
that it's currently tabled to full table scan workloads.
It doesn't work with data skipping, such as skipping partitions or page skipping with zone maps.
But this is actually something we're working on at the moment.
Nice. I look forward to reading that paper when it comes out.
So you mentioned earlier on that this obviously is just totally read only and would
they like how would it look like from adding some sort of rights into the mix or would is the sort
of work we're targeting here just purely like obviously it's just analytical right so that
typically is read only right but what if like how does like sort of ingestation ingestation of data
and stuff kind of work?
Or is that kind of an orthogonal problem to this?
In some ways it's orthogonal, but not entirely.
In terms of ingesting data, if you're append-only, it's relatively simple.
It's nearly as simple as being just read-only.
If you start adding updates and transactions into the mix,
it starts getting a little bit more complicated
as you start needing to think about consistency
and data freshness and snapshotting.
And I think it is applicable,
but there would be substantially more engineering effort to make it work.
It's something that I've been whiteboarding with colleagues in my lab because I have colleagues who are interested in HTAP or hybrid transactional analytical processing.
I'm trying to figure out how to integrate and create a buffer pool that works both well for transactions and well for analytics.
Nice, nice. works both well for transactions and well for analytics.
Nice, nice. Like orthogonal goals.
One is like latency and the other is like bandwidth.
That sounds good.
So I mean, on that as well,
so obviously you've run the star schema benchmark.
How would it, I mean, I guess maybe it was the reason
why you chose that, not, the probably reason why you chose why you chose that one right but obviously did you look at like maybe
doing something else i mean like using tpc tpch or all those other benchmarks to see how it would
fare on those or would you think it'd just be the same sort of outcome um i think it would probably be similar-ish. The reason that the Star Schema Benchmark was particularly...
The reason we chose the Star Schema Benchmark
was because it has these query flights
that have similar queries with similar processing throughputs,
but with different selectivities across the flight.
That was the main reason I went with it,
because it was
helpful to show that selectivity sensitivity sure sure that makes sense yeah and i kind of another
thing that kind of springs to mind off the back of what we've just been talking about with the
respective transactions is that obviously this focuses heavily on um cashing for like
analytics and workloads would there be any benefit at all if you had the same sort of
heterogeneous heterogeneous hardware and we just let's just like say we're all tp for a second
we're not doing any of the big heavy stuff are there any sort of uh can it be like adapted to
that situation as well or is there no sort of really obviously the the bottlenecks are in
different places for that type of workload so maybe this this wouldn't be useful. But I'm just thinking if you had a transaction processing system on top of
heterogeneous hardware, how could you leverage that?
That's a good question. There hasn't been
a huge amount of work of transaction processing
in general on heterogeneous hardware. There's been a little bit recently
of some people...
There has been work
on doing transaction processing on GPUs,
but it is reasonably tricky,
especially with ad hoc workloads
because GPUs aren't really as good
at this kind of task parallel processing.
But ignoring the merits of GPUs for transactions,
I think it probably applies less
because when you're doing transaction processing,
you have, to use the word again,
you have a bit more of a heterogeneous workload
and transactions may be accessing like quite different
data. And so the way you partition the work across your CPU and GPU might be quite different than in
the analytical case. Like in the GPU case, I think a common-ish approach for LTP workloads is to batch
transactions together and ship them to the GPU for execution. And so I think in that case,
you existing approaches already have to worry about where the data is for each
transaction,
rather than worrying about how much of a query the CPU or the GPU will process.
It's more like, where will each transaction execute?
Sure. That, that, that, that sounds good.
And also as well kind of so i'm guessing um head cash is publicly available right and i want to kind of obviously ask or there's a follow-on
to that um is proteus is it purely an academic system or does it have any sort of use outside
and is anyone kind of using it for anything other than experimentation? Yes. Proteus, well, no. Proteus
is purely an academic system at the moment. And while it's not
currently open source, we are actively in the process of open sourcing it.
So Proteus should be
available in the near future. Sounds good. I can
update the show notes when it does become publicly available.
Cool.
So my next question is, as a software developer, for example,
how can I leverage these findings in your research?
On top of that, what sort of impact do you think these findings can have
for system architects,
for software developers, and people working with databases?
Yeah.
So I think the main impact is to inspire people to rethink caching and the storage hierarchy.
And while the exact approach that we've taken in this paper may not be directly applicable to every system in the world, I think it's important to realize how powerful modern storage
is and to think of the hierarchy
as less of this strict linear hierarchy
we all learned in undergrad computer science,
where each level is more performant than the last,
but to think of storage media, including memory,
in terms of its properties instead.
For example, is it byte addressable
or is it block storage?
How much bandwidth does it provide?
What is its latency?
And I think thinking about storage
in these sorts of,
thinking about storage
with these sorts of properties,
instead of as a hierarchy one,
we get more interesting and relevant
over the next decade
as we get more interesting
and niche storage devices.
For example, I think Samsung has this hybrid DRAM flash byte addressable CXL
storage, which doesn't really fit in this current storage hierarchy anywhere. It's like byte
addressable persistent storage. You can't just like slot that in as like, ah, that goes between
memory and my NVvme ssd
cool so that could be a game changer right that works how like how far away from um
sort of being available to play around with is that sort of so far is it just sort of
rumored at the moment um it's one of those things that it's been announced but not when it will be available and if you have i believe
that if you are a partner with samsung you might be able to get access to this hardware for
evaluation um but it's not publicly available currently like i don't this is pure speculation, but my guess would be in a year
or two away.
Look forward to
seeing what impact that has. I guess
kind of following on from that,
it's hard to predict this sort of thing,
but do you think there are, maybe this is it,
maybe that is the game changer, but do you think there's any other
sort of hardware on
the horizon that could
really change the way we totally architect systems,
like totally rethink everything.
I mean, this probably sounds like a candidate actually,
but is there anything else sort of out there that you think might be
interesting?
There is a couple,
like one like broad class of hardware that's coming in the coming years is
CXL. And so CXL is kind of like PCIe,
but better with lower licensee
and which will enable some interesting new devices.
And particularly the one that people are very hyped up
and excited about is this idea of memory pooling
and disaggregated memory or remote
memory, far memory, whatever people want to call it, where you can attach more memory
to your server by basically just like plugging it into a CXL slot or accessing it across
the rack that's in a, and it could be in a different chassis or server. And so you can
dynamically expand the amount of memory available to you. And I think this will be particularly
interesting, especially because in future iterations of CXL, you'll be able to share
this memory potentially across different servers and so you may have shared
memory pooling i think that will be interesting as we're going to end up with kind of like byte
addressable far storage and i'm interested to see what what people do with this plenty of
interesting things on horizon and it's kind of going. My next question is going to be,
how easy do you think it would be to
take head cache and implement it into an existing system?
Do you think it would be pretty straightforward to do or difficult?
It really depends on the exact system.
There are a couple of axes at play here.
First, whether or not the database is already
designed to run on GPUs or not. For a system that uses GPUs or uses GPUs and CPUs, I think that
staged semi-lazy transfers would be reasonably straightforward to integrate. If the system
doesn't leverage GPUs or other types of accelerators, remember already, then, you know,
staged semi-lazy transfers are not really applicable unless you also want to build a processing engine for accelerators.
And kind of the second part is the execution model. And so implementing a head cache with a
pipelined execution model is more straightforward as it's easier to observe how quickly a query is consuming
the input data. And for the caching aspect, it may be challenging software engineering,
as most current caching policies don't need to be aware of what they're caching or really
higher levels of abstraction, they just need to know how often this page is used. So implementing hit cache may require providing the caching layer with quite a bit more metadata from higher layers than is conventional.
I would imagine that could make the software engineering aspect a little tricky.
Yeah, sure. I mean, I guess in the same breath, though, I mean, 40% potential performance gains is quite an attractive proposition to to try to strive for
and maybe probably is worth the engineering effort but yeah no cool and the next kind of
series of questions i ask i kind of ask everyone and the kind of um like kind of like my stock
questions but it's always interesting to see the diversity in answers across these questions
um so the first one is, what do you think,
well, what was the most interesting and maybe unexpected lesson that you learned
while working on Hetcache?
So I think the most interesting lesson
is that storage, or at least locally attached storage,
is really quite fast.
And in some cases, it's so fast,
you need to begin looking at CPU microarchitecture.
So when I was working on this
last year, I'm working on an AMD Epic Milan machine. And each of these CPUs is composed of
four different chiplets sitting on top of a storage die. And so there are four PCIe roots,
one for each of these chiplets. And these chiplets contain the cores and their caches.
And then the IO die has the PCI controller and the memory controllers.
And what I ran into is that I had, I think, all of my drives sitting on one of the NUMA nodes corresponding to one of the chiplets.
And I couldn't get all my storage bandwidth.
And I was going nuts trying to find a bug in my code,
thinking like, why can't I get my storage bandwidth?
What's going on here?
And then after a couple of weeks looking at my code,
I'm like, oh my God, it's a hardware issue.
Because they're all attached to this one NUMA node,
the way it was working was that all these drives
were using the memory bandwidth from that local NUMA node.
And so I couldn't achieve my expected storage bandwidth
from that many drives
because I was out of memory bandwidth on that NUMA node
with the bias configurations I had.
And then I was like, wow, chiplets are complicated.
And storage is more complicated than I thought it was.
And I went into a long deep dive
about how
CPUs work and how PCIe works. Sounds like a fun journey though. It was fun. It was,
you don't really expect it to be a hardware issue when you're writing code. You always kind of
expect that you're the one who's messed up somewhere. Yeah, that's true. That's really
cool. How long was it before you realised it was a hardware problem?
I think it was like 10 days or like two weeks.
That's not too bad. I mean, at least you weren't like kind of at that like banging head against the wall for like months before you realised, right?
So two weeks is not pleasant, but I mean, still. Yeah, could have been worse.
Great. So, I mean, have been worse. Great.
So, I mean, you kind of hinted at there, like, kind of you've been on this journey for a while. So from the initial idea of this paper to the side of publication, what were the things along the way that you tried that failed?
And just maybe talk us through that sort of research journey in general.
Yeah. So the initial idea was how do we combine fast MDMA storage with GPU
acceleration for analytical query processing?
And so kind of the initial line of work we had started was looking at
compressed data and whether we should be caching compressed pages on the
GPU because GPUs are pretty good at decompression.
And that would mean we could get more data
over the interconnect to the GPU.
And we could also cache more data on the GPU
and then decompress it on the fly
when we do query execution.
And we kind of ultimately gave up on this idea
because I got stuck in a pit of engineering
and ran into some other weird
uh bugs that i could not explain after a couple weeks and so i kind of put it on the back burner
and shifted ideas yeah on the backlog we'll move on to this yeah you may come back to
in this compression and gpus how long was this end-to-end journey then?
From like, let's do this to let's publish.
How long was that journey?
So I had published a paper at Daemon 2022 looking at this NVMe storage and CPU execution side.
And then after that,
my advisor was like,
you should submit something to CIDR.
Think about what you should submit.
And so basically since Damon last year,
I've been thinking that's how long the process has been.
And I've also started working with GPUs because one of the senior PhDs in the
lab at the time was a GPU expert
and so having him on hand was very helpful guidance. Yes I can imagine that's cool and so
where do you go next then with head cash what's what's in store in the future?
There's a few things that are kind of related to each other and kind of not in some ways.
One, at some point, I want to get back to this compression part, because realistically, in the stored completely uncompressed just because the overhead
of decompressing data on the CPU is a pretty high overhead.
And then another direction is potentially looking into network storage, especially as
we have these high bandwidth NICs these days and seeing
maybe whether these ideas can also be applied to network storage and whether we need to
like cache data locally.
And that's kind of like also kind of like looking more into like the cloud context with
like disaggregated storage.
That's a very vague idea at the moment. And then a bit more concretely,
trying to work on support for more than just full table scans
or full column scans
and adding the support
for data skipping
and taking that into account
in the caching.
Nice.
Plenty of interesting research directions there.
That would be a lot of fun to be had,
I'm sure.
Great stuff. So, I mean, kind of on that kind of how do you go about sort of determining determining like
generating ideas of like where to take your research it and how do you then determine
which one's worth pursuing it's a very difficult question to answer entirely.
So one of the sorts of ideas is we work on an academic database system, right?
And so there are lots of things that are unimplemented until we need to use them and implement them.
And so whenever we need to implement something, for example, a buffer pool, we sit around
and whiteboard things for a while, thinking about what assumptions people might have made in
the past to design these components and seeing whether these assumptions are still true for new
or emerging hardware. And so we often find we get good ideas just by implementing stuff and seeing
what problems or insights we find along the way. For example, when I originally started my PhD,
I had no idea what I was going to do.
And someone pointed me at like, oh, we have fast storage now,
like maybe write a join algorithm that's optimized for spilling to NVMe
storage if the intermediate state is too large.
And when I started doing this, I was like, huh,
the storage isn't the bottleneck for query processing for a bunch of these
queries. That's odd.
That goes against like the assumption I have in my head that storage is always
the bottleneck.
And so kind of just like running into like insights by implementing things and
seeing what happens.
That's a nice, that's a nice process.
I like the,
the approach of sort of like having the brain trust as well of where you kind of get down. So, okay, now we need to implement a nice process. I like the approach of having the brain trust as well,
of where you kind of get down.
So, okay, now we need to implement a buff pool.
How should we do this?
That sounds like a really enjoyable, creative process.
For sure.
Great stuff.
So, I've got two more questions for you.
The penultimate one is kind of, I guess, quite high level, quite vague.
What do you think is the biggest challenge in database research now i think that one of the biggest challenges is how hardware is
becoming increasingly heterogeneous heterogeneous and this is difficult because there's a trade-off between
performance and portability. So conventionally, if you're doing performance tuning on, say,
a regular CPU like x86 or ARM, it's reasonably straightforward. You kind of think about your
cap hierarchy and fitting stuff into memory or into into caches but when you have all these different accelerators it gets more difficult to reason
about things especially if you have software that can run in lots of different places and ensuring
that it's like performant in all of those places and I think hardware is only going to become more heterogeneous.
For example, like Intel's taking this turn with Sapphire Rapids to piling on, Sapphire Rapids
being the CPUs they released a couple of weeks ago. And with this new generation CPU, Intel has
piled on all these accelerators into their new CPUs.
And this means that software developers have to take into account all these different features that may or may not be present on all the different systems they run.
And so I think that we need to have new abstractions in order to use these accelerators. because otherwise they're just like too difficult
and time consuming to be worth programming for, even if they do result in more efficient
or faster processing.
And from a data movement perspective, specifically, there was a really good paper at CIDR this
year as well, called Data Pipes, which basically made data movement declarative. And then underneath the hood would
leverage these different hardware features if they existed to do transfers the most efficiently.
And so I think working on these sorts of abstractions to make programming for heterogeneous
hardware easier is going to be really important in the coming years. That's really cool.
And now it's time for the last word. What's the one
thing you want the listener to take
away from your work on Hetcache?
Cool. I'm going to cheat and say
two things. Okay, yeah, yeah.
I'll let you off.
First, NVMe storage is
really fast and competitive
with main memory for primary storage
for coarse-grained accesses.
So for scale-up systems, we should preferably be using memory where it's actually better
for latency-critical or finely accessed data, and then leverage NVMe storage for data where
the bandwidth aspect is more important than the latency.
And secondly, the storage hierarchy is only going to get messier and less linear over the next decade.
And so when we have new technologies such as the Excel remote memory
and accelerators only getting more common,
we're going to need a new mental model
for thinking about the storage hierarchy.
Amazing. Let's end it there.
Thanks so much, Hamish.
It's been a fantastic conversation.
And if the listeners interested in knowing more about Hamish's work,
we'll put links to all the relevant materials in the show notes.
So you can go and check those out and we will see you all next time for some
more awesome computer science research. Thank you.