Disseminate: The Computer Science Research Podcast - Tarikul Islam Papon | ACEing the Bufferpool Management Paradigm for Modern Storage Devices | #35

Starting point is 00:00:00 Hello and welcome to Disseminate the Computer Science Research Podcast. I'm your host, Jack Wardby. A reminder, if you enjoy the show, please consider supporting us through Buy Me a Coffee. It really helps us keep making this show. Today, I'm pleased to say I'm joined by Tarikul Islam Papon, who will be telling us everything we need to know about his recent paper, Acing the Buffer Pool Management Paradigm for Modern Storage Devices. Papon is a PhD student at Boston University. So I'll give you a very short introduction there, Papon, but maybe you can tell us a little bit more about yourself and how you became interested in data management research. Hi, Jack. Thanks for the introduction, and I'm really happy to be here.

Starting point is 00:01:02 So I am, as you have already said, I'm a PhD candidate at Boston University. I started in 2019 here. Before my PhD, I worked as a lecturer in Bangladesh, where I completed my bachelor's and master's. There I served as a lecturer for four years. And during my master's, I got with like a couple of projects that involved like handling like large amount of data uh so through that one of them was actually to find like the most influential nodes in a social network so there we had to crawl billions of nodes and eventually you had to handle a large amount of data. So from there, I kind of grew that fascination toward data management.

Starting point is 00:01:47 And eventually, I decided to do my doctorate over here. Fantastic stuff. So yeah, today we're going to be talking about secondary storage devices and buffer pool managers. So maybe we can take one of those at a time and start off by telling us about hard disk drives and solid state drives and what they are and how they differ yeah so i mean for the last couple of years i've been working on this domain of how to actually build better data systems for modern data modern storage devices so by modern storage devices i mean solid state drives so if we take a look at like the landscape of the storage devices,

Starting point is 00:02:25 it has changed like quite dramatically over the last three, four decades. Like from tape, we moved to hard disk. Hard disk has been like the dominating device and like the last couple of decades, SSDs or solid state drives, they have taken the place of hard disk or they are slowly taking the place of hard disk.

Starting point is 00:02:44 So like if we want to to talk about the fundamental differences or the fundamental properties of these two types of devices, hard disks are mechanical devices. There are few rotating platters, there is a head that moves. Because of this mechanical components, there is restriction how fast you can actually access data from hard disks and also random access are slower because of this movement sequential access can be faster and in contrast

Starting point is 00:03:10 to like all of these ssds are purely electrical device so they store data in chips and big and they kind of adopt a very hierarchical structure which allows them to perform concurrent IOs like you can do multiple reads and writes parallelly which hard disks do not allow you to do that and yeah so because of the adoption of this electrical device and also the hierarchical architecture SSDs have fast random access and also quite fast sequential access as well. And one key bottleneck or key challenge of SSDs is you can only write a limited amount of times in an SSD. So every time you are writing, you are slowly destroying the cells a little bit. So it kind of puts a maximum limit of how many writes you can perform on the disk and writes are slightly slower than writes can be up to

Starting point is 00:04:12 one order of magnitude or maybe up to 10 times slower than reads it varies across devices and writes are slightly slower than reads so essentially these are the basic differences between these two types of devices awesome yeah just out of interest there you said that every time you write to a an ssd you destroy you destroy it a little bit and that's kind of a permanent permanent thing how how many how long how i forgot really right intensive maybe we can talk about later on but a really right intensive sort of work on how long will my SSD last? So yeah this is a very good question and the answer is actually it depends like what kind of SSDs that you are using obviously

Starting point is 00:04:51 but for a normal consumer for a regular consumer I guess it will last around five to ten years with heavy right extensive environment and yes yes, as you can see, like most people will not actually realize this issue. So this is an issue. And obviously when you want to develop something for SSDs, you need to keep this in mind. But at the same time, like most people will not never really actually realize this issue

Starting point is 00:05:19 because the number of writes you can do, it is not too small. So it's quite big actually. Yeah, like most users won't run against that sort of extreme. That's a good point. Cool, so that's one half of the story today. So then the second half is these buffer pool managers. And can you maybe tell us a little bit more about what a buffer pool manager is

Starting point is 00:05:38 and what role it plays in a database? So essentially, most of the problem of the data management community comes from the fact that secondary storage devices are slower than main memory. So that is kind of like the fundamental bottleneck that where we are at. So when you have a database, if the database is stored in your storage device, and you need to access the database so what most database systems does is that it kind of keeps a small memory in the small set of pages it keeps track of small set of pages in the main memory so that works like a buffer uh so what is the purpose of this buffer it keeps track of let's say most frequently accessed pages or the pages that are being most frequently

Starting point is 00:06:26 being accessed. It keeps track of them so that you do not have to go to the storage devices to access them. That is essentially the purpose of the buffer pool for a database, for a DBMS or database management system. Now, most research actually focuses on developing the scheduling policy for this buffer pool, like which pages to keep in the buffer pool. That has been a very active area of research and many algorithms, like LRU, least recently used, least frequently used, and a lot of algorithm clock and so on and so on have emerged. And there has been new policies for specifically flash

Starting point is 00:07:06 devices uh which try to minimize the number of rights because ssds suffer from that right endurance problem uh we kind of wanted to tackle this problem from a different perspective we did not want to modify the page replacement policy rather uh we wanted to see if we can actually use the devices internal properties to get the most benefit out of the devices. So by utilizing the device more, you can use this with any page replacement policy. And we wanted to see that how we can make the buffer pool faster. And if the buffer pool is fast, the whole database performance is actually affected through that. Amazing. That kind of preempts my next question.

Starting point is 00:07:44 As I say, can you give us the uh the elevator pitch for ace but i guess that's it it's like rather than changing the algorithm we use to kind of move pages in and out of the buffer pool it's more how can we exploit and kind of co-design the the system with the hardware to kind of to make better use of the actual properties of the underlying hardware right yes exactly exactly you kind of summarized it very well so kind of just linking that back to kind of you mentioned a few sort of algorithms have been designed for like flash and other things what are i mean these two sort of challenges and shortcomings you identify in your paper of these existing profitable managers can you maybe elaborate on what these are a little bit more yeah so most of these approaches what they try to do is that first of all they always read and like they exchange one page for one they can exchange one right for one

Starting point is 00:08:37 read so for example when you want to evict a page from the buffer pool and if that page is actually a dirty page so you have to clean it by dirty i mean that there has been updates applied on that page and when you are flushing it you you need to apply those changes permanently in your storage devices so you have to write it and it is very possible that probably the request that caused this eviction is a read request so you are essentially exchanging one read for one write and none of these approaches or none of the as far as we know none of these approaches actually take into account the device concurrency so can you can actually they always evict one page from

Starting point is 00:09:18 the buffer pool and read one page can we write multiple pages from the buffer pool that was the started that was the question that we initially started that can we write multiple pages from the buffer pool? That was the question that we initially started, that can we write multiple pages concurrently so that we can clean most of the pages and then the next request will be mostly for clean pages and there will be spaces in the buffer pool since you have already written back multiple pages. So that's kind of what started the whole idea.

Starting point is 00:09:47 Kind of with these challenges in mind then, and when you were kind of trying to answer this question and starting to work in this sort of space, you kind of, I'm guessing somewhere on there, you kind of, okay, we want to refactor the buffer pool design space. I guess this maybe happened at the start when you were trying to get a grasp of what's out there and how we could do this. So can you maybe tell us about this refactored design space i guess this maybe happened at the start when you're trying to get a grasp of what's out there and how we could do this so can you maybe tell us about this refactor design space

Starting point is 00:10:08 you came up with what it looks like and then how did it go about influencing how you designed ace so just to summarize that like the major two challenges of state-of-the-art systems are essentially they do not consider the asymmetry of the device so writes are slower than read if you need to consider this and also the main power of modern storage devices that is concurrency that you can actually perform concurrent IOs these systems do not take into account that so we kind of ask the question and and one other one other key thing like this is kind of associated with these two challenges is that when you are writing when the writing and eviction are two different steps

Starting point is 00:10:52 but essentially all these approaches they kind of take two decisions like these two decisions based on one decision and that is the page replacement policy like which page needs to be evicted if the page is dirty then you write it back so there is no separation between between like the eviction and also the write back so we wanted to ask that it can by decoupling these two these two design decisions and that is the which pages to be written back and which pages to be evicted uh can we do something? And also, can we evict multiple pages? So we kind of drew a new buffer pool design space and we listed these questions that can we write more pages concurrently? Can we write, can we evict multiple pages from the buffer pool?

Starting point is 00:11:38 If so, can we prefetch multiple pages so that the performance can be impacted? So overall in the design space, there are like four major components. That is the eviction policy, and it includes the page replacement algorithm and along with the design decision that how many pages to be evicted from the buffer pool. And then there is the component of writer

Starting point is 00:12:01 and the writer component decides how many pages to be written back concurrently. And we always decide to write multiple pages since devices can support concurrent IOT. And the other component is the reader. And the reader essentially is a prefetcher. And we have tested with two prefetchers, but any prefetcher can be integrated.

Starting point is 00:12:23 And honestly, in our experiments, we did not find prefetching to be that effective because it heavily depends on the workload characteristics. But yeah, you can, I mean, if you know that the workload has a predictive pattern, then it helps. We have also seen that. So yeah, with all this,

Starting point is 00:12:42 that is kind of like the refactor design space where we split up the eviction from the writeback. Awesome. Yeah, that makes total sense. Cool. So I guess kind of with that then, tell us how ACE works. So the primary idea of ACE is pretty simple. That is every time you need to write back a page, you write multiple pages. And how many pages do you write? You write based on the device's concurrency. So now I will talk a little bit about what the device concurrency means.

Starting point is 00:13:10 So we experimented with several SSDs, and we found that every device has like a sweet concurrency point. And this actually varies from read requests, from write requests. So for write requests, most SSDs have a concurrency value of around 8 to 25. It depends based on the devices. So let's say the write concurrency is 8. That means that you can actually issue 8 IOs at the cost of one IO, almost at the cost of one IO. It should not be like, it is not exactly one IO, but the increase in the latency is not that significant. And the main thing is that when you are issuing these eight IOs,

Starting point is 00:13:54 you get to utilize the full bandwidth of the device. That is where the main benefit comes from. So a device that can support, let's say, 30 megabytes per second for one IO, it can give you up until 240 Mbps or something like this. It might not be exactly linear, but it can give you almost 8x or 6x throughput. So by using that concurrency value, that sweet spot, that right concurrency value, it decides how many pages to write back and if the prefetching is enabled then also then s also evicts evicts multiple pages from

Starting point is 00:14:34 the buffer pool and this way it clears up uh the space and that's that space is filled up by the prefetcher and the choices are different for uh the right the pages to be written back and the pages to be evicted in that sense that the pages that are being written back that are those are always dirty pages and the pages that are being evicted that is driven by the page replacement policy so it will just evict the last uh eight pages from the buffer pool or x pages from the buffer so as we touched on they buffer pool. So you guys were going to touch on that. You said that there's a sweet spot in the concurrency. So my question is how you determine that.

Starting point is 00:15:10 Is it sort of based on like the hardware characteristics? So it's a sense of like, I know I've got this spec of SSD basically. And I can say, okay, this pretty much is around eight, whereas this brand is around like, don't know 20 or is it more could you get a lot of variability kind of within brands yeah yeah this is a very good question uh so to know the to know the exact like very exact value of these right concurrency uh what you should do is perform a benchmarking there you try to do a stress test performance uh using like varying the number of ios of the device and uh computing or

Starting point is 00:15:54 calculating the calculating the bandwidth or the throughput that you are achieving that is like the best way to do it but if you let's say you don't have the, you don't know like which tool to use or for this specific example, you only have the specs of the SSDs, you can almost blindly use the number of channels of the device, the optimal right concurrency of the device, but not necessarily the exact value, but it will give you a very close estimate of the value. Yeah, you'll be in the right ballpark, right? So if the true value is 10, you're picking 11 rather than picking 20, right, I guess. Yeah.

Starting point is 00:16:39 Cool. So, I mean, obviously you mentioned earlier on, going back to this, like SSD sort of like degrade over time. Does that mean that the suite spotting I mean, obviously, you mentioned earlier on, going back to this, like, SSD sort of, like, degrade over time. Does that mean that the sweet spot is also changing over time as well? That is actually a very good question, and I will test it out.

Starting point is 00:16:56 That's what I will say. So the thing is, to test it out, you really need to run it for a very long amount of time. So I have been working on these devices for like one and a half years and I haven't really, even very recently, I actually performed an experiment and I did not see any degradation of these devices. So the device's write endurance might, so I mean this is only my hypothesis hypothesis uh so the devices write endurance might not actually affect the bandwidth or the read write concurrency so you can write probably let's say uh one billion

Starting point is 00:17:33 times in an ssd cell uh but every time you are writing you will get almost like very close to that bandwidth that you would have achieved from the devices. So I don't think it directly affects it, but yeah, it is a possibility. And to do that, we need to actually test the value for a longer period of time to actually see what is happening. So before we go any further, actually, I just wanted to kind of ask, what does ACE stand for? It's an acronym, right? So I kind of, yeah, what's the meaning behind it? The acronym stands for Asymmetry Concurrency Aware Buffer Pool. So

Starting point is 00:18:12 it kind of takes into account the two key properties of ACCDs, that is read-write asymmetry, writes are slower than reads and the concurrency, that is it can perform concurrently. So from that it came to become ACE. Nice, it's a catchy acronym it's a good name i'm glad that you like it so we're talking about implementation there so i know for the experiments in in in your paper you you

Starting point is 00:18:39 implemented ace in postgres so i'm i'm interested to know kind of more about that experience and how difficult it was dealing with kind of, I'm guessing, quite a large and old code base. And yeah, tell us a little bit more about the implementation effort. Yeah, the implementation effort was heavy, but the thing is we did not really change most stuff of the DBMSms like in post-based most of our changes were focused in the storage region and very specifically in the buffer pool and we needed to implement some low level uh low level primer that i could actually use the concurrent ios of the devices so we needed to write some libraries for you to get them um and yeah the the bit like the benefit of ace is that uh so to test this we had to implement uh several page replacement policies in postgres and the the implementation of the page investment policies like we tested with four

Starting point is 00:19:42 uh lru cf lru cfRU stands for clean first LRU. And there is also another variant of LRU that is LRU WSR and also the clock sweep, which is the default Bayesian placement policy of Postgres. So we implemented all these Bayesian placement policies, but these were totally independent of like the actual implementation of ace so the like that is essentially the beauty of ace that you don't really need to change the replacement policy but it is kind of like a wrapper that sits on top of the uh of these policies and and decides like we how many pages to write and writes them concurrently and evicts them if required how easy would it be to take the existing implementation and port it to a different database system?

Starting point is 00:20:29 So mostly it will require some software engineering effort, not more than that. So it will not, I mean, what I'm trying to say is that it will not take too much effort from the perspective because the idea, you can actually report it very easily that you need to write the the the change is minimal that you need to write multiple pages at the same time and you decide the eviction based on the page's policy and this very simple idea can give you good good performance benefit because you are utilizing the device more you are getting

Starting point is 00:21:03 more utilization from the device that is so yeah so i guess let's let's talk some results then so can you tell us about your experimental setup and and the experiment you and how you went about evaluating it so we evaluated is we implemented it in our server and our server has has very hard, like it is a very powerful machine with 384 gigs of main memory and everything. We implemented this in Postgres and evaluated with the standard APCC benchmark and also we created some synthetic traces and just to stress test is for different scenarios. So just to summarize the results, when the workload contains more writes,

Starting point is 00:21:48 ACE performs more because the main benefit of ACE comes from concurrent writing or efficient writing. So when the workload has more writes, the benefit of ACE is more. It can be as high as up to, like in our experiments, we found that it can be up to 2x.

Starting point is 00:22:08 You can gain in some cases when there is like the, for write-only workloads, for as you reduce the number of writes in the workload, the performance benefit reduces, but you can still gain benefit. We tested with 90 percent reads and 10 percent write,

Starting point is 00:22:30 which is a very read-intensive workload, and we found out that it can still be 15 percent to 20 percent faster. Also for read-heavy workload, if the workload has a predictive pattern, the performance benefit increases slightly. And we also experimented with the TPCC benchmark. And in the TPCC benchmark, the TPCC benchmark consists of five transactions, and each transaction has different properties. So we saw that the write-heavy transaction had the most benefit of, I don't remember the exact number,

Starting point is 00:23:05 but I think it was 1.5x and the mix transactions had a benefit of 1.3x and the read-only transactions did not have any benefit. So the main takeaway from these experiments, what I would say is that even if the workload has a very small amount of right, S can gain benefit from you. But if there is no right, it is fine.

Starting point is 00:23:29 Because then S just converts to a very classic powerful manager. So no performance penalty at all. But if the workload has even a very small portion of right, you will gain some benefit by adopting this idea. And very interestingly, we observed that we performed the experiment across multiple devices and we saw that the device that has the highest asymmetry, that is the device where the wire write is the slowest compared to reads, in those devices has the highest performance benefit and the reason behind this is essentially since those in those devices writes are more expensive if you can

Starting point is 00:24:13 perform efficient writing or if you can perform concurrent writing then you get to amortize the expensive write cost and that gives you the like more speed up in those devices so a few questions before i go that there so in the in the asymmetry there between the read and the right forms is it higher end ssds that have a i mean any correlation between the sort of the cost of the ssd and the asymmetry or is it kind of the cheaper ones have are all over the place you've got a big disparity between read and write performance whereas it's more sort of even for higher end or is it kind of the cheaper ones have are all over the place you've got a big disparity between read and write performance whereas it's more sort of even for higher end or is there no sort of cost I'm kind of maybe touch on the cost implications maybe a little bit uh yeah good question so there

Starting point is 00:24:54 is no direct comparison or no direct correlation between the cost and the reader asymmetry as far as I know the the thing is some SSDs they are read intensive and the read intensivewrite asymmetry, as far as I know. The thing is, some SSDs, they are read-intensive, and the read-intensive SSDs tend to have a higher read-write asymmetry. And it can be very fast. Like, the reads can be very fast, and the device can be expensive and everything, but the read-write asymmetry can be, let's say, 10 or something like this. Whereas there can be devices, like, there can be let's say 10 or something like this whereas there can be devices like there can be very low-end devices which has nominal read rate asymmetry like two or three or something

Starting point is 00:25:33 like this so it does not really directly depend on the on the cost rather the specifics like the internals of the device and yeah what you want to get from the device when designing it this is great it sounds like if like a like almost like a free lunch right so there's this does the wrapper layer introduce no sort of overhead at all like it's totally like it's totally negligible non-negligible is the because i guess there must be some sort of computation happening there to kind of determine what should happen next. But that's very minimal, is it? Yeah. So when you are using concurrent IOs and when you can actually write multiple IOs and at the cost of a single IO, you can get the benefit.

Starting point is 00:26:20 And it is not really a single IO, as I have already mentioned. There is a penalty, but the penalty is not like the exact number of the number number of the ios so you can gain by adopting this very simple idea you can actually gain some performance benefit and there is a catch the catch is the the catch is again uh again almost negligible uh since we ace writes a bit more aggressively since you are writing more like every time you need to write you write back one page you are writing and the number of writing let's say eight pages uh since you are writing a bit more in our experiments we observed that this value remains between 0.1 percent like the increase remains between 0.1 percent you might think that okay it should have

Starting point is 00:27:07 been much much more more because s is writing more pages but more pages at one go but that's not the case because since when you have written back multiple pages the pages has been cleaned so the next evictions will be actually free of cost so the next evictions you will just evict clean pages so evicting a clean page from the buffer pool is not an operation. You just drop it from the buffer pool. You just forget about it. So ACE writes beforehand, it writes preemptively,

Starting point is 00:27:37 but in the long run, it almost gets amortized over time. And the increase in write remains between 0.1 percent which is almost invisible so if you as can say there's no sort of scenarios at the moment obviously where it's very dependent on the workload but where kind of ace's performance is kind of suboptimal or like i guess are there any other limitations of ace at the moment as As far as we know, the right endurance and the slight increase in right amplification, that is essentially the penalty of ACE. Other than that, even if the workload contains a little bit of right, then ACE will have performance benefit.

Starting point is 00:28:22 The concept of ACE will have performance benefit. The concept of Ace will have performance benefit. And the idea is very simple, but you can actually get decent benefits. So that is kind of like the beauty of this idea. And we kind of feel that this way. And during our submission process, we have actually seen that sometimes, I mean, some people can actually understand that,

Starting point is 00:28:50 okay, this is a very simple idea. And how is it novel? And this is kind of like an engineering effort. So you kind of get this, but eventually as we kind of evolved and refactored the buffer pool design space and after adding the prefature, then we kind of got the satisfaction of what we wanted to actually build. So, yeah, even with the simple idea, you can get a good benefit.

Starting point is 00:29:19 Yeah, I mean, there's beauty in simplicity, right? Like something doesn't have to be super complicated to, to, to, to kind of become the worth of something and the, the, the quality of an idea is not linked to its complexity. Right. So the fact that it's, it's, it's simple, it works really well. And it is, it's like it lives up to its name and gives you pretty much good performance in like almost every scenario is like, it's, it's fantastic. And for sure. Cool. So guess well on that kind of where do you go next then so is it kind of can we exploit different types

Starting point is 00:29:54 of hardware or yeah what's the next steps with with ace so like now what we are working on is that we are trying to import this idea of asymmetry and concurrency in different aspects of data management so very recently we have worked on developing a graph database on like a graph manager that can utilize the concurrent io and can implement some algorithms so that is one aspect we also want to work a little bit on as you know as you probably know that our lab works a lot on lsm trees and log structure merge trees so we want to import this idea of asymmetry and concurrency into lsm compactions and see how they can benefit and so yeah the like the buffer pool was just one example of this idea of asymmetric concurrency, our storage system design.

Starting point is 00:30:49 And you can adopt this idea into multiple different, different applications and see the benefit. Nice little plug there for Suva Deep's episode on LSM trees. So, yeah, the listener can go and check that out as well. Cool. But yeah, I mean, I can imagine some hypothetical scenario where on on sl lsm tree so your listener can go and check that out as well um cool but yeah no so on that also i mean i can imagine some hypothetical scenario where um i don't know i'm running my i'm running my database in the cloud and all of a sudden i want to change the the underlying hardware that it's running on and i want to move from i don't know ssds to some other sort of cool new storage device that my cloud provider is offering me.

Starting point is 00:31:25 Do you think it'd be possible to kind of have that wrapper, like almost be intelligent enough to say, okay, well, the hardware's changed now, so I need to change what I'm doing to give better recommendations, essentially, of what I can write out or what I should pre-fetch, for example? Or is that not a particularly promising research direction?

Starting point is 00:31:51 Oh, that is a very good point and yeah i mean we haven't done that yet but this is actually a very cool idea and uh this this can be actually very easily doable uh when you are running because you know the underlying hardware at least even if you don't have enough time to do a thorough benchmarking and get the exact concurrency values of the devices by knowing little bit of the devices and most of these SSDs they have this smart technology which kind of holds all the information of the devices like the number of channels number of writes that has been performed number of erases and so on and so on it keeps track of a lot of statistics and everything so yeah by getting that it can be i mean you can get that ballpark value and you can probably you can actually seamlessly switch between between devices if required in such a cloud environment but yeah it is

Starting point is 00:32:41 it is actually a cool idea and can be a very good extension uh when using this concept awesome cool so what can what impact do you think this this work can have then like as i know as a software developer data engineer or one of these types of jobs how do you think they could leverage um your findings in their sort of day-to-day working life? Yeah, like the key takeaway, like when you are using or when you are developing systems for SSDs, try to keep in mind the differences between hard disks and SSDs like read-write asymmetry and access concurrency. If you keep in mind that, yeah, we have this property,

Starting point is 00:33:27 so how can we exploit these properties or how can we get the most out of these devices? Can we do concurrent IO and how many IOs can we perform? If you go beyond your limit, that will hurt your performance. So these are essentially the questions that you need, that you sort of kind of keep in mind and then design systems that are better tailored for these kind of situations yeah i think that's really good advice sort of be aware that of the hardware you're running on right and sort of like how can you exploit those those characteristics to kind of

Starting point is 00:34:02 to to make whatever it is you're building more performant right that's really cool um across i mean how long have you been working on on this ace project and kind of following on from that sorry the what's the most interesting sort of lesson you've sort of learned whilst whilst on that journey uh yeah that is a good question so uh i started working on ace in like late 2020s and uh we kind of uh finished it uh last year so it was like one and a half year or maybe some something around yeah i think one and a half year of project but during that time we got to actually this is not like the only contribution of our work we previously also published a daemon paper and a cider paper based on these characteristics of modern storage devices

Starting point is 00:34:52 so we got to learn about uh ssd concurrency how they are different and read rate asymmetry how to measure them we developed a tool to actually do this benchmarking and then we kind of analytically did a study that how the systems that can actually batch writes and can if they can perform those batch drives concurrently how will be the performance improvement we model this and so on and during the development of AS, we got to learn these cool things about SSDs. Like, for example, smart is one of them. You get to know these very cool stats of the devices.

Starting point is 00:35:34 Like one of our reviewers asked the impact of SSD endurance and we kind of had to perform an experiment that shows the number of writes of the devices and it does not increase so i got to learn that all these devices have these statistics in the flash controller and you can access them and you get to know about this so there is like narrowing down to one lesson will be very difficult, but I think these small learnings about different devices and the different buffer pool managers, the state-of-the-art systems you get to learn about them, what research has been done.

Starting point is 00:36:18 And yeah, I mean, narrowing down to one lesson will be difficult. That's okay. You can have many lessons. The more lessons, the better, right? I mean, that's good. Were there any sort of dead ends along the way? Like any things you tried that sort of failed? Or you're like, oh, damn, we need to change this now. Or was it just one big sort of upward trajectory throughout the program?

Starting point is 00:36:42 No, no, no. No, definitely not. So as I said that we kind of started it in 2020 and we almost had a working buffer pool. That was a prototype. So it was not implemented in Postgres. So we had it running in 2021. We went on.

Starting point is 00:36:59 But the key idea more or less remained the same. But there has been a lot of optimizations, like the refactored uh design space that was not a previously there so previously the main idea was just to write multiple pages whenever possible and see what happens with that and like separating the eviction from the write back that was not the that was not initially there. And yeah, we got some very good reviews from the reviewers. They suggested us to implement it in a real system and also how we can actually,

Starting point is 00:37:34 if we can actually use prefeture and so on. So we got some very good ideas and feedbacks from the reviewers and we started implementing those ideas and then we moved on to implementing it in the Postgres. So yeah, it has actually changed a little bit over time, but not too much. But in the buffer pool that was like two years ago and the buffer pool that we have now,

Starting point is 00:38:01 the performance is drastically different because like by the refactoring and also adding some optimizations in the code is actually impacted a lot over time. I guess, is this your main sort of project you've been working on? I guess my question here is, what other things are you working on at the moment?

Starting point is 00:38:21 Can you tell the listener about your other research? Yeah, so I work a little bit on LSM. So we have, with Shubhadeep, things you work on at home why can you tell a listener about your other research uh yeah so i work a little bit on lsm so i we have with shubhadeep we worked on lithi uh so the lithis and delete our lsm engine so i contributed in one of the one of the one of the implementational aspects of lithi uh so i do a little bit of LSM stuff. And one of my other big projects is actually, we call it relational memory. So the idea of relational memory is a software hardware co-design space where you use the concept of hardware specialization and you use specialized hardware like FPGA. So our key idea is that we have a row store and how we can actually access its different columns

Starting point is 00:39:16 without accessing unnecessary data in the cache line. So can we actually do it? And we have built a prototype. We call it relational memory and it can actually use again custom FPGA based hardware. We built the first prototype and on a PSPL platform,

Starting point is 00:39:38 we experimented that and we are seeing that, yeah, you can actually perform like faster than the DRAM. It can like faster than the dm it can be faster than the dm in some cases and for some analytical queries it can be as fast as ddd then so yeah it can provide comparative comparative performance so that is actually very cool very cool work that we have been working on now we are trying to extend this idea of relational memory to the memory controller. And also, we have been thinking about making this a concept more general,

Starting point is 00:40:14 not keeping it specific to database systems. Rather, we are working now on a data reorganization unit that can benefit any sort of data systems like tensors or matrix operations or image processing and that sort of stuff. So you can essentially you can accelerate any operation that involves a set of data or involves transforming data from one shape to another shape. Working on some really cool things there. So, I mean, I really like this which is my favorite question and it's like i'll kind of can you tell me more about your creative process how do you go about sort of generating ideas and then selecting yeah this is something that i'm going to spend the next two to three years maybe working on like how do you go about doing that so i can tell you how we started this project and so the the main idea of this pro this

Starting point is 00:41:08 project was actually this is manos's brainchild and manos had this idea of uh developing uh asymmetry and concurrency our buffer pool so he shared the he shared several ideas with me and one of one of them was this uh concept of developing better data systems for considering the asymmetry and concurrency of devices. And I kind of liked the idea because I don't, honestly, I don't really know. But for some reason, it intrigued me. And I, that probably one of the reasons I have always kind of had a soft corner for hardware, like during my bachelor's, I developed, I worked, I worked a lot on hardwares and stuff. So when I, I got to know that, yeah, there were some projects related to storage devices and also like the

Starting point is 00:41:56 relational memory that is also FPGA based and hardware based. So that probably intrigued me, I don't really know. And now if i want to talk about like the ideas generation process or like the whole process of developing a system uh like what i always want to do is that like to get the solution of a problem you need to understand the problem well and you also need to understand the problem from different perspectives like you can solve a problem from many different ways. Like in multiple angles, it can be done. So if you have the good idea of these different angles,

Starting point is 00:42:33 then try to think like which angle you can actually get a solution from there. So like if you know all the perspective and then you get to have some sort of intuition in one of these perspectives, yeah, maybe if we tweak this, then something can happen. Do some pen and paper work, see if it is actually viable. And then the next step is essentially testing your idea and build a prototype. So, I mean, the first prototype that should work well for, let's say, one to 1,000 nodes. And yeah, after you have a working prototype that kind of supports your hypothesis,

Starting point is 00:43:15 and then you can start working on optimizing it and maybe adding more features or probably better designs and so on and so on so these two steps actually have like it is not like it it does not happen in one iteration there will be some back and forth between the between developing the prototype as you work more and more towards your idea you develop better understanding of the of the whole system and a whole process and you get more ideas. And I guess that's it.

Starting point is 00:43:48 That's a fantastic answer to that question. Like I say, it's really intriguing to see how everyone works because there's so much like the disparity in answers across that to that question. It's fantastic just to see how everyone's got their own little way of doing things. It's great. But no, that's really cool.

Starting point is 00:44:02 Yeah, so just a couple more questions now, Papon. So the first one is, um but no that's really cool yeah so just just a couple more questions now pop on so the first one is is what do you think is the biggest challenge in databases data management research today wow that this is a very like a big question and yeah very big question again i again i don't think I can pinpoint to one problem, but maybe handling the different types of data and the data that are now being generated from many sources,

Starting point is 00:44:36 like from different sources and different varieties of data with this exponential growth of data that is being generated from different sources, handling this large amount of data is a challenge. So there has been a lot of work that is being going on with indexing, learned indexes is a cool thing, and developing good key value store that supports good concurrency control, transaction management. So all these are actually important. And also, like with the new privacy regulations coming in place, so privacy is a big thing.

Starting point is 00:45:12 And like with machine learning, many people are using many machine learning concepts to accelerate many, like small components of data management system, which is also a very cool way of doing things, especially like auto-tuning or robust workload or robust system design based on workload uncertainty and so on and so on. So, yeah, I mean, pinpointing it to one thing will be different.

Starting point is 00:45:44 I guess I have already listed 10 challenges or so on and so on. So a lot of stuff are happening. But I think handling this large amount of data from a variety of sources while ensuring the privacy, I think, yeah, that can summarize the whole thing. Yeah, that sounds great. Plenty of interesting research left out there for us all to do, right? cool um okay last it's time for the the last word now so uh what's the one takeaway you want the listener to to get from this this podcast today know your ssd know your storage device know the know the characteristics of the device and while you are designing the systems make sure that

Starting point is 00:46:27 you keep into account those properties and tune your system to that to get the most out of your device. Fantastic let's end it there then so thanks so much Pappan for coming on the show it's been a fantastic conversation I've really enjoyed it if the listener is interested in knowing more about performance work we'll put a link to everything in the show notes so you can go and find that and again if you enjoy the show please consider um supporting us through buy me a coffee and we'll see you next time for some more awesome computer science research you

Your Ad Here

Disseminate: The Computer Science Research Podcast - Tarikul Islam Papon | ACEing the Bufferpool Management Paradigm for Modern Storage Devices | #35

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.