Grey Beards on Systems - 77: GreyBeards talk high performance databases with Brian Bulkowski, Founder & CTO, Aerospike

Starting point is 00:00:00 Hey everybody, Ray Lucchese here with Howard Marks here. Welcome to the next episode of Graybeards on Storage podcast, the show where we get Graybeards Storage bloggers to talk with system vendors to discuss upcoming products, technologies and trends affecting the data center today. This Graybridge on Storage episode was recorded on November 30th, 2018. We have with us here today Brian Bukowski, CTO and founder of Aerospike. So Brian, why don't you tell us a little bit about yourself? And I believe Aerospike is an in-memory database, and what's so special about that? Hey Ray, great to be able to talk to your audience. Thanks for having me on. Aerospike is not exactly an in-memory database and what's so special about that? Hey, Ray, great to be able to talk to your audience. Thanks for having me on. Aerospike is not exactly an in-memory database. We've had to

Starting point is 00:00:50 call ourselves an in-memory database because we have the kind of speed that is associated with in-memory. I remember one day I was talking to one of the Gartner analysts and I said, so is Flash storage or is Flash memory? And he said, and he looks at me kind of confused and I'm like, well, you know, what is it? Is it storage or is it memory? And he looks at me kind of confused, and I'm like, well, what is it? Is it storage or is it memory? That's the only two things we have in the world right now, and now we've got this NAND stuff. And he goes, well, I think it's memory. And I said, well, actually, I think it's storage.

Starting point is 00:01:17 And what we've used so far for NAND is we put it behind storage interfaces. What we know about – yeah, well, we do now. We do now. And so, but the thing about NAND that makes it like in memory, if you use it properly, is random access, right? And so storage, what we think of as storage has typically been limited by the five millisecond seek time. And that's been kind of an ironclad thing, right? We've had five milliseconds, you know, okay, fine, short stroke it down to 2.5 milliseconds. But, you know, that's been a number since the mid 90s. So sure, there's, you know, all kinds of fancy tiering that's been done with, you know, this system and that system and, you know, you know, sand systems that have that kind of

Starting point is 00:02:03 tiering, but still there's this fundamental two and a half milliseconds that you have to undergo, which simply doesn't exist with Flash. So when I looked at the world in 2008 and someone said to me, hey, Brian, I think the world needs a new database, and I said, you're joking. That's a hard problem, building a new database. The answer was Flash. How can we,

Starting point is 00:02:25 in fact, create internal data structures that are different from the on-disk layouts? Yeah, and really, it's all about seeks, because streaming speeds of rotational drives are pretty high. Yeah, Flash can beat them by 2x to 3x, but it's not, I mean, 250 seeks per second or maybe 500 seeks per second versus 100,000, 250,000 per device. It's like having a three-part cabinet in a two and a half inch drive, right? Well, not without the resiliency, but yeah. Yeah, I guess. Yeah. Which is why you need two of them, right? Yeah. Yeah, you need at least two. There is a read-write asymmetrical access time thing there,

Starting point is 00:03:14 but I think for the most part, we all agree with what you're saying. Yeah, and the writes are faster than the disks used to be anyway. Right. So some of the traditional in-memory database tricks, Aerospike isn't exactly able to do, and that's around some of basically joins. And so one of the nice things when you're on a single machine, not in a distributed system, and you're in pure RAM, is having links over to another piece of information

Starting point is 00:03:37 for sort of, you know, OLAP cube style systems becomes a lot easier. Because it's all addressing of memory and stuff like that. Yeah, yeah, yeah. Yeah, exactly. It's all pointers. And, you know, it's all five nanoseconds away as opposed to being seven. Basically, I think of it as about 100 microseconds is what you have to pay for a NAND read. So we sort of structured AirSpike around that with we saw traditional databases that were working at 5,000, 10,000, 50,000 operations per second, database operations for primary key.

Starting point is 00:04:16 And analytics is all well and good, but what about good old primary key optimizations? Well, if you're doing that kind of problem, you're a transactional database, and transactional databases need persistence. Oh, yeah. Before you go on, when you're talking about SSDs or NAND, it comes in so many flavors these days. I'm assuming you're talking about NVMe SSDs and not NVDIMM kinds of things with NAND embedded in them and stuff like that? Or I guess you could probably use both, maybe. I don't know. Well, this comes up to the topic we're all going to talk about here later regarding, you know, putting storage on memory buses. But certainly when we started the whole project back in 2008, we were really thinking about things

Starting point is 00:05:01 that looked like storage, right? That were across a SAS bus, across a SATA bus. Because I didn't really like, I wasn't super enthused about things like Diablo memory and those dual sided persistent things simply because they were too small. There were 16 gigabyte, 32 gigabyte, and it's like, I don't know, that just seems boring to me. And I think people who think about storage like to think these days about terabytes and petabytes. We've started thinking about exabytes. Yeah, definitely. And there's one company out there

Starting point is 00:05:36 talking out of bytes, but. Yeah. So, so, so really it was all about, you know, so, so then we had to do some fundamental operating system research and try to figure out what's the fastest way through the most common data center operating system of the day, which is to say Linux, to access storage devices on a SATA or SAS bus. And the answer was good old read and write. A lot of people sort of took the M-Map, memory map system approach. We found that to be incredibly slow, four times slower than simply calling read and write through either synchronous or asynchronous APIs. So a lot of what we did in founding the code base was to say, what's the code path Oracle takes?

Starting point is 00:06:19 Because I know that that's well optimized by all the storage vendors. And so we have to think of ourselves not as an in-memory database from that perspective. But on the other hand, we get this random access capability of how are we going to actually do primary key access? How are we going to traverse indexes and avoid the kind of parallelness bottlenecks that allow us to get up into the hundreds of thousands per second per server and into the multi-millions per cluster? Well, NVMe must help with that parallelization, doesn't it? It gets even better. There's a funny thing about that. The short answer is no, and the long answer is kind of. So really what we've got is PCIe underlying both, right? So you've got a PCIe bus going out these days to SATA and SAS attach anyway.

Starting point is 00:07:06 At least that's the first hop out to your HBA. And then you've got something, you know, that's fan out under it. NVMe is just a protocol. It's just like SATA. It's just like SCSI. There isn't that much in it that has had this sort of running back and forth between, even though they have NVMe drives underneath, they actually support what's called the SCSI M, the multipath SCSI driver as the primary interface because the SCSI driver is faster than the NVMe driver. There's been more time to optimize it and guys

Starting point is 00:07:43 have tweaked it. yeah i i've you've been talking to intel i mean they they've they've got claims that the scuzzy driver is much you know more overhead than the nvme driver but then it goes back and forth it goes back and forth and you know i love my my friends at intel um but uh i i also do a lot of measurements and so uh they were out saying you know the nvme driver is ready for primetime a couple of years ago. And I had to sort of, you know, keep my mouth shut and look at the ceiling while they were saying all that stuff. That wasn't what our customer base was saying. The good news is the NVMe drivers now that are shipped in the operating system are super good.

Starting point is 00:08:21 We love kernels above 4.4, particularly. 4.13 was also a nice vintage, got a lot of good improvements in the kernel. We tried to talk anyone out of kernels before 4.4, simply because of NVMe driver support. And what has really happened, though, is devices, the underlying hardware devices being shipped by manufacturers. If they put the NVMe stamp on it, it has to be fast. And so they've ended up bifurcating the market and say, well, NVMe is fast. Our fast drive supports NVMe. Our slow drives support SAS and SCSI. Okay, so that's their market differentiator.

Starting point is 00:08:58 Is there an underlying core technology that causes that to be true? Meh. No, they spend a little bit more money on the controllers in the NVMe device so they have more cores and more paths to the flash. But I hope what we're talking about here is a temporary thing. I would much rather see NVMe replace SATA and SAS.

Starting point is 00:09:23 I think, yeah. I mean, it's certainly in our deployments, it's pretty much 100% all NVMe these days. And there's some really, the NVMe committee is doing some really interesting upcoming work. There's a particular one that allows you to sort of partition out

Starting point is 00:09:38 a drive into logical drives all the way down through the garbage collection system. Because that's one of the cases where you have a lot of cross-talk, performance cross-talk, where you have one user of the drive and another user of the drive, and writes really end up interfering with reads as part of NAND.

Starting point is 00:09:56 You're saying namespaces can actually have their garbage collection partitioned out within a namespace? I'm not sure whether they're going to use namespace as their underlying piece, but there's a piece that they're working on in the committee about that. And this is driven by cloud vendors, right? Cloud vendors need to have virtual machines where the noisy neighbor doesn't impact. So they want to be able to have basically this virtual machine using that and then another virtual machine and them really, really, really being separated all the way down through controllers and chips all the way into the device being fully separated. Yeah, it's an interesting combination of multiple namespace support and the kind of host managed drive stuff that we were talking about a few years ago. So that you as the application know which namespaces are doing garbage collection

Starting point is 00:10:47 and you can write to the other ones. Absolutely. We recently introduced a system like that in Aerospike that we call Quiesce that allows us to basically pause an individual server and allow garbage collection to occur on that system. We actually did it primarily for one of the cloud vendors that has a tendency, all virtual machine systems tend to want to be able to live migrate you. VMware does that famously, but also Google Cloud does it.

Starting point is 00:11:13 But they give you a little notification. So we can actually route all of the read traffic away to other systems. And then when the hardware migration, we can actually come back so we don't have any timeouts during these migration periods. The same concept underlying, too. Just on my own education here. So, Aerospike is a distributed database that operates across multiple servers and multiple storage devices and that sort of thing?

Starting point is 00:11:40 Is that? Yes. So I think if in 2008, if you were founding a database company, distributed systems were just, of course, you're going to build a distributed system. And your goal is my goal when I was thinking about it was HA, I was thinking a little bit less about horizontal scale, although that's obviously important. And a little bit more about, well, if you're really not going to have any timeouts to do a hardware failure, okay, it's great to have a SAN with, you know, all kind of crosswise RAID error correction, you know, recombination stuff. On the other hand, well, what if a server blows up? So if you're going to handle a case of an entire database server blowing up and immediately re-vectoring to some other database server, then you've just built a distributed system. So let's start with a distributed system all the way down. I got you. I got you. That's interesting.

Starting point is 00:12:30 Okay. Okay. And you do the data protection in the application layer, right? We do. So what we'll do is we're sort of from a direct attach heritage, even though we run fine on SANs. I just had a major customer do a real nice SAN launch today, if you guys are following the Euro region tips, payment processing system. So they're using

Starting point is 00:12:52 SAN. So we've run fine on SANs, but we're really direct attach heritage. So we figure, you know, the server is going to be the failure domain, and it has a bunch of storage that is directly attached. That's sort of how we think about the world. Therefore, we need two copies on multiple servers. The problem there is error correction and sort of doing a lot of the classic storage tricks don't really work when you've got entirely different servers that you expect to fail. So that's sort of our philosophy. So it's heterogeneous servers, heterogeneous storage effectively for your distributed database? Well, we like a lot of homogeneity from a distributed systems perspective. If one particular machine has half the amount of RAM, that's kind of hard to plan for. If you think about server problems, as opposed to storage problems, you've got the axis of CPU size and number of cores, you've got the axis of the amount of RAM,

Starting point is 00:13:46 less so the speed of the RAM. RAM's all pretty much the same. And then you've got the network, and network processing is probably the biggest bottleneck we have these days. So you get too much heterogeneity across three entire axes, and it's very hard to plan and run a practical distributed system.

Starting point is 00:14:00 Yeah, and the kind of customers you're selling to are happy to go, oh, we need eight nodes for that. We'll get eight. They're pretty much the same. Yeah. And cloud makes that easy too, right? So, you know, just pick your instance type, roll eight of them out, you know, stamp, stamp, stamp, and off to the races. Okay. So today you're using those SSDs in those servers for persistence, right? Yeah, persistence and size, absolutely. So you and I met at Intel's introduction of their Optane DIM product. So let's talk about that, how that affects database servers and structure and such, because it seems to me that makes lots of things a lot more addressable.

Starting point is 00:14:48 Absolutely. So very, very excited about the crosspoint technology in general. And there's a lot to talk about within it. And I think it's underappreciated these days in the database industry in general. Well, the current optane ssds are five times as expensive as an nvme flash ssd on a per gigabyte basis i might say four but i won't quibble with five and and so you're really going do i want one terabyte of really fast or five terabytes of not quite as fast for the same money and the difference isn't big enough to be really compelling i am going to agree with you 100 and um uh i made a presentation to intel where i said hey guys your p4610 uh is really an awesome nand drive

Starting point is 00:15:42 and uh you are going to have trouble uh with with your Optane drive competing against your own drive. So the good news from Intel's perspective is they have a horse in both camps. But what I know as a database guy is PCIe is the problem. And we look at all of our, we have a lot of data. What? Yeah.

Starting point is 00:16:06 So you want to go to the memory bus. Okay, I got you. I got you. One four-lane PCIe socket where you plug a U.2 drive in is 25 gigabits per second. It's actually the latency. The latency is killing us more than anything else. But it's like 10 microseconds, 50 microseconds latency to get to access that drive out there versus nanoseconds kind of thing. Is that the numbers we're talking about? Yeah, exactly. So, you know, there's a lot of cases in databases where you might have to pull, you know, we're really cautious in AirSpike to try to pull only once for every particular database operation.

Starting point is 00:16:49 But there's cases where you have to pull two or three different discrete blocks of memory. And so then that starts adding up when you're trying to build something high performance. So the presentation I originally made before the one we met at, which was the one regarding the NVMe Optane drive, I showed a slide that said, if your write load is very, very high, then Optane makes sense. And therefore, Optane makes sense, kind of the way HP just announced it for 3PAR as, this is our write buffer in a huge storage system. Absolutely. So the curve that we measured regarding Optane NVMe was with NAND, a typical NAND drive, if you've got a 50 plus

Starting point is 00:17:34 percent write load or even 90 percent write load, when you start pushing a lot of throughput and you're not very over-provisioned, this is back to your price point, I know. If you're not very over provisioned, you will start seeing a lot of latency on your read path. And Optane simply does not do that. And Optane simply does not do that. You look at the curve and it's still just, you know, you're just eating PCIe latency at 10 microseconds, absolutely regardless of your read-write ratios. Dead flat. But that's now a sophisticated story. It's harder to do.

Starting point is 00:18:12 You can't just walk in and say, it's faster. You have to say, what's your read-write ratio? How many things does the difference between a 250-mic read because I have garbage collection pauses versus a 10-microsecond? And how sensitive is your application to the 95th percentile latency as opposed to the average latency? And it's about consistency and predictability, not about absolute performance. So what's a typical read-write ratio for Aerospike workloads? I mean, is that based on what the customer is doing? Is that more application dependent than database dependent? Well, you know, our view of the world is skewed. I'll admit that because people come to Aerospike when they have a mixed workload problem.

Starting point is 00:18:56 So we were built from the ground up to handle reads and writes at the same time very fast. And we find that to be common in internet style workloads. So what happens in a lot of internet style workloads is, you know, a user wants to do something. You often have to read a profile, read some information, make some decisions. And then you often write back to that profile, hey, I just served this person this thing. So you get a huge amount of fixed 50-50 mixed workloads. Yeah, so we think it's pretty common. On the other hand, you know, people come to us with their 50-50 workloads. Maybe we don't see a lot of the other ones.

Starting point is 00:19:36 Because they have problems with other devices, databases. I got you, I got you. The ones that are 80-20 work fine on Oracle. Or just about any database, really. Just about any database. Exactly. So, you know, in memory, and this is one of the reasons, you know, going back to your in-memory question, in-memory is always known as being very good at mixed workloads and high write workloads. And so we often want to walk in and say, hi, Aerospike, the in-memory database, because someone will say, oh, you're great at mixed workloads and high write loads and all kinds

Starting point is 00:20:02 of stuff. And we go, absolutely. By the way, we're doing it on nand um and so you've got the benefits of nand and you know 50 cents uh these days 50 cents a gig uh and persistence at the same time how would you like some of that instead of just having to buy a pile of d-ram well and then that pile of d-ram means you need a pile of servers to put it in yep low density so let's talk about persistent memory, speaking of density and persistence. So the 3D cross-point comes in with higher density, memory bus access, not quite memory speeds, but close. Is that kind of what you're looking at? Yeah. So if you look at the specs and what they've said, you know, Intel has these nice glossies about it,

Starting point is 00:20:45 but so DRAM is at, what, 5 nanos to 7 nanos, and they have a nice, efficient L1, L2, L3 cache that's tightly bound to the CPU, blah, blah, blah. And Crosspoint is out there saying, well, you know, we're going to be pretty slow. We're at 50 nanos and 100 nanos per read, and read writes are the same. And by the way, there's no garbage collection. We have to do a little wear leveling maybe, but we don't have to do garbage collection.

Starting point is 00:21:15 So, you know, more like a traditional rotational disk, which still has to deal with bad blocks, but it's not like, you know, they have to do a true garbage collection. So 10 times slower than DRAM, but a hundred times or 200, you know, a thousand, not, yeah, I guess a thousand times faster than SSDs. And more parallel because no garbage collection and none of the crazy interactions and... More consistency, that sort of thing. I gotcha. Yeah. And also more symmetric.

Starting point is 00:21:47 You know, Michael, it was a great point you mentioned, Howard, about symmetry in asymmetry in NAND. And Aerospike has had to build a bunch of technology around those asymmetries. And that's after the SSD vendors have done millions of things to get around that asymmetry. Absolutely. And so we have to know what they do. We have to measure what they do. We have to call them out on their bugs. That's always a fun part of my job. You probably have to get far enough to recognize when SSDs have exceeded the RAM buffer as a write cache and are therefore getting slow.

Starting point is 00:22:27 Well, it's the page write activity. It's the garbage collection activity that's behind the page write activity. I mean, all these things start to interfere with normal access, reads and writes. So the good news about persistent memory is we don't have to deal with that. When we first took a look at the PCIe NVMe-based Optane stuff, we were a little underwhelmed. First of all, with density of the original go to market with Intel stuff was 375 gigabytes per either add in, you know, HHHL or 2.5. And when I presented that to my customer base, they just looked at me funny, because they're all buying, you know, they're getting out of their 1.5 teras and they're looking hungrily at the six teras. And you walk in and say,

Starting point is 00:23:09 you know, 0.4 tera and they go, really? Well, yeah. And 0.4 terabytes in a PCIe slot. And I only have a few of those. I can't just throw shell. I can't just throw shelves on. Yes. On the other hand, they were also doing a 2.5 NVMe, and so you could jam 50 of those into a Supermicro, but that seemed kind of esoteric to everyone. And Dell didn't have 2.5s. Only now have they released their Gen. Server vendors were slow to adopt the U.2 support,

Starting point is 00:23:47 except for Supermicro. Yes, except for Supermicro. So now we're living in a different world where Optane NVMe is at 1.5 Tera, which at least puts them on the map. But on the other hand, the NAND guys have done a lot of great work also. So we're seeing, you know, performance levels. It's still hard to make a case for that. So but flipping over to the

Starting point is 00:24:10 memory bus, everything changes. And the challenge with the memory bus is the memory bus is built for non persistent stuff. And that was always a challenge previously for the guys doing Diablo memory and stuff like that. But it's even more of a challenge if you're trying to have a truly persistent device. And that's really been the programming challenge. Every panel I've been on discussing Crosspoint on a memory bus, everyone stands up and says, how hard was it to program to? What did you have to do? What are the interfaces look like?

Starting point is 00:24:42 How do you deal with a power cycle? All the things storage guys have known how to do? What are the interfaces look like? How do you how do you deal with, you know, a power cycle? All the things storage guys have known how to do for a long time, you know, you have to send a message. And then when it's fully committed, you get a message back, you have to process that. And you know, that's the way storage works. Well, how exactly does that work on a memory bus? So we have had a transactional memory spec for a while. And interestingly, Intel didn't choose to implement that they did basically some other while. And interestingly, Intel didn't choose to implement that they did basically some other stuff. And so there's this whole pmem.io interface, we use the lowest level version of that. And essentially, you get something that's like

Starting point is 00:25:14 mapped memory, but we have to be very cautious about how the caching layers inside the subsystems work in order to make sure that when we restart, it's correct, or that we have, you know, basically proper checksums and, you know and bits that we set and clear, which can then become a performance bottleneck to make sure that the internal data structure is right. Given that we're talking about 512 gigabytes per DIMM, and I would assume for your customers, 10 to 12 DIMMs is going to be the average. If you're actually checking the state of all of that on a power cycle, it could be an hour before you can get access to the system.

Starting point is 00:25:56 And at that point, why don't you just put it on NAND and do an index rebuild. So that is the challenge of programming with this stuff, is being able to know when it is correct and when it is in sync with underlying storage. And you see those changes to the BIOS as well as the underlying access methods and stuff? Or is it just more or less at the access method level? Okay, it actually requires more than bios stuff it requires it requires some chipset changes um which is why brian and i were at a

Starting point is 00:26:32 brilliant introduction several months ago but the people listening to this podcast won't be able to buy this stuff for another several months because it's the next unannounced generation of servers that will support it except for there's been one interesting change in the market dynamics that I'm incredibly excited about. You can get Intel, you can get Optane on the memory bus, Crosspoint on the memory bus today. And that requires calling up Google, and Google has instances in their cloud that have it today. In their cloud, okay. For those of us who run data centers, that's not get, that's use. But for developers, that's perfect. You can go write your code and test. Well, you know, they're marketing SAP on top of it. So SAP has basically a bundle with those things. And so you can run your SAP stack on the Google Cloud with six terabytes per instance.

Starting point is 00:27:30 Actually, I don't know what their instance size is, but we've all known that Intel is going to, you know, supports six terabytes per chassis, essentially per two socket. That's well known. So basically they're saying, you know, hey, we're enterprise ready. You use SAP, come and bring your SAP apps and we'll run it on this fancy crosspoint stuff, and you're going to get a higher level of scale than you can get in your data center. It's surprising that Google is the only one that offered that today, but it's just a question of time, I guess, for the rest of those guys.

Starting point is 00:27:59 Well, Google, you know, we can argue which cloud is the better cloud all day. I used to do a cloud podcast, by the way. But that's, so, you know, AeroSpike has to be cloud agnostic, of course. But one of the things about Google historically is they've put a lot of energy into building their own hardware outside of the chip and even inside the chip in some cases. So they have chip designers, they have motherboard designers, they have all kinds of people. And if they wanted to do, for example, you know, a new memory bus, they probably have the in-house technology. They certainly are well known for doing all their own boards and, you know, for

Starting point is 00:28:41 their data centers. So the fact that they could, since, you know, they don't have to support partners, they don't have to support Dell, they don't have to support all of the ecosystem that Intel does. Um, if you are up for making your own boards and chips, I would expect actually Google to be, you know, easily six months ahead and lo and behold, here we are. Um, you know, but, but to your point, it is more for, you know, we see it more as a POC thing, to be honest, because, uh, you know, it's a way for us all to get our fingers in it. Right. And the good news for everybody else is that if developers get to do that kind of testing and development in the cloud, after I can actually buy the hardware, the software will come faster. And that is a good thing. Because it's on-site and stuff like that.

Starting point is 00:29:28 So what do you think the speedup is going from, let's say, your basic NVMe SSD today to an in-memory Optane solution with similar capacities? I mean, is it in fact a 1,000x speedup? It can't be a 1,000x speedup. It can't be a 1,000x speedup. It can't be a 1,000x speedup. So the real question is the bottleneck shifts from PCIe or NAND, depending on exactly where you are and whether you've got Optane behind the PCIe.

Starting point is 00:29:57 It shifts to the network. And so AirSpike is an operational database. We've never really been CPU limited, right? There's some really interesting GPU based column store analytics systems. As a transactional system, it's been all about the network actually for a long time. its age and there was you know 10 was was just starting to roll out it was a little pricey uh 10 you know 10 is now super common uh 40 is a little pricey and 100 is considered pretty esoteric primarily because of run length problems right well once you're up in the 100 you're you're you're having trouble making more than what about five meters but 25 is right in there yeah and and so um with all of that it's like okay so you know that's the the the balance of terror uh shifts so so starkly over to the network that um you know some of the interesting you would think the network would be a bottleneck even with nvme

Starting point is 00:31:02 ssds i mean it's it's well it is when we start talking about persistent memory on the memory bus there are going to be things that are problems with for other people before the network brian people are going to put ext3 on an optane dim and wonder why it's not that ridiculously fast you know there's all sorts of software layers in there that are going to get people in trouble first. There's a whole other train of thought there, which is file systems bug me.

Starting point is 00:31:35 And the reason is really, I don't have anything wrong with the file metaphor. It's just that everything's POSIX and everything's even the exact kind of interface that POSIX is and the fact that there are inodes. There's a whole bunch of inefficiencies that have just crept into those interfaces and then they've ossified. If you want to try to do interesting file system research, you're just hamstrung by this one particular API. Databases, we've had an opening in APIs, right? We've got SQL, and SQL is more extensible than the file system API, but then you've got all these crazy NoSQL guys

Starting point is 00:32:08 who could go out there and do their own APIs and really try to innovate there. So that's my only problem with file systems is really the API. And it's like, okay, so with a clustered modern database of any sort, us or our competitors, you get to specify how much consistency you want as part of the read and the write.

Starting point is 00:32:26 Whoa, whoa, whoa, whoa, whoa, whoa, whoa, whoa. You get to decide how much consistency. Like what? Two-tenths or 20% or 100% or yes or no? Yes. So, for example, do you want to be – the big flag in Aerospike is full linearizable versus what we call session consistency. Session consistency allows stale reads in some corner cases, and full linearized doesn't. And due to that, we have to do one extra round trip during the course of the transaction and there's a performance penalty of about 30%. So we're write consistent in all the

Starting point is 00:33:12 cases, but we've got this one knob of read consistency. And you look at Cassandra, Cassandra has even a more flexible system for consistency. You've got quorum reads, partial quorum reads, all these kinds of things. And they do both read and write consistency on that basis, et cetera, et cetera. And it's on an API call by API call basis. See, as storage guys, we have problems with this. You write data to storage, it better come back the same every time. We only have one level of consistency we believe in, consistent. And I think that's a problem as a database guy because, look, you look at most databases, take the old relational ones. Let's throw out the crazy NoSQL guys.

Starting point is 00:34:00 And everyone would relax consistency for some parts of their data set. It's part of the configuration of a relational system. MySQL did not support relational integrity for a long time. Oracle version six did not support relational integrity. You would get application level inconsistency time after time again, up through what Oracle nine or something like that. And even then you can run different modes. So the database guys, they say,

Starting point is 00:34:25 well, maybe you want to be consistent, maybe you don't, maybe you want to be faster, maybe you want to be slower. And this has been true for 40 years. I guess I never realized that. And then the funny thing is you turn around to the storage guys and all they do is 100 consistency well so why did i relax consistency at the database layer i relax consistency at the database layer and i'm not telling the storage guys so they can't go fast yeah the prop the problem is the storage guys

Starting point is 00:34:59 live in a nobody ever tells us what this data is or what's important. Therefore, we treat everything as it's a completely important. Yes. So the ability to innovate within the storage ecosystem is limited. And so when I looked at this problem, I said, I want to do something new and interesting. I'm going to have to be on the database side of the equation, even if what I'm doing kind of looks like it might be a block store, which is to say key value lookups in database terms. Huh. So you really want to invent, you want different consistency levels at the storage layer.

Starting point is 00:35:38 That's an interesting concept. No, I want storage to just go away. Storage can just go away. You know, just leave it to the database. We'll take all the storage dollars, move them over to database dollars. That's my plan. We're not entirely happy with that. and VME committee, because I think they're kind of sophisticated about being able to do some pretty interesting stuff, right? So when we do multi-pathing and you talk about multi-pathing, you talk about namespaces and namespace options, I think there's room to start taking some of the learnings and databases and see if we can make storage that can do new tricks.

Starting point is 00:36:20 I'm kind of excited about that. So back to persistent memory. So we've got this stuff on the memory bus. And the interesting part about it is, okay, so DRAM was kind of limited to about one terabyte per physical box, right? You had to go into esoterica to get beyond that. And this is part of Intel's strategy because then Intel can say, do you take the amount of RAM in the world, and then we can calculate the number of Xeon CPUs, sockets, required to actually power that much RAM. And this is our slice of the market, since we're not in the DRAM

Starting point is 00:36:55 market. So they had to create a limiter in order to sell enough CPUs. And, you know, power to them. It's a business model. So now they're coming out and saying, well, now that we've got this cross point, we're going to move at a different level of scale. We're going to allow you to put a different amount of storage-y thing off of the core Intel CPUs. So everyone in the in-memory space is getting this nice little scalability boost of moving from essentially half a tera per socket to three tera per socket, which ends up being sort of one tera to six because of the de facto two socket use of these things. So that's all well and good, but then that amount of scale is nice. But then, you know, we were talking earlier, we're talking about hundreds of terabytes. How do you even with six terabytes per

Starting point is 00:37:46 box, how do you get up to 300 terabytes? And oh, by the way, you need two copies, because it has to be, you know, ha, or maybe even three copies, where storage guys, two copies isn't enough. There you go. So So what when I looked at when I when I was first approached by Intel with this, I said, Well, you know, that's all well and good, but I don't think my customers in three years are going to be really excited about going from one terabyte to six terabytes. I think they're going to be excited about going from 20 terabytes to 100 terabytes. launched was to say, well, with our hybrid architecture, we're putting our indexes in RAM because we need the parallelism of RAM in our indexes to drive NVMe, which demands parallelism. That was sort of our core trick. And I said, well, great. So let's rip out the DRAM of that system and make it a persistent memory. And then we can have six terabytes of indexes per 100 terabytes of NVMe. And that's going to be a really, really swank system. And by the way, we're probably still going to be bottlenecked by NAND at that point anyway,

Starting point is 00:38:53 so it's going to come for free. We're going to have no performance drop when we switch our indexes from DRAM to persistent memory. And oh, by the way, no more index rebuilds if we do a careful job. Right, but you're only using it for kind of persistent. Well, it's persistent, but it's not authoritative. You can always recreate that data, that index. Correct. We can always go back. So we get a little bit of a wiggle room there because we can go back to NAND. You mentioned level one, level two, level three caching in a prior discussion here. Does the SCM, let's say indexing in your case, is it sitting in level one, level two and level three caches as well? I mean, there's a challenge there that you might be flooding those instruction and data caches with indexing cache.

Starting point is 00:39:50 Yeah, but we've already got those optimizations for DRAM. So our DRAM subsystem, there's already flags on individual op codes, right, that tell you whether it's a bypass, whether to populate the cache, whether to look through all that stuff, right? So, yeah, we've already decorated all that. We had to do that for DRAM. I mean, DRAM is just as capable of doing cache pollution, if not more so.

Starting point is 00:40:13 Yeah, yeah, it's been a long time since I programmed at that level. No, but it's an interesting thought, Ray, that moving the data into DRAM as opposed to treating it as storage changes how the CPU uses its cache because now that's memory to be cached. Yes. And it means you have to think a little more carefully about consistency. And there's a little sidebar here

Starting point is 00:40:38 about the C programming language because the C programming language, the thing that accesses RAM, right? Because you don't have a storage API. You actually have, you know, pointer following. What do you, how do you start talking about what the C language, they just say, oh, it's implementation dependent, implementation dependent. What's the size of an integer? I don't know. It's implementation dependent. Welcome to C. Yeah, welcome to C. So there's been talk about doing basically some new revs of the C language to try to deal with some of this stuff and to bring some of this storage slash DRAM stuff into C. I think that's died out, but it's honestly kind of hard to use the APIs that we've got. And I still wonder whether we're going to see some of these essentially storage and persistence related keywords flip up into C and have the C programmer guys, the compiler guys try to help us out.

Starting point is 00:41:37 That's a whole different level of discussion. Hey, Brian, this has been great. Howard, any last questions for Brian? Oh, too many. Okay, Brian, anything you'd like to say to our listening audience? This has been great. Howard, any last questions for Brian? Oh, too many. Okay. Brian, anything you'd like to say to our listening audience? Hey, Aerospike is open source. Try it out.

Starting point is 00:41:53 And I feel like I didn't answer the one question about the performance differences we've measured. Optane really, the cross-pointing the memory bus really is doing very well from us on a performance perspective. So we're seeing, you know, very small hits compared to DRAM, bottlenecks, or elsewhere. So that's one thing we're excited about. But yeah, Aerospike's on the website. Power, a lot of fun stuff. Distributed system. If you have friends, if you're a storage guy, you might have database friends.

Starting point is 00:42:19 So spread the word is all I'm saying. Storage guys and DBAs are not historically well associated. Yeah, you might have friends that way, I guess I'd say. Thin provisioning was, after all, created so that we could lie to DBAs. Excellent. Well, this has been great. Thank you very much, Brian, for being on our show today. Thanks, Ray.

Starting point is 00:42:41 Next time, we'll talk to another system storage technology person. Any questions you want us to ask, please let us know. And if you enjoy our podcast, tell your friends about it. And please review us on iTunes and Google Play, as this will help get the word out. That's it for now. Bye, Howard. Bye, Ray. Bye, Brian.

Starting point is 00:42:56 Bye, Ray. Until next time.

Grey Beards on Systems - 77: GreyBeards talk high performance databases with Brian Bulkowski, Founder & CTO, Aerospike

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.