Grey Beards on Systems - 75: GreyBeards talk persistent memory IO with Andy Grimes, Principal Technologist, NetApp

Episode Date: November 6, 2018

Sponsored By:  NetApp In this episode we talk new persistent memory IO technology  with Andy Grimes, Principal Technologist, NetApp. Andy presented at the NetApp Insight 2018 TechFieldDay Extra (TFD...x) event (video available here). If you get a chance we encourage you to watch the videos as Andy, did a great job describing their new MAX … Continue reading "75: GreyBeards talk persistent memory IO with Andy Grimes, Principal Technologist, NetApp"

Transcript
Discussion (0)
Starting point is 00:00:00 Hey everybody, Ray Lucchese here with Howard Marks here. Welcome to another sponsored episode of the Graybeards on Storage podcast. This Graybeards on Storage podcast is brought to you today by NetApp Max Data and was recorded on October 28th, 2018. We have with us here today Andy Grimes, Principal Technologist at NetApp. The Graybeards talked with Andy at NetApp Inside Tech Field Day Extra event last week. So Andy, why don't you tell us a little bit about yourself and what's new at NetApp? Thanks, Ray and Howard, for having me on.
Starting point is 00:00:41 Again, my name is Andy Grimes. I'm a Principal Technologist at NetApp. Been with NetApp. Been with NetApp for about 10 and a half years. Recently, I worked on the NetApp MaxData product and bringing it to market. Before that, I was on NetApp Flash solutions. But now I'm actually moving over to our HCI solutions, which ironically, we'll start to start to work with our MaxData solutions, hopefully in the near future. So great to be here today and love to talk to you about MaxData. What exactly is MaxData? Is it a file system? Is it a dessert topping? Why do I care? Well, so MaxData is a software product that DATEP has just announced we will be releasing in the very near future. But it's a software defined flash solution. So we think of it as the next
Starting point is 00:01:23 generation of flash, leveraging persistent memory solutions as they come to market. And MaxData is effectively a way to project or to tier your data from traditional storage, SAN storage, into a server on the memory bus on a persistent memory DIMM and then accelerate it to memory speeds faster than is capable with any other technology in the market. So this would be like NVDIMS and 3D crosspoint kinds of memory solutions? Is that how you'd see this being used? Yes, the opportunity is, you know, we've had DRAM for a long period of time. Lately, we've started to have NVDIMS, and soon we will have OptaneDIMS. And there are three different types of technologies, but DRAM is, of course, volatile memory. NVDIMS is more expensive, power-protected DRAM-based technology. And
Starting point is 00:02:16 then now we finally have OptaneDIMS, which will be coming from Intel, hopefully in the near future. And hopefully at a lower cost per gigabyte than DRAM. That's the expectation, although Intel ultimately owns their pricing. But NVDIMs are more expensive today, but it does give you memory performance speeds with the power consistency of 3D NAND. But Optane DIMMs really promise much more than that. Higher performance, near memory speeds with higher write resiliency and other performance advantages. But mainly it's going to come in very, very large capacities,
Starting point is 00:02:51 which allow us to store data literally on the memory bus. Okay, so today, if I go to HPE or Dell, they'd be happy to sell me 16, maybe 32 gig NVDIMMs. So that means in a server, I might have 256 gigabytes. Are you just making that available or are you using that to accelerate access to something else?
Starting point is 00:03:20 So what we're doing with MaxData that's really unusual in the market is we're taking the MaxData software load when you install it on a system, it uses your memory tier is some type of dim on the memory bus. And what the software does is it actually configures a single mount point concatenating the memory tier and the storage tier. And the storage tier is an underlying block device, which in our first release will be a NetApp on tap LUN, or multiple on tap LUNs as the back ends tier. But basically, what max data does is creates a single mount point with a memory tier and a storage tier. And what allows you to do is write data into the mount point and it enters through a POSIX API. So traditional application semantics continue to work. But basically what we're doing is all writes and
Starting point is 00:04:11 reads occur onto, excuse me, all writes occur onto that persistent memory tier, and then will be tiered down to the storage tier at a later time. Reads will be opportunistically promoted into the primary memory tier for extremely high performance,, low-latency reads as well as writes. So what sort of performance can we expect from an access latency perspective? So we've seen single-digit microsecond latencies for single-threaded, single-core workloads. We are running in user space to help us with GPL licensing. So we don't have complete control when we get to multi-threaded multi-core, but single-threaded single-core will see under 10 microseconds of latency.
Starting point is 00:04:51 When we get to multi-core multi-threaded, we'll typically see higher latencies, but typically still under that 40 microsecond latency number, although it will vary a little bit. But that's still phenomenal compared to what we've achieved with externally attached SAN in the 150 to 200 microsecond range, where we really run into the network bottleneck limitations. Yeah, we were really impressed with the NVMe over Fabric Solutions pushing 100 microseconds just last month. Yeah. It's been a crazy world where you do literally talk to customers and you go, here we are at 150 200 100 microseconds of latency oh i can do 10 well sometimes i might do 40 oh and it's like well give me a break
Starting point is 00:05:30 you know we found we found a couple of solutions actually prefer user space because they can do polling rather than interrupt driven and stuff like that so it's kind of an interesting you know uh dilemma as to where where you're actually benefiting from kernel space versus user space. But nonetheless... We had a number of customers actually request the user space shift. So that was why we made that change. Yeah, we're seeing a lot more of that. Intel's encouraging it with SPDK. I think that pendulum swings back for a while. Exactly. So a couple other things that are really interesting, though,
Starting point is 00:06:05 is persistent memory brought us into, and I'm an old storage guy, so I don't trust servers. Yeah. I believe we own the sacred duty to never lose the data and never corrupt the data, and preferably never let you lose access to the data, and it's a religion. Amen.
Starting point is 00:06:22 And I've been in the world of, you know, HCI and some of the HCI companies out there are three nines of reliability and oh, what's a little data corruption between friends. Nobody will ever notice if they don't read those files anyway. Exactly. So I really look at Max Data
Starting point is 00:06:38 and that was one of the things that we required as a storage company. When we actually looked at Max Data, we thought, okay, I'm using persistent writes in a server and the servers are going to burn down, fall over and sink into the swamp at any moment. So we need a way to protect those rights. Normally DRAM, for example, we would restrict them to only read. So they always existed somewhere else. With MaxData, we have to be able to not only serve writes in order to equally serve an existing application without
Starting point is 00:07:05 modification, but I also have to protect them with enterprise class tools. So that's why MaxData also includes the ability to do snapshots, which effectively take the data in the T1 space, flag all unique blocks, and then copy them down to the storage tier for persistence on an external system. So if you trigger a snapshot or a snapsync in max data, we will actually trigger the copy down and then the on tap snapshot if it's an on tap backend with a snap mirror, a snap vault, and all the typical NetApp data protection tools. So you have a complete enterprise compliant
Starting point is 00:07:41 data protection suite, which has never been available to in-memory applications up to now. I thought there was another choice. Yeah. The other thing we do is I still look at a server as likely to fail. And so we want to go even further than a snapshot, I want to be able to actually protect a server by copying the memory tier to a second server memory tier. And that's called max recovery. And we basically take two max data instances, we peer them together over a private 100 gig ethernet RDMA network, and every write that lands in the in the memory tier of a max data primary server
Starting point is 00:08:16 will be replicated to the memory tier of a secondary server. So is that replication synchronous? It's synchronous, and it's RPO zero. So we will never, ever lose a write. Right. There you go. There's some overhead with that, right? Yeah, but we're not talking about a Metro cluster looks like one big array fails over automatically yet, right? No. No, it's an active passive architecture. So you'll have a primary to secondary. But in the event of a failure, we will copy the data back over 100 gig Ethernet RDMA network. So we've seen a copy back times of approximately three minutes per terabyte. So it's, it's an incredibly fast architecture. And we've seen like in MongoDB instances, the recovery time for a shard failure can be amazingly fast, you know, just a few minutes for most implementations. So you've got, you know, in-memory acceleration performance without
Starting point is 00:09:12 really modifying the application, which is incredibly cool to add to your enterprise arsenal, but also the ability to protect and even recover more quickly in various failure scenarios. And so this is sort of allowing persistent. So a couple of questions. You know, the file IO stack in Linux has been relatively convoluted, enhanced and evolved over many, many generations. And it's pretty deep.
Starting point is 00:09:37 Are you guys using the file IO stack or? So we're using our own custom file system called, it's been submitted to the Linux kernel. It's called Zoofs. In our branding, it's MaxFS. It is actually part of the reason NetApp acquired the original company that this technology is derived from, PlexiStore, about a year and a half. The MaxFS file system and the Zoofs file system that's derived from it is extremely low metadata, low overhead, very, very low latency. And while we're bringing the MaxData product to market, we see a tremendous
Starting point is 00:10:12 opportunity for that. We're also integrating it into our ONTAP ecosystem, into our internal components. So it's a really, really slick capability that we've added to our arsenal. You mentioned metadata. So there are directories and inodes and all this stuff associated with the persistent memory file system? So typically we do use metadata, but we don't have the journaling and the multi-metadata recovery that is typically used for OS file systems. This is typically a memory file system with much more elegant, simpler paging access. We actually do have a memory API as well, and the ability to add functions like actually pinning data into memory. So if you have files that are stored in the Max Data Mount Point,
Starting point is 00:10:57 you can actually, there's semantics to pin specific files, so they are always accelerated. So there's some interesting capabilities that we'll continue to add to later on. But what's really brilliant about the solution is it is software. So I could run it on a server that a customer purchases, that memory that a customer purchases, without custom chipsets, without custom hardware design, without custom flash modules that popular wisdom and certainly the marketing departments of many organizations have insisted you've always needed to go this fast. And we've actually proven that we can do it from software. With either DRAM or NVDIMM or sometime later Optane DIMMS kind of thing.
Starting point is 00:11:41 Yeah, yeah. Well, so wait a second, Ray. Did you say we could use ordinary DRAM for this? I believe so. That's what he mentioned earlier on, right? I mean, if you had enough DRAM to do something of substance, right? Well, DRAM costs a lot less than NVDIMS. I mean, I'm a risk-taker. No, I'm not. I'm a storage guy. Never mind. So, we'd certainly support DRAM, and if your application tolerates it, we will support DRAM at release. We see the opportunity though, is the inclusion of persistent memory into your data workflows. NVDIMM now with the cost metrics makes sense. And then Optane
Starting point is 00:12:17 DIMMs when they're available and in sufficient quantities with the servers that support them. But what's beautiful about our software architecture is we can actually innovate very quickly. Unlike hardware-based platforms, and no disrespect to ONTAP or ElementOS or Centricity or any of the other hardware platforms out there, it takes longer to release things in hardware because you're tied to hardware release cycles
Starting point is 00:12:41 and you're typically tied to much broader feature sets that you have to validate or regress against. More, more appliances out there and such. Yeah. Well, that and spinning an ASIC takes a while. Yeah.
Starting point is 00:12:53 Yeah. Especially in that case. But what we're seeing with max data, that's beautiful is, is it's, we plan to release a new version about every three months. So when we talk to a customer about, I don't have a really
Starting point is 00:13:06 heavy hardware interface matrix I have to validate and constantly innovate and retest. So at release, we plan to support an ONTAP backend with ONTAP LUNs behind it requiring ONTAP 9.5. But we have tested it with server standalone with internal SSDs. We've tested it with Element OS in our HCI product. We've tested it with cloud, where we've actually run instances in Linux images in Amazon with DRAM with EBS storage. And that actually brings you to the DRAM example. There may be great use cases for using DRAM in Amazon for accelerating applications. You see Max Data as part of the Amazon marketplace at some point?
Starting point is 00:13:48 It was back in the Plexi store days, and we're evaluating how quickly to bring it to market. Because we're already doing a number of early access programs with customers, and about a third of them were AWS. About a third of them are server, and about a third of the requests are sand attached. And that's been pretty interesting for us. And as a software product, we can go in the direction the market requests.
Starting point is 00:14:11 Yeah, I can see a lot of use cases for applications like Mongo, where it's sharded up the wazoo and protected inter-node so that short-term losses of it being DRAM aren't really that much of a problem. I've been thinking a long time that persistent memory meant new database engines, but you guys are bringing the advantages of persistent memory to existing application models. Yeah, that's the most exciting thing for me. And when we actually talked to an analyst, it was really exciting because they said, do you know how much money a hyperscaler spends on DRAM? And we were like, no. And they said, it's a massive percentage. It's a massive cost for them. And if Optane DIMMs, for example, are 33% of the cost of DRAM was one estimate. And again, that's subject to Intel. But the reality
Starting point is 00:14:59 is, is that there was an article in El Reg a couple of days later saying, you know, one of the big hyperscalaters just got a ship and they don't know what to do with them. And we're like, well, here you go. I can, you know, take a server. I can put Optane DIMMs in it. I can give it a traditional, you know, storage tier back end, and the application will never know the difference.
Starting point is 00:15:19 And if that's an internal serving software, if it's integrated into a PC vendor or an ISV stack, it doesn't matter. It doesn't know. And it means I don't have to rewrite my application to run with Aero, Spike, or HANA. Yep, exactly. And I don't give up my data protection for my enterprise class solutions either, which should accelerate adoption. So it's a pretty exciting space for us to be in, actually. You could think Intel would have bought you guys by this time or something like that considering the fact that they want to you know
Starting point is 00:15:48 increase adoption of persistent memory optane dims specifically that app got there first yeah i guess i guess you mentioned uh you know the operating software operating system software what versions of linux you guys currently run on? And is that going to be extended, I guess? Currently, we run on IBM Red Hat. Oh, wait, sorry. That's going to be late. Or Purple Hat. No, Red Hat 7.5, CentOS 7.5 are the current versions,
Starting point is 00:16:16 although it's fairly easy for us to qualify a different one. The other thing that was interesting, just to digress slightly, was at the NetApp Insight Spotlight Sessions, we actually had Lenovo and Cisco on stage with us when we announced Max Data with Intel. What does that mean? Lenovo and Cisco typically aren't on stage at the same time for anything. So you've got servers, Lenovo servers running this, as well as Cisco UCS?
Starting point is 00:16:45 Yeah, we're qualifying Cisco UCS right now. But Lenovo, we have an OEM with Lenovo, and both of them wanted to be on stage for the announcement because this is relevant to their business. And then Intel, we invited Intel to be on stage, and they certainly wanted to be because this is relevant to their business. And to a company that, you know,
Starting point is 00:17:06 three or four years ago was, was getting our, our, our flash hat back on. It's pretty cool to be in a position where we're now actually relevant to memory vendors and server vendors. And even at insight where I presented one of our sessions in the, in the public call, I had a memory vendor come up to me afterwards and said,
Starting point is 00:17:23 we absolutely have to talk to you. You guys are going places. No one else can. And that's a pretty cool come up to me afterwards and said, we absolutely have to talk to you. You guys are going places no one else can. And that's a pretty cool place to be for, you know, somebody who people used to think of us as a storage company. And isn't it nice that I can buy standardized NVDIMMs from Micron or Viking and know that they're going to work in my Lenovo or my Cisco server? It wasn't all that long ago where it was, oh, you want to use NVDIMS?
Starting point is 00:17:47 There's these three models that Supermicro has with the proper BIOS support, and that's all there is. Yeah, I had a great conversation with a customer a few weeks ago, and it was one of those nightmare meetings where you walk in and the guy's like, I like software-defined. I'm like, we do too. And he's like, what? He's like, I'm testing all these NVME startups,
Starting point is 00:18:06 and they're top of rack, and they're all great, and what are you doing? And I'm like, well, I'm just, we do too. And he's like, what? He's like, I'm testing all these NVMe startups and they're top of rack and they're all great. And what are you doing? And I'm like, well, I'm just putting it right on the memory bus. And he's like, what? Wait a minute. He's like, back up. He's like, I don't have to buy your expensive cards. I'm like, nope.
Starting point is 00:18:15 He's like, I don't have to buy your servers if I don't want to. I'm like, nope. We'd like it to when we get to HCI, of course. But today it's a software load. So he was like, that's exactly what I want he was like that's exactly what i want to see that's exactly what i want to hear these top of rack appliances don't buy me anything because i'm still way out on the network and so it's actually pretty exciting conversations to get into um
Starting point is 00:18:36 because we're it's not really a storage company anymore and the joke is is we're not sure where we left the storage company let us know if you find it. Well, I don't know. You're addressing it via POSIX. That makes it storage to me. Yeah, yeah, yeah. But it's really a memory solution that happens to extend the storage infrastructure. Yeah, the line blur, the line blurs,
Starting point is 00:19:00 memory storage, what's really the difference? You know, I think of one as expensive and I think of one as less. It's fascinating when we talk to customers and they're like, you're doing what? Next thing I know I'm talking to application teams, server team, high performance computing teams, and then sometimes cloud architects. It's a really fun place to be. I've been invited to a lot of high performance computing RFPs lately. And back in the day, if you've read the Michael Lewis book, Flash Boys, that was the holy grail of speed.
Starting point is 00:19:32 Right, right. High-frequency trading and all that stuff. And this fits high-performance computing very well because those applications checkpoint. So I generate a snapshot at the checkpoint. And if that server falls over, I just recover to this checkpoint. You mentioned that the AHCI is a potential implementation as well. In this environment, it would be operating with Element X as a back-end? So at release, what we've announced and what we intend to support at shipment is an ONTAP backend attachment.
Starting point is 00:20:07 And it can be multiple LUNs to a single, and we can be multiple memory DIMMs up to six terabytes when we get to opting in support of memory tier. And backend will be, you know, one to the 25th. So 25 times the backend capacity for the storage tier. But at release, we will support ONTAP. We have the potential to support local SSD and server. We also have the potential to support ElementOS as a backend storage attachment. Centricity, we've tested both of them
Starting point is 00:20:40 and have had no issues with them. At release, we plan to support Linux only bare metal, because if you're going really, really fast, you're probably not virtualizing it yet. However, naturally, that's something that's a priority for us to support virtualization technologies almost immediately. But what we're really seeing is, you know, HCI is extremely attractive for us, because of course, it's our compute at that point. So we can invest in that compute space. But of course, our HCI architecture is so flexible,
Starting point is 00:21:11 you could go ahead and use your own servers sooner than that before we release our own Optane DIMM supported server. So there's a lot of flexibility with the technology because at some point, it just becomes a backend qualification, not really a whole new like re-architecture. And I can see different customer sets wanting different of those options. There's a bunch of the hyperscalers or people who'd like to think of themselves
Starting point is 00:21:36 as hyperscalers would like the local SSD, but would probably be better served with something else. Yeah, about a third of them have really said to us, while we're using Ceph, we're using Cassandra, excuse me, we're using Cassandra, MongoDB, or Oracle, and we'd like to use local internal SSD. And it's like, okay, sure, we can support that pending a future release item,
Starting point is 00:22:04 but we can test it fairly easily today. In fact, I have it running on my laptop in a CentOS image in DRAM with a local file for the block layer. And I'm running a couple hundred thousand IOPS of it. Yeah, but you demonstrate in poetry, but production's in prose. Yeah, I guess, I guess. So when is the release of the max data scheduled for? Do you have a particular, uh, timeframe at this point? Uh, currently that,
Starting point is 00:22:30 you know, the, the legal ease is by the end of the year. Um, we expect it sooner than that. Um, but it is a brand new product to market, uh, without any real predecessors. So we're going to get it right when we release it. Um, but as of right now, we're still tracking towards our release date, which should be in the near future, similar to the ONTAP 9.5 timeframe. What's been fascinating is we kind of, okay, who are your competitors? We're not sure. I don't think there's anything out there at this point. Well, rolling a whole lot of your own is your biggest competitor. Yeah.
Starting point is 00:23:02 Yeah. What we're seeing is the operating systems are adding the capabilities in. So we've seen, but they can't do the protections and they don't do the tiering for reads and writes effectively. The top of rack NVMe appliances
Starting point is 00:23:16 we're seeing rapidly niche on themselves. What they were designed for for NVMe performance we're now delivering with the ONTAP AFF and our EF series systems with NVMe over Fiber Channel, NVMe over Finnyband, and NVMe over Rocky support with the enterprise data protection tools. So we think persistent memory is going to rapidly make those NVMe niche arrays obsolete. And that's what we've heard from a lot of the customers we've talked to who are interested in this space. It's, wait a minute, what are you doing? Oh, you bought PlexiStore.
Starting point is 00:23:48 Smart move. What are you doing with it? Oh, you're integrating it with ONTAP. Awesome. Because then you get the sand performance of NVMe over Fiber Channel as your foundation, and then you put it in memory. Well, and we really do want those snapshots to process just as fast as they can. So back-end performance is still going to be important. Well, this has been great. Andy, anything you'd like to say to our listening audience? No, this has been a really fun ride. Four years ago, NetApp was in Flash, and we were a little late to the market, and we had some things to do.
Starting point is 00:24:18 But we certainly put being late to the market to good use. As one person said to me, we're really good at being fashionably late. So now first to market with NVMe over Fiber Channel, first to market with MaxData. We're going to bring the same innovation to our HCI products and to the cloud business. So you're going to see some really exciting things from us, but it's a fun time to be at NetApp.
Starting point is 00:24:39 Okay, well, this has been great. Thank you very much, Andy, for being on our show today. Awesome, thanks for the time. And thanks to NetApp for sponsoring this podcast. Next time, we'll talk with another system technology person. Any questions you want us to ask, please let us know. And if you enjoy our podcast, tell your friends about it and please review us on iTunes and Google Play as this will also help get the word out. That's it for now. Bye, Howard. Bye, Ray. And bye, Andy. Bye, guys.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.