Grey Beards on Systems - 75: GreyBeards talk persistent memory IO with Andy Grimes, Principal Technologist, NetApp
Episode Date: November 6, 2018Sponsored By:  NetApp In this episode we talk new persistent memory IO technology  with Andy Grimes, Principal Technologist, NetApp. Andy presented at the NetApp Insight 2018 TechFieldDay Extra (TFD...x) event (video available here). If you get a chance we encourage you to watch the videos as Andy, did a great job describing their new MAX … Continue reading "75: GreyBeards talk persistent memory IO with Andy Grimes, Principal Technologist, NetApp"
Transcript
Discussion (0)
Hey everybody, Ray Lucchese here with Howard Marks here.
Welcome to another sponsored episode of the Graybeards on Storage podcast.
This Graybeards on Storage podcast is brought to you today by NetApp Max Data and was recorded
on October 28th, 2018.
We have with us here today Andy Grimes, Principal Technologist at NetApp.
The Graybeards talked with Andy at NetApp Inside Tech Field Day Extra event last week.
So Andy, why don't you tell us a little bit about yourself and what's new at NetApp?
Thanks, Ray and Howard, for having me on.
Again, my name is Andy Grimes. I'm a Principal Technologist at NetApp.
Been with NetApp. Been with
NetApp for about 10 and a half years. Recently, I worked on the NetApp MaxData product and bringing
it to market. Before that, I was on NetApp Flash solutions. But now I'm actually moving over to our
HCI solutions, which ironically, we'll start to start to work with our MaxData solutions,
hopefully in the near future. So great to be here today and love to talk to you about MaxData. What exactly is MaxData? Is it a file system? Is it a dessert topping? Why do I care?
Well, so MaxData is a software product that DATEP has just announced we will be releasing in the
very near future. But it's a software defined flash solution. So we think of it as the next
generation of flash,
leveraging persistent memory solutions as they come to market. And MaxData is effectively a way to project or to tier your data from traditional storage, SAN storage, into a server on the memory
bus on a persistent memory DIMM and then accelerate it to memory speeds faster than is capable with
any other technology in the market. So this would be like NVDIMS and 3D crosspoint kinds of memory
solutions? Is that how you'd see this being used? Yes, the opportunity is, you know, we've had DRAM
for a long period of time. Lately, we've started to have NVDIMS, and soon we will have OptaneDIMS.
And there are three different types of technologies, but DRAM is, of course,
volatile memory. NVDIMS is more expensive, power-protected DRAM-based technology. And
then now we finally have OptaneDIMS, which will be coming from Intel, hopefully in the near future.
And hopefully at a lower cost per gigabyte than DRAM.
That's the expectation, although Intel ultimately owns their pricing. But NVDIMs are more expensive
today, but it does give you memory performance speeds with the power consistency of 3D NAND.
But Optane DIMMs really promise much more than that. Higher performance, near memory speeds with higher write resiliency
and other performance advantages.
But mainly it's going to come
in very, very large capacities,
which allow us to store data
literally on the memory bus.
Okay, so today,
if I go to HPE or Dell,
they'd be happy to sell me
16, maybe 32 gig NVDIMMs.
So that means in a server, I might have 256 gigabytes.
Are you just making that available or are you using that to accelerate access to something else?
So what we're doing with MaxData that's really unusual in the market is we're taking the
MaxData software load when you install it on a system, it uses your memory tier is some type of
dim on the memory bus. And what the software does is it actually configures a single mount point
concatenating the memory tier and the storage tier. And the storage tier is an underlying block device, which in our first
release will be a NetApp on tap LUN, or multiple on tap LUNs as the back ends tier. But basically,
what max data does is creates a single mount point with a memory tier and a storage tier.
And what allows you to do is write data into the mount point and it enters through a POSIX API. So traditional
application semantics continue to work. But basically what we're doing is all writes and
reads occur onto, excuse me, all writes occur onto that persistent memory tier, and then will be
tiered down to the storage tier at a later time. Reads will be opportunistically promoted into the
primary memory tier for extremely high performance,, low-latency reads as well as writes.
So what sort of performance can we expect from an access latency perspective?
So we've seen single-digit microsecond latencies for single-threaded, single-core workloads.
We are running in user space to help us with GPL licensing.
So we don't have complete control when we get to multi-threaded multi-core,
but single-threaded single-core will see under 10 microseconds of latency.
When we get to multi-core multi-threaded, we'll typically see higher latencies,
but typically still under that 40 microsecond latency number,
although it will vary a little bit.
But that's still phenomenal compared to what we've achieved with externally attached SAN in the 150 to 200 microsecond range, where we really run into the
network bottleneck limitations. Yeah, we were really impressed with the NVMe over Fabric
Solutions pushing 100 microseconds just last month. Yeah. It's been a crazy world where you
do literally talk to customers and you go, here we are at 150 200 100 microseconds
of latency oh i can do 10 well sometimes i might do 40 oh and it's like well give me a break
you know we found we found a couple of solutions actually prefer user space because they can do
polling rather than interrupt driven and stuff like that so it's kind of an interesting you know
uh dilemma as to where where you're actually benefiting from kernel space versus user space. But nonetheless...
We had a number of customers actually request the user space shift. So that
was why we made that change. Yeah, we're seeing a lot more of that.
Intel's encouraging it with SPDK. I think that pendulum
swings back for a while. Exactly. So a couple other things
that are really interesting, though,
is persistent memory brought us into,
and I'm an old storage guy, so I don't trust servers.
Yeah.
I believe we own the sacred duty to never lose the data
and never corrupt the data,
and preferably never let you lose access to the data,
and it's a religion.
Amen.
And I've been in the world of, you know,
HCI and some of the HCI companies out there
are three nines of reliability
and oh, what's a little data corruption between friends.
Nobody will ever notice
if they don't read those files anyway.
Exactly.
So I really look at Max Data
and that was one of the things
that we required as a storage company.
When we actually looked at Max Data,
we thought, okay, I'm using persistent writes in a server and the servers are going to burn down, fall over and
sink into the swamp at any moment. So we need a way to protect those rights. Normally DRAM,
for example, we would restrict them to only read. So they always existed somewhere else.
With MaxData, we have to be able to not only serve writes in order to equally serve an existing
application without
modification, but I also have to protect them with enterprise class tools. So that's why MaxData also
includes the ability to do snapshots, which effectively take the data in the T1 space,
flag all unique blocks, and then copy them down to the storage tier for persistence on an external system. So if you trigger a snapshot or a snapsync in max data,
we will actually trigger the copy down
and then the on tap snapshot if it's an on tap backend
with a snap mirror, a snap vault,
and all the typical NetApp data protection tools.
So you have a complete enterprise compliant
data protection suite,
which has never been available
to in-memory applications up to now. I thought there was another choice. Yeah. The other thing we do is I still look
at a server as likely to fail. And so we want to go even further than a snapshot, I want to be able
to actually protect a server by copying the memory tier to a second server memory tier. And that's
called max recovery. And we basically take
two max data instances, we peer them together over a private 100 gig ethernet RDMA network,
and every write that lands in the in the memory tier of a max data primary server
will be replicated to the memory tier of a secondary server.
So is that replication synchronous?
It's synchronous, and it's RPO zero. So we will never,
ever lose a write. Right. There you go. There's some overhead with that, right?
Yeah, but we're not talking about a Metro cluster looks like one big array fails over
automatically yet, right? No. No, it's an active passive architecture. So you'll have a primary to
secondary. But in the event of a failure, we will copy the data back over 100 gig Ethernet RDMA network. So we've seen a copy back times of approximately three minutes per terabyte. So it's, it's an incredibly fast architecture. And we've seen like in MongoDB instances, the recovery time for a shard failure can be amazingly fast, you know, just a few minutes
for most implementations. So you've got, you know, in-memory acceleration performance without
really modifying the application, which is incredibly cool to add to your enterprise arsenal,
but also the ability to protect and even recover more quickly in various failure scenarios.
And so this is sort of allowing persistent.
So a couple of questions.
You know, the file IO stack in Linux
has been relatively convoluted,
enhanced and evolved over many, many generations.
And it's pretty deep.
Are you guys using the file IO stack or?
So we're using our own custom file system called,
it's been submitted to the Linux kernel.
It's called Zoofs.
In our branding, it's MaxFS.
It is actually part of the reason NetApp acquired the original company that this technology is derived from, PlexiStore, about a year and a half.
The MaxFS file system and the Zoofs file system that's derived from it is extremely low metadata, low overhead,
very, very low latency. And while we're bringing the MaxData product to market, we see a tremendous
opportunity for that. We're also integrating it into our ONTAP ecosystem, into our internal
components. So it's a really, really slick capability that we've added to our arsenal.
You mentioned metadata.
So there are directories and inodes and all this stuff associated with the persistent memory file system?
So typically we do use metadata, but we don't have the journaling and the multi-metadata recovery that is typically used for OS file systems.
This is typically a memory file system with much more elegant, simpler
paging access. We actually do have a memory API as well, and the ability to add functions like
actually pinning data into memory. So if you have files that are stored in the Max Data Mount Point,
you can actually, there's semantics to pin specific files, so they are always accelerated.
So there's some interesting capabilities that we'll continue to add to later on. But what's really brilliant about the solution is it is software. So I could
run it on a server that a customer purchases, that memory that a customer purchases,
without custom chipsets, without custom hardware design, without custom flash modules that
popular wisdom and certainly the marketing departments of many organizations have insisted
you've always needed to go this fast.
And we've actually proven that we can do it from software.
With either DRAM or NVDIMM or sometime later Optane DIMMS kind of thing.
Yeah, yeah.
Well, so wait a second, Ray. Did you say we could use ordinary DRAM for this?
I believe so. That's what he mentioned earlier on, right? I mean, if you had enough DRAM
to do something of substance, right?
Well, DRAM costs a lot less than NVDIMS. I mean, I'm a risk-taker. No, I'm not. I'm
a storage guy. Never mind.
So, we'd certainly support DRAM, and if your application tolerates it, we will support DRAM at release. We see the opportunity though, is the inclusion of persistent
memory into your data workflows. NVDIMM now with the cost metrics makes sense. And then Optane
DIMMs when they're available and in sufficient quantities with the servers that support them.
But what's beautiful about our software architecture
is we can actually innovate very quickly.
Unlike hardware-based platforms,
and no disrespect to ONTAP or ElementOS or Centricity
or any of the other hardware platforms out there,
it takes longer to release things in hardware
because you're tied to hardware release cycles
and you're typically tied to much broader feature sets
that you have to validate or regress against.
More,
more appliances out there and such.
Yeah.
Well,
that and spinning an ASIC takes a while.
Yeah.
Yeah.
Especially in that case.
But what we're seeing with max data,
that's beautiful is,
is it's,
we plan to release a new version about every three months.
So when we talk to a customer about,
I don't have a really
heavy hardware interface matrix I have to validate and constantly innovate and retest. So at release,
we plan to support an ONTAP backend with ONTAP LUNs behind it requiring ONTAP 9.5. But we have
tested it with server standalone with internal SSDs. We've tested it with Element OS in our HCI product.
We've tested it with cloud, where we've actually run instances in Linux images in Amazon with
DRAM with EBS storage.
And that actually brings you to the DRAM example.
There may be great use cases for using DRAM in Amazon for accelerating applications.
You see Max Data as part of the Amazon marketplace at some point?
It was back in the Plexi store days,
and we're evaluating how quickly to bring it to market.
Because we're already doing a number of early access programs with customers,
and about a third of them were AWS.
About a third of them are server,
and about a third of the requests are sand attached.
And that's been pretty interesting for us.
And as a software product, we can go in the direction the market requests.
Yeah, I can see a lot of use cases for applications like Mongo, where it's sharded up the wazoo
and protected inter-node so that short-term losses of it being DRAM aren't really that
much of a problem.
I've been thinking a long time that persistent memory meant new database engines, but you guys are bringing the advantages of persistent memory to existing application models.
Yeah, that's the most exciting thing for me.
And when we actually talked to an analyst, it was really exciting because they said, do you know how much money a hyperscaler spends on DRAM? And we were like, no. And they said,
it's a massive percentage. It's a massive cost for them. And if Optane DIMMs, for example,
are 33% of the cost of DRAM was one estimate. And again, that's subject to Intel. But the reality
is, is that there was an article in El Reg a couple of days later saying, you know, one of
the big hyperscalaters just got a ship
and they don't know what to do with them.
And we're like, well, here you go.
I can, you know, take a server.
I can put Optane DIMMs in it.
I can give it a traditional, you know, storage tier back end,
and the application will never know the difference.
And if that's an internal serving software,
if it's integrated into a PC vendor or an ISV stack, it doesn't matter.
It doesn't know.
And it means I don't have to rewrite my application to run with Aero, Spike, or HANA.
Yep, exactly.
And I don't give up my data protection for my enterprise class solutions either, which should accelerate adoption.
So it's a pretty exciting space for us to be in, actually.
You could think Intel would have bought you guys by this time or something like that considering the fact that they want to you know
increase adoption of persistent memory optane dims specifically that app got there first
yeah i guess i guess you mentioned uh you know the operating software operating system software
what versions of linux you guys currently run on? And is that going to be extended, I guess?
Currently, we run on IBM Red Hat.
Oh, wait, sorry.
That's going to be late.
Or Purple Hat.
No, Red Hat 7.5, CentOS 7.5 are the current versions,
although it's fairly easy for us to qualify a different one.
The other thing that was interesting, just to digress slightly,
was at the NetApp Insight Spotlight Sessions,
we actually had Lenovo and
Cisco on stage with us when we announced Max Data with Intel.
What does that mean? Lenovo and Cisco typically
aren't on stage at the same time for anything. So you've got servers,
Lenovo servers running this, as well as Cisco UCS?
Yeah, we're qualifying Cisco UCS right now.
But Lenovo, we have an OEM with Lenovo,
and both of them wanted to be on stage for the announcement
because this is relevant to their business.
And then Intel, we invited Intel to be on stage,
and they certainly wanted to be
because this is relevant to their business.
And to a company that, you know,
three or four years ago was,
was getting our, our,
our flash hat back on.
It's pretty cool to be in a position where we're now actually relevant to
memory vendors and server vendors.
And even at insight where I presented one of our sessions in the,
in the public call,
I had a memory vendor come up to me afterwards and said,
we absolutely have to talk to you.
You guys are going places. No one else can. And that's a pretty cool come up to me afterwards and said, we absolutely have to talk to you. You guys are going places no one else can.
And that's a pretty cool place to be for, you know, somebody who people used to think
of us as a storage company.
And isn't it nice that I can buy standardized NVDIMMs from Micron or Viking and know that
they're going to work in my Lenovo or my Cisco server?
It wasn't all that long ago where it was,
oh, you want to use NVDIMS?
There's these three models that Supermicro has
with the proper BIOS support, and that's all there is.
Yeah, I had a great conversation with a customer a few weeks ago,
and it was one of those nightmare meetings where you walk in
and the guy's like, I like software-defined.
I'm like, we do too.
And he's like, what?
He's like, I'm testing all these NVME startups,
and they're top of rack, and they're all great, and what are you doing? And I'm like, well, I'm just, we do too. And he's like, what? He's like, I'm testing all these NVMe startups and they're top of rack and they're all great.
And what are you doing? And I'm like, well, I'm just putting it right
on the memory bus.
And he's like, what?
Wait a minute.
He's like, back up.
He's like, I don't have to buy your expensive cards.
I'm like, nope.
He's like, I don't have to buy your servers
if I don't want to.
I'm like, nope.
We'd like it to when we get to HCI, of course.
But today it's a software load.
So he was like, that's exactly what I want he was like that's exactly what i want to see that's
exactly what i want to hear these top of rack appliances don't buy me anything because i'm
still way out on the network and so it's actually pretty exciting conversations to get into um
because we're it's not really a storage company anymore and the joke is is we're not sure where
we left the storage company let us know if you find it. Well, I don't know.
You're addressing it via POSIX.
That makes it storage to me.
Yeah, yeah, yeah.
But it's really a memory solution
that happens to extend the storage infrastructure.
Yeah, the line blur, the line blurs,
memory storage, what's really the difference?
You know, I think of one as expensive and I think of one as less.
It's fascinating when we talk to customers and they're like, you're doing what?
Next thing I know I'm talking to application teams, server team, high performance
computing teams, and then sometimes cloud architects.
It's a really fun place to be. I've been invited to a lot of high performance
computing RFPs lately.
And back in the day, if you've read the Michael Lewis book, Flash Boys, that was the holy grail of speed.
Right, right.
High-frequency trading and all that stuff.
And this fits high-performance computing very well because those applications checkpoint.
So I generate a snapshot at the checkpoint.
And if that server falls over, I just recover to this checkpoint.
You mentioned that the AHCI is a potential implementation as well.
In this environment, it would be operating with Element X as a back-end?
So at release, what we've announced and what we intend to support at shipment is an ONTAP backend attachment.
And it can be multiple LUNs to a single, and we can be multiple memory DIMMs up to six terabytes when we get to opting in support of memory tier.
And backend will be, you know, one to the 25th.
So 25 times the backend capacity for the storage tier.
But at release, we will support ONTAP.
We have the potential to support local SSD and server.
We also have the potential to support ElementOS
as a backend storage attachment.
Centricity, we've tested both of them
and have had no issues with them.
At release, we plan to support Linux only bare metal,
because if you're going really, really fast, you're probably not virtualizing it yet. However,
naturally, that's something that's a priority for us to support virtualization technologies
almost immediately. But what we're really seeing is, you know, HCI is extremely attractive for us,
because of course, it's our compute at that point.
So we can invest in that compute space.
But of course, our HCI architecture is so flexible,
you could go ahead and use your own servers sooner than that
before we release our own Optane DIMM supported server.
So there's a lot of flexibility with the technology
because at some point,
it just becomes a backend qualification,
not really a whole new like re-architecture.
And I can see different customer sets wanting different of those options.
There's a bunch of the hyperscalers or people who'd like to think of themselves
as hyperscalers would like the local SSD,
but would probably be better served with something else.
Yeah, about a third of them have really said to us,
while we're using Ceph, we're using Cassandra,
excuse me, we're using Cassandra, MongoDB, or Oracle,
and we'd like to use local internal SSD.
And it's like, okay, sure, we can support that
pending a future release item,
but we can test it fairly easily today.
In fact, I have it running on my laptop in a CentOS image in DRAM
with a local file for the block layer.
And I'm running a couple hundred thousand IOPS of it.
Yeah, but you demonstrate in poetry, but production's in prose.
Yeah, I guess, I guess.
So when is the release of the max
data scheduled for? Do you have a particular, uh, timeframe at this point? Uh, currently that,
you know, the, the legal ease is by the end of the year. Um, we expect it sooner than that. Um,
but it is a brand new product to market, uh, without any real predecessors. So we're going
to get it right when we release it. Um, but as of right now, we're still tracking towards our release date, which should be in the near future, similar to the ONTAP 9.5 timeframe.
What's been fascinating is we kind of, okay, who are your competitors?
We're not sure.
I don't think there's anything out there at this point.
Well, rolling a whole lot of your own is your biggest competitor.
Yeah.
Yeah.
What we're seeing is the operating systems
are adding the capabilities in.
So we've seen,
but they can't do the protections
and they don't do the tiering
for reads and writes effectively.
The top of rack NVMe appliances
we're seeing rapidly niche
on themselves.
What they were designed for
for NVMe performance
we're now delivering
with the ONTAP AFF and our EF series systems with NVMe over Fiber Channel, NVMe over Finnyband, and NVMe over Rocky support with the enterprise data protection tools.
So we think persistent memory is going to rapidly make those NVMe niche arrays obsolete.
And that's what we've heard from a lot of the customers we've talked to who are interested in this space. It's, wait a minute, what are you doing? Oh, you bought PlexiStore.
Smart move. What are you doing with it? Oh, you're integrating it with ONTAP. Awesome. Because then
you get the sand performance of NVMe over Fiber Channel as your foundation, and then you put it
in memory. Well, and we really do want those snapshots to process just as fast as they can. So
back-end performance is still going to be important.
Well, this has been great.
Andy, anything you'd like to say to our listening audience?
No, this has been a really fun ride.
Four years ago, NetApp was in Flash, and we were a little late to the market, and we had some things to do.
But we certainly put being late to the market to good use.
As one person said to me, we're really good at being fashionably late.
So now first to market with NVMe over Fiber Channel,
first to market with MaxData.
We're going to bring the same innovation
to our HCI products and to the cloud business.
So you're going to see some really exciting things from us,
but it's a fun time to be at NetApp.
Okay, well, this has been great.
Thank you very much, Andy, for being on our show today.
Awesome, thanks for the time.
And thanks to NetApp for sponsoring this podcast. Next time, we'll talk with another
system technology person. Any questions you want us to ask, please let us know. And if you enjoy
our podcast, tell your friends about it and please review us on iTunes and Google Play as this will
also help get the word out. That's it for now. Bye, Howard. Bye, Ray. And bye, Andy. Bye, guys.