Grey Beards on Systems - 74: Greybeards talk NVMe shared storage with Josh Goldenhar, VP Cust. Success, Excelero
Episode Date: October 30, 2018Sponsored by: In this episode we talk NVMe shared storage with Josh Goldenhar (@eeschwa), VP, Customer Success at Excelero. Josh has been on our show before (please see our April 2017 podcast), the la...st time with Excelero’s CTO & Co-founder, Yavin Romen. This is Excelero’s 1st sponsored GBoS podcast and we wish to welcome them … Continue reading "74: Greybeards talk NVMe shared storage with Josh Goldenhar, VP Cust. Success, Excelero"
Transcript
Discussion (0)
Hey everybody, Ray Lucchese here with Howard Marks here.
Welcome to another sponsored episode of the Greybirds on Storage podcast.
This Greybirds on Storage podcast is brought to you today by Acceleron.
It was recorded on October 2nd, 2018. We have with us here today, Josh Goldenhaar,
VP of Customer Success. So Josh, why don't you tell us a little bit about yourself and what's
new at Accelerol? Thanks very much. Once again, this is Josh Goldenhaar. Happy to be here. And
I am our VP of Customer Success, which is a fancy name for saying that I head up pre-sales, post-sales, and support.
And the good thing about that is that if you have to have support, that means you have customers.
Yeah. for four years now and am transitioning into this role this year because we said,
well, hey, we've got people using this and we need to formalize support and have that available.
So that's what we're doing here. And that's myself. I have a background in storage and
large-scale Linux and Unix administration. So that's me in a nutshell.
Oh, that's great.
Well, having customers is always a great thing.
We first saw you at StorageField Day 12.
Ray was very impressed.
I was extremely impressed with the speeds that you guys were able to get out.
Two SSDs, by Lord.
Why don't we start with a quick review of what NVMesh is
and how the NVMe market is developing, because it seems to me it's moving
pretty well. It is. And you're both bringing up interesting points is that we came out and we
kind of wowed the world with the performance you could get out of off-the-shelf components. And
that's really still the heart of NVMesh. So in a nutshell, NVMesh is software and it's software that you add to off-the-shelf
components such as servers, 2550, 100 gigabit networking and standard NVMe drives. We put all
these together, we pool this all as a resource and we've really created this next generation
SAN product. So it's like storage area networking,
but instead of fiber channel, you can use ethernet.
The benefit is that you get half the cost
of fiber channel 32, but three times the bandwidth.
But beyond that, it works kind of like you think about it.
And that's really our best claim to fame
is this crazy low latency access to things that look like LUNs or virtual NVMe
drives, but they go across the network so that you can go ahead and disaggregate your storage
if you want to and get local performance. So that's our key thing. If I wrap it up,
we're software. We allow you to pool NVMe resources over a network, access logical volumes carved out of that pool as if they were local.
But you get the benefits of shared storage, such as redundancy, the ability to have hosts fail, the ability to have drives fail.
It's kind of like a top of a rack storage system. Is that how you see this?
You could, but on a typical top of rack, when we're talking about that today, there's been a lot of buzz about NVMe over Fabric.
And NVMe over Fabric, at first glance, if you don't dive into it, sounds like what you need for top of rack storage.
That is, you put at the top of a rack a server with a bunch of NVMe drives, and you access those drives remotely with NVMe over Fibre Channel.
But we like Fibre Channel. We're old storage guys.
Excuse me. You know what? I misspoke. I said NVMe over Fabric is what I meant to say. NVMe
over Fibre Channel is another flavor of NVMe over Fabric. But it's important to understand here that
Fibre Channel NVMe or NVMe over Fabric, generally referring to an InfiniBand or an Ethernet Fabric, is just a transport.
It's just like saying Fibre Channel or saying iSCSI or saying SRP.
These are just the way you communicate from a client to a target. By itself, it doesn't imply any kind of management or the ability for multiple
hosts to use the same resource or logical volumes or redundancy or any kind of protection.
No, but that's not really a protocols job.
Exactly. I agree with you. I agree with you.
I view NVMe over Fabrics as very much like iSCSI in that it's, look, here's transport,
and what we're transporting is NVMe where what we used to transport was SCSI.
And we wrapped it in TCP for iSCSI or we wrapped it in FCP for fiber channel.
Yeah, exactly.
And that's where I just want to make sure folks listening, they understand the differentiation
that you can put a rack of NVMe at the top of a rack and use off-the-shelf open source
NVMe over Fabric protocols.
And you can access those drives remotely, but I like to call this remote DAS.
Uh-huh.
Okay.
Yeah.
You basically, you attach that.
So you've succeeded in physically disaggregating.
So there is a benefit.
I'll have to admit there, there is a benefit there, which is that you can make that top of rack server be specialized to hold NVMe. And then you're free to have your
compute nodes, just be compute nodes that don't have to accept NVMe storage.
But you're implying here that NVMe mesh is above and beyond fabric?
Absolutely. And that's what NVMesh gives you a complete storage solution. Instead of taking 100 remote drives and treating them like 100 different entities, we treat it like you would a virtual all-flash array.
Basically allowing you to carve out logical volumes that can optionally have RAID 0 striping or RAID 1 or RAID 10 mirrored volumes.
And these preserve the performance of NVMe.
NVMesh itself preserves the latency of NVMe so that you can use these logical volumes
like their physical drives. The client hosts think they have a local drive,
but it's really a logical volume. Yeah, but it's a resilient logical volume.
Exactly. And so that was our claim to fame in having these resilient logical volumes perform like they were local.
But what we're excited to see over the last year is this very, very solid adoption of NVMe, where NVMe drives seem to be constantly in shortage, uh, which is good for the vendors,
uh, bad for folks.
Flash has been in shortage till just recently too.
Yep.
Yep.
But, uh, NVMe, you might argue at the top end shouldn't also be so constrained, but
we're, you know, anytime we go to order drives, we're here, our customer is trying to order
a significant number of drives.
They are, uh, they're, they're kind of difficult to get.
Well, that's interesting, because I heard at the Flash Memory Summit that NVMe drives
had met the volume of SATA drives.
Now, most of that is, you know, M.2 for laptop applications.
But I think it's still an important indicator.
Yeah, it is.
There's another one I'm not sure people are talking about,
though, is that there's no real reason
for an NVMe drive to be more expensive
than a SATA drive, for example.
The NAND is the same.
And the controller, the drive vendors would tell you that,
oh, they've got money sunk in the development and the controller is more expensive.
But ask yourself what should be more expensive, a quote unquote controller that takes NVMe and translates directly to NAND or one that has to hook up to PCIe, convert that to a different protocol via bridge or SAS or SATA controller, and
then go ahead and communicate.
It's an NVMe controller is just simply not as complex.
You wouldn't think, but maybe it has performance constraints on it that the SAS controller
might not.
Well, in the laptop market, the Marvell or IDT merchant controller that's in that SSD can't cost more than a dollar or two more than the SATA version, or people wouldn't put them in laptops.
Right. So now the reason I bring this up is if you know that source cost on a SATA drive versus an NVMe drive is barely different, especially if we're talking
about the same class of drive. I'm not trying to compare a write-optimized NVMe versus a
read-optimized SATA. I want the same class of drive. You know they're the same price, which,
who in the world knows the pricing of drives and the components? The Super 8,
the Microsofts, Amazons, Facebooks, Googles, they insist on cost plus. So if you know the cost of
an NVMe drive is barely different, if at all, than a SATA drive, why wouldn't you buy NVMe drives
for your flash needs instead of SATA? So therefore, I would submit to you that most of the consumption is going to the Super 8s
and as you said, also for the consumer devices.
So that's how it's all being sucked up.
That's why the numbers are so high
is if you had the choice between and they're the same cost,
why wouldn't you choose NVMe?
Oh, the only reason I wouldn't choose NVMe
is because it's a lot of availability. And if I'm buying new gear or designing new gear, I but at the cutting edge of people who are adopting this, they're now not just saying that I want NVMe just for the highest levels of performance.
They're saying, I want high levels of performance, but now I'm willing to give a little on the
latency. I need the read latency to be just as good as ever, but my write load is really not
that high. And so I want to work more towards a capacity
optimized layer of NVMe. I want all that performance for reads. I'm willing to give a
little on the latency for writes. And we're seeing this at the same time that this is public
information. Toshiba has announced, I think, a 15 terabyte drive and eyeing in some of their
press releases, an even larger drive.
And when you start talking about drives with that kind of capacity,
you're not talking about ultimate performance anymore.
Exactly.
Even at four lanes of PCIe on the backside of that drive,
it's going to bottleneck at the interface, not at the flash.
You don't think the controller is going to be a problem there?
Well, no, but it's the same logic as we used to use with spinning disks.
And you don't put 400 drives on one fiber channel loop because you'll saturate it.
And you don't put 400 flash chips on one x4 PCIe slot because you'll saturate it. This is why some of the vendors with ridiculously large SSDs
advertise unlimited endurance
because you can't literally write data fast enough to damage it.
Exactly.
You're talking about that 100 or the 128 terabyte offering?
Yes.
Yeah, exactly.
If you're limited by that by four,
it's simply impossible to overwrite the drive enough times in its rated lifetime because of that bottleneck there.
Amazing that we're sitting here talking about a 3.5 gigabyte per second interface as being a bottleneck.
Everything eventually becomes the bottleneck. Yeah. So nonetheless, we are seeing these drives. And so people are starting to say, you know what, again, people kind of on that higher, not so much bleeding edge, but cutting edge thing. Not only do I want NVMe, but I'm not content with NVMe just for the highest levels of performance. I want you to be filling out systems with 8 terabyte or 16 terabyte drives, obviously I'm doing that because I need a lot of capacity.
And just mirroring in this case is not going to cut it for me, especially if I don't have a heavy write load, if I'm mainly reads.
And some of us are sufficiently paranoid that for production workloads, we would want you to mirror three ways.
Exactly. And so that is what we're seeing. And hence, we're responding with that. That's the direction of NVMesh is that we really have these areas where we're starting to see very quickly now a change that people are not only accepting NVMe, but thinking that, okay, this is
going to become more mainstream. I'm going to use this for capacity. It's still for a tier zero
to a tier one, but they want that capacity optimization. They're willing to yield
on the latency that these devices can provide and write.
So when you say capacity optimization, are you talking about RAID
versus mirroring or, God forbid, dedupe or compression or something like that?
Those will come, but I think it's always by a matter of degrees. So a dedupe and compression
on write is going to really severely impact latency. Well, depending on where you do it. If you land data in a cache in Hackett and then reduce it.
That's true.
And I am tainted by my own product.
So all the performance, if any of our claims on performance are synchronous down to the drives, we don't ever cache.
Right. And being all software, that's the best thing because you don't have NVDIMMs and special hardware and those things.
Yep. So we'll be reacting to that, or we are reacting to that by adding erasure coding or
distributed RAID into our product. It'll be branded Mesh Protect.
And that is coming out in its first version.
It will be very similar to RAID 6 in 8 plus 2.
So you'll get 80% usable capacity versus 50% usable on mirroring.
Okay.
And NVMesh still does data placement in the client, right?
Exactly.
So the client now will be directly talking in a stripe of, in its simplest form, eight plus two drives, so 10 drives total.
It'll be calculating parity, and it will be talking to those 10 drives directly and placing data and placing the parity.
So the definition is still
centrally managed, but once that definition is handed to the client and the client attaches to
a logical volume, it's performing all IO directly to the hosts and or drives that are contained in
that host. And so the RAID stripe is dedicated to those eight plus two drives?
Is that how it'd work?
Or is it some sort of namespace?
I'm not even sure what the terminology is
that's carved out of the drives on the back end.
So it's a good question.
We do it similar to how we do RAID 10,
which is that none of our methods
ever take over an entire drive. That is, you're
not fixed to drive sets. So this is not like a traditional, when I say 10 drives, I don't mean
you have to have 10 drives and we use the whole thing and it's a fixed size. We allocate segments
from at least 10 drives and those segments can be of variable sizes. Now the segments as you stripe have to be
the same size, but I literally could take a 100 megabyte or 100 gigabyte segment off of a one
terabyte drive and the other nine drives could be three terabyte drives and I could still use only
a 100 gig segment and going across those 10, I would create an 800 gigabyte volume because there'd be 800 gigabyte segments for data and
two for parity. And so I can stripe these across multiple different drives. And your next question
to me is going to be, well, what about in a drive failure? So we use excess capacity from the other
drives in the pool to backfill those segments. So in the case of when you did have a drive failure,
any segments that were in use on that drive would be replaced by alternate segments. And
so it may not be a one-for-one replacement. Everything's on a per volume or logical volume
basis. And on the same set of SSDs, I can intermix mirrored NVMe namespaces that I'm using for write-intensive applications
and erasure-coded namespaces, right?
Yep, absolutely.
That's still, and best practice is left up to, there's some suggestions on if you wanted
to restrict that.
Let's say you had some very high-performance, low-latency write-intensive drives.
You could put those into
a pool to say, you know what, only RAID 1 or RAID 10 volume should only go to these drives.
But these kind of NVMesh or Mesh Protect distributed RAID volumes here, these should
go to these other drives. So we have mechanisms to go ahead and create different pools.
Okay. Yeah. So the other question would be, do you guys – so you mentioned that spare segments would be allocated across the other drives.
Do you do like a hot spare drive or anything like that?
Or – and the other question was, what does a rebuild look like in this NVMe mesh world?
So the – we'll break it down into those two questions. We don't, because we don't
treat drives as whole entities, you can allocate spare capacity. So we do have a mechanism
that we call provisioning groups. And in that you can say, pre-reserve this space of these drives. So you could pre-allocate 20 terabytes in this VPG.
And then in your volumes, if every drive is three terabytes, let's say, that means as
long as you create a total number of volumes in that group that are 17 terabytes or less,
you have three terabytes of spare space.
Right. in that group that are 17 terabytes or less, you have three terabytes of spare space. Right, but I could choose to be really paranoid
and enforce that and say,
make sure that there's always one drive's worth
of free space in this pool.
Yeah, we can go ahead and via policies,
go ahead and have a pool defined that way.
Yeah, I have seen too many distributed
erasure code systems that don't do that.
Right, and you get in trouble when you have a failure.
Oh yeah, it's like, look, we have 500 gigabytes free
and a one terabyte drive just failed.
Yeah, that's not good.
No, no, no, bad things happen.
And then to Ray's point, in our system,
rebuilds are done not by the clients themselves, but they are done by our
client-side functionality within the target server. That takes a little explanation. What I mean there
is that the host that contains the drive or drives that had a failure, it's the one that's
responsible for doing the rebuild activity.
So this way we don't have somebody's, in an extreme case you probably would really have it this way, but you wouldn't have somebody's client desktop, maybe an M&E workstation,
on a 25 gigabit link, right? That's technically the client here. Talking to these target systems
that have multiple hundred gigabit
links, you wouldn't want that client doing the rebuild. So the targets do it to each other. Now,
we do favor client-side IO over the rebuild. So today, we allow the clients to have as much
performance as they can get, and we kind of make the rebuild go in the background. Although
there are planned features around that if you purposelyfully want to have the rebuild for instance complete as
as fast as possible sometimes i'd like to just say i want a minimum rebuild rate that i want to avoid
forever yes yeah so for today we favored the client, but tomorrow we'll go ahead. And there are even
ways today to speed up the rebuild simply by having multiple clients participate in that.
So what I mean by that is on our rebuilds, and this was the same for our mirrored rebuilds,
when a volume goes into a degraded mode, that is, it has to be rebuilt. And during that rebuild process,
in a mirrored case, we are only reading from the good mirror. But we write to both locations. And that's both the entity that's doing the rebuild, plus any client side activity. So if you naturally
have a clustered file system, and you have multiple clients doing writes, all the clients
will inadvertently help participate in the rebuild because anytime they do writes, they will
participate. And that's going to be the same thing for Mesh Protect. The difference is, of course,
if it's a data drive that failed, you'll have to go ahead and use parity to calculate the data for a read. But any writes that occur, we'll go ahead and rebuild basically a stripe so that the more
clients you have simultaneously doing write activity, you'll actually naturally speed up
the rebuild. And so as a software release, it's going to be available to all your current customer
base? Yeah, we're still, this is a big step up in that we're going to have this much higher efficiency.
But currently the plan is to go ahead and allow anybody who's at 1.2 to go ahead and upgrade to 2.0 if they're under a current support contract.
So there won't be any additional charge.
Yeah, and they won't have to reformat the back end because they're all mirrored today.
And if they want to add, you know, an erasure group, they could.
They can add new volumes.
Yeah, if they wanted to convert an existing volume, they will have to go ahead and back that up and then restore it into a newly created volume.
But yeah, if you have extra drives, you can just start using this and off you go.
Hey, this has been great. Howard, any last questions for Josh? Just, and this all runs over Rocky, right?
Yeah, we actually currently support Rocky on Ethernet. We still support InfiniBand,
but I think we see about 80% of our deployments on Ethernet.
And who knows?
There could be some surprises coming, too, of supporting much more commonly accepted protocols.
Oh, goody, Fiber Channel.
Please, let's not go there.
I wasn't thinking Fiber Channel, but that's a possibility.
But I was thinking more something on Ethernet that's not rocky.
Yeah, yeah.
Since we just said it, yeah, TCPIP is definitely a consideration.
Okay.
Josh, anything you'd like to say to our listening audience before we go off?
No.
It's always a pleasure to get the message out there.
I hope people find this interesting,
even if only for the market perspective on what's happening with NVMe.
We're seeing a lot of adoption.
We're seeing also a lot of excitement around NVMe over fabrics.
And I think NVMesh just can use these and really make this into a storage system for you.
And I'm always glad to bring that message out.
All right.
Well, this has been great. Thank you very much, Josh, for being on our show today. And thank you, Howard. Thank you, Ray.
Thanks to Accelerow for sponsoring this podcast. Next time, we will talk with another system
storage technology person. Any questions you want us to ask, please let us know. And if you enjoy
our podcast, tell your friends about it. Please review us on iTunes and Google Play as this will
also help get the word out. That's it for now. Bye, Howard. Bye, Ray. Bye, Josh. Bye-bye, Ray. Until next time.