Grey Beards on Systems - 85: GreyBeards talk NVMe NAS with Howard Marks, Technologist Extraordinary and Plenipotentiary, VAST Data Inc.
Episode Date: August 1, 2019As most of you know, Howard Marks was a founding co-Host of the GreyBeards-On- Storage podcast and has since joined with VAST Data, an NVMe file and object storage vendor headquartered in NY with R&D ...out of Israel. We first met with VAST at StorageFieldDay18 (SFD18, video presentation). Howard announced his employment at that event. … Continue reading "85: GreyBeards talk NVMe NAS with Howard Marks, Technologist Extraordinary and Plenipotentiary, VAST Data Inc."
Transcript
Discussion (0)
Hey everybody, Ray Lucchese here with Greg Schultz.
Welcome to the next episode of the Greybeards on Storage podcast,
the show where we get Greybeards Storage bloggers to talk with system vendors
to discuss upcoming products, technologies, and trends affecting the data center today.
This Graveyard on Storage episode was recorded on July 18, 2019. We have with us here today an
old friend, co-host emeritus Howard Marks, technologist extraordinaire, and plenty
potentiary of Vast Data Inc. So Howard, why don't you tell us a little bit about what you've been
doing these days and about your new company?
Sure.
Well, I decided to take a trip on the vendor side and see what it was like telling a particular story instead of helping everybody else tell their stories.
And I had a conversation with the folks at Vast, and it looked like a really interesting story. Basically, our founders, who designed a couple of the first generation of all-flash arrays, ExtremeIO and Caminario, looked at the
all-flash market and realized, while it was possible to make systems faster than that generation of all flash
arrays, the number of customers who were saying, you know, my extreme IO isn't fast enough. I need
something faster was really small. What people were saying was, I love my all flash array, I wish I could afford it for more of my applications. And so VAST is about bringing
all flash to the masses. And when I say large
scale, our entry point is on the order of 500 terabytes of data before data reduction. So,
you know, we're look, we're selling into the large enterprise, the national labs, the machine learning, the data-intensive applications, and people with lots of data.
So, are you focused primarily on the HPC marketplace then?
The goal is, and the design concept, is to be universal storage. Traditionally, we've divided storage up into tiers matching the cost
of storing any piece of data to the media we put it on. And the problem with that is as we discover
new value, we have data stuck on slow systems that don't let us extract that value.
So, you know, in many, many cases, there's things like, well, yes, the archive is in
an object store, but the object store isn't fast enough for the full text indexer to run against.
So when I have to do a query, I have to bring, promote data up a tier, run a query and then throw it away.
We've designed a system with no tiers. that we'll get to later, bring the cost of that all flashed down
to where we can compete with 7200 RPM spinning disk-based systems
for many applications.
All right, so let me try to unpack all this.
File an object store, but no block.
Right, file an object, but no block.
And NFS and SMb3 dot whatever today
today nfs3 and s3 smb and nfs4 soon
and uh s3 compatible compatibility is that something that you can verify these days? I have no idea. You know, there's various versions of that. couple of additional functions for specific uses. There's an uncommon S3 function called fan out
to say, here's an object, create me 400 copies of that object. And we have some customers who
find that to be attractive. So we support that as well as the usual stuff. But when you start
looking at the S3 functions that applications like backup applications and archive applications
and media digital asset management systems use, that's a common subset and we support
that common subset.
When we were talking at some storage field day in my past life here, Bass made a – I don't know if it was the CTO or the CEO made the comment
that we're going to kill off disk.
It was very provocative.
I say he's crazy.
But, you know, disk has been around for, gosh, I'm thinking 78 years now,
and it hasn't gone away it might be you know moving down
scale or down the stack or down the tiers but so yeah so tell me about what why is that i mean the
path the path for spinning discs is going to become like the path for tape it it becomes the
right solution for narrower and narrower use cases.
Gosh, but you're still selling millions and millions of disk drives.
Yes, but we're just reaching the point where that diminishment is starting for anything
but primary storage. Today, the number of people using all flash systems for things like backup and archives
is small because all flash systems have traditionally been too expensive for that.
And curiously, part of why they've been too expensive is because they've been too small.
If you think about flashware and write endurance as an issue, if you have five small systems, each supporting a rack of servers running VMs, then you have to have each system be able to deal with the busiest of that fifth of your data.
If you put them all on one system, you get to amortize the wear across a much bigger pool of flash. The other thing you mentioned in the introductory to Spiel was this no-tearing concept.
Right.
You might want to tell me how that's going to work in an environment where you're doing primary storage, S3.
Now it's time to introduce the architecture.
So we sell two pieces of hardware.
You know, all of our IP is in the software, but we sell appliances because it's, people want to buy appliances and support is easier and all the other good reasons.
So we sell two kinds of appliances.
The main one is the Vast Enclosure. It's a fault-tolerant HA NVMe over Fabrics JBuff.
Holds 56 U.2 SSDs.
12 of those SSDs are Optane SSDs.
The other 44 are QLC SSDs. The other 44 are QLC SSDs. And we use QLC as a shortcut for SSDs that use QLC,
but we also mean SSDs that don't have a DRAM buffer and the super cap to protect the DRAM buffer. They're very low cost SSDs.
But wait, Optane?
You're not using Optane as a tier of storage?
You're just using Optane as a cache?
We're not even using Optane as a cache.
We're using Optane as a write buffer.
Oh, that's a cache?
That's a write buffer as a cache?
It is, but it's a limited case.
What we don't do is ever promote data from the flash to the Optane.
So because reads from the QLC flash are fast enough, there's no reason to say,
when somebody reads this block of data, I'm going to promote it to the Optane.
I'm just going to satisfy those reads from the flash.
So as the customers see it, when we talk about usable capacity, it's all the flash.
The Optane doesn't count at all.
The Optane's where we store all the metadata because the other appliance is a four servers
in two U appliance that we OEM from Intel, where all the software and intelligence runs.
And when we run our software on those servers and we turn them into what we call vast servers,
those servers satisfy NFS and S3 requests.
They manage the data on the SSDs in the enclosure.
They manage the metadata, but they're stateless.
All the state is stored in the Optane in the enclosures in a cluster.
So there's no east-west crosstalk between the VAST servers,
and the system will run just fine with one VAST server
because one VAST server has access to all the SSDs
and therefore all the data and all the metadata.
Now, 600 terabytes of data behind one pair of x86 processors
might bog down, but there's no functions that we're going to lose.
So it's a scale-out architecture.
We can have an arbitrary number of enclosures
and an arbitrary number of servers.
And because most of the performance comes from the servers,
we can scale performance and capacity independently.
You mentioned the QLC doesn't have SCAP, DRAM.
Which means that we save a lot of money. Yeah, but what about the data? The data integrity,
you're writing to a QLC device and all of a sudden the system dies.
Okay. So there's two questions here.
The first is the data integrity question and the second is the endurance question.
So in terms of the data integrity question,
the metadata is all transactional
and incoming data to the system
gets written to multiple Optane SSDs.
Later, it gets destaged from the Optane SSDs to the Qane SSDs. Later, it gets destaged from the Optane SSDs
to the QLC SSDs.
And in that process,
we reduce it using data reduction technology
that we guarantee is better
than anybody else's data reduction technology.
We erasure code it using very wide stripes with very high levels of protection.
And then we destage that data to the QLC SSDs. And we destage that data to the QLC SSDs
in megabyte writes. Once we've written a megabyte to that SSD, we send it the NVMe commit command, and the data is all written to Flash, and we don't start accessing the copy of the data we just put in the QLC till after the SSD acknowledges the commit.
Then we delete the copy that's in the optane. So in terms of if power fails in the middle of this operation,
another vast server is going to look at the metadata, see there's a
partial transaction, roll it back and destage that data again.
Now, so the way we're leveling stuff like that is because you're doing megabyte level writes?
That's part of it.
So if you think about, yeah, so let's stop at our erasure coding for a second.
So we don't let you choose an erasure code method.
We stripe as wide as we can, depending on how many enclosures you have.
So in a typical system, we stripe 150 data strips and four parity strips.
150 and four data and parity. Okay. So what's that? A five failure, right?
It supports, okay, it supports four failing devices.
Right.
We can fail four devices before data loss.
And that's a much higher level of resiliency than a typical system.
Most are plus one or plus two.
But the problem...
150 would imply multiple enclosures, right?
Three or four enclosures.
Yeah, 100, right.
If we have one enclosure, it's 36 plus four.
So there's 44 SSDs and we need room to rebuild.
With read solemn and erasure codes, if we striped 150 plus four and an SSD failed, we'd have to read data from 149 SSDs to reconstruct the failed data.
A little rebuild issue. Okay. But you're not using read Solomon?
Instead of read Solomon, we use a new class of erasure codes called locally decodable codes. Okay. But you're not using Reed-Solomon? be calculated from the same set of data strips, but by calculating over broader regions,
the parity strips tell us the net result of the data from three quarters of the strips.
So we only have to read a quarter of the strips. So if you think about it as parity, as opposed to the more complicated math and the way XOR works, you go, well, parity strips one and two tell me that if I XOR together the first 50 strips, it comes out to a one.
So I don't have to read.
We need the math gal here.
I have talked to Rachel about it. And so because we know what the total of the first 50 strips is,
I don't need to read the first 50 strips.
And that means...
So you're telling me you can read the four parity stripes, let's call them,
and you already know what three-fourths of the data drives will look like at that point?
Well, I know what the result of the parity calculation of those three-fourths of the data drives will look like at that point? Well, I know what the result of the parity calculation
of those three-quarters would be, and therefore
I don't need to read it. So you only have to read the one-fourth of it.
Exactly. Okay, I got you. I got you.
This is interesting. I like these decodable, locally
decodable codes. Yeah, it makes doing really wide stripes reasonable.
We're still reading on the order of 40.
You're still reading with 150 devices.
You're still reading lots.
But that's a lot less than 150.
And it means that at 150 plus four, we're down in the 3% overhead range, not the 25 or 30% overhead range most systems are.
So we've got very high resiliency, very low overhead. You mentioned the guaranteed better data reduction capability than anybody else in the universe.
Is that what I heard?
You got to explain that, Howard.
Okay.
So first I'll explain the guarantee and then I'll explain how we get there.
The guarantee is really simple.
If you have unencrypted data.
What about media?
What about non-compressible? What about, you know, you If you have unencrypted data. What about media? What about non-compressible?
What about, you know, you're just saying unencrypted, okay?
Right.
We're guaranteeing better.
We're not guaranteeing a specific amount.
And so if you have media data that on some other system reduces half of 1%, and on our system, it reduces two and a half percent. That's better.
So we're not, it may not be impressive, but it is two times better.
And so the guarantee simply says,
if you have a set of data that's not encrypted and you put it on some other guy's system
and it takes X gigabytes of space
and you put it on our system
and it takes Y gigabytes of space
and Y is greater than X,
then we will expand your system
so that you have enough space.
How we do that, it works like this. So like a deduplication
system, we break data up into chunks. And we break data up into chunks using a variable block size method,
the one covered in the Rocksoft patent.
So the boundaries of the chunks are determined by the data,
not by some size.
And we make chunks between 8 and 16K.
And then, like a deduplication system, we hash the chunks.
But the difference is we don't use SHA-1.
SHA-1 is designed to be collision resistant,
and therefore SHA-1 is designed to have
a very large change in the hash
for a small change in the data.
We use a weaker hash function, it's not 160 bits. It's like 56 bits.
You can tell closeness of the data?
Similarity.
Similarity? Who deals with similarity? This is data. You can't similarity under your spot.
Well, sure it does. So we hash the chunk and we get a similarity hash.
And what the similarity hash means to us is a suggestion as to what compression dictionary will work best for this data.
Oh, you mean like text, office, media,
something like that? Is there a different Huffman code?
Well, not more the LZ part than the Huffman part, right? But if you do standard compression,
you go, oh, look, there's seven zeros in a row. Let me replace that seven zeros with a symbol.
And then you have to build a dictionary that says, I saw this data and I replaced it with that symbol. If two chunks are similar,
then they have a lot of the same small repeating patterns in them that will generate the same
symbols. And therefore they'll compress with the same dictionary, but we don't have to store the dictionary twice. So we take the first chunk that generates a given similarity hash, and we store it as a reference block.
In metadata, I might add.
Yeah.
When a second chunk generates the same hash, we compress the two chunks together using Z standard.
And that lets us get a delta chunk. That's the differences between the reference chunk
and our new chunk. And we just store that delta. And there's some code in there that goes,
oh, the delta is zero bytes it is a duplicate
let me update the counters all right so i've got i've got data let's call it text data there's a
standard you've you've created a standard text compression dictionary when you first encountered
the first text data block or chunk it's much more specific it's much more specific than text but
yes okay let's say raise text data okay with all my nuances of spelling and all that other stuff
and well you you save you saved a file yes and then you did a save as because you want to create
a new version and and you edit a and you and and you spell check the new version and
you save it again so now we have chunks of data that are that that are very similar because the
only difference is in the reference chunk plenipotentiary was misspelled and in the
and in the new chunk plenipotentiary was misspelled. And in the new chunk, plenipotentiary is spelled properly.
So, you know, it's probably a 16K chunk with 10 bytes different.
And so our delta block that gets generated from that is going to be 150 bytes or something like that because there's many.
So your delta block has to point to the original block right
the original block probably needs to point to the delta block because you don't ever actually want
to get rid of the original block because it's you know you've got dependencies on a delta block right
so the so the ref so we have to keep a counter on the reference block that says how many delta
blocks there are that point to it just just like in deduplication.
Yes, only it's not a duplicate.
And when we rehydrate data, we need to read the delta block and the reference block and rehydrate the data from the combination.
So for a given chunk, if the original chunk, the reference block, and the new chunk are exactly the same, it acts like deduplication.
And we just store a pointer.
If they're similar, we store less than the whole second chunk. And it's all compressed with Z standard,
which is one of the leading new compression schemes that does both LZ style compression
and Huffman style compression. So we compress as well as anybody compresses plus or minus 2% because we use one of the later better compression schemes.
We deduplicate or effectively deduplicate
just like everybody else
because if two chunks end up being the same,
we just store a pointer.
And then for the chunks that the other guys would store
simply compressed and maybe store 8K for a 16K chunk.
We store hundreds of bytes instead of 8K.
Okay, I understand.
I like what you're doing.
I think it's very neat.
But the metadata is going to kill you.
You have to know how many bytes in the 100 bytes in this DDo block that's a D card and another 417 in this one, 304 in that, and 5 in the last one.
And that's the next key to the how big we build the metadata
or what metadata is in RAM
and what metadata is out on something that's much slower than RAM.
And you're doing this compression and all that stuff in DRAM on the servers
or on the Optane?
A combination.
The calculations happen in DRAM,
and the stripe gets assembled in Optane and then destaged.
And the Optane is the key to this reduction as well, because the whole reference chunk compressed thing takes a little bit more time than conventional reduction. But since we act once the data is in the Optane and we have terabytes of Optane,
not gigabytes of DRAM, the delay is post-act, so user applications never see it. And as long
as we can destage data to the flash as fast as data is being written in, it doesn't matter how
many milliseconds or microseconds it takes
for any given block to get all the way down to the QLC.
And you mentioned the Optane writes are replicated across multiple Optanes. So if I'm writing a
block, let's say it's going to be fast written into Optane first, but you've got multiple versions of those?
Right. So in each enclosure, there's a dozen Optane SSDs. A vast server gets an NFS request to write. It writes that data to multiple Optanes if there are multiple enclosures across multiple enclosures. Then it acts.
And then later, a process on another VAST server starts the D-stage process.
And decompression and decoding and all the other stuff.
Right. And since there's a pool of VAST servers,
it all happens in parallel, and the more servers you have, the faster it happens.
Yeah, but I mean, you actually have to write these things to real Optane at some point.
And so you mentioned the NVMe over Fabric to the enclosures.
Is that RDMA?
It's RDMA.
It's 100 gigabit per second, either Rocky V2 or InfiniBand, depending on customer requirements.
And so the other thing about the – there's a couple of things about the metadata that are special.
The first is the metadata is all byte-oriented. It's not block oriented. So a file extent doesn't say this logical block and then you look someplace else and go, oh, that logical block is stored on this SSD.
That pointer says this SSD, this LBA, this many bytes in, this long.
And you're writing megabytes of data at a whack here, right? Well, so we're writing megabytes of data at a whack
because the SSDs we're using are very sensitive to write size.
So remember, there's no DRAM.
Yeah, and no SCAP.
And the page size on QLC Flash is 64 to 128K.
So if you are a log-based file system that was designed for spinning disks,
and you write 4K blocks to these SSDs,
each 4K write takes 64k of flash and then the system has and then it has to garbage collect
so these drives are rated at six petabytes written endurance for 4k writes
and 25 petabytes written for 128k writes but we write one meg the other the other big thing we do specifically for
flash endurance is we segregate data based on its expected on its life expectancy
based on a heat map um a life map if you look if you look at a standard storage system, data, with small exceptions, gets written to the media in the order it is received by the system.
And so if I've got an Oracle database, that Oracle database is writing data to the database files, which is long-lived.
It's going to stay on the system a long time.
And data to the transaction logs, which is short-lived.
When we run the backup tonight, those logs are going to get expunged.
If we write data to the SSDs in the order it's received, that data is all interleaved together.
Then we send the SSD. Tonight,
we run the backup. We delete the log files. We send the SSDs a message that says,
okay, all the log file data is now invalidated. You can use it in your internal log system.
And now the SSD has to do garbage collection. And that increases the right amplification in the SSDs and the number
of times a given piece of data has been moved around from one place in the flash to another
place in the flash. We go, the log files are all in the log folder. We know the log folder is
short-lived. So we're going to write a thousand one meg stripes of temporary data, the temp folder, the log folder.
You know, we have a little bit of heuristics to figure out which kinds of files in your environment
are typically short-lived. And we put all the short-lived data together.
And then we put all the long-lived data together. And of course, we can do this
because we've got plenty of Optane to accumulate these huge blocks of data. So when we do garbage
collection in our log-based system, we tell the SSD, this gigabyte of data is now invalid. And then we send the trim. And in NVMe, it's called deallocate.
But we send the command to an SSD that says, okay, you can release all that data. And the SSD now has
erase blocks. And an erase block in this kind of flash is tens of megabytes in size.
But those erase blocks are completely empty, so it can just erase
them. It doesn't have to do any internal garbage collection. The thing people forget about garbage
collection is a given system collects garbage till it reaches, I have enough free space.
How much data you have to move to do that depends on whether on the mix of valid and invalid data that you start with.
Transient or non-transient, they're intermixed in those pages.
By separating them, we have to garbage collect much less often because every stripe we garbage collect generates a lot of free space and therefore less wear.
The whole system is designed to minimize wear on the flash, accommodating the geometry of the SSD
so that we can not only eliminate how much rate amplification we create ourselves
by doing things like garbage collection but how much rate amplification we
create inside the SSD through managing the data okay so back to the metadata
the metadata is retained and maintained on obtain only only. It's never gone down to the QLC.
So the metadata structures are trees that are very wide,
512 nodes per layer.
It should be 512 leafs per node.
And very shallow.
It's a max seven steps through the metadata to find the end leaf that you're looking for
and the end leaf would be um potentially a block you know a file extent a file entry in a folder
something like that um if you compare that to the prototypical Unix system with inodes,
it could be 40 or 50, walking your way down a folder, looking in inodes, looking in the next
folder. It's a lot of lookups. So we minimize that number of lookups. When we run out of space in the Optane,
the bottom leaves get forced down to QLC.
So it goes from seven fast lookups in the Optane to six fast lookups and one slightly slower lookup.
And that could go to two if necessary
or something like that,
based on the amount of capacity that's being...
We can handle
you know very very large numbers of objects without without pushing the next layer into the
flash i think the challenge might be the s3 stuff because the number of objects could could be
extraordinary yes and we're no we're designed to to handle billions of objects. And remember, it's all scale out.
So every time you add an enclosure, you add more Optane.
And the metadata in each, you know, there's a root of a tree in each enclosure and a function for the servers to calculate where to find those when they
power up so the expansion is relatively simple there's not a lot of you add we added two
enclosures and now we have to copy half the metadata over into it you think you'd have to
copy a lot of metadata over as you're expanding the stripes and all that stuff well as as data
gets moved over the metadata for that data has to get moved over, but there isn't a lot of proactive moving of things around.
We generally frown on proactively moving things around because it creates right amplification.
You mentioned the servers are stateless, but the enclosures obviously are not.
Right. All the states in the enclosure.
And each enclosure has two fabric modules and 200 gig Ethernet ports
per fabric module.
And then a PCIe switching fabric
that connects the fabric modules
to the SSDs.
So there's really no computation
in there other than networking.
The enclosure we're using now,
you know, we don't build custom hardware. The enclosure we're using now you know we we don't build custom hardware
um the enclosure we're using now is an off-the-shelf product it uses x86 servers as the
fabric modules all they do is create nvme over fabrics connections between the ethernet and the SSDs. We don't do any storage management in the enclosure.
The enclosure's dumb.
So let's talk basic file system services,
things like quarter management, QoS, replication, global namespaces.
So it's one global namespace across the entire cluster.
We don't have multiple volumes or that whole concept. It's one namespace. It can be subdivided for multi-tenancy, and we can even assign tenants pools of vast servers to do QoS by tenant and say, here, you're going to have this much performance because you have this many vast servers.
Snapshots are coming any day now.
Replication coming later this year.
QoS right now is done via vast server pooling.
Remember, we're a startup.
We've been shipping for eight months.
Cloning versus snapshotting. So I guess that's, and your snapshots will, when they come, be read-only?
So the initial version of snapshots that ships any day now will be per file system or per system snapshots. The next version, which should ship this year,
will be for per folder snapshots.
Well, when you have a storage system that's an exabyte,
you really want to take snapshots of subsets. And so we will, in the fullness of time,
be able to do things like support vSphere hosts and vVols
and automatically generate per-folder snapshots.
This is all distant roadmap with no dates attached.
You mentioned vSphere.
So do you support an NFS data store for vSphere today, even though it's not vVolves?
We do not currently support that use case.
That is something that will be changing relatively soon.
When you're a company at our stage, support means that there's a customer who needs support.
And so, you know, there's a little bit of testing we have to do.
And then there's just, you know, well, are we going to support it?
Well, who are we going to support it on?
And we're still at that stage.
And you mentioned that you support billions of objects.
Does that include billions of files?
Yes. So the metadata constructs what we call the vast element store. It's not a file system. It's not an object store. It's an abstraction. folder, excuse me, an S3 bucket is an NFS folder.
So you can create files via NFS and then we store an abstract version of ACLs,
not NTFS ACLs, not NFS4 ACLs,
but one that we can make serve both sets of purposes.
God, this has been great.
There's plenty of other questions I have.
Greg, do you have anything you want to ask?
Yeah.
So is there any client-side software required,
or is there a special client-side driver?
Nope.
It's not a parallel file system.
It's a file and object system.
So clients access the system via NFS.
If they want higher performance,
they can access the system via NFS over RDMA,
which eliminates all the TCP issues related to a single client accessing a single share and lets you get eight gigabytes per second of access from a single NFS over RDMA client to a single file.
Or they can use S3. But we do not build clients for user systems.
We do not build tiers. We have a no more tiers formula, just like the folks at Johnson's.
And while the no more tiers formula starts to become important, not so much internally, but because we're saying you can afford to put all the data you would put on your tier one and reduction realm, not seven deduplication realms, because I've got data
on seven systems that deduplicate
independently.
And then we get to amortize
the wear created
by your HOT applications
against...
So we
get to amortize the wear from your
HOT applications against the data that's being used by your cold applications, and the average goes down.
We call it the virtuous cycle.
Sometimes it's the counterintuitive cycle.
The more data you put on an all-flash system, the more data you can afford to put on the all-flash system.
All right. Listen, Howard, do you have anything you'd like to say to our listening audience?
Well, so if you're dealing with large quantities of data and data-intensive applications like
machine learning or oil and gas exploration or real-time analytics, where your hot data and your cold data don't
segregate as well as they used to, we're the solution. One system, one tier, one performance
profile for all of those applications. Well, this has been great thank you very much howard for being on our show again today wow i've
been in the habit for so long it's always a pleasure to be a graveyard next time we'll talk
to another system storage technology person any question you want us to ask please let us know and
if you enjoy our podcast tell your friends about it and please review us on itunes and google play
as this will help get the word out that's it for now now. Bye, Greg. Bye, Ray. Bye, Howard. Bye, Ray. And that's it. Thanks, gents.