Grey Beards on Systems - 0101: Greybeards talk with Howard Marks, Technologist Extraordinary & Plenipotentiary at VAST
Episode Date: April 30, 2020As most of you know, Howard Marks (@deepstoragenet), Technologist Extraordinary & Plenipotentiary at VAST Data used to be a Greybeards co-host and is still on our roster as a co-host emeritus. When I ...started to schedule this podcast, it was going to be our 100th podcast and we wanted to invite Howard and the rest … Continue reading "0101: Greybeards talk with Howard Marks, Technologist Extraordinary & Plenipotentiary at VAST"
Transcript
Discussion (0)
Hey everybody, Ray Lucchese here with Keith Townsend and Matt Leib and Howard Marks here,
but this time I'm a guest.
Welcome to the next episode of Graybridge on Storage podcast, show where we get Graybridge
storage bloggers to talk with system vendors and other experts to discuss upcoming products, technologies, and trends affecting the data center or our world today.
This Gray Bridge on Storage episode was recorded on April 23rd, 2020.
We have with us here today, Howard Marks, a longtime friend, former co-host, and technologist, extraordinary and plenty potentially at VAST.
So, Howard, why don't you tell us a little
bit about yourself and what's new at VAST? Well, I've been around here long enough. I don't think
I need to tell people much. I've been in the storage business kind of forever, but things
are exciting at VAST. As you guys know, we make a large-scale all-flash storage system for unstructured data. And we've been selling now for about 14 months.
That's been going really well.
And our investors rewarded us just the other day with.
Yeah, I saw something about that.
You guys got some more money, I think.
Yeah, just a little.
A hundred million dollar round at a $1.2 billion dollar valuation you know i was going to go to nab in the
unicorn aloha shirt uh but then they canceled nab so we're you know so the funding part is nice
um especially since we haven't spent the b round yet so you know that extra 100 million dollars
is going to come in really
handy in this kind of times. Right, right. Well, it should give you plenty of runway.
So I don't think we've talked with Vast before. So you mentioned unstructured data. So what exactly
you guys do again? So I hate to remind you this, Ray, but I have been, I guess, before from VAST. But that aside,
so we make a large-scale all-flash scale-out system for unstructured data. So all-flash is relatively obvious. But we make systems that start at a petabyte and go up, and up extends
to exabytes. It's pretty unusual to see a petabyte of flash
for storage these days.
Is that all flash?
The problem is that while other vendors
might have been willing to make you a petabyte
all flash storage system,
nobody was able to afford it before.
So VAST, you know, our founder,
Renan Halleck, was the chief technical guy at Extreme.io.
And as he was planning to leave, he started talking to Extreme.io customers saying, okay, so what don't you like?
And nobody said, you know, this all-flash array isn't fast enough.
Yeah, I would say so. Everybody said, this all-flash array,
it costs too much for me to put all my data on,
and I have to use it just for my most critical
or my most performance-requiring applications.
VAST was designed to bring the cost of all-flash down to the point
where you could use it for things like archives without people
going, are you crazy? You can't afford that. Yeah. It's very unusual to be archiving to
all flash arrays, Howard. It is. We're the first people who make it possible.
And the secret sauce starts with QLC flash. Are you guys actually shipping QLC Flash? We are.
We've been shipping QLC Flash for about a year.
We buy the cheapest QLC Flash SSDs.
Intel makes them for a couple of the hyperscalers. And so this is their 15.36 terabyte usable SSDs,
no DRAM buffer, no power failure capacitors because they don't have the buffer that they have to protect.
And that means that the QLC is basically exposed to us.
And other people can't use those SSDs,
first of all, because they're single-ported
and most enterprise storage systems
are designed for dual-port drives.
So, Howard, you're describing
like my nightmare desktop setup
of a PLC flash with no DRAM, single-ported.
That just doesn't sound like any kind of thing that I want to build a system around.
Well, it's not something you want to build a system around.
It's something you want really smart guys to build a system around
so that they can address all of the deficiencies of that really cheap QLC SSD that you're afraid of.
But they address it in software.
Well, there is a lot of software, but there's also a substantial amount of 3D crosspoint.
Optane and QLC?
Optane and QLC. So you're using Intel?
We are using Intel Optane SSDs
we're using Intel QLC SSDs
because they like us
not because we're locked into that
and the Optane
acts as a write buffer
all the data is only stored on QLC other than during writes?
yes
you guys sound like a walking commercial for Optane.
If you look at how other storage vendors use Optane, it looks very much like what other storage vendors were doing in 2009, 2010 with Flash SSDs.
Right.
It's, you know, oh, look here, we have this faster thing. We can use it as a cache.
We have this faster thing.
We can use it as a tier and we already have tiering software.
Cause we wrote that 10 years ago when flash SSDs came out.
We are the first vendors to really look at 3d cross point and say,
how is this fundamentally different? So give me about
three minutes to run through some architectural points. So we disaggregate the compute part of
a storage system from the media. The compute part runs in software on x86 servers and connects to
the media over NVMe over Fabrics. The back end is NVMevme over fabrics the back end is nvme over fabric the
back end is nvme over fabrics ethernet usually usually 100 gig ethernet uh for hpc customers
we run infiniband because they like to integrate into their infiniband network so sorry to interrupt
the media stored in an ha enclosure there two Fabric modules that each have 200 gig
ethernet ports that route NVMe over Fabric's requests to the SSDs in the enclosure. There's
12 Optane SSDs and 44 QLC SSDs in each enclosure. It's Optane SSDs, not Optane memory?
It's Optane SSDs because we talked to them over NVMe, over Fabrics. Optane DIMMs are local to
the CPU that the DIMM slot is in, right? We have disaggregated everything and we share everything and you can't share a dim you can with software and maybe with
like gen z three years from now but but you really can't share a dim so you got this controller
two controllers 10 controllers scale out controllers scale out scale out front end
servers you can call them controllers they do a little bit more, but controller is a good enough term.
Every front-end server mounts over NVMe over Fabrics
every SSD in the cluster at boot time.
Wait, hold on.
Let me, because the scaling problem is coming directly into my head
when I hear you say that.
How many front-end servers?
An arbitrary number we've tested.
An arbitrary number.
Let's just say two to end, apparently.
You don't do a one, right?
No, no, no.
So we won't sell you one because performance is primarily dictated by the CPU.
You know, there's so much performance in the back end that how much CPU
provides is the primary determinant
of performance.
But the system will run on one.
So in failure modes,
you get degraded performance
all the way down to, I started with
20 and now there's only one.
I'm
kind of taken back
about each front end mounts every SSD because I was going to ask you about this, you know, data locality challenge from the front end to the back end.
But you're telling me that every SSD is mounted by every front end server.
Every SSD is mounted by every front end server, which means that every bit of data is equally local to every front-end server. Every SSD is mounted by every front-end server, which means that every bit of data
is equally local to every front-end server.
So these guys are smarter than me
because this is, that's crazy, Howard.
Metadata locking across 25,000 front-end servers
to 44 SSDs.
We have not tested 25,000, but for 100.
100 servers?
So 100 front-end servers connect to, you know,
so 100 front-end servers in a typical arrangement
would be 400 or 500 SSDs on the back end.
We're going to find the gotcha in this
because there's a gotcha somewhere.
We're going to find it.
So let me fill you in on a couple of other pieces
and you can see if a gotcha shows up.
All right.
Okay.
So the front-end servers, the controllers, are stateless.
There's no metadata cache.
They simply process requests and do all the compute. So they
manage the data protection and the reduction and all of that stuff. But all the metadata is stored
in the 3D crosspoint. So if you write a file via NFS or next month via SMB, the front-end server creates that file
in the metadata structures in the 3D crosspoint
so that all of the metadata is equally available
to all of the front-end servers.
All right, so that's starting to make a little sense to me.
So I have low latency access to the 3D crosspoint SSDs.
Everyone has equal access over this 100 gigabit
Ethernet or
Or InfiniBand.
Or InfiniBand if I need
super low latency. It's usually more
InfiniBand because I already
use InfiniBand. 100 server
trying to create
files at the same time
to a single metadata
data set? Is that kind of the world?
A distributed metadata data set, but not a dedicated metadata service or server like
you'd have in like Lustre. So you would distribute the metadata across all the optanes,
let's say 100 optanes in this 400 SSD configuration? Yeah. So is it writing to those?
I know we keep sidetracking you, Howard.
But is it writing those Optanes simultaneously,
or is it replicating across a given server set?
So when a front-end server makes a metadata modification
or writes new data to the write buffer, it writes immediately to two optanes in two enclosures before it considers it written.
It isn't primary, secondary.
So it's three optane versions of this metadata or two?
Two.
Okay. So Howard, let's talk about metadata
management because at the end of the day
some CPU has to manage
the metadata. It has
to optimize it, etc.
It's a database still. Yeah.
So it's B-tree
based
and the trees are designed to be
so there's a B-tree
we call it a V-tree but aside, there's a V tree for every object, you know, file or S3 object.
And that tree is a maximum of seven layers deep.
So any lookup to find, you know, where's this piece of data is a maximum of seven queries to the metadata base. But since every element has its own tree, having hundreds of front ends, unless they're all accessing the same file, they're not fighting over the same metadata objects. They're just dealing with the same set of metadata.
Yeah, but I still have to perform like database. It's still a database. I still have to perform database maintenance on this.
B-trees, maybe not.
Not really more than maintaining the tree for any given file at any given time. You know, so if 100 servers with 500 and 400 SSDs and 100 Optanes, how does the traffic cop work in this thing?
I'm going to access a file A. Which one of those 100 servers do I go to?
Those front-end servers have a pool of virtual IP addresses.
A pool, okay. And RoundRobin DNS assigns you to one of those front-end servers.
So there's a RoundRobin DNS in front of this thing?
Yeah.
And there's a Mellanox switch or something like that behind it?
So we use Mellanox switches primarily today.
Mellanox is an investor.
We're also qualifying other switches because it's 100 gig
ethernet. So in theory, each client is only writing to one of those metadata handlers.
I mean, you could, you know, I'm a Windows guy, so I'll think SMB, not NFS, but I could have
drives mapped to different names that point to different
front-end servers, but any one connection is going to go to one front-end server.
And continue to use that server until it's a failure or something like that.
Right.
This is really cool, but I'm going to ask the obvious question. Why do I even want this? So you want this because the current class of scale-out systems
doesn't fit your needs.
So we haven't today talked about how we have a new class of erasure codes
that gives you N plus 4 protection with 3% overhead
or about how we have a new class of data reduction that we guarantee
reduces data better than anybody else's.
N plus four with only 3% overhead?
146 plus four.
Now, Ray laughs at 146 plus four because you're thinking of Reed Solomon erasure codes.
And if we did 146 plus four with Reed Solomon,
when a drive failed, we'd have to read from 145 drives.
If two drives fail, then it gets, and then it's, you know,
that's everything for one drive failure.
I just have to read the first parity.
We use a new class of erasure codes called locally decodable codes.
And what locally decodable means is that in order to rebuild, we only need to read one fourth of the surviving data strips and all four protection strips.
So we have to read 37 data strips to rebuild, not 145 data
strips to rebuild. Still, I mean, 37 is a good large number of data that's going to be consumed
during a rebuild, right? I mean, these are 15 terabytes, was 15.4-ish kind of? Yeah, 15.4-ish. In order to have 146 plus four, you have four enclosures, which means you probably have 24 or more front-end servers. And the rebuild process is distributed across all the front-end servers.
Each one rebuilds one RAID stripe at a time.
So if there were a million RAID stripes
that hit the failed SSD,
that's a million rebuild jobs
distributed across all the front-end servers.
I was thinking you were going to actually assign
like one of the 37 drives that are remaining to one of the front-end servers, but you're doing it
horizontally, not vertically. We're doing it horizontally. Everything parallelizes out
and it's a fail-in-place architecture. We don't have spare drives. It's not a many-to-one rebuild. It's a many-to-many
rebuild. We just use spare space across the clock. Let's go one layer deeper if there is another
layer deeper. Oh, you know me. There's always another layer deeper. There's another layer
deeper. We're dealing with some pretty stupid drives here.
These QLC drives are pretty dumb. Yes. How do you guys mitigate the stupidity of the drives
when it comes to data integrity, et cetera? Because I'm writing to QLC and I don't trust QLC.
First of all, let me talk about the endurance part for a second. And you can tell me how
geeky you want me to get. I'll start with
not very geeky. The spec sheet for these drives? Was it two writes per year or something like that?
It's 0.2 drive writes per day if you write 4k random IOs, but it's four drive writes per day,
20 times as much if you write 128k sequential ios so you're right sequential
obviously okay so so we use the 3d cross point as a write buffer and we accumulate very large
amounts of data before we start migrating it to the qlc we created this system to use stupid SSDs.
I'll tell you guys a little secret.
The plan called for using the open format SSDs
where the controller is responsible
for flash translation.
And it turns out those are more expensive.
So we use these other stupid SSDs.
Don't have any translation on board?
Well, they do. But you don't use it. But we are aware of the internal structures.
And so we always write big writes that are a multiple of the page size of the flash
so that we never create page tears and force the SSD to garbage collect.
So what's the chances of overrunning the Optane write buffer?
So there's three terabytes of Optane write buffer per enclosure.
And the odds of overwriting it are nil because the process that sucks the data out of that
write buffer runs faster than the process that pushes the data into that write buffer.
So we can drain the buffer faster than we can fill it.
So how many SSDs are we talking per enclosure? Is it 40?
44. 44 and 12.
44 and 12.
And you're saying that you can actually empty the Optane faster than you can fill it?
Yes.
Now, emptying the Optane means reading it into something and writing out to a QLC SSD, right?
So we now do encryption at rest.
Encryption at rest.
Okay.
And so that means that the migration process means reading data.
And so, oh, another trick.
So the whole idea in maximizing the endurance of the QLC layer is to minimize how much garbage collection happens, both within the system and within the SSD.
So when we write new data to the system, we evaluate its life expectancy.
If it's a temp file or getting written to a temp folder or other things,
then we say this is going to be ephemeral data.
It's not going to last very long.
It's hard to do that sort of determination on the fly.
It is, but we don't have to be perfect.
If we are pretty good at predicting the life expectancy of the data,
and we write erasure coding stripes that contain just data we expect to be ephemeral,
come garbage collection time,
however percentage we were right,
that stripe is going to be empty because that data was ephemeral
and it was deleted and overwritten,
which means we only have to garbage collect
a very small amount of data.
And that means that we create less right amplification
on the back end. And of course, the data that we move, we go, oh, you've survived long enough to be
garbage collected. Your life expectancy is now in a higher class because you've made it past
infant mortality. Yeah. And, you know, a stripe in this kind of world is like,
you're saying 146 plus four, right?
Or something like that.
It may be max, right?
It's 36.
So in one enclosure, there's 44 SSDs.
So we stripe 36 plus four is 40,
which leaves us four SSDs worth of space to rebuild to.
And then we rotate the stripes so that, you know, it's distributed RAID. And then as you add
enclosures, the stripes get wider. It's even so, 36 of these 128 kilobyte large block sizes time.
We write one megabyte. So it's 36 meg let's say to for for a
substripe and then there's 40 with the plus four no no on each ssd one front end server writes one
writes one meg substripes and we write a one gig deep stripe on each SSD of data with the same life expectancy so that when
we garbage collect, we're doing the erase at one gigabyte per SSD because the erase block size in
this QLC flash is 200 megabytes-ish. So we're erasing a multiple of the erase block size
and again, reducing the rate amplification inside the SSD.
So in theory, let's bring that higher
to like standard operating drive replacement.
I shouldn't have this very sudden use of, you know, a whole block of SSDs going out because you guys have kind of planned that usage pretty well. for any reason at any time on any system under maintenance.
And we will write you a maintenance contract for up to 10 years
and we'll write extensions to shorter contracts up to 10 years at a flat rate.
Not many storage vendors out there offer 10 years of maintenance at a whack.
We have a government customer who bought a 10-year service contract.
I'm impressed.
Part of the new economic model is that you can leave this on your floor for 10 years.
It's an all-flash system.
It's unlikely you're going to have performance drivers saying,
well, the new thing is so fast,
we have to replace this old thing in five or six years. But if your vendor says, I'll only,
you know, in year six, your maintenance becomes so high, you might as well buy a new system.
Then over a 10 year period, you actually end up migrating twice. And that means there's two years
where if you're a typical sloppy enterprise, well, it takes a year from the day the new system comes
in till the day the old system goes out. And so you're paying for two systems for two of those
10 years. So the economics of, I didn't have to have my guys do a migration. I didn't have to buy
a new system. I didn't have to run two systems at the same time. That starts to add up pretty quick.
You mentioned somewhere earlier about this world-class data reduction,
better, best in the world. I'm not sure what the term was.
Guaranteed better data reduction than any other storage system.
So how does one do that with, you know,
these data reduction systems have been out there for years?
A whole new idea.
Really? You don't write the data?
Well, no, we do write the data, but we don't do conventional deduplication.
What?
All right, that needs to be explained, Howard.
In a conventional system, you've got compression, which takes very small repetitions in the
incoming data and finds them and replaces them with symbols.
Shorter symbols, I might add, but go ahead.
Deduplication finds larger exact duplicates and replaces them
with pointers. We use a technique that's based on similarity. So like deduplication, we break the
data up into chunks and we use variable chunk sizes with the same technique that the Rocksoft patent covers so that insertions don't throw us off.
But rather than hashing that block with a strong collision-resistant hash function like SHA-1,
that's designed to have a large change in the hash function output for a small change in the data input. We use a hash function that generates
the same output if the two blocks are sufficiently similar or are sufficiently close in cryptographic
distance. Some similarity like that is the problem. If I've got a block where I've written and I've updated it with a block that's two characters different, I want both those blocks to be saved, Howard.
I didn't steal. We're not throwing one away, Ray. I told you we're not doing deduplication. But what similar means is those two blocks dedupe with the same compression dictionary because only two bytes are different.
So all of those repetitions and symbols are the same. So we store, when the first block we see
comes in that generates a given hash, we compress it and we store it and it becomes a reference
block. When the second block generates the same hash, we compress the two blocks together and save the difference.
Huh? What do you mean compress it together? The first block is already compressed.
Right. The second block comes in and it hashes to the similar hash?
Right. And then we compress it alongside the first block using the first block's dictionary.
We don't have to save the dictionary a second time.
Huffman encoding kind of thing?
Is that what HuffPuff kind of thing?
We also do Huffman encoding,
but this is more about the LZ phase
than the Huffman phase.
So we use ZStandard,
Facebook's new compression algorithm that it's one yeah
compression algorithms just keep getting better in various ways and um z standard is very good at
the degree of compression for the amount of compute that it needs yeah that's always the
problem right compression demands so much
horsepower to run effectively. And the hash itself is not cheap either. For most systems,
the problem is time. That, you know, if you're doing real-time, if you're doing in-line
reduction, then it has to be fast because you're affecting the write latency. And if you're doing inline reduction, then it has to be fast because you're affecting the write latency.
And if you have a NVRAM write buffer,
it's only a few gigabytes. It's small.
You have to be really worried about draining it all the time. And again,
that means you don't have much time and you can't do really extensive
compression. Once data is written to our Optane layer, we acknowledge the right.
The migration process from the Optane to the QLC is asynchronous.
And that means that as long as we can have enough bandwidth draining the Optane layer
so that it doesn't overflow, the time it takes to move any
piece of data from Optane to QLC and erasure code it and compress it and encrypt it doesn't really
matter. And since we can scale compute by adding more front-end servers independently of scaling capacity in the enclosures.
And frankly, front-end servers are cheap.
They're just standard x86 servers.
Our software is licensed on the enclosure.
So if you want more front-end servers,
you're just buying hardware.
And how much CPU it takes to compress
becomes a function of how much hardware,
whether that creates a performance problem or there's not enough CPU to do other things problem is a scaling issue.
You throw a little bit more cheap hardware at it. You mentioned encryption at rest somewhere
back in my memory here. Yeah. New feature in next week's announcements.
So encryption at rest is going to affect compression, deduplication, data reduction, and all that other stuff, right?
We encrypt last.
I understand you encrypt last, but so the hash is not.
Okay, I got it.
So your hash is sitting there in the metadata associated with the block, I guess. Yeah, the hash and which reference block it points to
and the CRC are all in the metadata that points to.
So two blocks that happen to be similar
could be two distinct files.
They don't even have to be close to one another.
And they get encrypted with the same key, I guess.
It's not like a volume level encryption or a pool.
It's system-wide encryption.
It's system-wide, I see.
You say encrypt my data at rest and all the data in the system is encrypted.
Selective encryption and data reduction is an interesting problem.
Not one I have spent a huge amount of time considering.
So, Howard, that brings me back to this point huge amount of time considering.
So Howard, that brings me back to this point of kind of
work management across
let's say that we have 10
front-end servers.
Right.
While there isn't,
while the servers themselves are
stateless, the work is
not stateless. Like if I'm
in the middle of doing, if one of the nodes is in
the middle of doing encryption and then it fails or encryption is being done across a certain number
of nodes and that fails, where is that state maintained? Because while the servers themselves
are stateless, there is still state. There's always state. The key to our system
is the states in the enclosure where another server can pick it up. So let me answer your
question and give you another example. Every update to our system is transactional. And so migrating a substripe from Optane to QLC is a transactional process.
And so we create a transaction token and indicate progress on that, and keep track of progress
on that transaction. If the front-end server died in the middle of migrating a stripe,
the data is all still in the 3D crosspoint. The transaction didn't complete, it gets aborted,
and moving that data gets assigned to another vast server, which starts migrating the data.
There may be some data that got written to the QLC that's going to get written
to the QLC again, and it will occupy a little bit of space until we garbage collect. So there's some,
so obviously I'm keeping this tracking on the 3D cross point, and there's some worker, Damon, on each, on each node looking at that.
Yeah, they're not logs, but yes. So let me give you, let me give you a little more complicated
example. SMB. All right. NFS and S3, the first protocols we implemented are stateless. Every request is
independent, but SMB is very stateful because SMB does things like opportunistic locking,
where a node caches data that it's reading and tells the server, I have this data cached.
And if some other user updates that data, the server sends a message that says invalidate your cache for this because somebody changed it.
That requires that the SMB server have a huge amount of state about every SMB client connection.
In everybody else's scale out NAS running SMB 2.1.
If a node fails, that state is lost and the user has to reconnect.
Because we store that state in the 3D cross point, when a node in our system fails,
one of the other nodes picks up that virtual ip address and picks up the state and the client
retries within the smb timeout and the user never noticed that anything went wrong
so yeah the system has state the front end servers don't have state. And compared to something like an Isilon or some other scale-out system that's shared nothing,
that means failures of the front-end server nodes don't cause a rebuild event
because there's nothing that they were exclusive owners of.
Having trouble visualizing.
We need a whiteboard or something like that for our podcast, but that would be a different domain.
So this goes back to that argument that me, Matt, and Ray had over Kubernetes and state.
So in a truly stateless system, which the VAS system seems like is a truly stateless system, at least on the front end, not the system, but the nodes are stateless.
Yeah, the system is very stateful.
Any storage system is stateful, but the front end servers are stateless.
Yeah, the system is stateful, but how you maintain that state is based on the underlying
processes on the storage system itself.
So how would I approve this?
Now, go back and teach Kubernetes guys how to do this.
I'm just a storage guy.
All I know about Kubernetes is that we have a CSI and it works great.
Okay. Talk to me about,
do you guys do snapshots and replication and mirroring and,
and across systems and all that stuff?
So as you guys know,
we're a startup.
And so you come out with a product that meets the needs of some customers.
We're making our real push into what you would think of as enterprise users with the release that comes out at the end of April.
That includes SMB support and a snap to S3 function so that you can take snapshots and replicate that data to S3 and have offsite backups.
And we just treat remote backup,
remote snapshots like a special class of snapshots.
So there's a dot vast remote folder.
You can browse to restore a file from them and encryption at rest.
So those features are, you know, just coming out.
And you guys already support S3, right? As an object protocol.
Yeah, we've, we've, we've supported NFS and S3 from day one.
All of our protocols are in-house development.
We don't use Samba or MoSMB or anything like that because all that state stuff I I was just describing and the failover only works because we've tied SMB so tightly to the distributed, excuse me, disaggregated shared everything architecture.
Because we've had the SMB server store at state and crosspoint, we can do the failover. Yeah, we're going to have to get you on a different podcast and talk about the networking part of this because I have a bunch
of networking questions about that back end. Even if it's NVMe over Fabric, there is still the
process of moving the packets in and off the network at a latency and pace perspective that makes sense.
So there's still a lot there to unpack. Oh, yeah. Well, I can tell, you know,
some of this isn't going to be out for a little while, but I can tell you that
we bottlenecked at the PCIe slot. We ran out of lanes.
On the back end or the front end?
In the back end.
Yeah, that's pretty impressive because, you know, as you talk about DPDK and all the technologies needed to move at line rate, just in general, to say that the bottleneck is PCIe, you guys are doing some pretty special stuff to optimize the code to get the packets on and off and reduce that latency. It'll be pretty interesting to see what you guys are doing some pretty special stuff to optimize the cold to get the packets on and off and reduce
that latency. It'll be pretty interesting to see what you guys are doing. I'd be glad to sit down
with you, Keith. Well, you have to sit down with all of us. We might move it to the CTO advisor
so we can do the whiteboard part. All right, gentlemen, this has been great. Keith and Matt,
any last questions for Howard before we close?
You know, I would ask Howard, how do you feel about working on the vendor side?
What's the change in your business?
Besides paycheck?
The steady paycheck is a nice thing.
The health insurance is a nice thing. Um, I am still learning, um, how to make everything I say be bright and sunny
because I'm a marketing guy, but, uh, you know, it all works. It's a good team. You know, I,
I took a job cause I was getting lonely and wanted to work with a team and I'm very happy with this
team. And, you very happy with this team.
And, you know, it's a great story to tell.
The technology is unique.
We're doing things nobody else does.
And if you look at, you know, so we talked about the similarity data reduction.
But if you look at the systems, you know, we don't compete with pure flash array or solid fire. We compete with Isilon and Spectrum Scale, GPFS, and Lustre.
And on those systems, data reduction's been, yeah, you could compress your data in the cold
tier, but we don't recommend it for anything with performance. Or, yeah, we de-dupe, but it's a
background job. And you have to make sure that the background job
doesn't run when your users are busy
because it consumes a lot of the system performance.
We're data reduction all the time
over tens or hundreds of petabytes of data.
And people have had problems
getting that stuff to scale before.
And again, having all the metadata in the 3D crosspoint
means we don't have the usual dedupe problem of,
does that hash table fit in memory?
Interesting.
Keith, anything you'd like to ask?
Well, I'll just leave it off with the comment.
It looks pretty cool.
The solution is pretty cool.
I'll have to talk to my friends at Intel
to see if I can get them to sponsor
putting one of those on my data center for a little bit.
Yeah, let me know. I'd like to get
a hands-on myself.
I think we can make an argument for it, guys.
Yeah, yeah.
With our standard
enclosure
being 675 terabytes
of raw capacity,
home labs are not really a topic.
I don't have a home lab, Howard.
I'm in a 450,000 square foot data center.
So I think I can manage it.
Yeah, I understand.
I was the guy who had more than a home lab
for a long time.
Yeah, there you go.
Howard, anything you'd like to say
to our listening audience before we close?
Yeah, I think that our listening audience
should be watching VAST because even if you're not in the large scale that we operate in, you know, we're
doing interesting things and taking a new approach to doing storage. And somebody will follow along
and go, that was a good idea. What happens if you try and scale that down? Exactly.
Well, this has been great. Thank you very much, Howard, for being on our show today.
Always a pleasure, guys.
I miss being a gray beard.
Well, you're still a gray beard, but you're not a gray beard on storage gray beard.
Next time we will talk to another system storage technology person.
Any questions you want us to ask, please let us know.
And if you enjoy our podcast, tell your friends about it.
And please review us on iTunes and Google Play as this will help get the word out.
That's it for now. Bye, Keith. Bye, Matt. Bye, Ray. Bye, Ray. And bye, Howard. Bye, Ray. Bye.