Grey Beards on Systems - 0101: Greybeards talk with Howard Marks, Technologist Extraordinary & Plenipotentiary at VAST

Episode Date: April 30, 2020

As most of you know, Howard Marks (@deepstoragenet), Technologist Extraordinary & Plenipotentiary at VAST Data used to be a Greybeards co-host and is still on our roster as a co-host emeritus. When I ...started to schedule this podcast, it was going to be our 100th podcast and we wanted to invite Howard and the rest … Continue reading "0101: Greybeards talk with Howard Marks, Technologist Extraordinary & Plenipotentiary at VAST"

Transcript
Discussion (0)
Starting point is 00:00:00 Hey everybody, Ray Lucchese here with Keith Townsend and Matt Leib and Howard Marks here, but this time I'm a guest. Welcome to the next episode of Graybridge on Storage podcast, show where we get Graybridge storage bloggers to talk with system vendors and other experts to discuss upcoming products, technologies, and trends affecting the data center or our world today. This Gray Bridge on Storage episode was recorded on April 23rd, 2020. We have with us here today, Howard Marks, a longtime friend, former co-host, and technologist, extraordinary and plenty potentially at VAST. So, Howard, why don't you tell us a little bit about yourself and what's new at VAST? Well, I've been around here long enough. I don't think
Starting point is 00:00:49 I need to tell people much. I've been in the storage business kind of forever, but things are exciting at VAST. As you guys know, we make a large-scale all-flash storage system for unstructured data. And we've been selling now for about 14 months. That's been going really well. And our investors rewarded us just the other day with. Yeah, I saw something about that. You guys got some more money, I think. Yeah, just a little. A hundred million dollar round at a $1.2 billion dollar valuation you know i was going to go to nab in the
Starting point is 00:01:27 unicorn aloha shirt uh but then they canceled nab so we're you know so the funding part is nice um especially since we haven't spent the b round yet so you know that extra 100 million dollars is going to come in really handy in this kind of times. Right, right. Well, it should give you plenty of runway. So I don't think we've talked with Vast before. So you mentioned unstructured data. So what exactly you guys do again? So I hate to remind you this, Ray, but I have been, I guess, before from VAST. But that aside, so we make a large-scale all-flash scale-out system for unstructured data. So all-flash is relatively obvious. But we make systems that start at a petabyte and go up, and up extends to exabytes. It's pretty unusual to see a petabyte of flash
Starting point is 00:02:25 for storage these days. Is that all flash? The problem is that while other vendors might have been willing to make you a petabyte all flash storage system, nobody was able to afford it before. So VAST, you know, our founder, Renan Halleck, was the chief technical guy at Extreme.io.
Starting point is 00:02:48 And as he was planning to leave, he started talking to Extreme.io customers saying, okay, so what don't you like? And nobody said, you know, this all-flash array isn't fast enough. Yeah, I would say so. Everybody said, this all-flash array, it costs too much for me to put all my data on, and I have to use it just for my most critical or my most performance-requiring applications. VAST was designed to bring the cost of all-flash down to the point where you could use it for things like archives without people
Starting point is 00:03:25 going, are you crazy? You can't afford that. Yeah. It's very unusual to be archiving to all flash arrays, Howard. It is. We're the first people who make it possible. And the secret sauce starts with QLC flash. Are you guys actually shipping QLC Flash? We are. We've been shipping QLC Flash for about a year. We buy the cheapest QLC Flash SSDs. Intel makes them for a couple of the hyperscalers. And so this is their 15.36 terabyte usable SSDs, no DRAM buffer, no power failure capacitors because they don't have the buffer that they have to protect. And that means that the QLC is basically exposed to us.
Starting point is 00:04:23 And other people can't use those SSDs, first of all, because they're single-ported and most enterprise storage systems are designed for dual-port drives. So, Howard, you're describing like my nightmare desktop setup of a PLC flash with no DRAM, single-ported. That just doesn't sound like any kind of thing that I want to build a system around.
Starting point is 00:04:48 Well, it's not something you want to build a system around. It's something you want really smart guys to build a system around so that they can address all of the deficiencies of that really cheap QLC SSD that you're afraid of. But they address it in software. Well, there is a lot of software, but there's also a substantial amount of 3D crosspoint. Optane and QLC? Optane and QLC. So you're using Intel? We are using Intel Optane SSDs
Starting point is 00:05:25 we're using Intel QLC SSDs because they like us not because we're locked into that and the Optane acts as a write buffer all the data is only stored on QLC other than during writes? yes you guys sound like a walking commercial for Optane.
Starting point is 00:05:49 If you look at how other storage vendors use Optane, it looks very much like what other storage vendors were doing in 2009, 2010 with Flash SSDs. Right. It's, you know, oh, look here, we have this faster thing. We can use it as a cache. We have this faster thing. We can use it as a tier and we already have tiering software. Cause we wrote that 10 years ago when flash SSDs came out. We are the first vendors to really look at 3d cross point and say, how is this fundamentally different? So give me about
Starting point is 00:06:27 three minutes to run through some architectural points. So we disaggregate the compute part of a storage system from the media. The compute part runs in software on x86 servers and connects to the media over NVMe over Fabrics. The back end is NVMevme over fabrics the back end is nvme over fabric the back end is nvme over fabrics ethernet usually usually 100 gig ethernet uh for hpc customers we run infiniband because they like to integrate into their infiniband network so sorry to interrupt the media stored in an ha enclosure there two Fabric modules that each have 200 gig ethernet ports that route NVMe over Fabric's requests to the SSDs in the enclosure. There's 12 Optane SSDs and 44 QLC SSDs in each enclosure. It's Optane SSDs, not Optane memory?
Starting point is 00:07:26 It's Optane SSDs because we talked to them over NVMe, over Fabrics. Optane DIMMs are local to the CPU that the DIMM slot is in, right? We have disaggregated everything and we share everything and you can't share a dim you can with software and maybe with like gen z three years from now but but you really can't share a dim so you got this controller two controllers 10 controllers scale out controllers scale out scale out front end servers you can call them controllers they do a little bit more, but controller is a good enough term. Every front-end server mounts over NVMe over Fabrics every SSD in the cluster at boot time. Wait, hold on.
Starting point is 00:08:18 Let me, because the scaling problem is coming directly into my head when I hear you say that. How many front-end servers? An arbitrary number we've tested. An arbitrary number. Let's just say two to end, apparently. You don't do a one, right? No, no, no.
Starting point is 00:08:34 So we won't sell you one because performance is primarily dictated by the CPU. You know, there's so much performance in the back end that how much CPU provides is the primary determinant of performance. But the system will run on one. So in failure modes, you get degraded performance all the way down to, I started with
Starting point is 00:08:58 20 and now there's only one. I'm kind of taken back about each front end mounts every SSD because I was going to ask you about this, you know, data locality challenge from the front end to the back end. But you're telling me that every SSD is mounted by every front end server. Every SSD is mounted by every front end server, which means that every bit of data is equally local to every front-end server. Every SSD is mounted by every front-end server, which means that every bit of data is equally local to every front-end server. So these guys are smarter than me
Starting point is 00:09:33 because this is, that's crazy, Howard. Metadata locking across 25,000 front-end servers to 44 SSDs. We have not tested 25,000, but for 100. 100 servers? So 100 front-end servers connect to, you know, so 100 front-end servers in a typical arrangement would be 400 or 500 SSDs on the back end.
Starting point is 00:10:05 We're going to find the gotcha in this because there's a gotcha somewhere. We're going to find it. So let me fill you in on a couple of other pieces and you can see if a gotcha shows up. All right. Okay. So the front-end servers, the controllers, are stateless.
Starting point is 00:10:21 There's no metadata cache. They simply process requests and do all the compute. So they manage the data protection and the reduction and all of that stuff. But all the metadata is stored in the 3D crosspoint. So if you write a file via NFS or next month via SMB, the front-end server creates that file in the metadata structures in the 3D crosspoint so that all of the metadata is equally available to all of the front-end servers. All right, so that's starting to make a little sense to me.
Starting point is 00:10:58 So I have low latency access to the 3D crosspoint SSDs. Everyone has equal access over this 100 gigabit Ethernet or Or InfiniBand. Or InfiniBand if I need super low latency. It's usually more InfiniBand because I already use InfiniBand. 100 server
Starting point is 00:11:18 trying to create files at the same time to a single metadata data set? Is that kind of the world? A distributed metadata data set, but not a dedicated metadata service or server like you'd have in like Lustre. So you would distribute the metadata across all the optanes, let's say 100 optanes in this 400 SSD configuration? Yeah. So is it writing to those? I know we keep sidetracking you, Howard.
Starting point is 00:11:49 But is it writing those Optanes simultaneously, or is it replicating across a given server set? So when a front-end server makes a metadata modification or writes new data to the write buffer, it writes immediately to two optanes in two enclosures before it considers it written. It isn't primary, secondary. So it's three optane versions of this metadata or two? Two. Okay. So Howard, let's talk about metadata
Starting point is 00:12:26 management because at the end of the day some CPU has to manage the metadata. It has to optimize it, etc. It's a database still. Yeah. So it's B-tree based and the trees are designed to be
Starting point is 00:12:42 so there's a B-tree we call it a V-tree but aside, there's a V tree for every object, you know, file or S3 object. And that tree is a maximum of seven layers deep. So any lookup to find, you know, where's this piece of data is a maximum of seven queries to the metadata base. But since every element has its own tree, having hundreds of front ends, unless they're all accessing the same file, they're not fighting over the same metadata objects. They're just dealing with the same set of metadata. Yeah, but I still have to perform like database. It's still a database. I still have to perform database maintenance on this. B-trees, maybe not. Not really more than maintaining the tree for any given file at any given time. You know, so if 100 servers with 500 and 400 SSDs and 100 Optanes, how does the traffic cop work in this thing? I'm going to access a file A. Which one of those 100 servers do I go to?
Starting point is 00:13:58 Those front-end servers have a pool of virtual IP addresses. A pool, okay. And RoundRobin DNS assigns you to one of those front-end servers. So there's a RoundRobin DNS in front of this thing? Yeah. And there's a Mellanox switch or something like that behind it? So we use Mellanox switches primarily today. Mellanox is an investor. We're also qualifying other switches because it's 100 gig
Starting point is 00:14:25 ethernet. So in theory, each client is only writing to one of those metadata handlers. I mean, you could, you know, I'm a Windows guy, so I'll think SMB, not NFS, but I could have drives mapped to different names that point to different front-end servers, but any one connection is going to go to one front-end server. And continue to use that server until it's a failure or something like that. Right. This is really cool, but I'm going to ask the obvious question. Why do I even want this? So you want this because the current class of scale-out systems doesn't fit your needs.
Starting point is 00:15:11 So we haven't today talked about how we have a new class of erasure codes that gives you N plus 4 protection with 3% overhead or about how we have a new class of data reduction that we guarantee reduces data better than anybody else's. N plus four with only 3% overhead? 146 plus four. Now, Ray laughs at 146 plus four because you're thinking of Reed Solomon erasure codes. And if we did 146 plus four with Reed Solomon,
Starting point is 00:15:53 when a drive failed, we'd have to read from 145 drives. If two drives fail, then it gets, and then it's, you know, that's everything for one drive failure. I just have to read the first parity. We use a new class of erasure codes called locally decodable codes. And what locally decodable means is that in order to rebuild, we only need to read one fourth of the surviving data strips and all four protection strips. So we have to read 37 data strips to rebuild, not 145 data strips to rebuild. Still, I mean, 37 is a good large number of data that's going to be consumed
Starting point is 00:16:34 during a rebuild, right? I mean, these are 15 terabytes, was 15.4-ish kind of? Yeah, 15.4-ish. In order to have 146 plus four, you have four enclosures, which means you probably have 24 or more front-end servers. And the rebuild process is distributed across all the front-end servers. Each one rebuilds one RAID stripe at a time. So if there were a million RAID stripes that hit the failed SSD, that's a million rebuild jobs distributed across all the front-end servers. I was thinking you were going to actually assign like one of the 37 drives that are remaining to one of the front-end servers, but you're doing it
Starting point is 00:17:31 horizontally, not vertically. We're doing it horizontally. Everything parallelizes out and it's a fail-in-place architecture. We don't have spare drives. It's not a many-to-one rebuild. It's a many-to-many rebuild. We just use spare space across the clock. Let's go one layer deeper if there is another layer deeper. Oh, you know me. There's always another layer deeper. There's another layer deeper. We're dealing with some pretty stupid drives here. These QLC drives are pretty dumb. Yes. How do you guys mitigate the stupidity of the drives when it comes to data integrity, et cetera? Because I'm writing to QLC and I don't trust QLC. First of all, let me talk about the endurance part for a second. And you can tell me how
Starting point is 00:18:23 geeky you want me to get. I'll start with not very geeky. The spec sheet for these drives? Was it two writes per year or something like that? It's 0.2 drive writes per day if you write 4k random IOs, but it's four drive writes per day, 20 times as much if you write 128k sequential ios so you're right sequential obviously okay so so we use the 3d cross point as a write buffer and we accumulate very large amounts of data before we start migrating it to the qlc we created this system to use stupid SSDs. I'll tell you guys a little secret. The plan called for using the open format SSDs
Starting point is 00:19:14 where the controller is responsible for flash translation. And it turns out those are more expensive. So we use these other stupid SSDs. Don't have any translation on board? Well, they do. But you don't use it. But we are aware of the internal structures. And so we always write big writes that are a multiple of the page size of the flash so that we never create page tears and force the SSD to garbage collect.
Starting point is 00:19:47 So what's the chances of overrunning the Optane write buffer? So there's three terabytes of Optane write buffer per enclosure. And the odds of overwriting it are nil because the process that sucks the data out of that write buffer runs faster than the process that pushes the data into that write buffer. So we can drain the buffer faster than we can fill it. So how many SSDs are we talking per enclosure? Is it 40? 44. 44 and 12. 44 and 12.
Starting point is 00:20:27 And you're saying that you can actually empty the Optane faster than you can fill it? Yes. Now, emptying the Optane means reading it into something and writing out to a QLC SSD, right? So we now do encryption at rest. Encryption at rest. Okay. And so that means that the migration process means reading data. And so, oh, another trick.
Starting point is 00:20:52 So the whole idea in maximizing the endurance of the QLC layer is to minimize how much garbage collection happens, both within the system and within the SSD. So when we write new data to the system, we evaluate its life expectancy. If it's a temp file or getting written to a temp folder or other things, then we say this is going to be ephemeral data. It's not going to last very long. It's hard to do that sort of determination on the fly. It is, but we don't have to be perfect. If we are pretty good at predicting the life expectancy of the data,
Starting point is 00:21:37 and we write erasure coding stripes that contain just data we expect to be ephemeral, come garbage collection time, however percentage we were right, that stripe is going to be empty because that data was ephemeral and it was deleted and overwritten, which means we only have to garbage collect a very small amount of data. And that means that we create less right amplification
Starting point is 00:22:07 on the back end. And of course, the data that we move, we go, oh, you've survived long enough to be garbage collected. Your life expectancy is now in a higher class because you've made it past infant mortality. Yeah. And, you know, a stripe in this kind of world is like, you're saying 146 plus four, right? Or something like that. It may be max, right? It's 36. So in one enclosure, there's 44 SSDs.
Starting point is 00:22:41 So we stripe 36 plus four is 40, which leaves us four SSDs worth of space to rebuild to. And then we rotate the stripes so that, you know, it's distributed RAID. And then as you add enclosures, the stripes get wider. It's even so, 36 of these 128 kilobyte large block sizes time. We write one megabyte. So it's 36 meg let's say to for for a substripe and then there's 40 with the plus four no no on each ssd one front end server writes one writes one meg substripes and we write a one gig deep stripe on each SSD of data with the same life expectancy so that when we garbage collect, we're doing the erase at one gigabyte per SSD because the erase block size in
Starting point is 00:23:38 this QLC flash is 200 megabytes-ish. So we're erasing a multiple of the erase block size and again, reducing the rate amplification inside the SSD. So in theory, let's bring that higher to like standard operating drive replacement. I shouldn't have this very sudden use of, you know, a whole block of SSDs going out because you guys have kind of planned that usage pretty well. for any reason at any time on any system under maintenance. And we will write you a maintenance contract for up to 10 years and we'll write extensions to shorter contracts up to 10 years at a flat rate. Not many storage vendors out there offer 10 years of maintenance at a whack.
Starting point is 00:24:43 We have a government customer who bought a 10-year service contract. I'm impressed. Part of the new economic model is that you can leave this on your floor for 10 years. It's an all-flash system. It's unlikely you're going to have performance drivers saying, well, the new thing is so fast, we have to replace this old thing in five or six years. But if your vendor says, I'll only, you know, in year six, your maintenance becomes so high, you might as well buy a new system.
Starting point is 00:25:19 Then over a 10 year period, you actually end up migrating twice. And that means there's two years where if you're a typical sloppy enterprise, well, it takes a year from the day the new system comes in till the day the old system goes out. And so you're paying for two systems for two of those 10 years. So the economics of, I didn't have to have my guys do a migration. I didn't have to buy a new system. I didn't have to run two systems at the same time. That starts to add up pretty quick. You mentioned somewhere earlier about this world-class data reduction, better, best in the world. I'm not sure what the term was. Guaranteed better data reduction than any other storage system.
Starting point is 00:26:06 So how does one do that with, you know, these data reduction systems have been out there for years? A whole new idea. Really? You don't write the data? Well, no, we do write the data, but we don't do conventional deduplication. What? All right, that needs to be explained, Howard. In a conventional system, you've got compression, which takes very small repetitions in the
Starting point is 00:26:35 incoming data and finds them and replaces them with symbols. Shorter symbols, I might add, but go ahead. Deduplication finds larger exact duplicates and replaces them with pointers. We use a technique that's based on similarity. So like deduplication, we break the data up into chunks and we use variable chunk sizes with the same technique that the Rocksoft patent covers so that insertions don't throw us off. But rather than hashing that block with a strong collision-resistant hash function like SHA-1, that's designed to have a large change in the hash function output for a small change in the data input. We use a hash function that generates the same output if the two blocks are sufficiently similar or are sufficiently close in cryptographic
Starting point is 00:27:38 distance. Some similarity like that is the problem. If I've got a block where I've written and I've updated it with a block that's two characters different, I want both those blocks to be saved, Howard. I didn't steal. We're not throwing one away, Ray. I told you we're not doing deduplication. But what similar means is those two blocks dedupe with the same compression dictionary because only two bytes are different. So all of those repetitions and symbols are the same. So we store, when the first block we see comes in that generates a given hash, we compress it and we store it and it becomes a reference block. When the second block generates the same hash, we compress the two blocks together and save the difference. Huh? What do you mean compress it together? The first block is already compressed. Right. The second block comes in and it hashes to the similar hash? Right. And then we compress it alongside the first block using the first block's dictionary.
Starting point is 00:28:45 We don't have to save the dictionary a second time. Huffman encoding kind of thing? Is that what HuffPuff kind of thing? We also do Huffman encoding, but this is more about the LZ phase than the Huffman phase. So we use ZStandard, Facebook's new compression algorithm that it's one yeah
Starting point is 00:29:08 compression algorithms just keep getting better in various ways and um z standard is very good at the degree of compression for the amount of compute that it needs yeah that's always the problem right compression demands so much horsepower to run effectively. And the hash itself is not cheap either. For most systems, the problem is time. That, you know, if you're doing real-time, if you're doing in-line reduction, then it has to be fast because you're affecting the write latency. And if you're doing inline reduction, then it has to be fast because you're affecting the write latency. And if you have a NVRAM write buffer, it's only a few gigabytes. It's small.
Starting point is 00:29:55 You have to be really worried about draining it all the time. And again, that means you don't have much time and you can't do really extensive compression. Once data is written to our Optane layer, we acknowledge the right. The migration process from the Optane to the QLC is asynchronous. And that means that as long as we can have enough bandwidth draining the Optane layer so that it doesn't overflow, the time it takes to move any piece of data from Optane to QLC and erasure code it and compress it and encrypt it doesn't really matter. And since we can scale compute by adding more front-end servers independently of scaling capacity in the enclosures.
Starting point is 00:30:47 And frankly, front-end servers are cheap. They're just standard x86 servers. Our software is licensed on the enclosure. So if you want more front-end servers, you're just buying hardware. And how much CPU it takes to compress becomes a function of how much hardware, whether that creates a performance problem or there's not enough CPU to do other things problem is a scaling issue.
Starting point is 00:31:12 You throw a little bit more cheap hardware at it. You mentioned encryption at rest somewhere back in my memory here. Yeah. New feature in next week's announcements. So encryption at rest is going to affect compression, deduplication, data reduction, and all that other stuff, right? We encrypt last. I understand you encrypt last, but so the hash is not. Okay, I got it. So your hash is sitting there in the metadata associated with the block, I guess. Yeah, the hash and which reference block it points to and the CRC are all in the metadata that points to.
Starting point is 00:31:52 So two blocks that happen to be similar could be two distinct files. They don't even have to be close to one another. And they get encrypted with the same key, I guess. It's not like a volume level encryption or a pool. It's system-wide encryption. It's system-wide, I see. You say encrypt my data at rest and all the data in the system is encrypted.
Starting point is 00:32:16 Selective encryption and data reduction is an interesting problem. Not one I have spent a huge amount of time considering. So, Howard, that brings me back to this point huge amount of time considering. So Howard, that brings me back to this point of kind of work management across let's say that we have 10 front-end servers. Right.
Starting point is 00:32:36 While there isn't, while the servers themselves are stateless, the work is not stateless. Like if I'm in the middle of doing, if one of the nodes is in the middle of doing encryption and then it fails or encryption is being done across a certain number of nodes and that fails, where is that state maintained? Because while the servers themselves are stateless, there is still state. There's always state. The key to our system
Starting point is 00:33:06 is the states in the enclosure where another server can pick it up. So let me answer your question and give you another example. Every update to our system is transactional. And so migrating a substripe from Optane to QLC is a transactional process. And so we create a transaction token and indicate progress on that, and keep track of progress on that transaction. If the front-end server died in the middle of migrating a stripe, the data is all still in the 3D crosspoint. The transaction didn't complete, it gets aborted, and moving that data gets assigned to another vast server, which starts migrating the data. There may be some data that got written to the QLC that's going to get written to the QLC again, and it will occupy a little bit of space until we garbage collect. So there's some,
Starting point is 00:34:15 so obviously I'm keeping this tracking on the 3D cross point, and there's some worker, Damon, on each, on each node looking at that. Yeah, they're not logs, but yes. So let me give you, let me give you a little more complicated example. SMB. All right. NFS and S3, the first protocols we implemented are stateless. Every request is independent, but SMB is very stateful because SMB does things like opportunistic locking, where a node caches data that it's reading and tells the server, I have this data cached. And if some other user updates that data, the server sends a message that says invalidate your cache for this because somebody changed it. That requires that the SMB server have a huge amount of state about every SMB client connection. In everybody else's scale out NAS running SMB 2.1.
Starting point is 00:35:25 If a node fails, that state is lost and the user has to reconnect. Because we store that state in the 3D cross point, when a node in our system fails, one of the other nodes picks up that virtual ip address and picks up the state and the client retries within the smb timeout and the user never noticed that anything went wrong so yeah the system has state the front end servers don't have state. And compared to something like an Isilon or some other scale-out system that's shared nothing, that means failures of the front-end server nodes don't cause a rebuild event because there's nothing that they were exclusive owners of. Having trouble visualizing.
Starting point is 00:36:20 We need a whiteboard or something like that for our podcast, but that would be a different domain. So this goes back to that argument that me, Matt, and Ray had over Kubernetes and state. So in a truly stateless system, which the VAS system seems like is a truly stateless system, at least on the front end, not the system, but the nodes are stateless. Yeah, the system is very stateful. Any storage system is stateful, but the front end servers are stateless. Yeah, the system is stateful, but how you maintain that state is based on the underlying processes on the storage system itself. So how would I approve this?
Starting point is 00:37:02 Now, go back and teach Kubernetes guys how to do this. I'm just a storage guy. All I know about Kubernetes is that we have a CSI and it works great. Okay. Talk to me about, do you guys do snapshots and replication and mirroring and, and across systems and all that stuff? So as you guys know, we're a startup.
Starting point is 00:37:21 And so you come out with a product that meets the needs of some customers. We're making our real push into what you would think of as enterprise users with the release that comes out at the end of April. That includes SMB support and a snap to S3 function so that you can take snapshots and replicate that data to S3 and have offsite backups. And we just treat remote backup, remote snapshots like a special class of snapshots. So there's a dot vast remote folder. You can browse to restore a file from them and encryption at rest. So those features are, you know, just coming out.
Starting point is 00:38:08 And you guys already support S3, right? As an object protocol. Yeah, we've, we've, we've supported NFS and S3 from day one. All of our protocols are in-house development. We don't use Samba or MoSMB or anything like that because all that state stuff I I was just describing and the failover only works because we've tied SMB so tightly to the distributed, excuse me, disaggregated shared everything architecture. Because we've had the SMB server store at state and crosspoint, we can do the failover. Yeah, we're going to have to get you on a different podcast and talk about the networking part of this because I have a bunch of networking questions about that back end. Even if it's NVMe over Fabric, there is still the process of moving the packets in and off the network at a latency and pace perspective that makes sense. So there's still a lot there to unpack. Oh, yeah. Well, I can tell, you know,
Starting point is 00:39:13 some of this isn't going to be out for a little while, but I can tell you that we bottlenecked at the PCIe slot. We ran out of lanes. On the back end or the front end? In the back end. Yeah, that's pretty impressive because, you know, as you talk about DPDK and all the technologies needed to move at line rate, just in general, to say that the bottleneck is PCIe, you guys are doing some pretty special stuff to optimize the code to get the packets on and off and reduce that latency. It'll be pretty interesting to see what you guys are doing some pretty special stuff to optimize the cold to get the packets on and off and reduce that latency. It'll be pretty interesting to see what you guys are doing. I'd be glad to sit down with you, Keith. Well, you have to sit down with all of us. We might move it to the CTO advisor so we can do the whiteboard part. All right, gentlemen, this has been great. Keith and Matt,
Starting point is 00:40:03 any last questions for Howard before we close? You know, I would ask Howard, how do you feel about working on the vendor side? What's the change in your business? Besides paycheck? The steady paycheck is a nice thing. The health insurance is a nice thing. Um, I am still learning, um, how to make everything I say be bright and sunny because I'm a marketing guy, but, uh, you know, it all works. It's a good team. You know, I, I took a job cause I was getting lonely and wanted to work with a team and I'm very happy with this
Starting point is 00:40:44 team. And, you very happy with this team. And, you know, it's a great story to tell. The technology is unique. We're doing things nobody else does. And if you look at, you know, so we talked about the similarity data reduction. But if you look at the systems, you know, we don't compete with pure flash array or solid fire. We compete with Isilon and Spectrum Scale, GPFS, and Lustre. And on those systems, data reduction's been, yeah, you could compress your data in the cold tier, but we don't recommend it for anything with performance. Or, yeah, we de-dupe, but it's a
Starting point is 00:41:23 background job. And you have to make sure that the background job doesn't run when your users are busy because it consumes a lot of the system performance. We're data reduction all the time over tens or hundreds of petabytes of data. And people have had problems getting that stuff to scale before. And again, having all the metadata in the 3D crosspoint
Starting point is 00:41:45 means we don't have the usual dedupe problem of, does that hash table fit in memory? Interesting. Keith, anything you'd like to ask? Well, I'll just leave it off with the comment. It looks pretty cool. The solution is pretty cool. I'll have to talk to my friends at Intel
Starting point is 00:42:04 to see if I can get them to sponsor putting one of those on my data center for a little bit. Yeah, let me know. I'd like to get a hands-on myself. I think we can make an argument for it, guys. Yeah, yeah. With our standard enclosure
Starting point is 00:42:20 being 675 terabytes of raw capacity, home labs are not really a topic. I don't have a home lab, Howard. I'm in a 450,000 square foot data center. So I think I can manage it. Yeah, I understand. I was the guy who had more than a home lab
Starting point is 00:42:36 for a long time. Yeah, there you go. Howard, anything you'd like to say to our listening audience before we close? Yeah, I think that our listening audience should be watching VAST because even if you're not in the large scale that we operate in, you know, we're doing interesting things and taking a new approach to doing storage. And somebody will follow along and go, that was a good idea. What happens if you try and scale that down? Exactly.
Starting point is 00:43:02 Well, this has been great. Thank you very much, Howard, for being on our show today. Always a pleasure, guys. I miss being a gray beard. Well, you're still a gray beard, but you're not a gray beard on storage gray beard. Next time we will talk to another system storage technology person. Any questions you want us to ask, please let us know. And if you enjoy our podcast, tell your friends about it. And please review us on iTunes and Google Play as this will help get the word out.
Starting point is 00:43:25 That's it for now. Bye, Keith. Bye, Matt. Bye, Ray. Bye, Ray. And bye, Howard. Bye, Ray. Bye.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.