Grey Beards on Systems - GreyBeards talk data-aware, scale-out file systems with Peter Godman, Co-founder & CEO, Qumulo

Episode Date: July 13, 2015

In this podcast we discuss Qumulo’s data-aware, scale-out file system storage with Peter Godman, Co-founder & CEO of Qumulo. Peter has been involved in scale-out storage for a while now, coming fro...m (EMC) Isilon before starting Qumulo. Although, this time he’s adding data-awareness to scale-out storage. Presently, Qumulo is vertically focused on the HPC and media/entertainment market … Continue reading "GreyBeards talk data-aware, scale-out file systems with Peter Godman, Co-founder & CEO, Qumulo"

Transcript
Discussion (0)
Starting point is 00:00:00 Hey everybody, Ray Lucchese here and Howard Marks here. Welcome to the next episode of Greybeards on Storage monthly podcast, a show where we get Greybeards storage and system bloggers to talk with storage and system vendors to discuss upcoming products, technologies, and trends affecting the data center today. Welcome to the 22nd episode of Greybeards on Storage, which was recorded on July 6, 2015. We have with us here today Peter Godman, CEO and founder of Cumulo. Why don't you tell us a little bit about Cumulo, Peter? Thanks, Howard and Ray.
Starting point is 00:00:44 Cumulo is an enterprise Thanks, Howard and Ray. Cumulo is an enterprise data storage company that was founded in 2012. We've been working on our product for about three and a quarter years. We launched our company right about three months ago in March of 2015. In the very limited amount of time that we've had product on the market,
Starting point is 00:01:03 we've acquired right around 30 customers. And the way to quickly understand what Cumulo does is we build storage that scales really, really easily, but that most importantly is great at telling you about what data you have stored. Traditional storage tends to focus on just the container and managing the container. And we do that, but we also focus on helping people understand their data footprint itself and manage it. So you've been out on the market, you said about three months, four months. Is that true? Yeah. so we actually started selling our first product, which is our QC24 product, in about August of last year. But we sold product for quite a while in stealth mode.
Starting point is 00:01:54 So we went out and found people that had the problem that we were trying to solve. And then with about 15 customers in March of 2015, we finally launched the company and product and announced the new hardware platform. That's great. So I'm a little bit confused by what you mean about information about your data because I've heard terms like that mean so many different things. But before we get there, let's go to just the real basics.
Starting point is 00:02:33 Block, file, scale out, scale up? Yeah, we build a scale out file storage system. So we say we build the world's first data-aware scale-out NAS. Okay. And data awareness means what? Okay. So back to data awareness. Lots of people want to talk about data in a lot of different ways. And you're right to pause on what that actually means for any particular vendor. The problem that we're focused on is one where organizations need to store billions now of individual files across petabytes of data. And rapidly come to the realization that they don't necessarily know what they have anymore, why it's growing, where the performance goes in terms of the actual data, who's using individual files, and then also grapple with
Starting point is 00:03:32 trying to back up and archive and understand what things need to be archived out of their data sets. So we're not a big data company. Big data companies are really focused on extracting value out of your individual files. You can kind of think of us as the big data of your metadata company. So you have metadata associated with billions and billions of files, and we help folks understand what those files actually are and what they have. I guess one question I would have is, are you creating additional metadata beyond what I would consider the standard NFS or CIFS-SMB metadata for files? Today, we're focused on the metadata that already exists inside files. you can imagine us extending that to be any sort of extensible metadata and the ability to provide analytics on extensible metadata and index, et cetera. So the canonical use case there, I guess, you know, you imagine the first step here in data awareness is tell me what proportion
Starting point is 00:04:38 of storage in my system is consumed by files that are, you know, three to seven megabytes in size. The next step might be, you know, tell me about what proportion of storage is consumed by files that are three to seven megabytes that are 192 kilobit MP3s or something like that. So you start with just the POSIX metadata. Next step out is things you can infer about the file by looking at it. And then the step beyond that, I think, would be the individual metadata tags that applications or end users make about the data. Interesting. And you said scale out. So is there a limit to the number of nodes?
Starting point is 00:05:17 I guess that's a critical question here. I mean, architecturally, there's always a limit of some type, right? Yeah. So our primary limit is just governed by what we can assemble to put together and test. So we build an N-way scale-out system. And the assumption is that we scale linearly. And in practice, our ability to deliver bigger and bigger systems is gated more than anything else by our ability to assemble more and more hardware. Practically speaking, we don't see a huge number of scalability limitations for our architecture right now. That said, our biggest systems in field are right around the 20-node mark right now.
Starting point is 00:05:56 That's not bad. And it's a shared nothing model, right? That's right. We build a shared nothing system. So the way to think about it is that all protection inside our scale-out file system lives at a scalable block layer. So there's a scalable block layer in our system that provides transactional semantics and also data protection. The file system itself thinks it's talking to pretty much a linear address space, but the file system itself is also a scale-out entity. So whereas some object systems and some scale-out systems will build protection and transactional semantics into the file layer, we actually build it into the block layer. better data efficiency because we can protect files against each other rather than just against themselves and also higher transactional performance. Interesting distinction you draw there Peter. Probably have to probe that for a little bit.
Starting point is 00:06:56 But so as far as the hardware is concerned, is there a specific inter-cluster network that you require or anything like that in InfiniBand or something like that? Today, our smaller systems use 10 gigabit Ethernet and our larger systems use 40 gigabit Ethernet. We're committed to open standards at this point. I've been involved in systems before that used InfiniBand, for example, and using that is pretty convenient. But at this point, you can get what you need out of Ethernet. And, you know, one of the themes for our company is that as time moves forward, vendors of something that, you know, you and I would call storage will have less and less
Starting point is 00:07:38 control over our hardware environment. So in the future, storage systems will run in other people's infrastructure as a service clouds. In the future, our customers will want more and more to use white box hardware that they're procuring on their own. And so we're focused on using all open hardware components. So Ethernet for interconnectivity. And rather than using a proprietary NV RAM
Starting point is 00:08:06 we use SSD to take all writes in the system and to handle commitment of those writes. Okay so the SSD is for a write buffer only or is it used for reads as well? Yeah so all data coming into the system lands in SSD and data may be read from SSD as well. And as the SSD begins to get full, data is selected to migrate down to spinning disk as a call to tier. So storage tiering to a large extent.
Starting point is 00:08:42 Yeah. I've dealt with this problem for a long time. And the traditional solution, scale aside, has been to run some third-party product that walks the file system and copies the metadata into a database so that I can query it. Have you guys made the actual file system metadata that kind of queryable, indexable database?
Starting point is 00:09:16 Or are there two data structures there? And how do you manage that? Yeah, great question. We've, you know, like you, something that we observed when we started the company uh just as an aside you know we we conducted a great deal of research starting the company we ended up talking to about 600 administrators of data storage and one of the things that we found like you is over and over again you see this pattern of things that walk the file system extract extract metadata, and put it into a separate database. And what we heard when we talked to people doing that is it provides a solution to some problems. If you don't need up to the minute
Starting point is 00:09:56 information or even up to the hour or day information, you can get a kind of view of what's going on in your storage. But in creating that view, you had to have a separate database to scale and manage a piece of software that walks through files in a file system and then pushes it out to this separate database. And in return for that, you get this kind of out-of-date information. If that walker is single-threaded,
Starting point is 00:10:21 it might take, say, 50 days to index a billion files. And if it's multi-threaded, then you consume a lot of IOPS looking at the same files over and over and over again, making sure that they haven't changed. So big problem. People are really unhappy with this situation. And the existing file systems bog down. Yeah, that's right.
Starting point is 00:10:43 None of the file systems are just not designed for, you know, I want to, even a backup job going, I want to check the archive bit on these four billion files. It takes forever. That's right. That's right. So, exactly right. So creation of a backup catalog, for example, starts to take forever and it results in large systems that are almost impossible to back up. Answering any kind of complicated question becomes really difficult. And then last, the caches of storage systems are not there to cater to the needs of things that want to repeatedly read all metadata out of the system. And yet in a traditional storage environment, that's what
Starting point is 00:11:25 ends up happening. All of your cache is completely filled with inodes associated with periodic walks of the entire file system tree. So that's a really bad situation. And we believe that storage should be rather more intelligent about that and should be able to answer complicated questions about data footprint virtually instantaneously. So what we decided to do, you asked the question about is it built in or is it external. So our database is completely built into the file system tree itself. And what we do is we basically build a hierarchical database into the file system such that you can answer really quickly complicated questions about resource usage and also really rapidly identify,
Starting point is 00:12:17 for example, all of the files in the system that have changed in the last 24 hours. So functionality was recently added in the last few weeks. So functionality was recently added in the last few weeks that makes that query, for example, very, very fast on our system. And what you can do inside any directory inside... Kind of the logical equivalent of being able to put an index on a metadata attribute.
Starting point is 00:12:42 Yeah, that's right. So one of the challenges with traditional relational database approach is it gets more and more expensive the more indexes you add. Keeping track of all these different data structures is really expensive. So what we came up with is a way that individual attributes of things can have functions applied to them, and those can be aggregated inside the file system tree. So, for example, one of the attributes that we aggregate is the maximum of the change time inside a directory structure. So, you can look at any directory in the system and say, tell me about the most
Starting point is 00:13:18 recent change time that exists somewhere in this tree. And if you, for example, want to say, I just want to know about these changes. For all the MP4s, I can very quickly find out if there are any at any node in the tree without having to enumerate its members. That's the idea. So, you know, you can imagine recursively starting at the root directory saying, hey, does this thing, does this directory have anything that's been changed in the last day?
Starting point is 00:13:47 No? Okay, I don't need to deal with it then. And you do that recursively in your search for files that match individual criteria. So that's how indexing works inside CumuloCore. And we did that, as you say, in response to people's pain associated with building these external databases of metadata. It's almost like you built your own internal database for the metadata. It's not really a relational database per se, but you've optimized the information. It's an older model. Yeah, absolutely. ISAM kind of thing.
Starting point is 00:14:25 Absolutely right. More Codacil like. Yeah, I suppose, yeah. Okay. Yeah, so the thing is with relational model is it's wonderful to be able to ask really, really complicated queries about very large data sets. But, you know,
Starting point is 00:14:42 when you build a file system, it has to perform like a file system. We sell a scalable file system product and so updating 10 indexes for every inode write we do just wouldn't work and so we had to find a way to build a high performance database that could answer most of the questions people had
Starting point is 00:14:59 and still have a file system My problem with things like ILM over the years has in no small part been just that the metadata available in POSIX is so meager
Starting point is 00:15:15 that it makes it difficult to make decisions. Do you have any extensibility to this? So today, we don't have an implementation of extensible metadata in our file system. You can expect that to change relatively quickly. What we do have is the ability to, as you say, first leverage POSIX metadata and then next derive it automatically from the files themselves. So lots and lots of file types have headers where you can read the information about the file.
Starting point is 00:15:48 For example, find out if it's an MP3 or whatever kind of file it is. When you start getting to JPEGs, there's a lot of data there. Yeah, right. There's actually a lot of data inside systems, and most of it never gets consulted. And it's going to be a long road for us to pull all of that data out of all of those different file types. It's one of the things that makes a vertically focused go-to-market strategy really appropriate for this sort of product because you need to understand 10, 20, 50 file formats at a time, rather than every file format the world has ever, you know, has ever come up with, which would be impossible. So, yeah, in answer to your question, as I say, I see this as a three-stage journey. The first is just POSIX. The second is, you know, POSIX plus whatever's in the file. And then the third stage will be
Starting point is 00:16:41 applications and possibly individuals tagging extended attributes on data. And of those, you know, like you, I've seen for a long time that getting humans to do any manual tagging of metadata on files is very painful and difficult. I blame Bill Gates. You know, I worked for a while on a system at Real Networks to do with having all content creators just tag metadata on their created content. And that project went on for a very long time. And I think the folks who were running it came to terms with the fact that getting humans to do things like that would always be very, very difficult. The funny part is there's one class of humans it's not that difficult for, and that's attorneys. Yeah, librarians maybe or something like that.
Starting point is 00:17:37 Because if you go to a law firm, law firms don't open Microsoft Word or WordPerfect, and amazingly enough, a lot of law firms still run WordPerfect, but a lawyer doesn't sit down to write a pleading by opening the word processor. He opens the document management system, and he says, I want to create for this client, for this matter, this kind of document, and then it creates all the metadata.
Starting point is 00:18:06 And he can't hit save until he enters all the rest of the metadata. And if Microsoft Office in 1995 had had an easy way for us to say, don't let anybody leave without filling out these four fields, life would be much easier today. For these sorts of things, yes, yes, yes. So you mentioned a vertically focused, I go to markets, so I assume you're vertically focused,
Starting point is 00:18:33 and what markets are you going after? Yeah, today the market that we've enjoyed the most traction in has been media, and we're also targeting life sciences and oil and gas also. And as I say, probably slightly more than two. Sorry, go ahead. No surprises there.
Starting point is 00:18:57 No surprises there. Yeah. Yeah, so the way I look at those markets is, you know, one way is to say this is commercial high-performance computing. See folks with really high-performance requirements that are using computers to analyze or create content. But the other side of that, I think, that unifies all of these fields is it's humans and computers working together on analyzing and creation. One thing that we've observed is when it's just an application that is the only data accessor, there you tend to see a lot more adoption of object storage. And also that application is
Starting point is 00:19:40 going to track metadata associated with its stored assets on its own. When there are humans on the other side of it, though, humans are great at creating things, and they're also great at creating waste product along the way that goes unindexed and not understood. And so these verticals have this in common. It tends to be places where there are people that are directing the creation and analysis of large amounts of data. I'll give you a random example. I was talking with a large biotech company recently that commented they had had a researcher visiting from Europe. And he had stayed for about 90 days. And sometime three months after his departure, they were sort of looking at the storage footprint. The storage
Starting point is 00:20:26 footprint was about 30 petabytes total. And they said, you know, something's just not adding up here. It's like someone came along and just sort of inflated the balloon a while ago. And it turns out that this researcher had, during his tenure at the institution, created about a petabyte of data and left it lying around all over this 30 petabyte file store. And this organization had no way of enumerating all of the data that belonged to this person or no way of understanding the fact that it was this person that had created this rapid growth and footprint. And what we see is that when folks do have billions of files, you see this problem over and over and over again, particularly when there are humans involved because the humans are making decisions.
Starting point is 00:21:10 They're creating temporary artifacts. And humans will often work with the belief that the underlying resources are free, storage in particular. Well, then it doesn't come out of my budget. Yeah. I have the same problem with my own desktop. Yeah, yeah. It's free, right? I never got a bill for it. Yeah, well.
Starting point is 00:21:32 So to solve that problem, I would want more than POSIX metadata holds for me. Well, doesn't POSIX indicate the creator of the... There's owner. Owner, yeah. It's quite the same. Yeah. And I kind of want historical.
Starting point is 00:21:56 You know, who wrote this file last? Not to mention the whole security auditing part. Yeah. So, you know, this particular example is a good example of one where just POSIX is sufficient. If you track through time utilization by owner or something, and you just use graph how that's developing over time, you can see this pretty easily just based on POSIX.
Starting point is 00:22:22 However, I completely agree with you that sometimes, you know, when all you're seeing is sort of the most recent owner and the user is different from the owner, then that's not sufficient. And what you actually want to see is information about the individual operations that have happened against the file. And I, you know, I expect that that's where audit comes in. That, you know, audit, usually the expectation is that is a system to preserve individual operations that have occurred against a file. Security in these spaces is obviously a huge concern as well. You know, right back to humans working on data, when it's a single application working on data, you don't normally have enormous security concerns. It's either
Starting point is 00:23:05 secure or it's not, and it tends to be perimeter security. When you need to have a thousand people working together on a piece of sensitive data, things get a lot more complicated, and that's where audit comes into play, as you say. So you have audit in the system today, or is that something that's one of these future... It's a future thing. We don't have it in the system today. And you mentioned security. Do you encrypt the data in the file system? Today, we don't have encryption in the file system. You're not even logging audit data right now. No, we're not logging audit data right now.
Starting point is 00:23:38 Because, well, that's too bad. Yeah. I think that, you know... Not that I want logs, because that's the real problem with auditing, you know. Even as humble a file server as Windows has had all of that auditing available for decades, but as soon as you turn it on,
Starting point is 00:24:03 you end up with megabytes of cryptic log files. And then you need to do. Splunk or whatever. Or you need the big data of your metadata. A couple of points about this. We as a company, Cumulo, try to be very transparent about what we do and what we don't do at any given point in time. We operate in a space where mature products have hundreds of years of investment in them, and it's pretty much inescapable. full-featured, scalable storage product that handles file protocols today without an enormous investment of human capital, basically. And so prioritization is a constant part of our existence.
Starting point is 00:24:54 So one thing that's kind of notable and interesting about the way we deliver our technology is in response to this particular problem. So we actually ship new versions of our storage software every two weeks. What? Yeah, every two weeks. Wait, wait, no. It used to take us longer than two weeks just to validate a new version of a system of storage. Right. So, yeah, so I'll tell you about how we do that. Almost every one of our customers updates to almost every build that we put out,
Starting point is 00:25:29 every two-weekly build. So that sounds crazy, right? It sounds like, well, that's just a recipe for total mayhem. So it's actually not. And it helps to understand how traditional storage systems have been qualified and certified. So the traditional way and the way that I've been exposed to in the past is that software developers write software. And occasionally we'll write a test here and there.
Starting point is 00:25:55 And then the whole bag gets thrown over to a QA organization whose job it is to verify that it works, for some definition of verify that it works. And so what you have is a huge disconnect between the people who are producing the artifact and the people who are certifying it, right? And the people who certify end up doing really more tests than anything else. So we looked at that and looked at what people were doing in the SaaS world with rolling out software continuously and kind of looked at how are people accomplishing that without periodically having their website fall apart? And the answer for us ended up being, we have a team of software engineers
Starting point is 00:26:37 that write almost all of the test coverage for their own code. So we have a really maniacal focus on testing every single aspect of a piece of code as it goes into the code base. So we do certification. For us, it takes nine or ten days. But we don't expect any significant issues to show up in that process,
Starting point is 00:27:00 and rarely do we see any significant issues show up. And the reason is that by the time, well, let's say 12 hours after a developer checks a piece of code into our system, it's already undergone a million tests or thereabouts, and many tens of thousands of unique tests of different facets of our software. So we're building things in a new way, and it's possible that other storage companies have done this in the past, and I think it's probable that most have never done things this way, and it actually works out really well. So we measure the number of days it takes for us to respond to a customer request for a new feature or a piece of, let's say, relatively minor functionality, and we measure that today as about 10 days, 10 business days from request to delivery. So we turn things around really, really fast.
Starting point is 00:27:47 And we do that having a pristine track record of no outage and no data loss across now 18 months of having systems in deployment. That's amazing. It's almost agile development applied to storage. It's absolutely agile development applied to storage. It's wonderful and scary all at the same time. Well, you know, I have to be honest with you. Remember, we're not gray beards for nothing. And if I could grow facial hair, I think it would be gray.
Starting point is 00:28:23 So I'm with you. You know, I too was skeptical about this. Honestly, Neil's my co-founder. He's VP of engineering. He said, no, Pete, look, we've got to do it this way because this is how things are going to be done. And these are all the great things. And I was honestly quite skeptical, but it's worked out really well. Yeah, I would say, you know, in the old days a six month validation activity was
Starting point is 00:28:45 relatively good. Nine to ten days seems... The world is moving faster than it used to, Ray. Wow, yeah, I understand that. Although, this method and turning
Starting point is 00:29:01 around new ASICs in ten days would become a challenge, to say the least. Yeah, I don't know if you could take all of this and apply it to hardware. It's worked pretty well in the software domain. But that all fits in with trying to keep anything proprietary out of hardware. We build systems that are, as I say, there's no custom MV RAM. There's no custom fabric. It's Ethernet and it's SSD, and that lets us move really fast on software. So what's the go-to-market model, software only?
Starting point is 00:29:35 Yeah, so even though we build a technology where pretty much all of the IP is in software, most of the systems that we sell are with bundled hardware. So we have two lines of appliances, the QC24 and the QC208. The number just signifies the amount of raw capacity in a system. And almost everyone buys those appliances. So one of the things that we observed along the way, talking to lots of folks that buy data storage, is that there is huge demand for software-only storage at the very, very high end of the market. When you talk to the largest investment banks, when you talk to hyperscale internet companies, people want software, but everyone else needs to buy appliances because no one can afford to qualify a piece of hardware against our software because it's expensive. It takes a long
Starting point is 00:30:25 time to make sure this piece of hardware isn't going to lie and destroy your data along the way. Yeah, there's another pocket of demand all the way at the bottom. It's software only lives at both ends. Yeah, it's the extreme side of things. I agree. We see it at the bottom as well. But in the middle, people want to buy... The funny thing is, frequently people want to do POCs on the hardware that's already in their data center, but then want to
Starting point is 00:30:54 go into production on your appliances. Yeah. So we do a lot of proofs of concepts and evaluations using virtual machines. It's really handy to have something software- only because people get to look at it in virtual machine context. But when things actually roll into production, they're mainly on appliances. We do have one customer today that is software only.
Starting point is 00:31:18 So we're quite willing to do it for the right opportunity. But you over and over again run into this problem of how are you going to qualify that this piece of hardware isn't going to let you down. And you need to have enough scale to make that worth doing. Yeah. Yeah, yeah. Makes sense. So speaking of performance, do you guys have any benchmark types of numbers for your systems? I think we publish benchmarks out to all of our customers every two weeks. We don't presently talk about performance numbers publicly. Okay. But the architecture of the system sounds like it's set up to provide good performance with both large and small IOs.
Starting point is 00:32:07 Yeah, that's right. The system is designed to deliver good small IO performance. Our scale-out block store is based on a 4K block size, for example. And simultaneously, one of the expectations people have of a scale-out file offering is that it delivers very high throughput, and we deliver high throughput as well. Right. Well, I mean, especially in some of the markets you're in, like media, where I'm going to suck a multi-megabyte file up and render it for a while
Starting point is 00:32:39 and then push it back down. Yep, that's right. So how's the data laid out on the back end? So I mean each node controls a certain hash space for the
Starting point is 00:32:56 scale out file system or is it This is the point where I'm going to start practicing the law without a license I think. I'll talk a little bit about it. And then Neil or Aaron will run into my office at the end of this and say, you got that totally wrong, Pete. So basically the file system, the block store in the system is the segmented distributed block store where five gigabyte contiguous address spaces are laid out between individual nodes. So the file system itself says, okay, these next blocks are going to go into,
Starting point is 00:33:33 we call these things P-stores, these five gigabyte contiguous regions. So it'll say, I'm going to put this into P-store 11, and I'm going to write it offset 102 or something like that. So the file system knows about that segmented P store space, but it doesn't know anything about how those P stores are actually laid out and distributed between individual nodes in the system. So it's not a pure linear address space the file system sees. It's segmented, but it nets out to be the same thing.
Starting point is 00:34:01 And the protection level is like RAID 6 kinds of things, or is RAID 1? So yeah, today, each of the P-Stores is a mirrored P-Store. So it's a two-way mirrored P-Store. So a couple of notes about why we do this. We're actually working on a ratio coding right now, but I can't tell you yet when it will come out. It will be in the not very distant future. So today, we do 2x mirroring. So, you know, the key to reliability in a mirrored system is having incredibly fast rebuild times. They have to be very, very fast. And one of the reasons why we did all protection at the block layer instead of the file layer is it's the only way that you can make hard disks operate entirely sequentially while you're rebuilding data.
Starting point is 00:34:47 So in CumuloCore, as the system re-protects after a drive failure, all drives in the entire system participate in reading data that was left exposed by the failure of that component, and they all operate entirely sequentially. And what that means is on a minimum-sized QC208, our 208 terabytes per node system, it takes about, as I understand it, about an hour and 20 minutes to re-protect away from a failed 8-terabyte drive.
Starting point is 00:35:24 So about an hour and 20 minutes for failed eight terabyte drive. And then that amount of time halves every time you double the system size. So an eight node system should be about 40 minutes and a 16 node system should be about 20 minutes. And that scales down. And that's regardless of whether you have small files or large files or transactional access patent or a sequential access patent. But a node failure that takes multiple drives offline is also going to take linearly longer. Yeah, so node failure, we deal with those entirely through just node replacement.
Starting point is 00:35:59 So we just bring in another chassis, pull out the drives, and put them in the new chassis. We don't have problems with having a custom NVRAM or something. It's just SSDs and disks. So if you put them in a different chassis, then you're back to normal. Right. So you really don't have to rebuild. You just kind of do a mind meld between the old down node and the new node that you just brought in. You just move the disks and the SSDs,
Starting point is 00:36:26 and then you're up and running, I guess. Yep. And you provide a spares kit that a customer can buy to be able to do that? Yeah, that's right. So, well, we have, yeah, so we'll always sell it. I don't want to wait the four hours for something I can do myself.
Starting point is 00:36:42 Oh, right. It's really one of the things that annoys me most is when I walk into data centers and they go, yeah, and we have 172 of these. And when something goes wrong, we call and they bring us one to replace it. Right, right. That's like, you have 172?
Starting point is 00:36:57 Why don't you have 174? So, you know, the other side of that is you don't necessarily want, you know, the other side of that is you don't necessarily want, you know, we could deliver extra chassis to everyone with a bunch of blank drives. But you'd have to spare out enough capacity that that's a lot of capacity that's just sort of sitting still for a redundant system. So the way we deal with it is if a node fails, the system stays up and we, you know, we bring in another chassis and then just transfer the disks over, and then you're done. Okay. So I need to hit the performance thing one more time, Peter. Yeah, I've been following spec SFS since I was a wee child.
Starting point is 00:37:40 And I'm looking at it. Every once in a while, we see a scale-out system out there, and it performs real well, does lots of high numbers of NFS ops or CIFS ops and things of that nature. But the recent version of SpecSFS 2014, there has been zero non-SpecSFS submissions. And I was wondering if you're aware of what the problem is with respect to that and number one
Starting point is 00:38:07 and number two if you are planning to release a spec SFS submission. Yeah, so we will. We're looking forward to doing so. We are kind of in an interesting gap
Starting point is 00:38:21 again on spec. I suppose you wouldn't know this but I was involved in doing a lot of the spec work at Isilon a long time ago. I was one of the people working on performance of that system. And I don't know what's up with the new one. I know that it's application-centric. I don't know if there are compatibility issues for people running presumably extracted parts of application backends against storage. I wish I knew the answer to the question, but I don't know if there are compatibility issues for people running presumably extracted parts of application backends against storage. I wish I knew the answer to the question, but I don't know why people aren't publishing right now. In my last performance analysis, I kind of went into this. I think it's not that
Starting point is 00:38:55 bad as I suspect. I mean, when SpecSFS 2000, oh, I don't know, the prior one, S97R1 or something like that changed over to the next 2008. There was probably a quarter delay for a couple of items, but it wasn't really a big change. This one is a sizable change. They went to applications. They went to an operating system file stack.
Starting point is 00:39:19 It's a major change from my perspective. It seems the limited amount I've read about it makes it seem like a huge change. As you know, the problem with all benchmarks is that vendors have a really great incentive to go game them. Any way they can figure out how to do that, you know, by making replicas of, you know, whole data sets or figuring out what, you know, what particular pattern of bytes gets written into a file and a benchmark or whatever silly stuff. Peter, when I was at PC Magazine, one manufacturer of video chips customized them to run our benchmarks faster. Video chips? Yeah.
Starting point is 00:40:01 This VGA chip would identify the data pattern that we wrote as part of our benchmark and would read the screen out of a buffer rather than render it. I don't want to hear this, Howard. So, you know, I've been on the other side of that and I've seen sort of all the incentives that creates to optimize towards a particular benchmark. I actually really like spec as a benchmark because traditionally it hits a lot of different operation types and is focused on stuff that matters. Even though everyone reports IOPS on spec, it does report latency, and latency really does matter. So I like all that about that lineage, but at the same time, we've, you know, we've only cursorily run spec in the life of our company and we primarily focus our efforts and performance to, you know, are our customers satisfied with the performance that they get and optimizing only for what they're trying to do because that seems like a healthier short-term direction for the company.
Starting point is 00:41:06 I think longer term, you can expect us to say, oh, yeah, but here's the deal. As you move from three verticals to 20, the more general case applications become more significant. There is a video version of the SpecSFS. So one of the four applications is a video, almost a media server kind of thing.
Starting point is 00:41:25 So you might take a look at that. All right. You know, we've kind of hit the 40-minute mark here. I got one more thing I want to talk about, though, Ray. All right, Howard, go for it. Because the problem comes now, I've got this ginormous scale-out file system. How do I back it up? Yeah, how do you back it up?
Starting point is 00:41:50 Because we've already talked about, well, and walking the file system to find out who's got the archive bit set isn't practical when I've got millions and millions of files. And your index will speed that up, but only to the point where NetBackup can handle it. Well, yeah, so there are a few things that go wrong with backup. The first is identifying rapidly all of the files that have changed in a certain time period. We solved that problem. I mean, you can get an answer to that question that takes an amount of time in proportion to the number of changed files.
Starting point is 00:42:27 And you don't have to return the whole thing at once. We can even stream that information back, right? So I actually provide a stream of things that have changed. So we feel like the technology we've built is going to solve that particular problem. And the second problem, a lot of backup systems... But you have to do that in a way that doesn't require me to rewrite my backup app. Right. Yes. My next point was about backup applications. So, you know, so backup applications need to understand that. And then second, they need to be able to parallelize very large data sets. If CumuloCore comes back and says, oh, this is the first time you're backing this up. So there are 1.6 billion files in here. Most backup applications are going to choke, especially if they're running
Starting point is 00:43:12 single threaded. Now, if they have the ability to paralyze their operations, the situation looks better. But usually they're going to either need to look at all data in the system to figure out how to parallelize, or they need some help in some way. So the other thing that we're doing through our analytics is allowing very rapid partitioning of very large trees. You say, break down this tree of 1.7 billion files and 3 petabytes of data into 100 equal-sized chunks, right? And because of the hierarchical database functionality of our system, we can do that near instantaneously,
Starting point is 00:43:51 actually carve a huge file system into many equal-sized chunks ready for parallelization. Are you actually changing the directory structure, or are you doing this virtualized? No, we actually can just respond with a list of inclusions and exclusions that define that slice of a file system tree. So if you imagine how you describe to Async inclusions and exclusions, you can imagine that Cumulo can say these are the inclusions and exclusions that define this particular slice of this tree. So that's what we're doing on that front is we make it so that we can actually help the application parallelize. Now, as I said, we're a very transparent company. Our integrations with backup software right now, they're not yet, but we're actively working on integrating with applications
Starting point is 00:44:39 to make it so that this thing is the first very large file system that is actually easy to backup. Okay. You mentioned parallelization. Do you support like NFS v4? No. So that's a whole separate topic in terms of parallel access through clients. No, we don't do NFS v4. We will do as demand dictates. I think we're still kind of in chicken-the-egg territory on NFS before, or PNFS particularly, in that client support is still a bit problematic and client compatibility is still problematic,
Starting point is 00:45:14 and then vendor support is also patchy, and we're part of that part of the problem. But it's tough to prioritize that when there's only so much actual demand. Yeah, I guess the other question on version levels with the SMB3, are you guys up to that level yet? Yeah, we do SMB2.1 right now, and again, it's demand-based. I think the big driver for SMB3 will be completely transparent failover semantics when nodes come and go in your scale-out storage system. That's going to be one of the earliest drivers for us. Rather than the performance gains, which not as many people ask for,
Starting point is 00:45:57 it'll be completely undetectable failure. You know, no dialogue box failure. Yeah. I guess I have one further question. Snapshot support? Not today. Not today. High on the priority list.
Starting point is 00:46:13 I understand. No, I think I've got it. It's, you know, like a lot of the scale-out systems, you know, they're hitting that set of verticals where the requirements are pretty well definable. I like the idea. I like the analytics. And, of course, I always want more. Yeah, don't we all?
Starting point is 00:46:39 We wouldn't be gray birds if we didn't. No. So, Peter, is there anything else you'd like to say? You're with me on that too i always want more i'm never never satisfied um no i mean uh you know to your point about wanting more stay in touch um these are uh these are vectors not points you know so things change fast with us so i'll look forward to uh keeping you and hopefully your audience up to date all right good well this has been great it's's been a pleasure to have Peter with us here on our podcast.
Starting point is 00:47:09 Next month, we'll talk to another startup storage technology person. Any questions you want to ask, please let us know. That's it for now. Bye, Howard. Bye, Ray. Until next time. Thanks again, Peter. Thank you so much, Howard and Ray.
Starting point is 00:47:20 All right. Thank you. Until next time.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.