Grey Beards on Systems - 098: GreyBeards talk data protection & visualization for massive unstructured data repositories with Christian Smith, VP Product at Igneous

Starting point is 00:00:00 Hey everybody, Ray Lucchese here with Matt Lieb. Welcome to another sponsored episode of Graybeards on Storage podcast. This Graybeards on Storage podcast is brought to you today by Igneous and was recorded on March 18th, 2020. We have with us here today, Christian is brought to you today by Igneous and was recorded on March 18, 2020. We have with us here today Christian Smith, VP of Product at Igneous. So, Christian, why don't you tell us a little bit about yourself, your company, and your SaaS data management solutions for unstructured data? Sure. Hey, Matt. Hey, Ray. Kind of crazy times today.

Starting point is 00:00:40 I think we're all huddled down. I'm in the office here, and I think I'm the only one here. There's less people here than there is at my home right now. So I hope everybody's safe at home and we'll get through this. But talking about Ignea, so my name is Christian Smith, product manager here. I've been at Ignea since we started. I've been at places like NetApp and Icelon and SGI in the back of my past. Really who we are is a bunch of file system guys. Our team, our founders, they wrote Waffle from NetApp. They wrote 1FS from Isilon.

Starting point is 00:01:16 And when we got to this point where we were getting ready to think about starting a company, it was really about what does the next generation of data management look like? We didn't want to write another file system. We didn't want to write another NAS device. We really wanted to say, you know, data is growing, unstructured data is growing, machine generated data is growing. It's across segments like, you know, life sciences and high tech manufacturing and finance and insurance. It's all the places where just the machines are generating more data than people are able to generate on their own. And as you start looking at these people, they had a couple of gigabytes 10 years ago,

Starting point is 00:01:56 and now they're looking at petabytes. And in the distant future, it won't be uncommon for them to have hundreds of petabytes in their environment. Oh my God. Yeah. So this presents some challenges, right? How do you do just simple things like how do you move data? How do you protect data at this scale? Where does the cloud fit in in this? And you think that hundreds of petabytes will be on-prem? I don't think it'll be on-prem. I think it'll be a hybrid environment more than it'll be on-prem. So I think in the world we're looking at where there's just so many services that are available in the cloud to process data or leverage cloud capabilities that it's going to be, you know, data is going to get generated on-prem. You can't move a sequencer into the cloud. You can't move,

Starting point is 00:02:43 you know, you're not yet. Yeah, maybe Amazon will go, you know, buy those and you can send your samples in. But there's a lot of these places where the machines generate the data are just, they're there on premises, they're in labs, they're next to scientists, they're next to researchers. And so there's going to be some hybrid environment that exists in these environments overall. And really, it's like, how do you protect it? How do you move it? How do you even see what you have? Like just thinking about visibility of your unstructured data, any kind of common questions like how much data do you have and how many files do you have usually is something like, well, we bought this much and it's 80% full.

Starting point is 00:03:21 So I guess we have this much data. Well, what's your file count? I have no idea. And so we really, you know, came and approached this as those common capabilities of, you know, see, protect, and move. And then how do you do this in a way that's like reduces friction and is built for the scale of, you know, hundreds of petabytes, even though you're starting out with terabytes type of scale. And so that's really what Igneous was about. And we had to go through and kind of re-architect how you deal with this data. Can you really protect hundreds of petabytes of data? It seems like the file scan would take like, you know, a couple of months or something like that, wouldn't it? You know, you can protect hundreds of petabytes of data. And that goes back to kind of the architecture piece of this is that you know in in today's modern architectures uh everything

Starting point is 00:04:12 has some flash in it everything has lots of spindles behind it and and so moving data is not the challenge you're you're correct in saying like the hard part is actually going and determines what's changed. And that's where we had to go back and, you know, kind of rewrite the way you talk to these devices and do it in a, you know, a scalable, efficient, multi-threaded way that still doesn't disrupt, you know, the applications that are running. And so, you know, we have deployments that have customers have, you know, 40 to a hundred billion files it. And we're- Billion. Billion, as in B.

Starting point is 00:04:48 And we're going through those every single night looking for change rate and comparing that and finding all the files that changed and moving them over to a secondary tier of storage in their environment. And so you can do it, but you have to really kind of understand file systems, understand data,

Starting point is 00:05:04 understand how you talk to these systems and do it at the scale of the enterprise. And these aren't your NAS boxes. These are somebody else's, right? These are, yeah. This is like Isilon. This is NetApp. This is Cumulo. This is Pure Flashblade. This is Lustre, Glustre, GPFS. I mean, it ranges from where you're intersecting the segment. Like is it life sciences? It tends to be Isilon. If it's chip design, it tends to be NetApp. And you go into the physics guys and they have like huge Lustre or GPFS deployments.

Starting point is 00:05:40 But you're not actually managing that storage. You're managing the data that resides on that storage. Correct. That's a fine line there, Matt. Well, so, I mean, the truth is a NetApp is going to be managed by NetApp. You know, a pure FlashBlade is going to be managed by Purity. But amongst all of that, you have sort of a layer, not that it adds any overhead to it, but it's actually digging into the data that sits on there. My question actually is more along the lines of, does this leverage the metadata tables that are inherent in the existing architectures and replicating those? Or how does that work? Yeah, so a lot of these architectures

Starting point is 00:06:28 have metadata that is based in Flash today. And so those kind of common ways that you would go scan through the file systems mean you can do it much faster. That said, there's a lot of architectures that don't have it on Flash today, and it's still sitting on spinning disk. And so you have to have a model that can work in both ways. And the way that we really do that is, you know, our client, we wrote a proprietary client. It talks directly to NFS or SMB SIFs or object. And the way that it works is it goes, opens a bunch of connections kind of across the NAS device. And then in that there are threads and those threads have a proprietary way that

Starting point is 00:07:11 they're crawling through the file systems looking for change. And the kind of way we think about this is as you're crawling through looking for change, you're measuring latency as you go through so that you don't disrupt customers' applications. And so it's like a go big, go wide. And then be smart about how you go big and wide, both in terms of you're not hard partitioning the file system, you're dynamically allocating threads to scan through the file system when you need it. And then as threads finish, you just reallocate them to new places. And so a lot of the rsyncs or robocopies fail because you go create these static bindings in the file system. And one of those threads might get done, but the

Starting point is 00:07:57 other one's now crawling through something that has 10 billion more files in it or 100 million more files in it. And now you're kind of stuck, like you're just waiting for that thing to finish, that last kind of crawl to happen, which could take a lot longer. And so as we're doing this kind of distributed crawl through the system and determining this rate of change, we're also seeing what does the latency look like on the system? And so we'll scan up and scan down the number of threads we're using so that applications run. A level zero on a petabyte of data could take a day. A level zero on 100 petabytes of data could take two weeks. And so that's why you got to have this capability to scan up and scale up and scale down as you go.

Starting point is 00:08:38 So once you understand, let's say, the change files across multiple NAS boxes and all that stuff, the movement has to be somewhat challenging as well, right? Yeah, because you end up in this world where everybody that's done rsync before knows it works great for big files. And then all of a sudden you hit a patch of small files and it grinds to a halt. And so the kind of the architecture pieces, we call it adaptive scanners, which is how we scan through things. And we talk about tasks that can scan at 400,000 files per second. IntelliMove is our move engine. It's the stuff that as it's going through and it

Starting point is 00:09:19 finds data, it's handing it off into different thread pools. And so one of those thread pools says, if you encounter small files, you want to kind of aggregate them together and move them as a much bigger chunk. And as you get to big files, you want to break them up into multiple threads and push them as multiple threads. And the whole intent there is that you just have a lot of data in flight. You're really keeping networks busy. You're really just pushing this kind of pipeline effect of stuff that you're finding and getting it up and into memory and pushed out as fast as possible on the network. And the targets for the data protection would be what per se? And this has been a lot of fun because essentially when we started out, we started out and the targets for

Starting point is 00:10:05 this data was, you know, kind of super micro servers or appliances that we deployed two years ago, it started to become cloud. And so it was, everybody had a cloud component and we co-launched with Amazon and their Azure Archive Blob Store, which all of a sudden had this price point of $12 per terabyte per year, which then turned into Amazon having their deep glacier. And so now cloud is a much more of a target for these backup operations. And people have direct connect to these cloud providers now. They're much more common. There's a lot more points of presence out there to target those.

Starting point is 00:10:44 And a couple of things, though, you got to think about is if you're using the cloud, which we strongly encourage everybody to go use the cloud. It's basically outsourcing all of your tape or your secondary systems or your data centers makes it really easy to manage and deal with. But you have to think about how am I going to move data in effectively and efficiently, which means avoiding transaction costs. Like I can't take a billion files and put a billion files in the cloud. Your transaction costs are going to be like 50 grand because they do charge you just to move things. Even though it says it's free in, you know, every thousand puts is a nickel on Amazon Deep Archive, right? So it becomes pretty expensive fast. And then conversely, you got to think about how do I expire data? And so when I expire data, I can't just rehydrate everything and expire data and then compact

Starting point is 00:11:39 it and rewrite it. I have to do it in a way that's like economically efficient. And so I have to detach expiration of data, which is a business operation from the actual cleanup of data, which is an economic operation. And so we've done some- Is it all done by policy though? It is done by policy. So the policy is the business end of it, which says keep this for 90 days. And so once you like expire that, that data is no longer recoverable, but you don't want to immediately go clean that up on the cloud side because of it sitting in Glacier or Azure,

Starting point is 00:12:14 that could be a very expensive event to go clean up immediately. And so what you want to do is kind of build up these these expirations over time. And when it now becomes cheaper to rehydrate a chunk of data, pull out the expired data and rewrite it, is when you want to go do that operation versus keeping it stored there in perpetuity so that you're not growing unbounded. It's deferred garbage collection per se when you actually need to do it and stuff like that.

Starting point is 00:12:42 That's exactly it. And it's the first time garbage collection has ever existed with like a true dollar cost behind it. Right. So, I mean, how does this thing get deployed? I mean, if we're talking hundreds of petabytes and billions of files across multiple NAS solutions, I mean, it's got to be a nightmare.

Starting point is 00:13:00 Yeah, so it is. To deal with in the normal world with Igneous, look,neous, we've dealt with scale out for a long time. We approach this as a scale out architecture. And so regardless of whether you have a single site that's really large and you need to deploy multiple VMs, So we deploy through VMs. Those VMs are in kind of the cloud native format, which is just a collection of containers that are in those VMs doing some work. So we deploy, you know, this VM that sits on premises. It's talking to our SaaS offering, our cloud offering. So that VM is really just the stateless thing that lives in their environment that's doing the scanning and the movement. All the kind of orchestration and all the configuration is really through the SaaS portal of this. And then you can continue to deploy more and more VMs based on the size of a given data center environment, or you could deploy more and more VMs across multiple sites, but you're still

Starting point is 00:13:59 just managing as a single pane of glass, no matter how you distribute those VMs out. So, I'm sorry, this VM, is it any particular flavor? Can you choose, I want to run it as a Hyper-V device or a VMware device? Our default is OVAs. That seems to be the most prevalent, but we've deployed in Hyper-V. We've deployed in KVM environments before. So it's pretty flexible. It's a model where to deploy is a really lightweight, the total VM size is 100 gigs. We don't really, sorry, it's like 100 megs. And then when we deploy, the footprint of that VM is pretty light. It's like eight cores know, eight cores, 32 gigs of RAM

Starting point is 00:14:46 in that VM. And then the total footprint of that VM, once it's all like loaded up and deployed is about 100 gigs worth of space. So pretty easy, pretty lightweight to go deploy. And that's a Linux based OS? It's a Linux based OS, yes. So who in this world's got hundreds of petabytes of data with billions of files these days? Surprisingly, name all your big life sciences organizations, name all your big M&E shops. Media guys? Name all your big financial institutions. It exists everywhere. So we just published a case study with a company called Quantum Spatial. And so they fly airplanes, take pictures of the ground, and then they process that

Starting point is 00:15:32 data. And then they, from there, produce results that they provide to their clients. And it could be everything from LIDAR to radar. It could be aerial imagery. it could be over crops, it could be over forests. There's all sorts of use cases why you go do this picture taking. So that comes off a plane, it gets loaded onto NAS devices, it gets processed. From there, then they produce the results to their customers. And then there's this backup and archive workflow. So they want to go either protect that data, so they have a good copy to fall back to, they want to go either protect that data so they have a good copy to fall back to if they want to go back and reprocess things. And so there's a finite period of backup. And then once they're really done with the project, they want to go archive it. And so

Starting point is 00:16:14 they'll go archive it off into Azure. And that sits there according to their contract regulations. And their workflow is pretty fluid because they're always like, you think of archive as kind of a one-way thing. We're just going to go archive this data once, and that's going to sit there for a long time. But they're actually have really more of an active archive environment. Like they're putting the data into archive and then pulling it back, you know, probably a month later to go do some reprocessing against it. And the reason why it's so active is, is they're just trying to keep their primary storage in check is how much capacity they've deployed in their limited data center space. And so they've got to have things that can move that data back and forth as fast as possible

Starting point is 00:16:54 for them. You've talked about data protection. Is there another solution that you have besides that? Yeah. So one thing just to kind of circle back on, like QSI is one too, where I would call them a good hybrid environment where they protect to the cloud. That's an archive to the cloud. But they also do a lot of that locally too. So that kind of 30 to 90 day window actually lands locally. So we're one of the few that's in this category of a SaaS backup provider that gives customers the flexibility? Do you want to protect that data to Isilon or NetApp? Do you want to use that as the target for your data? Do you want to use an object store on-premises like ECS or StorageGrid or Cloudian or Scality? Or do you want to use any of the cloud providers? We interoperate with GCP and Azure and AWS and Wasabi. And so I think we're the only one out there that has that broad range of targets and has a SaaS offering. And that's Data Protect, right? That's the backup and archive piece of it. The other part that we've been working on is called Data Discover. And that was something we launched mid last year.

Starting point is 00:18:03 Now, Data Discoverer is visibility into all your data. So, you know, it's really hard to understand what you have. So you have to say things like, how hot and how cold is your data? And Data Discoverer is the offering that goes out and scans all your data where it exists in place and gives you a heat map so that you can start to make decisions about what to do with your data. Do I want to archive some data? Do I want to back it up? Do I want to go delete some data? Do I want to promote some data to flash? And so a very unique offering that you finally have like a dashboard to control what your footprint looks like. Are you charging on a petabyte basis or something? Or how does this work? Is it because it's a SaaS offering? It's almost like a monthly charge? Lots of flexibility.

Starting point is 00:18:45 We have two modes. Like one mode is subscription of data under management. And that's more of your backup and archive. And you can prepay, you know, by month, year, multi-year. You have our data discover. And that is by VM actually. Like since that's a pretty lightweight VM, we look at it and go, okay, how much data are you scanning and how big is the footprint and how fast do you want

Starting point is 00:19:11 those results? So like a typical VM that we deploy could be scanning six petabytes of data and, you know, three, 4 billion files. And it does it in a day and a half or so, as you start getting bigger than that, you know, like, let's say you hit the 20, the 30, the 40 petabyte mark, you might want to deploy more VMs so that you keep that visibility of data in that short of a window so it's always fresh. So you're getting that high scan rate. But others will say, hey, I have 100 petabytes. I've got two VMs and, you know, we scan that data once a week and that's good enough for us. So you've got two charge models, I should say. One is by capacity under management and the other one is by VM? That's correct. That's correct. And we have some bundles out there right now, which are

Starting point is 00:19:54 for 100 terabytes of data under management plus data discover, data protect and data discover, it's 30K a year to get started. and then costs change as you scale that up. Since this backup scenario seems really robust, is there a provision in place for endpoint? Even like desktop stuff? Yeah. So we've been really focused on the big data right now. We feel like we have a long runway to just doing a portfolio of data management capabilities that we've just touched the surface of. And we haven't even talked about search and finding data and all the data you have. We haven't talked about GDPR or PII discovery. We haven't talked about distributed enterprises

Starting point is 00:20:47 and collaboration. So I think we have a pretty good runway for stuff that our customers are certainly asking us to do that we'll stick to where we're at for now. And then maybe as that we cover more of that market, we'll reach out into different segments. One of the questions we get asked a lot is, is my God, I love the service. It's so easy to use. We had it up and running in a half an hour. Why can't you do this for my VM environment? And we kind of stick true

Starting point is 00:21:13 to our roots and say, you got to be the best at what you do. And you know, there's, there's NAS and SAN and we're on the NAS side. Like that is the world that we operate in and have a lot of great capabilities. And, And so we see a long roadmap for this that will continue to grow as more and more unstructured data is generated. Hey, this has been great. Matt, any last questions for Christian? Boy, you know, I find the whole backup space to be very, very interesting, incredibly active, and new approaches. And personally, I'm a huge fan of sort of the whole SaaS approach. I think we could have this conversation for hours and still probably barely scratch the surface of your business use case. But, you know, obviously, we don't have time for that. But I really did enjoy the conversation and I could see myself digging in much deeper to this product. Okay. So, Christian, anything else you'd like to say to our listening audience before we close? sensitive is that there are a lot of people out there doing research right now in the life sciences space. We certainly want to be part of the broader group that can help out in that. And

Starting point is 00:22:30 using things like tape in this environment becomes a challenge. So we're offering free usage of our services to go back up data to the cloud for customers that are doing research around COVID-19 at this point. If anybody is in that camp, please reach out to us. We'll have some broader announcements around this coming up, but we certainly want to play a part. And this is a time when lives are at risk and losing data while lives are at risk is a very difficult environment to be in. So please reach out. That's great. If you want to give me a link, I can put it on the podcast post and we'll raise awareness from it. Sounds great. Well, this has been great. Thank you very much, Christian, for being on our show today. And thanks

Starting point is 00:23:15 to Igneous for sponsoring this podcast. Thank you guys. Nice talking to you. Thank you, Christian. Next time, we'll talk with another system storage technology person. Any questions you want us to ask, please let us know. If you enjoy our podcast, tell your friends about it. Please review us on Apple Podcasts, Google Play, and Spotify, as this will also help get the word out. That's it for now. Bye, Matt. Bye, Ray.

Starting point is 00:23:35 Until next time. Thanks.

Your Ad Here

Grey Beards on Systems - 098: GreyBeards talk data protection & visualization for massive unstructured data repositories with Christian Smith, VP Product at Igneous

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.