Grey Beards on Systems - 098: GreyBeards talk data protection & visualization for massive unstructured data repositories with Christian Smith, VP Product at Igneous
Episode Date: March 24, 2020Sponsored By: Even before COVID-19 there was a lot of file data being created and mined, but with the advent of the pandemic, this has accelerated considerably. As such, it seemed an appropriate time ...to talk with Christian Smith, VP of Product at Igneous, (@IgneousIO) a company that targets the protection and visibility of massive … Continue reading "098: GreyBeards talk data protection & visualization for massive unstructured data repositories with Christian Smith, VP Product at Igneous"
Transcript
Discussion (0)
Hey everybody, Ray Lucchese here with Matt Lieb.
Welcome to another sponsored episode of Graybeards on Storage podcast.
This Graybeards on Storage podcast is brought to you today by Igneous and was recorded on
March 18th, 2020. We have with us here today, Christian is brought to you today by Igneous and was recorded on March 18, 2020.
We have with us here today Christian Smith, VP of Product at Igneous.
So, Christian, why don't you tell us a little bit about yourself, your company, and your SaaS data management solutions for unstructured data?
Sure. Hey, Matt. Hey, Ray.
Kind of crazy times today.
I think we're all huddled down.
I'm in the office here, and I think I'm the only one here.
There's less people here than there is at my home right now. So I hope everybody's safe at home and
we'll get through this. But talking about Ignea, so my name is Christian Smith, product manager
here. I've been at Ignea since we started. I've been at places like NetApp and Icelon and SGI in
the back of my past. Really who we are is a bunch of file system guys.
Our team, our founders, they wrote Waffle from NetApp.
They wrote 1FS from Isilon.
And when we got to this point where we were getting ready to think about starting a company,
it was really about what does the next generation of data management look like?
We didn't want to write another file system. We didn't want to write another NAS device.
We really wanted to say, you know, data is growing, unstructured data is growing,
machine generated data is growing. It's across segments like, you know, life sciences and high
tech manufacturing and finance and insurance. It's all the places where just
the machines are generating more data than people are able to generate on their own.
And as you start looking at these people, they had a couple of gigabytes 10 years ago,
and now they're looking at petabytes. And in the distant future, it won't be uncommon for them to
have hundreds of petabytes in their environment. Oh my God. Yeah. So this presents some challenges, right? How do you do just simple things like how do
you move data? How do you protect data at this scale? Where does the cloud fit in in this?
And you think that hundreds of petabytes will be on-prem?
I don't think it'll be on-prem. I think it'll be a hybrid environment more than it'll be
on-prem. So I think in the world we're looking at where there's just so many services that are
available in the cloud to process data or leverage cloud capabilities that it's going to be, you know,
data is going to get generated on-prem. You can't move a sequencer into the cloud. You can't move,
you know, you're not yet. Yeah, maybe Amazon will
go, you know, buy those and you can send your samples in. But there's a lot of these places
where the machines generate the data are just, they're there on premises, they're in labs,
they're next to scientists, they're next to researchers. And so there's going to be some
hybrid environment that exists in these environments overall. And really, it's like,
how do you protect it? How do you move it? How do you even see what you have? Like just thinking about
visibility of your unstructured data, any kind of common questions like how much data do you have
and how many files do you have usually is something like, well, we bought this much and it's 80% full.
So I guess we have this much data. Well, what's your file count? I have no idea.
And so we really, you know, came and approached this as those common capabilities of, you know,
see, protect, and move. And then how do you do this in a way that's like reduces friction and
is built for the scale of, you know, hundreds of petabytes, even though you're starting out with terabytes type of scale. And so that's really what Igneous was about. And
we had to go through and kind of re-architect how you deal with this data. Can you really protect
hundreds of petabytes of data? It seems like the file scan would take like, you know, a couple of
months or something like that, wouldn't it? You know, you can protect hundreds of petabytes of
data. And that goes back to kind of the architecture piece of this is that you know in in today's modern architectures uh everything
has some flash in it everything has lots of spindles behind it and and so moving data is
not the challenge you're you're correct in saying like the hard part is actually going and determines
what's changed. And that's
where we had to go back and, you know, kind of rewrite the way you talk to these devices and do
it in a, you know, a scalable, efficient, multi-threaded way that still doesn't disrupt,
you know, the applications that are running. And so, you know, we have deployments that have
customers have, you know, 40 to a hundred billion files it. And we're- Billion.
Billion, as in B.
And we're going through those every single night
looking for change rate and comparing that
and finding all the files that changed
and moving them over to a secondary tier of storage
in their environment.
And so you can do it,
but you have to really kind of understand file systems,
understand data,
understand how you talk to these systems and do it at the scale of the enterprise.
And these aren't your NAS boxes. These are somebody else's, right?
These are, yeah. This is like Isilon. This is NetApp. This is Cumulo. This is Pure Flashblade.
This is Lustre, Glustre, GPFS. I mean, it ranges from where you're intersecting the segment.
Like is it life sciences?
It tends to be Isilon.
If it's chip design, it tends to be NetApp.
And you go into the physics guys and they have like huge Lustre or GPFS deployments.
But you're not actually managing that storage.
You're managing the data that resides on that storage.
Correct.
That's a fine line there, Matt.
Well, so, I mean, the truth is a NetApp is going to be managed by NetApp. You know, a pure FlashBlade is going to be managed by Purity.
But amongst all of that, you have sort of a layer, not that it adds any overhead to it, but it's actually digging into the data that sits
on there. My question actually is more along the lines of, does this leverage the metadata tables
that are inherent in the existing architectures and replicating those? Or how does that work? Yeah, so a lot of these architectures
have metadata that is based in Flash today. And so those kind of common ways that you would go
scan through the file systems mean you can do it much faster. That said, there's a lot of
architectures that don't have it on Flash today, and it's still sitting on spinning disk.
And so you have to have a model that can work in both ways.
And the way that we really do that is, you know, our client, we wrote a proprietary client.
It talks directly to NFS or SMB SIFs or object.
And the way that it works is it goes, opens a bunch of connections kind of across the
NAS device. And then in that there are threads and those threads have a proprietary way that
they're crawling through the file systems looking for change. And the kind of way we think about
this is as you're crawling through looking for change, you're measuring latency as you go through
so that you don't disrupt customers'
applications. And so it's like a go big, go wide. And then be smart about how you go big and wide,
both in terms of you're not hard partitioning the file system, you're dynamically allocating
threads to scan through the file system when you need it. And then as threads finish, you just
reallocate them to new places. And so a lot of the rsyncs or robocopies fail because you go
create these static bindings in the file system. And one of those threads might get done, but the
other one's now crawling through something that has 10 billion more files in it or 100 million
more files in it. And now you're kind of stuck, like you're
just waiting for that thing to finish, that last kind of crawl to happen, which could take a lot
longer. And so as we're doing this kind of distributed crawl through the system and determining
this rate of change, we're also seeing what does the latency look like on the system? And so we'll
scan up and scan down the number of threads we're using so that applications run. A level zero on a petabyte of data could take a day.
A level zero on 100 petabytes of data could take two weeks. And so that's why you got to have this
capability to scan up and scale up and scale down as you go.
So once you understand, let's say, the change files across multiple NAS boxes and all that stuff, the movement has to be
somewhat challenging as well, right?
Yeah, because you end up in this world where everybody that's done rsync before knows it
works great for big files.
And then all of a sudden you hit a patch of small files and it grinds to a halt.
And so the kind of the architecture pieces, we call it adaptive
scanners, which is how we scan through things. And we talk about tasks that can scan at 400,000
files per second. IntelliMove is our move engine. It's the stuff that as it's going through and it
finds data, it's handing it off into different thread pools. And so one of those thread pools
says, if you encounter small files, you want to kind of aggregate them together and move them as
a much bigger chunk. And as you get to big files, you want to break them up into multiple threads
and push them as multiple threads. And the whole intent there is that you just have a lot of data
in flight. You're really keeping networks busy. You're really just pushing this kind of pipeline effect of stuff that you're finding and getting
it up and into memory and pushed out as fast as possible on the network. And the targets for the
data protection would be what per se? And this has been a lot of fun because essentially when
we started out, we started out and the targets for
this data was, you know, kind of super micro servers or appliances that we deployed two years
ago, it started to become cloud. And so it was, everybody had a cloud component and we co-launched
with Amazon and their Azure Archive Blob Store, which all of a sudden had this price point of $12 per terabyte per year, which
then turned into Amazon having their deep glacier.
And so now cloud is a much more of a target for these backup operations.
And people have direct connect to these cloud providers now.
They're much more common.
There's a lot more points of presence out there to target those.
And a couple of things, though, you got to think about is if you're using the cloud, which we strongly encourage everybody to go use the cloud.
It's basically outsourcing all of your tape or your secondary systems or your data centers makes it really easy to manage and deal with.
But you have to think about how am I going to move data in effectively and efficiently,
which means avoiding transaction costs. Like I can't take a billion files and put a billion files in the cloud. Your transaction costs are going to be like 50 grand because they do charge
you just to move things. Even though it says it's free in, you know, every thousand puts is a nickel on Amazon Deep Archive, right?
So it becomes pretty expensive fast.
And then conversely, you got to think about how do I expire data?
And so when I expire data, I can't just rehydrate everything and expire data and then compact
it and rewrite it.
I have to do it in a way that's like economically efficient. And so I have to detach
expiration of data, which is a business operation from the actual cleanup of data, which is an
economic operation. And so we've done some- Is it all done by policy though?
It is done by policy. So the policy is the business end of it, which says keep this for 90 days.
And so once you like expire that, that data is no longer recoverable,
but you don't want to immediately go clean that up on the cloud side
because of it sitting in Glacier or Azure,
that could be a very expensive event to go clean up immediately.
And so what you want to do is kind of build up these
these expirations over time.
And when it now becomes cheaper to rehydrate a chunk of data, pull out the expired data
and rewrite it, is when you want to go do that operation versus keeping it stored there
in perpetuity so that you're not growing unbounded.
It's deferred garbage collection per se when you actually need to do it and stuff like
that.
That's exactly it.
And it's the first time garbage collection has ever existed
with like a true dollar cost behind it.
Right.
So, I mean, how does this thing get deployed?
I mean, if we're talking hundreds of petabytes
and billions of files across multiple NAS solutions,
I mean, it's got to be a nightmare.
Yeah, so it is.
To deal with in the normal world with Igneous, look,neous, we've dealt with scale out for a long time. We approach this as a scale out architecture. And so regardless of whether you have a single site that's really large and you need to deploy multiple VMs, So we deploy through VMs. Those VMs are in kind of the cloud
native format, which is just a collection of containers that are in those VMs doing some work.
So we deploy, you know, this VM that sits on premises. It's talking to our SaaS offering,
our cloud offering. So that VM is really just the stateless thing that lives in their environment
that's doing the scanning and the movement. All the kind of orchestration and all the configuration is really through the SaaS portal
of this. And then you can continue to deploy more and more VMs based on the size of a given data
center environment, or you could deploy more and more VMs across multiple sites, but you're still
just managing as a single pane of glass, no matter how you distribute those VMs out. So, I'm sorry, this VM, is it any particular flavor?
Can you choose, I want to run it as a Hyper-V device or a VMware device?
Our default is OVAs.
That seems to be the most prevalent, but we've deployed in Hyper-V.
We've deployed in KVM environments before.
So it's pretty flexible. It's a model where to deploy is a really lightweight, the total VM size is 100
gigs. We don't really, sorry, it's like 100 megs. And then when we deploy, the footprint of that VM
is pretty light. It's like eight cores know, eight cores, 32 gigs of RAM
in that VM. And then the total footprint of that VM, once it's all like loaded up and deployed is
about 100 gigs worth of space. So pretty easy, pretty lightweight to go deploy.
And that's a Linux based OS?
It's a Linux based OS, yes.
So who in this world's got hundreds of petabytes of data with billions of files these days? Surprisingly, name all your big life sciences organizations, name all your big
M&E shops. Media guys? Name all your big financial institutions. It exists everywhere. So
we just published a case study with a company called
Quantum Spatial. And so they fly airplanes, take pictures of the ground, and then they process that
data. And then they, from there, produce results that they provide to their clients. And it could
be everything from LIDAR to radar. It could be aerial imagery. it could be over crops, it could be over forests. There's
all sorts of use cases why you go do this picture taking. So that comes off a plane,
it gets loaded onto NAS devices, it gets processed. From there, then they produce the results to their
customers. And then there's this backup and archive workflow. So they want to go either
protect that data, so they have a good copy to fall back to, they want to go either protect that data so they have a good copy
to fall back to if they want to go back and reprocess things. And so there's a finite period
of backup. And then once they're really done with the project, they want to go archive it. And so
they'll go archive it off into Azure. And that sits there according to their contract regulations.
And their workflow is pretty fluid because they're always like, you think of archive as kind of a one-way thing. We're just going to go archive this data once,
and that's going to sit there for a long time. But they're actually have really more of an active
archive environment. Like they're putting the data into archive and then pulling it back,
you know, probably a month later to go do some reprocessing against it. And the reason why it's
so active is, is they're just trying to keep their
primary storage in check is how much capacity they've deployed in their limited data center
space. And so they've got to have things that can move that data back and forth as fast as possible
for them. You've talked about data protection. Is there another solution that you have besides that?
Yeah. So one thing just to kind of circle back on, like QSI is one too, where I would call them a good hybrid environment where they protect to the cloud. That's an archive to the cloud. But they also do a lot of that locally too. So that kind of 30 to 90 day window actually lands locally. So we're one of the few that's in this category of a SaaS backup provider that gives customers the flexibility? Do you want to protect
that data to Isilon or NetApp? Do you want to use that as the target for your data? Do you want to
use an object store on-premises like ECS or StorageGrid or Cloudian or Scality? Or do you
want to use any of the cloud providers? We interoperate with GCP and Azure and AWS and Wasabi. And so I think we're the only
one out there that has that broad range of targets and has a SaaS offering. And that's
Data Protect, right? That's the backup and archive piece of it. The other part that we've
been working on is called Data Discover. And that was something we launched mid last year.
Now, Data Discoverer is visibility
into all your data. So, you know, it's really hard to understand what you have. So you have to say
things like, how hot and how cold is your data? And Data Discoverer is the offering that goes out
and scans all your data where it exists in place and gives you a heat map so that you can start to
make decisions about what to do with your data. Do I want to archive some data? Do I want to back it up? Do I want to go delete some data? Do I want to promote
some data to flash? And so a very unique offering that you finally have like a dashboard to control
what your footprint looks like. Are you charging on a petabyte basis or something? Or how does
this work? Is it because it's a SaaS offering? It's almost like a monthly charge? Lots of flexibility.
We have two modes.
Like one mode is subscription of data under management.
And that's more of your backup and archive.
And you can prepay, you know, by month, year, multi-year.
You have our data discover.
And that is by VM actually.
Like since that's a pretty lightweight VM, we look at it and
go, okay, how much data are you scanning and how big is the footprint and how fast do you want
those results? So like a typical VM that we deploy could be scanning six petabytes of data and, you
know, three, 4 billion files. And it does it in a day and a half or so, as you start getting bigger
than that, you know, like, let's say you hit the 20, the 30, the 40 petabyte mark, you might want to deploy more VMs
so that you keep that visibility of data in that short of a window so it's always fresh.
So you're getting that high scan rate. But others will say, hey, I have 100 petabytes.
I've got two VMs and, you know, we scan that data once a week and that's good enough for us.
So you've got two charge models, I should say. One is by capacity under management and the other
one is by VM? That's correct. That's correct. And we have some bundles out there right now, which are
for 100 terabytes of data under management plus data discover, data protect and data discover,
it's 30K a year to get started. and then costs change as you scale that up.
Since this backup scenario seems really robust, is there a provision in place for endpoint?
Even like desktop stuff?
Yeah. So we've been really focused on the big data right now. We feel like we have a long runway to just doing a portfolio of data
management capabilities that we've just touched the surface of. And we haven't even talked about
search and finding data and all the data you have. We haven't talked about GDPR or PII discovery.
We haven't talked about distributed enterprises
and collaboration.
So I think we have a pretty good runway
for stuff that our customers are certainly asking us to do
that we'll stick to where we're at for now.
And then maybe as that we cover more of that market,
we'll reach out into different segments.
One of the questions we get asked a lot is, is my God, I love the service. It's so easy to use. We had it up and
running in a half an hour. Why can't you do this for my VM environment? And we kind of stick true
to our roots and say, you got to be the best at what you do. And you know, there's, there's NAS
and SAN and we're on the NAS side. Like that is the world that we operate in and have a lot of
great capabilities. And, And so we see a long
roadmap for this that will continue to grow as more and more unstructured data is generated.
Hey, this has been great. Matt, any last questions for Christian?
Boy, you know, I find the whole backup space to be very, very interesting, incredibly active,
and new approaches. And personally, I'm a huge fan of sort of the whole SaaS approach. I think we could have this conversation for hours and still probably barely scratch the surface of your business use case. But, you know, obviously, we don't have time for that. But I really did enjoy the conversation and I could see myself digging in much deeper to this product.
Okay. So, Christian, anything else you'd like to say to our listening audience before we close? sensitive is that there are a lot of people out there doing research right now in the life sciences space. We certainly want to be part of the broader group that can help out in that. And
using things like tape in this environment becomes a challenge. So we're offering free usage of our
services to go back up data to the cloud for customers that are doing research around COVID-19 at this point. If anybody is in
that camp, please reach out to us. We'll have some broader announcements around this coming up,
but we certainly want to play a part. And this is a time when lives are at risk and
losing data while lives are at risk is a very difficult environment to be in. So please reach
out. That's great. If you want to give me
a link, I can put it on the podcast post and we'll raise awareness from it. Sounds great.
Well, this has been great. Thank you very much, Christian, for being on our show today. And thanks
to Igneous for sponsoring this podcast. Thank you guys. Nice talking to you. Thank you, Christian.
Next time, we'll talk with another system storage technology person. Any questions you want us to
ask, please let us know. If you enjoy our podcast, tell your friends about it.
Please review us on Apple Podcasts, Google Play, and Spotify,
as this will also help get the word out.
That's it for now.
Bye, Matt.
Bye, Ray.
Until next time.
Thanks.