Grey Beards on Systems - 76: GreyBeards talk backup content, GDPR and cyber security with Jim McGann, VP Mkt & Bus. Dev., Index Engines
Episode Date: November 14, 2018In this episode we talkindexing old backups, GDPR and CyberSense, a new approach to cyber security, with Jim McGann, VP Marketing and Business Development, Index Engines. Jim’s an old industry hand ...that’s been around backups, e-discovery and security almost since the beginning. Index Engines solution to cyber security, CyberSense, is also offered by Dell EMC … Continue reading "76: GreyBeards talk backup content, GDPR and cyber security with Jim McGann, VP Mkt & Bus. Dev., Index Engines"
Transcript
Discussion (0)
Hey everybody, Ray Lucchese here with Howard Marks here.
Welcome to the next episode of Gray Bridge Unstoraged podcast,
show where we get Gray Bridge Storaged bloggers to talk with system vendors
to discuss upcoming products, technologies, and trends affecting the data center today. This Graybridge on Storage episode
was recorded on November 2nd, 2018. We have with us here today Jim McCann, VP of Marketing and
Business Development at Index Engines. So Jim, why don't you tell us a little bit about yourself
and what's new at Index Engines? Hey, thanks Ray. Thanks Howard. It's great to be joining the
podcast today. I've been with Index Engines for almost 15 years now. We've been in business for quite a while.
We are a enterprise indexing company, hence the name. Really, what we are is we add indexing to
the data center. So we can layer on top of network storage and crawl network storage to index it. A unique claim to fame that Howard has always loved
has been the ability to index backup images.
So we've engineered access to NetBackup, Commvault, Networker, TSM
to be able to understand what's in those formats,
whether it be on tape or disk.
This really came in handy during my consulting days.
As a consultant, I frequently got called in to help clean up after they fired a CIO and other people. And so there would inevitably be the day when someone really needed to retrieve some data and a box would come back from Iron Mountain and it would be tapes that we didn't have tape drives for anymore. Nobody remembered what backup software they used in 1992.
Right.
Sounds like NASA.
And just, you know, finding the four messages that relate to the sexual harassment suit
that we're in the middle of got a lot easier.
Yeah, a lot of, you know, a lot of what, you know, what our mission in life is similar to what,
you know, what Google's done for the Internet is making the Internet data valuable through search,
is to make enterprise information valuable through search to be able to find it and manage it.
So some of the stuff where there's some bodies buried in old backup tapes, as we know that exist,
like the mortgage crisis or harassment lawsuits or,
you know, other types of lawsuits, oil spills. You know, people use, companies use,
and this could be a whole separate podcast, use tapes for long-term retention and archiving.
So is backup an archive is the question. So using it as an archive, you know, if you use it as an archive and it's an
old backup format that you don't have anymore, and then you throw it out in the salt mine somewhere
in Pennsylvania, how is that an effective archive? And when you need to go and find that silver
bullet email or needle in the haystack, how do you do that? Well, I'm glad you used the word
silver bullet because, well, everybody's always worried about the liability that might be in the data they're retaining.
But there was a time where I was working as a consultant to an advertising agency that lost an account and got sent an email that said, we're going to extend this account for 90 days while the new advertising agency staffs up and it'll be a million dollars a month. So that's $3 million. And of course they
sent that to the ad, to the account exec who had just been fired because he lost the account.
And, and when it came time to submit the invoice, accounting kicked it back.
And I had to go find that $3 million email. So, you know, the stuff in
your old data isn't necessarily poison pills. Sometimes it really is that magic bean that
you've been looking for. Yeah, we have customers that really, you know, engineering companies that
manufacture bridges or buildings. We have a customer in the UK that rebuilt some infrastructure in London.
All that data, the data of value that could be repurposed,
engineering documents and so on, if they archive them on backup tapes,
they have no value.
We've been working with a lot of those organizations that use our technology to go mine that data on old tapes.
It could be an old TSM tape where now they're using a product like Avamar or Networker
to go find the data value, find me file types of this type, find me documents or PDFs of this type
or these type of AutoCAD documents and extract them into a cloud archive that makes it much
more usable to regain and reaccess that intellectual property. So what we do, yeah, yeah. So what we, it's not
always bad. There is a lot of bad. I mean, we, we worked with a, a law firm in, in this is actually
an interesting story and I can talk about it because the company no longer exists, you know,
where they were a hedge fund, remember hedge funds back in the day, but they were ahead of time. And they allegedly hired someone from Microsoft
and then started trading on Microsoft Secrets. So the SEC was after them for years and years and
years. And all the information on their exchange server and their internal networks was cleansed.
And they hired these forensic analysts and these e-discovery
folks to go find it. No one could find it. So the CEO was in the process of going through a divorce.
The wife basically knew what the SEC was looking for and found an email, a copy of the email on
the home computer and brought it to the SEC and said, that's fine. That was a silver bullet. The SEC said, well, we need to see it on their exchange server.
So they hired a company that used our technology. They said, we'll go find the tapes that are the
backups of the exchange server from June, whatever year it was. They did that. They scanned it. They
said, type these keywords into search for this email. and lo and behold, it was there buried in some secret folder
and was found, and that company no longer exists, and the SEC won that battle, right?
So Shakespeare was right.
Hell hath no fury like a woman's squirt.
Let's not go there, Howard.
It's not good.
So what's new at IndexEdge?
You guys were just at Tech Field Day, right? Yeah, so what we've been doing is obviously we've been in the business of being able to help customers manage legacy tape data.
So a lot of legal discovery work, a lot of people that want to go tapeless and just eliminate their tape archives and move the data to the cloud.
We have partnerships with AWS to migrate that or other cloud providers as well.
And that's been a solid business for us for a long, long time.
Because like Howard said, you could walk in the room and hand me 10-year-old TSM tapes, LTO1, DLT even,
and we could scan them from beginning to end without having the backup software, index the content,
and surgically extract an individual email out of exchange, out of the backup format, index the content, and surgically extract an individual email out of exchange,
out of the backup format, back onto disk, maintain the integrity so it can be used for
legal discovery.
So we're seeing an uptick in that business because of the EU GDPR.
I mean, customers using tapes as an archive and managing personal data on offline tapes
is not necessarily a good strategy.
So things like that are having customers reach out to us
and say, help us clean up our legacy tape museums
that are sitting in offsite storage.
So that the GDPR has also kicked up our online,
you know, network data indexing.
So finding personal data in their network,
in their files and their networks to be able to manage that.
So when you have a right to be forgotten request or other requests required by, you know, personal
data, that's not only the GDPR, but California has a personal privacy act and they're popping
up all over the place. So we're seeing a lot of activity there. So what the interesting thing is
what customers are doing is, you know, the first phase is let's take a look at our storage and kind of just start cleaning up the mess.
You know, the stuff that's made all these storage vendors rich of just hoarding and hoarding stockpiles of, you know, petabytes of data on old NetApp filers.
It's like, let's go look at that stuff. And we had one company that's an electronic manufacturer on the West Coast that
said, let's do an assessment of two and a half petabytes of data. So we were able to scan that
in a couple of weeks at a metadata level. They did a profile or assessment and found that about
a petabyte of it was a combination of old log files that had zero value, which they were backing
up and protecting for decades, and a bunch of useless files that they immediately purged. So they reclaimed 50% of that capacity. And then they
started looking at, you know, the other half of it to figure out what had value. So, you know,
when you have a personal data management issue, you know, doing it across multi petabytes is
challenging for anybody, but when you can reduce that footprint and find the data value and then help manage that. So that's been a very hot space for us, especially the past year.
Yeah. And IT has always wanted to throw things away, but has never felt
empowered enough to make the value judgment to say, this data is actually worthless.
Yeah. Yeah. Well, it's not part of their mission statement. It's really the business users. And
every company has had records managers, but they haven't really been empowered to make these decisions.
Well, and they've always considered themselves to have a very narrow scope. You know, the records management people are only concerned with, you know, the things the SEC is going to sue us about not keeping. Right. But I think things like these
privacy or personal data regulations, once there's some pretty serious sanctions or fines,
you know, with real companies other than the Facebooks of the world, people will start to
take notice. And I think they will start to ask questions is why are we keeping this? What are
we keeping it? Does it have business value? Who can get access to this stuff? All those questions that they should have been asking and looking at the life cycle of data.
And I remember, you know, a decade ago we talked about ILM and managing data properly.
The whole archive thing.
And a decade before that we talked about HSM and it was the same thing.
Yeah.
So I think maybe third time's a charm.
Didn't Shakespeare – I think Shakespeare said that as well, Howard, right?
So I think we're definitely seeing a lot of traction in the GDPR.
But one of the interesting things that we spoke with Howard about a few weeks ago
at Tech Field Day was we've added analytics to our index.
So as you're indexing the data,
you know, indexing at a metadata level is interesting to do some, you know, tiering and
some, you know, finding, you know, obsolete data or redundant data or so on. But what you go inside
the content, as we do to do keyword search, you know, for legal discovery, we also added analytics to the product.
And when you say analytics, you're not talking about performance per se as much as content, is that?
Yes, exactly.
So you can really, looking at more the integrity of the data.
So, you know, in terms of cyber and the ransomware and the cyber criminals that exist out there, you know, we know that they are getting into data centers.
So there are real time protection tools, you know, things like McAfee that do signature based.
There's Varonis. It's looking at, you know, user behavior.
Those things are not 100 percent. So we know that that data centers are getting breached and the data is being corrupted.
So what we've added to our product
is the ability to look at inside the content.
So it's not just metadata-based analytics,
which a number of vendors are doing,
which really doesn't have a lot of value.
It's content-based analytics.
So we're looking at some high-level stuff
like file type mismatch, reading the header of a file. So we know they're attacking things like office documents. So read the header of a Word document and say this is based on the header is a Word document, but the file extension is.Loki or.encrypted. So that's not good. We've also added the ability to look at the entropy of files or pages in a database.
So we know when a file becomes encrypted, the entropy score goes to 99. So we've created a
score from zero to 100. So show me all files with an entropy score of 99 on the network.
You know, Word documents or PDFs have an entropy score of 99. That looks like it's corruption.
When you say entropy, you're talking about access or write data or?
No, we use an algorithm really to look at the, you know, random disorderness of the file.
So, you know, when a file becomes corrupted, it becomes much more disordered or a page of a database.
Content entropy.
Oh, that's interesting.
Okay, I got you.
Yeah. We've also created a score of a database. Content entropy. Oh, that's interesting. Okay, I got you. Yeah.
We've also created a score of similarity scores
when documents become very dissimilar,
looking for corrupt files when they strip out content.
So basic analytics that are really indicative of a cyber attack.
And a lot of the attacks are doing the same thing.
They're doing encryption of files, encryption of pages of a database. They And a lot of the attacks are doing the same thing. They're doing encryption of files,
encryption of pages of a database.
They're corrupting files.
They're changing extensions of the files.
They're doing all basically the same kind of stuff.
So the way what we're doing is,
and also mass deletions as well.
So what we're doing is looking for
creating analytics based on that behavior. And then we're applying machine learning, which we've trained with all the recent
malware that exists in the market, thousands and thousands of malware to be able to say,
look at the statistics and find a behavior that's indicative of an attack. So at that point, it allows customers to look for that content and find data that is potentially corrupt.
So for you guys to do this sort of thing, it's no longer just a one-shot event that you come in and index and stuff like that.
It's more of an ongoing type of solution.
Is that how you'd classify it?
Exactly right. That's a good setup here. So's more of an ongoing type of solution. Is that how you'd classify it? Exactly right.
That's a good setup here.
So it's really about observations of the data.
And if you think about data protection, so what has been data protection?
Data protection has been to support disaster recovery, you know, building burns down, earthquake,
whatever it is, right?
Well, it depends on the degree of the disaster frequently.
Frequently, it's just a user being stupid.
A non-stupid user making a mistake, for that matter.
Yeah.
Assuming they haven't eliminated all the stupid users out there.
We can only dream.
So what it's really doing is, you know, changing disaster recovery and making the cyber, you know, a cyber incident just part of that, looking at the data integrity and recovering from some of the cyber attack or corruption of the data, just like any other disaster.
So, you know, I was in a presentation yesterday where they were talking is, you know, cyber is the new disaster.
It's really about how is the data corrupted. And if you look at attacks like Sony is, you know,
the poster child and others, they happen over time. So as data slowly being corrupted,
if you're continually doing observations of the data, you're looking at it every day, every couple days, every week. You can find changes in the content that's indicative of a cyber attack.
And if you find it tomorrow, you can replace those with the last good copy like disaster recovery solutions do and continue on your business without any business interruption.
Well, I do have to track down whoever it is opened the phishing email and has the malware.
Yeah, and eliminate the possibility of continuing.
But yeah, yeah, yeah.
Ray, I've told you a couple of times we're not allowed to kill users anymore.
Not in this country anyway.
But what we've also added is forensic tools, right?
So exactly right, Howard. So if you see 1,000 files were deleted or 1,000 files were encrypted,
we can index also the Windows event logs,
and we can tell you who modified those 1,000 files.
So if they were all modified by one user, John Doe,
then that account was breached,
and that account needs to be shut down immediately
because that's being used to execute the ransomware. We will also point to the executable that did that action,
that did the encryption or the deletion. So that should point to your malware.
So beyond just looking at the corruption of data, we can use the forensic tools to analyze that corruption activity
to do kind of a, you know, Inspector Clouseau on, you know, who did it, which user account
did it, what executable did it, to allow the cyber engineers to clean that stuff up, right?
Well, hopefully more Huckel-Perot than Clouseau, because Clouseau was not a very good detective.
He was a little bumbly, but he got it done, right?
So, I mean, you guys would have to be sitting in most of the systems, accessing all the
storage on a periodic basis and doing this entropy store and maintaining this information
over time.
Is that kind of what you guys are doing?
Well, the beauty of it, and so we have our product is a standalone product,
but we've also partnered with the Dell EMC Cyber Recovery products.
So if you think about it, what their angle is, is isolate the crown jewels into a vault.
So get it off the network, air gap it off the network so it's isolated,
that if you are attacked, you can take this crown jewel data and recover your business
and integrate it as part of the backup process on a data domain, for example, right?
So that's a well-proven, well-defined process.
So if customers are using network or Avamar, Commvault, TSM, Spectrum, sorry,
to back data up into an isolated vault.
Index Engine CyberSense understands those formats and can look at the data, look at
what's happening inside the data.
The beauty of that is it doesn't need to rehydrate the data out of the backup image, so it's
not going to allow the ransomware or malware to execute any further.
Look at it and do an integrity check on the data,
and then come up with a decision like a red light, green light, say, you know, green light,
everything looks okay. Red light would kick into those forensic process that I just discussed.
So rather than actually looking at the storage, you're looking at the backup stream and verifying
that the backup stream is still consistent and good. Is that?
Yeah, we're not in the path of backup. We're looking at the backup stream is still consistent and good. Is that? Yeah, we're not in the path of backup.
We're looking at the backup image.
We can look at storage.
So I think there's a hybrid environment
where customers may want to put data into a vault.
I think the vault has a lot of advantages of isolating it
because the attackers won't know it exists,
so they can't attack it.
So it's protecting that environment.
But if there's stuff
that's not moved into the vault, so maybe legal contracts or financial documents that you want
to scan and just do an integrity check. So maybe a company just wants to say, scan that server,
that legal server that has a bunch of legal documents and give me an average entropy score
of those files. And my threshold for that is going to be about an 89 score,
89 out of 100, anything higher than 89 I want to look into further.
Or tell me the files on there that have entropy scores of 99,
I want to investigate those.
So you may not want to do it.
Can you compare to the last scan?
Yeah, so the observations that we do constantly compare it.
So it's comparing statistics from one scan to the next
and then applying the machine learning to see how it's changed.
Yeah, because I'm really concerned when the entropy level of a group of files
in some place together changed.
Right, right.
I mean, you can do a baseline.
So if you're doing it a day one scan and saying, you know, just tell me if there's anything strange here. I mean, the obvious thing is, show me known ransomware extensions that exist on that server. I mean, that's a no brainer. Show me, you know, files that have high entropy scores. Show me files that don't match their extension. Show me databases that have high entropy score pages of a database. So you can do, you know, there's a sense of using the analytics to just create an integrity
check on your data, whether it be as part of your backup process or whether it just
be network files that are being scanned.
So you're not doing anything like a checksum of the data as it exists and maintain that
checksum over time to make sure it's not changed and stuff like that. You're actually looking at the content of the data and trying to determine the randomness
nature of it.
Is that what I guess that's entropy is, right?
That's one piece of it, entropy.
There's over 40 statistics that are being generated.
We're adding another 40 onto that.
So some of the statistics would be file type know, file type mismatch, you know, when the extension doesn't match the actual file type or known ransomware extension or,
you know, corruption, file corruption, for example. So it's looking at a whole bunch of
different statistics and then using the machine learning based on that was trained on all the
common ransomware to say, you know, based on those statistics, it's going to tell you the attack vector that actually executed that corruption. Oh, that's pretty impressive.
And so, and the other thing you mentioned that is that for the most part, you're looking at
the backup data rather than the active data. So you're not really impacting any of the performance of the current
storage per se, if that's what they want, right? Yeah, again, we could look at either. But I think
it's a natural extension of the backup process. So it turns disaster recovery into more data
governance and data integrity. So it's adding significant value. And I think that that's really,
you know, as I, as Howard, you mentioned, we were partnered with Dell EMC on this. That's what Dell EMC says is adding the value to this. And the competitive advantage is, you know, it's not just
looking at, you know, maybe some metadata analytics or looking at the backup image to see if any kind
of corruption has existed on it. Although it, although it does that check, which is very important. All too often, I have had
backup systems tell me, oh yeah, we backed up your data fine. Everything's fine. No problem.
Only to discover that they were only doing the most cursory of checks. And really, I said, yeah, we got that. So they're only doing metadata based analytics. So they're just looking at high level metadata of files. So we took our analytics and
we turned off the content and just did metadata only, and then compared it to turning the content
on. So when the content's on, we find a detection error rate of 0.5%. So the issue that we're seeing
with a lot of these other vendors is
false positives or false negatives. So when you have a lot of false positives, you know, you lose,
you know, you lose any kind of integrity in terms of, you know, the cyber folks or the backup folks
looking at it. It's like, it's always just false positives. So why should I bother to look at it?
Well, nobody ever replies to the application that pages them every night and says something's wrong exactly so
what we're doing content-based analytics we're finding a half percent detection
error rate so minimizing the false positives when we turned off the content
and just did metadata only like some of the other vendors are doing with the
detection error rate turned to 11% 22 times higher than what we see.
So if you have an 11% detection error detection rate,
no one's going to spend any time looking at that stuff.
You're going to get, like Howard said, you're going to get emails,
you know, every 10 minutes saying, you know, potential corruption,
potential corruption.
And if you're just looking at metadata,
you're not going to look inside the read the file header to determine what the
real file type is. Is this truly a Word going to look inside the read the file header to determine what the real file type is.
Is this truly a Word document to look at the file type mismatch, which is one of the most common malware attacks is changing the extension to.Loki or.encrypted or.lol.
You're not going to find stuff like that.
That's a lot easier to recover from than actually encrypting the data.
Yeah, so we're pretty excited about it.
We're getting some really good response.
I think, you know, the partnership with Dell EMC is incredibly exciting.
There's a lot of a lot of customers building vaults today to isolate these environments and to start doing analytics on the data to check the integrity of it.
So it's a new feature.
We announced it over the past year.
The Dell EMC just formally announced the product last month.
And it's a good go-to-market partner for us.
It seems, Jim, you could actually partner with just about every backup vendor out there, right?
I mean, A, you currently support just about every one of their formats. And B, this would be a great add-on tool to all of them. Yeah, yeah, yeah, no doubt.
I was just wondering if there's any exclusivity with Dell EMC or is this something that
other backup vendors should be talking to you about? Yeah, well, I mean, I can't tell you about
our relationship with them, but we have a strong partnership with them.
You know, we do talk to other backup, other partners.
You know, some of them, a lot of them have added some of their own analytics.
So you kind of compete against what they're doing in a sense.
Yeah, which always puts folks like us in the difficult position of vendor A going, yes, we do that.
You don't need vendor B's product.
And us knowing, but vendor B does it so much better than you do.
Yeah.
Well, I think, I mean, the value of you guys, which, you know,
which I know is, you know, kind of telling people what the difference is,
you know, apples and oranges here, you know,
is like these are the things that are critical when you're looking at,
you know, looking for malware and corrupted data is what kind of analytics
and full content-based analytics with machine learning, you know, looking for malware and corrupted data is what kind of analytics and full content-based
analytics with machine learning. You know, our ability to be able to scale and provide,
you know, indexing in a petabyte class is critical to this application.
The ability to also index backup data. We have a customer that wants to process over two petabytes,
two to three petabytes of data,
process analytics on that stuff on a daily basis.
Ah, now we're talking real IO.
When you get customers that are really resource intensive in terms of indexing,
and I know there's a number of vendors out there that require massive resources and infrastructure to index, you know, hundreds of
terabytes of data, it just becomes unusable. You can't do it. You start, you know, having to do
sampling or subsets of data. And, you know, I think the customers that have been attacked and
have been through these cyber, you know, ransomware attacks really want to get confidence in their
data and check the integrity of it. So being able to support petabyte class
analytics is really a unique feature of index engines. So what's the secret that allows you
to do a petabyte indexing or a couple of petabyte indexing over a course of a day? I mean, without,
you know, in the old days, it was like 10% of the infrastructure or something like that to do
something like that, right? Well, so if you were tomorrow going to say, let me write an indexing solution, right?
There's two approaches.
You can go and say, let me just go to the open source community and grab open source stuff and put it together and create something.
So Elasticsearch, Lucene-based solutions are very, very, very dominant out there. A lot of people just go out there and say, let's use Elasticsearch, Lucene-based solutions are very, very, very dominant out there.
A lot of people just go out there and say, let's use Elasticsearch.
We know that that was designed for the Internet.
It came out of the Internet.
It's been enhanced, and it can scale to certain levels, but it can't scale to petabytes.
For a petabyte, you're going to need a ridiculous thousands of servers to process this stuff.
It just doesn't work.
So what we did, and this is the secret sauce that we're very proud of, is we built it from scratch.
So we don't use a database to store the index.
It's actually a dynamic index that's very compressed.
I mean, the footprint of the index itself is about 1% for metadata.
So if you're indexing a terabyte of data or a petabyte of data,
the index footprint is 1% of that, which makes it high performant, very scalable.
And if I index full text, it's 10 or 15%?
No, it's about 5%. So we store each word once. If we index your data, Howard, we would just store the word Howard once and have pointers to all the documents that exist in it.
It was designed to be an enterprise class indexing platform.
Everything we cared about was the scale, the size of the index footprint, the performance, and the speed of it.
So we can crawl through network data and index at a terabyte
per hour. Like I said, the project we did in the West Coast, which was two and a half petabytes,
we indexed with five virtual servers, six virtual servers in about in five weeks of indexing time.
There was a three letter acronym company in there for six months trying to index the data when the
company threw them out and said, you're not even 10% done indexing the data.
So we knew that we had to architect something that was unique.
And we put the effort into it and have really spent a lot of time to make it high speed
and high performance.
So when you get customers that say, I want to do analytics on petabytes of data, that
doesn't scare us. You know, for other vendors, they're going to talk you out of that and saying, no, you don't need to do analytics on petabytes of data. That doesn't scare us.
You know, for other vendors, they're going to talk you out of that and saying, no, you don't need to do that.
Yeah.
So do you guys play in the enterprise search business too?
Because I always think of you as backup data.
We do.
I mean, we do.
People, you know, people don't buy those
solutions necessarily. It's the use case would be more on the legal discovery side, you know,
to find data. It's, you know, empowering all the users to go and search their enterprise content.
I mean, no one's going to fund that. No one's going to pay for it. But I mean, there should be,
you know, there should be smarter storage, you know, where you embed, you know, search into storage, you embed analytics into storage, whether it be backup images or whether it be network storage.
I mean, I think that's what companies are going towards is to be able to add that smarts to it so you can look at not only, you know, things like corruption of the data, but look at other types of analytics that we're adding to the product.
I would have to say storage devices that offer that service have not been that successful.
Now, the problem has always been that people really like their storage systems and then want to add the service.
You know, the lesson of data gravity wasn't that people didn't want
the services. It was they didn't want to buy a whole new array to get the services.
Yeah. Well, the beauty of us is that we're agnostic to those environments. So we've always
thought of the company as a layer on top of the data center, you know, whether it be your NetApp
boxes or Isilon boxes or your backup data, backup images, to be able to understand what exists,
you know, and manage it and now do analytics on it, right?
And you guys work with files as well as block storage?
Yeah, we're very file-based. So we're looking at the files. Yeah, yeah. So very much we look at,
we look at files, we look at emails for the analytics. We're looking at pages of a database.
So we're very much looking at the data.
And kind of in backup, we're kind of unwrapping the backup images.
So we're, you know, any of that, you know, multiplex, compressed backup images, getting inside those and looking at the data.
So from day one, we've always been very focused on files and unstructured data, the stuff that people are managing effectively.
And a lot of backups nowadays are all deduped and that sort of stuff. And how does that play
with you guys? Yeah, so we work with data domain very well. So as an example, so deduped. So data
domain presents the data to us, you know, so for example, in the vault, they have a data domain in
the cyber recovery vault. Once they replicate data into the cyber recovery vault, the data domain
inside the vault attaches the retention lock to it to secure that data. And then their product,
which is their cyber recovery product, notifies our CyberShen's product that that's done,
and then presents a shadow copy of the data that we crawl through an NFS crawl. So mount to it and crawl.
And it's, you know, for us, it looks like just an NFS crawl through that content.
Right, because the data domain is rehydrating that data.
But, you know, yeah, but if I've backed, but if I've used the deduplication built into
Commvault or Veeam. Can you guys read that data
and understand those dedupe mechanisms as well?
It depends.
We support, there's a specific support matrix
that we do support.
So, you know, if you think about backup,
you're used to seeing, you know,
data that's multiplex or compressed
and coming in different, you know, different buckets.
So we have a very
intelligent dispatcher in our system that manages that stuff and really understands
those complex formats. So that's really our intellectual property is to get inside that.
And you talk to folks like, you know, the IBM TSM folks or the old TSM folks, and they're like,
you know, you can't understand our format. And it's like, hand me a 10-year-old TSM tape,
I'll scan it and show you what I can see. And then their eyes light up and it're like, you know, you can't understand our format. And it's like, well, hand me a 10 year old TSM tape. I'll scan it and I'll show you what I can see. And then their eyes
light up and it's like, you guys are crazy that have done what you've done here. Right.
And there we have, there we have the answer to the other question is, and what happens if a
vendor doesn't want to cooperate and describe their proprietary format?
Well, none of the vendors have done that. We've looked at it.
We've done nothing illegal.
We're not changing the format or we're not modifying it.
We're not saying, hey, take that old TSM backup image
and save it as a networker image.
Well, that would be a useful service.
Part of why backups applications are sticky is all those tapes in Iron Mountain are my phony baloney archive.
If I don't have ArcServe anymore, how am I going to read them?
Yeah, so that's not what we do.
It's nothing illegal.
We've looked at the format.
You look at the bits and bytes of it, and you can figure it out.
We've done that for Exchange and for Notes and for TSM and NetBackup and UltraBack even.
Right.
Well, none of us wanted to imply that reverse engineering was illegal.
I call it engineering access too.
It sounds a lot kinder and gentler.
Well, yeah, it's not like you're trying to compete.
And you guys are read-only to all of this data, right?
Yes.
We don't go change the data.
So we can't go.
And we do get these requests from some unreputable companies saying,
can you read a backup tape and go delete specific emails on that tape?
The answer to that isn't no, it's oh, hell no.
Well, I mean, the GDPR stuff is kind of interesting because it's at some level is an implication that you go back and delete stuff that's been backed up as well.
So it can't be restored.
So, well, that's an interesting topic.
So, you know, a lot of how
customers are handling long-term retention off of backup kind of conflicts with the GDPR. So if
somebody has a right to be forgotten request, so Ray, if you go into, you know, your company and
say, you know, I no longer want you to have my personal data and you've got no regulatory
requirements to keep it. So prove that you delete it. If they've got, you know, old content on old backup tapes that they're using as archives,
they can't really go and delete that stuff. It's a tough game. It's almost delete on restore
kinds of thing. It all comes down to, and some court is going to have to tell us what to do,
and it's going to have to be an appellate court. So it's five years from now easy. Yeah. So I think, well, I think the customers that are using,
you know, just storing everything on tape or, you know, even in the cloud these days for long-term
retention in backup formats won't really jive with these personal data regulatory policies that are
coming up more and more so. so i think there's definitely a trend
that's happening that needs to be um further vetted further understood and i think the it
organizations and the vendors need to provide solutions for this so we're well positioned there
so we can go and sit and look at those backup long-term retention content and make sense of it and extract the data value.
You're going to be a hugely useful tool for a lot of organizations just in the figuring out how we're going to deal with GDPR,
knowing where all this data might be stored.
Right.
Well, you know, what we see with GDPR is there's been a lot of people just holding back and just seeing what's happening.
So let's see who gets fined first.
You know, so it's kind of, you know, a little bit of a Russian roulette game going on right now.
But there's people that are like, we're just going to see how this kind of gets enforced.
So but what we've seen is companies saying, you know what, but it's a good time right now to kind of get a good understanding of what our data is. So data classification and classifying data on, you know, network storage, as well as in backup
to figure out why are we keeping this? Start asking those hard questions. Does this have value?
Why are we keeping, you know, thousands of copies of this five-year-old PowerPoint
that no one's accessed in three years? About a product we never actually released.
Exactly.
So people are asking those hard questions and looking at their data.
And I think, you know, if you go through and you do a assessment on this,
and we have a service that does this too, that you can go and say,
let's just look at 100 terabytes and do an assessment and just do a, you know, a study on that.
And people find that typically 30, 40% of it is just stuff that they could delete
tomorrow and no one would ever miss and has no legal hold or regulatory requirements.
So it brings up a couple of questions. You mentioned you're available as a service offering.
I assume there's software licensing as well that you could purchase or how is that?
How do you pay for this or how do you pay for this? Or how do you buy this, rather? Yeah, we are a software company.
So we sell software.
We provide services that help to sell our software.
We have partners.
A lot of the governance folks that are out there, that are the advisory firms, use us
as well to do some of this data cleanup, data migration, some of these forensic analysis or e-discovery projects.
We have quite a few of those that do those.
So we have a services arm just because customers say,
I've got a hundred terabytes
that I wanna just see how this works.
And kind of you guys are the experts.
So you handhold me through this process,
I'll watch, I'll learn,
and then I can execute those policies on, you know,
another 500 terabytes. So that helps us to sell software. Yeah, yeah, yeah. So you sell the
software as a license. Also, partners have access to your software to do the service,
plus you offer a service. Correct. Yeah, that's pretty typical. A lot of software companies have
services just to help customers get going.
I mean, a lot of customers like the turnkey.
You know, it's like I just don't have the resources to do this,
and I kind of need to prove it out to management that there's a cost savings here.
So, you know, we did, for example, we have a remote catalog management service.
So we had a customer in Spain and Italy that was shutting
down data centers and they had an old NBU and TSM instance that they needed to retire.
So we have the ability, not only on the tape indexing, the tape backup data indexing,
but to ingest the old legacy catalog. So we provided that service from our US office,
where we remotely dialed into France and
Spain and ingested those catalogs, consolidated to TSM and the NetBackup instance, and allowed
them to shut it down. That was done in a couple weeks. They were very happy they didn't have the
resources to do it because those data centers were basically shut down.
You mentioned the cloud as well. Are some of your solutions available in the cloud?
We can run in the cloud so there's a
lot of people looking at especially the cyber product you know using cyber as a service
cyber recovery as a service so you know we can run the cloud to do analytics in the cloud as well
yeah so you could support having like backup or archive data sitting on S3 or something like that and be able to scan it.
I'm not sure that's even available, but yeah.
So a typical deployment of that would be, say you index a bunch of tapes. So you have a thousand
tapes, they've been indexed and you say, hey, 10% of this data I want to keep for long-term
retention. So we can connect to any S3 enabled cloud. so Amazon, Glacier, move it up there, extract
the data, migrate it there, keep the index available so they can search it, and then
recover it from the cloud as well.
We keep the data in a managed index, so moving terabytes or hundreds of terabytes data onto
disk or into the cloud makes no sense.
Keeping it indexed in a managed repository makes a lot of sense.
And that's what we do with most of those environments.
Okay.
So could I have my 400 branch offices back up to S3 and then have you index it up in the VNAMI?
Yeah. I mean, the devil doubles and the details on those.
I mean, they're...
Which app?
Is it an application you support and formats and all that stuff?
Yeah, I mean, we can run the cloud as long as we can connect to it.
We can index it.
As long as they can present it as like, you know, we can mount to it.
We can scan and index the content.
I mean, a lot of the cloud providers, some of them do funny stuff. So it gets a little bit more complicated, right?
Yeah, yeah, yeah. Hey, this has been great. Howard, any last questions for Jim?
No, I think I get it. I've been, you know, Index Engines has been in my secret bag of tricks for
well over a decade.
Seems like a long time. All right, Jim, anything you'd like to say
to our listening audience?
No, I appreciate the time.
There's a lot of things that we do.
I mean, if there's any questions,
feel free to reach out to us at indexengines.com
and happy to talk to you further
about any of your data challenges that you have.
Okay.
Well, this has been great.
Thank you very much, Jim, for being on our show today.
Thanks, Ray. Thanks, Howard. Appreciate it. Next time, we will talk to another system
storage technology person. Any questions you want us to ask, please let us know. And if you enjoy
our podcast, tell your friends about it. And please review us on iTunes and Google Play,
as this will help get the word out. That's it for now. Bye, Howard. Bye, Ray.
Bye, Jim. Bye, guys. Until next time.