Grey Beards on Systems - 76: GreyBeards talk backup content, GDPR and cyber security with Jim McGann, VP Mkt & Bus. Dev., Index Engines

Starting point is 00:00:00 Hey everybody, Ray Lucchese here with Howard Marks here. Welcome to the next episode of Gray Bridge Unstoraged podcast, show where we get Gray Bridge Storaged bloggers to talk with system vendors to discuss upcoming products, technologies, and trends affecting the data center today. This Graybridge on Storage episode was recorded on November 2nd, 2018. We have with us here today Jim McCann, VP of Marketing and Business Development at Index Engines. So Jim, why don't you tell us a little bit about yourself and what's new at Index Engines? Hey, thanks Ray. Thanks Howard. It's great to be joining the podcast today. I've been with Index Engines for almost 15 years now. We've been in business for quite a while.

Starting point is 00:00:50 We are a enterprise indexing company, hence the name. Really, what we are is we add indexing to the data center. So we can layer on top of network storage and crawl network storage to index it. A unique claim to fame that Howard has always loved has been the ability to index backup images. So we've engineered access to NetBackup, Commvault, Networker, TSM to be able to understand what's in those formats, whether it be on tape or disk. This really came in handy during my consulting days. As a consultant, I frequently got called in to help clean up after they fired a CIO and other people. And so there would inevitably be the day when someone really needed to retrieve some data and a box would come back from Iron Mountain and it would be tapes that we didn't have tape drives for anymore. Nobody remembered what backup software they used in 1992.

Starting point is 00:01:48 Right. Sounds like NASA. And just, you know, finding the four messages that relate to the sexual harassment suit that we're in the middle of got a lot easier. Yeah, a lot of, you know, a lot of what, you know, what our mission in life is similar to what, you know, what Google's done for the Internet is making the Internet data valuable through search, is to make enterprise information valuable through search to be able to find it and manage it. So some of the stuff where there's some bodies buried in old backup tapes, as we know that exist,

Starting point is 00:02:21 like the mortgage crisis or harassment lawsuits or, you know, other types of lawsuits, oil spills. You know, people use, companies use, and this could be a whole separate podcast, use tapes for long-term retention and archiving. So is backup an archive is the question. So using it as an archive, you know, if you use it as an archive and it's an old backup format that you don't have anymore, and then you throw it out in the salt mine somewhere in Pennsylvania, how is that an effective archive? And when you need to go and find that silver bullet email or needle in the haystack, how do you do that? Well, I'm glad you used the word silver bullet because, well, everybody's always worried about the liability that might be in the data they're retaining.

Starting point is 00:03:10 But there was a time where I was working as a consultant to an advertising agency that lost an account and got sent an email that said, we're going to extend this account for 90 days while the new advertising agency staffs up and it'll be a million dollars a month. So that's $3 million. And of course they sent that to the ad, to the account exec who had just been fired because he lost the account. And, and when it came time to submit the invoice, accounting kicked it back. And I had to go find that $3 million email. So, you know, the stuff in your old data isn't necessarily poison pills. Sometimes it really is that magic bean that you've been looking for. Yeah, we have customers that really, you know, engineering companies that manufacture bridges or buildings. We have a customer in the UK that rebuilt some infrastructure in London. All that data, the data of value that could be repurposed,

Starting point is 00:04:13 engineering documents and so on, if they archive them on backup tapes, they have no value. We've been working with a lot of those organizations that use our technology to go mine that data on old tapes. It could be an old TSM tape where now they're using a product like Avamar or Networker to go find the data value, find me file types of this type, find me documents or PDFs of this type or these type of AutoCAD documents and extract them into a cloud archive that makes it much more usable to regain and reaccess that intellectual property. So what we do, yeah, yeah. So what we, it's not always bad. There is a lot of bad. I mean, we, we worked with a, a law firm in, in this is actually

Starting point is 00:04:52 an interesting story and I can talk about it because the company no longer exists, you know, where they were a hedge fund, remember hedge funds back in the day, but they were ahead of time. And they allegedly hired someone from Microsoft and then started trading on Microsoft Secrets. So the SEC was after them for years and years and years. And all the information on their exchange server and their internal networks was cleansed. And they hired these forensic analysts and these e-discovery folks to go find it. No one could find it. So the CEO was in the process of going through a divorce. The wife basically knew what the SEC was looking for and found an email, a copy of the email on the home computer and brought it to the SEC and said, that's fine. That was a silver bullet. The SEC said, well, we need to see it on their exchange server.

Starting point is 00:05:49 So they hired a company that used our technology. They said, we'll go find the tapes that are the backups of the exchange server from June, whatever year it was. They did that. They scanned it. They said, type these keywords into search for this email. and lo and behold, it was there buried in some secret folder and was found, and that company no longer exists, and the SEC won that battle, right? So Shakespeare was right. Hell hath no fury like a woman's squirt. Let's not go there, Howard. It's not good.

Starting point is 00:06:20 So what's new at IndexEdge? You guys were just at Tech Field Day, right? Yeah, so what we've been doing is obviously we've been in the business of being able to help customers manage legacy tape data. So a lot of legal discovery work, a lot of people that want to go tapeless and just eliminate their tape archives and move the data to the cloud. We have partnerships with AWS to migrate that or other cloud providers as well. And that's been a solid business for us for a long, long time. Because like Howard said, you could walk in the room and hand me 10-year-old TSM tapes, LTO1, DLT even, and we could scan them from beginning to end without having the backup software, index the content, and surgically extract an individual email out of exchange, out of the backup format, index the content, and surgically extract an individual email out of exchange,

Starting point is 00:07:05 out of the backup format, back onto disk, maintain the integrity so it can be used for legal discovery. So we're seeing an uptick in that business because of the EU GDPR. I mean, customers using tapes as an archive and managing personal data on offline tapes is not necessarily a good strategy. So things like that are having customers reach out to us and say, help us clean up our legacy tape museums that are sitting in offsite storage.

Starting point is 00:07:33 So that the GDPR has also kicked up our online, you know, network data indexing. So finding personal data in their network, in their files and their networks to be able to manage that. So when you have a right to be forgotten request or other requests required by, you know, personal data, that's not only the GDPR, but California has a personal privacy act and they're popping up all over the place. So we're seeing a lot of activity there. So what the interesting thing is what customers are doing is, you know, the first phase is let's take a look at our storage and kind of just start cleaning up the mess.

Starting point is 00:08:10 You know, the stuff that's made all these storage vendors rich of just hoarding and hoarding stockpiles of, you know, petabytes of data on old NetApp filers. It's like, let's go look at that stuff. And we had one company that's an electronic manufacturer on the West Coast that said, let's do an assessment of two and a half petabytes of data. So we were able to scan that in a couple of weeks at a metadata level. They did a profile or assessment and found that about a petabyte of it was a combination of old log files that had zero value, which they were backing up and protecting for decades, and a bunch of useless files that they immediately purged. So they reclaimed 50% of that capacity. And then they started looking at, you know, the other half of it to figure out what had value. So, you know, when you have a personal data management issue, you know, doing it across multi petabytes is

Starting point is 00:08:59 challenging for anybody, but when you can reduce that footprint and find the data value and then help manage that. So that's been a very hot space for us, especially the past year. Yeah. And IT has always wanted to throw things away, but has never felt empowered enough to make the value judgment to say, this data is actually worthless. Yeah. Yeah. Well, it's not part of their mission statement. It's really the business users. And every company has had records managers, but they haven't really been empowered to make these decisions. Well, and they've always considered themselves to have a very narrow scope. You know, the records management people are only concerned with, you know, the things the SEC is going to sue us about not keeping. Right. But I think things like these privacy or personal data regulations, once there's some pretty serious sanctions or fines, you know, with real companies other than the Facebooks of the world, people will start to

Starting point is 00:09:56 take notice. And I think they will start to ask questions is why are we keeping this? What are we keeping it? Does it have business value? Who can get access to this stuff? All those questions that they should have been asking and looking at the life cycle of data. And I remember, you know, a decade ago we talked about ILM and managing data properly. The whole archive thing. And a decade before that we talked about HSM and it was the same thing. Yeah. So I think maybe third time's a charm. Didn't Shakespeare – I think Shakespeare said that as well, Howard, right?

Starting point is 00:10:29 So I think we're definitely seeing a lot of traction in the GDPR. But one of the interesting things that we spoke with Howard about a few weeks ago at Tech Field Day was we've added analytics to our index. So as you're indexing the data, you know, indexing at a metadata level is interesting to do some, you know, tiering and some, you know, finding, you know, obsolete data or redundant data or so on. But what you go inside the content, as we do to do keyword search, you know, for legal discovery, we also added analytics to the product. And when you say analytics, you're not talking about performance per se as much as content, is that?

Starting point is 00:11:14 Yes, exactly. So you can really, looking at more the integrity of the data. So, you know, in terms of cyber and the ransomware and the cyber criminals that exist out there, you know, we know that they are getting into data centers. So there are real time protection tools, you know, things like McAfee that do signature based. There's Varonis. It's looking at, you know, user behavior. Those things are not 100 percent. So we know that that data centers are getting breached and the data is being corrupted. So what we've added to our product is the ability to look at inside the content.

Starting point is 00:11:54 So it's not just metadata-based analytics, which a number of vendors are doing, which really doesn't have a lot of value. It's content-based analytics. So we're looking at some high-level stuff like file type mismatch, reading the header of a file. So we know they're attacking things like office documents. So read the header of a Word document and say this is based on the header is a Word document, but the file extension is.Loki or.encrypted. So that's not good. We've also added the ability to look at the entropy of files or pages in a database. So we know when a file becomes encrypted, the entropy score goes to 99. So we've created a score from zero to 100. So show me all files with an entropy score of 99 on the network.

Starting point is 00:12:40 You know, Word documents or PDFs have an entropy score of 99. That looks like it's corruption. When you say entropy, you're talking about access or write data or? No, we use an algorithm really to look at the, you know, random disorderness of the file. So, you know, when a file becomes corrupted, it becomes much more disordered or a page of a database. Content entropy. Oh, that's interesting. Okay, I got you. Yeah. We've also created a score of a database. Content entropy. Oh, that's interesting. Okay, I got you. Yeah.

Starting point is 00:13:05 We've also created a score of similarity scores when documents become very dissimilar, looking for corrupt files when they strip out content. So basic analytics that are really indicative of a cyber attack. And a lot of the attacks are doing the same thing. They're doing encryption of files, encryption of pages of a database. They And a lot of the attacks are doing the same thing. They're doing encryption of files, encryption of pages of a database. They're corrupting files.

Starting point is 00:13:30 They're changing extensions of the files. They're doing all basically the same kind of stuff. So the way what we're doing is, and also mass deletions as well. So what we're doing is looking for creating analytics based on that behavior. And then we're applying machine learning, which we've trained with all the recent malware that exists in the market, thousands and thousands of malware to be able to say, look at the statistics and find a behavior that's indicative of an attack. So at that point, it allows customers to look for that content and find data that is potentially corrupt.

Starting point is 00:14:14 So for you guys to do this sort of thing, it's no longer just a one-shot event that you come in and index and stuff like that. It's more of an ongoing type of solution. Is that how you'd classify it? Exactly right. That's a good setup here. So's more of an ongoing type of solution. Is that how you'd classify it? Exactly right. That's a good setup here. So it's really about observations of the data. And if you think about data protection, so what has been data protection? Data protection has been to support disaster recovery, you know, building burns down, earthquake,

Starting point is 00:14:39 whatever it is, right? Well, it depends on the degree of the disaster frequently. Frequently, it's just a user being stupid. A non-stupid user making a mistake, for that matter. Yeah. Assuming they haven't eliminated all the stupid users out there. We can only dream. So what it's really doing is, you know, changing disaster recovery and making the cyber, you know, a cyber incident just part of that, looking at the data integrity and recovering from some of the cyber attack or corruption of the data, just like any other disaster.

Starting point is 00:15:24 So, you know, I was in a presentation yesterday where they were talking is, you know, cyber is the new disaster. It's really about how is the data corrupted. And if you look at attacks like Sony is, you know, the poster child and others, they happen over time. So as data slowly being corrupted, if you're continually doing observations of the data, you're looking at it every day, every couple days, every week. You can find changes in the content that's indicative of a cyber attack. And if you find it tomorrow, you can replace those with the last good copy like disaster recovery solutions do and continue on your business without any business interruption. Well, I do have to track down whoever it is opened the phishing email and has the malware. Yeah, and eliminate the possibility of continuing. But yeah, yeah, yeah.

Starting point is 00:16:11 Ray, I've told you a couple of times we're not allowed to kill users anymore. Not in this country anyway. But what we've also added is forensic tools, right? So exactly right, Howard. So if you see 1,000 files were deleted or 1,000 files were encrypted, we can index also the Windows event logs, and we can tell you who modified those 1,000 files. So if they were all modified by one user, John Doe, then that account was breached,

Starting point is 00:16:42 and that account needs to be shut down immediately because that's being used to execute the ransomware. We will also point to the executable that did that action, that did the encryption or the deletion. So that should point to your malware. So beyond just looking at the corruption of data, we can use the forensic tools to analyze that corruption activity to do kind of a, you know, Inspector Clouseau on, you know, who did it, which user account did it, what executable did it, to allow the cyber engineers to clean that stuff up, right? Well, hopefully more Huckel-Perot than Clouseau, because Clouseau was not a very good detective. He was a little bumbly, but he got it done, right?

Starting point is 00:17:30 So, I mean, you guys would have to be sitting in most of the systems, accessing all the storage on a periodic basis and doing this entropy store and maintaining this information over time. Is that kind of what you guys are doing? Well, the beauty of it, and so we have our product is a standalone product, but we've also partnered with the Dell EMC Cyber Recovery products. So if you think about it, what their angle is, is isolate the crown jewels into a vault. So get it off the network, air gap it off the network so it's isolated,

Starting point is 00:18:03 that if you are attacked, you can take this crown jewel data and recover your business and integrate it as part of the backup process on a data domain, for example, right? So that's a well-proven, well-defined process. So if customers are using network or Avamar, Commvault, TSM, Spectrum, sorry, to back data up into an isolated vault. Index Engine CyberSense understands those formats and can look at the data, look at what's happening inside the data. The beauty of that is it doesn't need to rehydrate the data out of the backup image, so it's

Starting point is 00:18:39 not going to allow the ransomware or malware to execute any further. Look at it and do an integrity check on the data, and then come up with a decision like a red light, green light, say, you know, green light, everything looks okay. Red light would kick into those forensic process that I just discussed. So rather than actually looking at the storage, you're looking at the backup stream and verifying that the backup stream is still consistent and good. Is that? Yeah, we're not in the path of backup. We're looking at the backup stream is still consistent and good. Is that? Yeah, we're not in the path of backup. We're looking at the backup image.

Starting point is 00:19:07 We can look at storage. So I think there's a hybrid environment where customers may want to put data into a vault. I think the vault has a lot of advantages of isolating it because the attackers won't know it exists, so they can't attack it. So it's protecting that environment. But if there's stuff

Starting point is 00:19:26 that's not moved into the vault, so maybe legal contracts or financial documents that you want to scan and just do an integrity check. So maybe a company just wants to say, scan that server, that legal server that has a bunch of legal documents and give me an average entropy score of those files. And my threshold for that is going to be about an 89 score, 89 out of 100, anything higher than 89 I want to look into further. Or tell me the files on there that have entropy scores of 99, I want to investigate those. So you may not want to do it.

Starting point is 00:20:00 Can you compare to the last scan? Yeah, so the observations that we do constantly compare it. So it's comparing statistics from one scan to the next and then applying the machine learning to see how it's changed. Yeah, because I'm really concerned when the entropy level of a group of files in some place together changed. Right, right. I mean, you can do a baseline.

Starting point is 00:20:23 So if you're doing it a day one scan and saying, you know, just tell me if there's anything strange here. I mean, the obvious thing is, show me known ransomware extensions that exist on that server. I mean, that's a no brainer. Show me, you know, files that have high entropy scores. Show me files that don't match their extension. Show me databases that have high entropy score pages of a database. So you can do, you know, there's a sense of using the analytics to just create an integrity check on your data, whether it be as part of your backup process or whether it just be network files that are being scanned. So you're not doing anything like a checksum of the data as it exists and maintain that checksum over time to make sure it's not changed and stuff like that. You're actually looking at the content of the data and trying to determine the randomness nature of it. Is that what I guess that's entropy is, right? That's one piece of it, entropy.

Starting point is 00:21:16 There's over 40 statistics that are being generated. We're adding another 40 onto that. So some of the statistics would be file type know, file type mismatch, you know, when the extension doesn't match the actual file type or known ransomware extension or, you know, corruption, file corruption, for example. So it's looking at a whole bunch of different statistics and then using the machine learning based on that was trained on all the common ransomware to say, you know, based on those statistics, it's going to tell you the attack vector that actually executed that corruption. Oh, that's pretty impressive. And so, and the other thing you mentioned that is that for the most part, you're looking at the backup data rather than the active data. So you're not really impacting any of the performance of the current

Starting point is 00:22:05 storage per se, if that's what they want, right? Yeah, again, we could look at either. But I think it's a natural extension of the backup process. So it turns disaster recovery into more data governance and data integrity. So it's adding significant value. And I think that that's really, you know, as I, as Howard, you mentioned, we were partnered with Dell EMC on this. That's what Dell EMC says is adding the value to this. And the competitive advantage is, you know, it's not just looking at, you know, maybe some metadata analytics or looking at the backup image to see if any kind of corruption has existed on it. Although it, although it does that check, which is very important. All too often, I have had backup systems tell me, oh yeah, we backed up your data fine. Everything's fine. No problem. Only to discover that they were only doing the most cursory of checks. And really, I said, yeah, we got that. So they're only doing metadata based analytics. So they're just looking at high level metadata of files. So we took our analytics and

Starting point is 00:23:30 we turned off the content and just did metadata only, and then compared it to turning the content on. So when the content's on, we find a detection error rate of 0.5%. So the issue that we're seeing with a lot of these other vendors is false positives or false negatives. So when you have a lot of false positives, you know, you lose, you know, you lose any kind of integrity in terms of, you know, the cyber folks or the backup folks looking at it. It's like, it's always just false positives. So why should I bother to look at it? Well, nobody ever replies to the application that pages them every night and says something's wrong exactly so what we're doing content-based analytics we're finding a half percent detection

Starting point is 00:24:12 error rate so minimizing the false positives when we turned off the content and just did metadata only like some of the other vendors are doing with the detection error rate turned to 11% 22 times higher than what we see. So if you have an 11% detection error detection rate, no one's going to spend any time looking at that stuff. You're going to get, like Howard said, you're going to get emails, you know, every 10 minutes saying, you know, potential corruption, potential corruption.

Starting point is 00:24:39 And if you're just looking at metadata, you're not going to look inside the read the file header to determine what the real file type is. Is this truly a Word going to look inside the read the file header to determine what the real file type is. Is this truly a Word document to look at the file type mismatch, which is one of the most common malware attacks is changing the extension to.Loki or.encrypted or.lol. You're not going to find stuff like that. That's a lot easier to recover from than actually encrypting the data. Yeah, so we're pretty excited about it. We're getting some really good response.

Starting point is 00:25:10 I think, you know, the partnership with Dell EMC is incredibly exciting. There's a lot of a lot of customers building vaults today to isolate these environments and to start doing analytics on the data to check the integrity of it. So it's a new feature. We announced it over the past year. The Dell EMC just formally announced the product last month. And it's a good go-to-market partner for us. It seems, Jim, you could actually partner with just about every backup vendor out there, right? I mean, A, you currently support just about every one of their formats. And B, this would be a great add-on tool to all of them. Yeah, yeah, yeah, no doubt.

Starting point is 00:25:52 I was just wondering if there's any exclusivity with Dell EMC or is this something that other backup vendors should be talking to you about? Yeah, well, I mean, I can't tell you about our relationship with them, but we have a strong partnership with them. You know, we do talk to other backup, other partners. You know, some of them, a lot of them have added some of their own analytics. So you kind of compete against what they're doing in a sense. Yeah, which always puts folks like us in the difficult position of vendor A going, yes, we do that. You don't need vendor B's product.

Starting point is 00:26:23 And us knowing, but vendor B does it so much better than you do. Yeah. Well, I think, I mean, the value of you guys, which, you know, which I know is, you know, kind of telling people what the difference is, you know, apples and oranges here, you know, is like these are the things that are critical when you're looking at, you know, looking for malware and corrupted data is what kind of analytics and full content-based analytics with machine learning, you know, looking for malware and corrupted data is what kind of analytics and full content-based

Starting point is 00:26:45 analytics with machine learning. You know, our ability to be able to scale and provide, you know, indexing in a petabyte class is critical to this application. The ability to also index backup data. We have a customer that wants to process over two petabytes, two to three petabytes of data, process analytics on that stuff on a daily basis. Ah, now we're talking real IO. When you get customers that are really resource intensive in terms of indexing, and I know there's a number of vendors out there that require massive resources and infrastructure to index, you know, hundreds of

Starting point is 00:27:25 terabytes of data, it just becomes unusable. You can't do it. You start, you know, having to do sampling or subsets of data. And, you know, I think the customers that have been attacked and have been through these cyber, you know, ransomware attacks really want to get confidence in their data and check the integrity of it. So being able to support petabyte class analytics is really a unique feature of index engines. So what's the secret that allows you to do a petabyte indexing or a couple of petabyte indexing over a course of a day? I mean, without, you know, in the old days, it was like 10% of the infrastructure or something like that to do something like that, right? Well, so if you were tomorrow going to say, let me write an indexing solution, right?

Starting point is 00:28:11 There's two approaches. You can go and say, let me just go to the open source community and grab open source stuff and put it together and create something. So Elasticsearch, Lucene-based solutions are very, very, very dominant out there. A lot of people just go out there and say, let's use Elasticsearch, Lucene-based solutions are very, very, very dominant out there. A lot of people just go out there and say, let's use Elasticsearch. We know that that was designed for the Internet. It came out of the Internet. It's been enhanced, and it can scale to certain levels, but it can't scale to petabytes. For a petabyte, you're going to need a ridiculous thousands of servers to process this stuff.

Starting point is 00:28:44 It just doesn't work. So what we did, and this is the secret sauce that we're very proud of, is we built it from scratch. So we don't use a database to store the index. It's actually a dynamic index that's very compressed. I mean, the footprint of the index itself is about 1% for metadata. So if you're indexing a terabyte of data or a petabyte of data, the index footprint is 1% of that, which makes it high performant, very scalable. And if I index full text, it's 10 or 15%?

Starting point is 00:29:15 No, it's about 5%. So we store each word once. If we index your data, Howard, we would just store the word Howard once and have pointers to all the documents that exist in it. It was designed to be an enterprise class indexing platform. Everything we cared about was the scale, the size of the index footprint, the performance, and the speed of it. So we can crawl through network data and index at a terabyte per hour. Like I said, the project we did in the West Coast, which was two and a half petabytes, we indexed with five virtual servers, six virtual servers in about in five weeks of indexing time. There was a three letter acronym company in there for six months trying to index the data when the company threw them out and said, you're not even 10% done indexing the data.

Starting point is 00:30:08 So we knew that we had to architect something that was unique. And we put the effort into it and have really spent a lot of time to make it high speed and high performance. So when you get customers that say, I want to do analytics on petabytes of data, that doesn't scare us. You know, for other vendors, they're going to talk you out of that and saying, no, you don't need to do analytics on petabytes of data. That doesn't scare us. You know, for other vendors, they're going to talk you out of that and saying, no, you don't need to do that. Yeah. So do you guys play in the enterprise search business too?

Starting point is 00:30:39 Because I always think of you as backup data. We do. I mean, we do. People, you know, people don't buy those solutions necessarily. It's the use case would be more on the legal discovery side, you know, to find data. It's, you know, empowering all the users to go and search their enterprise content. I mean, no one's going to fund that. No one's going to pay for it. But I mean, there should be, you know, there should be smarter storage, you know, where you embed, you know, search into storage, you embed analytics into storage, whether it be backup images or whether it be network storage.

Starting point is 00:31:15 I mean, I think that's what companies are going towards is to be able to add that smarts to it so you can look at not only, you know, things like corruption of the data, but look at other types of analytics that we're adding to the product. I would have to say storage devices that offer that service have not been that successful. Now, the problem has always been that people really like their storage systems and then want to add the service. You know, the lesson of data gravity wasn't that people didn't want the services. It was they didn't want to buy a whole new array to get the services. Yeah. Well, the beauty of us is that we're agnostic to those environments. So we've always thought of the company as a layer on top of the data center, you know, whether it be your NetApp boxes or Isilon boxes or your backup data, backup images, to be able to understand what exists,

Starting point is 00:32:06 you know, and manage it and now do analytics on it, right? And you guys work with files as well as block storage? Yeah, we're very file-based. So we're looking at the files. Yeah, yeah. So very much we look at, we look at files, we look at emails for the analytics. We're looking at pages of a database. So we're very much looking at the data. And kind of in backup, we're kind of unwrapping the backup images. So we're, you know, any of that, you know, multiplex, compressed backup images, getting inside those and looking at the data. So from day one, we've always been very focused on files and unstructured data, the stuff that people are managing effectively.

Starting point is 00:32:45 And a lot of backups nowadays are all deduped and that sort of stuff. And how does that play with you guys? Yeah, so we work with data domain very well. So as an example, so deduped. So data domain presents the data to us, you know, so for example, in the vault, they have a data domain in the cyber recovery vault. Once they replicate data into the cyber recovery vault, the data domain inside the vault attaches the retention lock to it to secure that data. And then their product, which is their cyber recovery product, notifies our CyberShen's product that that's done, and then presents a shadow copy of the data that we crawl through an NFS crawl. So mount to it and crawl. And it's, you know, for us, it looks like just an NFS crawl through that content.

Starting point is 00:33:30 Right, because the data domain is rehydrating that data. But, you know, yeah, but if I've backed, but if I've used the deduplication built into Commvault or Veeam. Can you guys read that data and understand those dedupe mechanisms as well? It depends. We support, there's a specific support matrix that we do support. So, you know, if you think about backup,

Starting point is 00:33:57 you're used to seeing, you know, data that's multiplex or compressed and coming in different, you know, different buckets. So we have a very intelligent dispatcher in our system that manages that stuff and really understands those complex formats. So that's really our intellectual property is to get inside that. And you talk to folks like, you know, the IBM TSM folks or the old TSM folks, and they're like, you know, you can't understand our format. And it's like, hand me a 10-year-old TSM tape,

Starting point is 00:34:24 I'll scan it and show you what I can see. And then their eyes light up and it're like, you know, you can't understand our format. And it's like, well, hand me a 10 year old TSM tape. I'll scan it and I'll show you what I can see. And then their eyes light up and it's like, you guys are crazy that have done what you've done here. Right. And there we have, there we have the answer to the other question is, and what happens if a vendor doesn't want to cooperate and describe their proprietary format? Well, none of the vendors have done that. We've looked at it. We've done nothing illegal. We're not changing the format or we're not modifying it. We're not saying, hey, take that old TSM backup image

Starting point is 00:34:56 and save it as a networker image. Well, that would be a useful service. Part of why backups applications are sticky is all those tapes in Iron Mountain are my phony baloney archive. If I don't have ArcServe anymore, how am I going to read them? Yeah, so that's not what we do. It's nothing illegal. We've looked at the format. You look at the bits and bytes of it, and you can figure it out.

Starting point is 00:35:27 We've done that for Exchange and for Notes and for TSM and NetBackup and UltraBack even. Right. Well, none of us wanted to imply that reverse engineering was illegal. I call it engineering access too. It sounds a lot kinder and gentler. Well, yeah, it's not like you're trying to compete. And you guys are read-only to all of this data, right? Yes.

Starting point is 00:35:54 We don't go change the data. So we can't go. And we do get these requests from some unreputable companies saying, can you read a backup tape and go delete specific emails on that tape? The answer to that isn't no, it's oh, hell no. Well, I mean, the GDPR stuff is kind of interesting because it's at some level is an implication that you go back and delete stuff that's been backed up as well. So it can't be restored. So, well, that's an interesting topic.

Starting point is 00:36:23 So, you know, a lot of how customers are handling long-term retention off of backup kind of conflicts with the GDPR. So if somebody has a right to be forgotten request, so Ray, if you go into, you know, your company and say, you know, I no longer want you to have my personal data and you've got no regulatory requirements to keep it. So prove that you delete it. If they've got, you know, old content on old backup tapes that they're using as archives, they can't really go and delete that stuff. It's a tough game. It's almost delete on restore kinds of thing. It all comes down to, and some court is going to have to tell us what to do, and it's going to have to be an appellate court. So it's five years from now easy. Yeah. So I think, well, I think the customers that are using,

Starting point is 00:37:09 you know, just storing everything on tape or, you know, even in the cloud these days for long-term retention in backup formats won't really jive with these personal data regulatory policies that are coming up more and more so. so i think there's definitely a trend that's happening that needs to be um further vetted further understood and i think the it organizations and the vendors need to provide solutions for this so we're well positioned there so we can go and sit and look at those backup long-term retention content and make sense of it and extract the data value. You're going to be a hugely useful tool for a lot of organizations just in the figuring out how we're going to deal with GDPR, knowing where all this data might be stored.

Starting point is 00:37:59 Right. Well, you know, what we see with GDPR is there's been a lot of people just holding back and just seeing what's happening. So let's see who gets fined first. You know, so it's kind of, you know, a little bit of a Russian roulette game going on right now. But there's people that are like, we're just going to see how this kind of gets enforced. So but what we've seen is companies saying, you know what, but it's a good time right now to kind of get a good understanding of what our data is. So data classification and classifying data on, you know, network storage, as well as in backup to figure out why are we keeping this? Start asking those hard questions. Does this have value? Why are we keeping, you know, thousands of copies of this five-year-old PowerPoint

Starting point is 00:38:39 that no one's accessed in three years? About a product we never actually released. Exactly. So people are asking those hard questions and looking at their data. And I think, you know, if you go through and you do a assessment on this, and we have a service that does this too, that you can go and say, let's just look at 100 terabytes and do an assessment and just do a, you know, a study on that. And people find that typically 30, 40% of it is just stuff that they could delete tomorrow and no one would ever miss and has no legal hold or regulatory requirements.

Starting point is 00:39:12 So it brings up a couple of questions. You mentioned you're available as a service offering. I assume there's software licensing as well that you could purchase or how is that? How do you pay for this or how do you pay for this? Or how do you buy this, rather? Yeah, we are a software company. So we sell software. We provide services that help to sell our software. We have partners. A lot of the governance folks that are out there, that are the advisory firms, use us as well to do some of this data cleanup, data migration, some of these forensic analysis or e-discovery projects.

Starting point is 00:39:47 We have quite a few of those that do those. So we have a services arm just because customers say, I've got a hundred terabytes that I wanna just see how this works. And kind of you guys are the experts. So you handhold me through this process, I'll watch, I'll learn, and then I can execute those policies on, you know,

Starting point is 00:40:06 another 500 terabytes. So that helps us to sell software. Yeah, yeah, yeah. So you sell the software as a license. Also, partners have access to your software to do the service, plus you offer a service. Correct. Yeah, that's pretty typical. A lot of software companies have services just to help customers get going. I mean, a lot of customers like the turnkey. You know, it's like I just don't have the resources to do this, and I kind of need to prove it out to management that there's a cost savings here. So, you know, we did, for example, we have a remote catalog management service.

Starting point is 00:40:44 So we had a customer in Spain and Italy that was shutting down data centers and they had an old NBU and TSM instance that they needed to retire. So we have the ability, not only on the tape indexing, the tape backup data indexing, but to ingest the old legacy catalog. So we provided that service from our US office, where we remotely dialed into France and Spain and ingested those catalogs, consolidated to TSM and the NetBackup instance, and allowed them to shut it down. That was done in a couple weeks. They were very happy they didn't have the resources to do it because those data centers were basically shut down.

Starting point is 00:41:18 You mentioned the cloud as well. Are some of your solutions available in the cloud? We can run in the cloud so there's a lot of people looking at especially the cyber product you know using cyber as a service cyber recovery as a service so you know we can run the cloud to do analytics in the cloud as well yeah so you could support having like backup or archive data sitting on S3 or something like that and be able to scan it. I'm not sure that's even available, but yeah. So a typical deployment of that would be, say you index a bunch of tapes. So you have a thousand tapes, they've been indexed and you say, hey, 10% of this data I want to keep for long-term

Starting point is 00:41:58 retention. So we can connect to any S3 enabled cloud. so Amazon, Glacier, move it up there, extract the data, migrate it there, keep the index available so they can search it, and then recover it from the cloud as well. We keep the data in a managed index, so moving terabytes or hundreds of terabytes data onto disk or into the cloud makes no sense. Keeping it indexed in a managed repository makes a lot of sense. And that's what we do with most of those environments. Okay.

Starting point is 00:42:34 So could I have my 400 branch offices back up to S3 and then have you index it up in the VNAMI? Yeah. I mean, the devil doubles and the details on those. I mean, they're... Which app? Is it an application you support and formats and all that stuff? Yeah, I mean, we can run the cloud as long as we can connect to it. We can index it. As long as they can present it as like, you know, we can mount to it.

Starting point is 00:43:02 We can scan and index the content. I mean, a lot of the cloud providers, some of them do funny stuff. So it gets a little bit more complicated, right? Yeah, yeah, yeah. Hey, this has been great. Howard, any last questions for Jim? No, I think I get it. I've been, you know, Index Engines has been in my secret bag of tricks for well over a decade. Seems like a long time. All right, Jim, anything you'd like to say to our listening audience? No, I appreciate the time.

Starting point is 00:43:29 There's a lot of things that we do. I mean, if there's any questions, feel free to reach out to us at indexengines.com and happy to talk to you further about any of your data challenges that you have. Okay. Well, this has been great. Thank you very much, Jim, for being on our show today.

Starting point is 00:43:44 Thanks, Ray. Thanks, Howard. Appreciate it. Next time, we will talk to another system storage technology person. Any questions you want us to ask, please let us know. And if you enjoy our podcast, tell your friends about it. And please review us on iTunes and Google Play, as this will help get the word out. That's it for now. Bye, Howard. Bye, Ray. Bye, Jim. Bye, guys. Until next time.

Grey Beards on Systems - 76: GreyBeards talk backup content, GDPR and cyber security with Jim McGann, VP Mkt & Bus. Dev., Index Engines

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.