Podcast Archive - StorageReview.com - Podcast #116: Delete those files!

Episode Date: February 22, 2023

Brian tackles the concept of data deletion on this podcast and why it is… The post Podcast #116: Delete those files! appeared first on StorageReview.com. ...

Transcript
Discussion (0)
Starting point is 00:00:00 Hey, thanks everyone for checking out the Storage View podcast. Today we've got an interesting conversation about to happen on deleting data and that's a concept that not many people or organizations are too familiar with. So Dr. Burke, how did you get into this notion of people should be deleting data? Well, it's actually the interesting point right so I spent I spent most of my career in cybersecurity and if you're doing cybersecurity cleanup the question always is well what was this data doing here to begin with why wasn't it properly protected who had access to it
Starting point is 00:00:41 oh we had forgotten there was an API that was open, right? So when I was asked what my predictions were for 2023, I was like, you know, I'm starting to see a lot of organizations are doing some downsizing. There's probably some groups that are going to get, well, downsized. And generally what then ends up happening is that who's taking care of the data that's left behind that was collected, right? So we're going to see some more data exposures. And so I figured one of the things that we're going to see in 2023 is organizations taking a better look at or a deeper thought about what data they actually keep and whether or not data is actually a liability as opposed to an asset. So it's funny, when your team reached out, I had to go back and think really the last time that we got pitched a data deletion
Starting point is 00:01:50 message. Now, we'll get it every now and then from a compliance angle, and we can get into that. But that's more of a legal protection than really being worried about the data. The data is sort of a side effect of that legal apparatus. But back then, Symantec was talking about it like, look, organizations, you're backing up all this data. You've got copies of it. You've got all these things going on. And some of it may be harmful for you to retain. And that's a little bit about some of what you're talking about from the risk profile perspective but it's been ages, it's a little counter you're not a storage guy so this is good
Starting point is 00:02:33 but it's very counter to the industry that's grown on selling you more space for your bits and it's also counter to the message that we're getting from all the acceleration guys like NVIDIA saying, you never want to delete anything because now you can train your models on more data for AI and ML. There's got to be a happy medium somewhere, though. What are your thoughts on that? Yeah, so there's a hidden promise in data, right? So I actually, it's interesting you should say that because I've been following the NVIDIA work
Starting point is 00:03:07 with great interest. So I have a PhD in machine learning. And one thing that, and I, but that's from a time when sort of the cloud wasn't actually a thing yet, right? So the problem always was how do you get enough data to train a model reliably? And nobody was all that
Starting point is 00:03:25 interested in in machine learning back then which is kind of how i ended up in cyber security and then the cloud started happening and then what that meant is that a lot of software wasn't necessarily running sort of on the desktop anymore software could run you know in a centralized location which means all of a sudden you can collect a lot of data about how people are acting, how people are interacting, what they write, what they're interested in. And so all of a sudden, it's like, okay, well, this is making modeling possible. And if you're able to model, you're probably able to monetize that. So there is sort of this explosion of the belief that if you hold data, there's an enormous amount of value to be mined in that data. But even if we can't mine it today, we might be able to mine that data in the future.
Starting point is 00:04:14 And so it gets you into the sort of like, well, if I throw it away, I'll never be able to mine it. So I might as well keep it because storage is cheap. I think that's where that kind of got its start, right? And the difficulty with it is that a lot of data, as you were indicating, if it does get exposed, and a lot of it is getting exposed, could actually be really harmful to the organization, right? And so how do you realize or how do we weigh that sort of hidden promise in the data, if you will, against the risk of the liability of the loss of that data? What can we
Starting point is 00:04:53 learn from data in the future and how valuable is that? Well, I guess, so in your background in machine learning, if I'm modeling, I don't know, I hesitate to even give a specific model, but if I just pick any model, is there a point of diminishing returns then? Is that what you're suggesting? That if I've got three years of data, that the fourth year is not going to get me that much more in terms of value to train? Well, sure. That could be true. There could also be changing patterns through the years. I mean, we can debate. It depends really more on what you want to do with the data.
Starting point is 00:05:34 But I think that there's a difference here. If I'm collecting, for instance, certain medical efficacy data, like you have a medication, you're applying it to people and that, you know, you want a really, really big body of data so that you make sure you don't make any mistakes with that, right? Because, you know, well, you're handing out meds, right? Versus a lot of data that might be, right? So there's a very specific purpose to the data that I'm collecting there, right, versus the concept of, well, I have a lot of shoppers or, you know, social networks are a great example of that, right, a lot of the value of the social network was in the,
Starting point is 00:06:16 well, there must be value to be had in understanding how a social network sticks together and how trends and things become popular, how things go viral and maybe we can influence that and then there is an advertising value to that now I think that we're seeing slowly that you know although some value has been certainly found there it's not at all living up to the promise right so a lot of this data I would argue has been collected under the premise of we'll find a way to make it useful in the future, even if we don't see it today. So let's just not throw it away for the time being. Right. So what you're suggesting then is that notion needs to be balanced with some sort of risk profile then to determine.
Starting point is 00:07:02 Because we obviously don't want to throw away everything, because especially immediate data has tremendous value. We may not want to hold on to everything forever, because the more data out there, potentially more risk. Do you see older data specifically at a higher risk or different risk than current data? Depends entirely on what one was collecting, right? Like if you, for instance, imagine you run a tax analysis software where people can do their taxes online.
Starting point is 00:07:36 You probably got some pretty sensitive data and the question is, well, do you need to keep it around? Is there any purpose to it? Well, there might be a purpose in the future because then we can see the changing patterns of taxation and tax law or how people do their taxes over the years and serve them better. Okay, well, is there true value in that? Can we predict that value in that? And is it worth there for keeping that data around for a long time? Or are there alternative ways in which we could store that data where the liability of the data loss becomes
Starting point is 00:08:05 mitigated or minimized, right? And there certainly are ways to approach that. But I think more even than that, Brian, is what I'm saying is very few organizations that I have helped over the years have taken a hard look or a deliberate look at the data that they're keeping from a perspective of liability, right? Like the calculation always seems to be data by itself could be valuable in the future. We could do something meaningful with it at some point. It doesn't cost us very much money at all to store it, manage it, and back it up. And therefore, we'll just keep it until we figure out what we want to do with it. And I think that's a false premise.
Starting point is 00:08:48 And that's really what I'm arguing we should push back against. Against the notion of save everything because one day it could be worth something, right? I think I had that notion in the 80s when I had baseball card sets. And it turns out that upon cleaning my room many years after my departure from my home, my mom discovered these and they ended up going in the recycle bin. I mean, we held on to the data forever. And it turns out it was probably worth less in the end than the than the cards the stock that they were printed on so what's you're you're advocating good hygiene for data if i'm going to hold on to data then what what's your argument there that
Starting point is 00:09:40 that it should be encrypted or how how else would you secure the data? Well, at least be deliberate about it, right? So I think the point about the baseball cards that you're raising there is that the baseball card, well, what's the liability of the baseball card becoming exposed? Well, not very much, right? That's in your particular scenario. And hence, my basement is full of boxes of stuff that I could maybe one day use in the kitchen, but the cost isn't very high. The data that is kept has a hidden cost associated with it that a lot of people don't realize.
Starting point is 00:10:20 Now, I'm going to give you sort of the example, right? Because I think it drives towards the answer you are seeking. At some point, I was part of an organization, and that organization was migrating from an on-premise SharePoint to an in-the-cloud SharePoint. And, you know, you go to an in-the-cloud SharePoint, you get all of a sudden a lot of statistics about your data. Now, this organization, about 800 people working there, and I saw that we had over 1.5 million office documents on the SharePoint. And I'm thinking to myself, I said, well, that's about 2,000 documents or something per person that works here, and it's probably been produced over the last 20 years. Nobody is ever going to go through this data and take a good look at it now power of the cloud right you can type in any search query and you can it was amazing the kind of personal and hr sense of things that were
Starting point is 00:11:13 to be found with just simple search queries there inside that organization right and it's you know it it gets to this point of okay well the data is kept well why well you know there might be important information about like a customer in, there might be important information about a customer in there. There might be important information about an engineering process or data that we really should keep because, you know, whatever reason. Because we don't know what the potential future loss is if I get rid of the data, you decide to keep it. But in doing so, well, nobody's really busy taking care of that data. And I know a lot of organizations now are talking about chief data officers and bringing people on that are responsible for data. But if you really take a close look at their job
Starting point is 00:11:56 titles, sorry, their job descriptions, their job description is get more value from the data we collect to figure out other data to collect, right? Like, it's never figure out if we should be keeping the stuff that we're already having. Wait, what you're describing, the deliberate nature of looking at and managing your data and then figuring out how to secure it, whether it's encryption or whatever else, it's really hard and time-consuming.
Starting point is 00:12:24 So there was a company called Data Gravity out of Nashua, New Hampshire, that spun, it didn't spin out of, it was a lot of the same team that was involved in Equalogic way back when. Anyway, it was a NAS basically, but hardware wise, nothing special. What they had was a software layer on top that would help organizations scan their folders and say, am I exposing a social security number in any of these? And based on whatever policies you have in place, would report back this word or this format or this thing that looks like a phone number or whatever was found in these documents and shouldn't be. It's a violation of our legal policies or it's in violation of some sort of sovereignty. Is this being stored in Spain when it can only be stored in Portugal or something? So that company was one of the few
Starting point is 00:13:31 from a storage perspective to really go after this. And it didn't work out. And I'm not even sure what else is in the market to help companies do that. But back where I started is that's like a compliance activity that's still not what you're talking about in terms of a holistic data approach.
Starting point is 00:13:53 But even that is really, really hard to look for things that you know to be a violation and then to remediate it if you find social security numbers in your data. Because really, what's to stop a person, we're our own worst enemies, a salesperson who puts a table together and collects this information, not knowingly doing anything wrong, but violating some sort of policies that gets out there into the corporate storage.
Starting point is 00:14:26 And now we've got a problem. Yeah, no, you're right. It's hard. So how do we get people to take on that hard challenge? I think that's right. And I mean, right now, I think that me just raising this point as, look, people are going to start thinking about it is mostly a Point of awareness raising because you're right. It's really really hard to do something about it
Starting point is 00:14:50 I just gave you an example of 1.5 million Office documents what what does one do right like, you know, David data grab them actually familiar with that company, right? So it's it's funny. You should raise it a similar technique Of course was used by the data loss prevention companies probably in the you know early 2000s where you scan through documents that are on the wire and you see if there are certain compliance violations right that helps you pick some of that up but really if you want to do this right it gets down to well what you're asking asking for is a complete data inventory and classification and subsequently a risk analysis of it, right?
Starting point is 00:15:28 And it is next to impossible because it's not just documents, because data lives in databases, the data lives in backups, right? Archives, data lives in SharePoint sites, right? Data lives in data lakes, whatever structure, whatever structure they may, you know, data mud puddles really is what starts to end up happening half the time. That's basically boiling the ocean, right? But if we looked at what could be done, right? Like you have to optimize sort of like where you're going to have the biggest
Starting point is 00:16:07 gains in this in first right so you can't simply be like well we're going to take every single bit of data at once right so if I was to to come forward with some idea about how we might approach that it starts all out with well the person who's responsible for that data right that's the number one place and so when we talk about a chief data officer that should then be if that role it does exist in your enterprise that shouldn't be yeah you know we we would like to get more value from our data so we got a data scientist he's the chief data officer and that's that right like so let's let's start out with someone who understands risk in organization because those people tend to understand value as well often and then give that a little bit of teeth right allow a person like that to create policy and when you're talking
Starting point is 00:16:55 about creating policy right like a very good step two is can you even come up with a process by which you associate a risk cost with a piece of data and what i mean by that right is that um not to go and do every piece of data but could you come up with a formula that simply says well here i've got social security numbers i got credit cards i've got customer i got emails that were sent between workers emails with customers How do you come on the process to say, what is the risk cost? What is the liability associated with the loss of that data? And that's going to fall into those categories you were talking about earlier,
Starting point is 00:17:33 namely, is there a compliance violation, right? Is this data by law, you know, PCI or HIPAA compliant? Must it be kept secret? Is it a reputational loss? Is there a financial information or is it a reputational loss is there a financial information or is there there a fraud you know could one commit fraud with this data is there an ability to is it classified information right especially for government organizations and those working with them right so you can come
Starting point is 00:18:00 up with sort of tiering system or sort of what the liability in the data is and sort of start to assign a process to that and then step number three right is is think of a risk justification decision process based on that risk cost again i'm not saying apply these processes far and wide think through how one would build them because once you sort of got that template now you can start to think about where would be a starting point to do that what do we do with old data if we really don't think we want to throw it away how do we how do we how do we take the fangs out of it right like because there's ways to deal with data that does not necessarily mean you're throwing it away but that significantly reduces the liability in that right and so to you know if you go through these sort of three steps that
Starting point is 00:18:51 i just kind of made up right um one thing you are doing is you are being deliberate about what data you're keeping how you're going to deal with it and what the what the rest of the organization might be and you know you'd be surprised how far you get, I think. Well, yeah. I mean, you talked about one spectrum of your chief data officer role being a data scientist. Clearly, that's probably not the best person. But man, it sure sounds like a legal operation, and I'm not sure that's the right answer either. I mean, certainly consultation from legal would be valuable, but anytime they make the rules,
Starting point is 00:19:32 I feel like they'd probably lean more towards hanging on to data than getting rid of it, because they're more paranoid than the data scientist guy. But I say delete. It doesn't necessarily have to be delete. You're also kind of lumping in security as part of that message, too. Yeah, I think certainly. And there's also, right, there's also a way, right, so obviously security and encryption. Okay, well, then there is access control and there's auto thing with access control, right? So these are standard cybersecurity processes, right? Identity access control processes that one ought to follow.
Starting point is 00:20:09 A policy of what to do with older data or not currently in use data is not hard to create, right? And we already have that for shredding hard drives and things like that shredding paper, destroying hard drives if these things exist, okay what do we do with data if it's in the cloud do we just keep moving it do we keep holding on to it, do we keep backing it up yeah that's what Amazon wants you to do
Starting point is 00:20:37 yeah yeah well it will be possible right, like we can back it up we can encrypt it and we can store the key with the legal department, I don't care but give it anpt it and we can we can store the key with you know stored with the legal department i don't care but give it an escrow and simply say we don't think we're going to use this data anymore it was part of a project that was shut down but because there might have been some value in there we're not ready to let go so let's encrypt this and let's essentially um
Starting point is 00:21:01 uh diffuse the data if you will. And by the way, that can go one step further, right? Like there's all manner of processes around, you know, like the zero trust proofs and stuff. These may be techniques you've heard of, right? Like a classical problem is, you know, when you're in a room with all your friends,
Starting point is 00:21:20 how do you figure out what the average salary is that all your friends make without anyone telling their salary to anyone else in the group right it's possible to compute that you know without having one person who knows the salary of anyone else right like there's there's processes to work with data in a way in which the data stays somewhat private there's a number of startups in that area as well right so there's even a way to sort of think about if we do have very sensitive data, is there some things we can do to, you know, possibly anonymize it, possibly split it up in such a way that it's no longer personally
Starting point is 00:21:58 identifiable or harmful, but it still would have almost the exact same value to us, right? It gets very creative to sort of what a chief data officer might be able to do with things like that, anything short of throwing it away or just encrypt it. Well, you started talking about the cloud and kind of growing forever, right? Because that's what they want to do. That's what they are doing.
Starting point is 00:22:21 I mean, pretty clearly, right, is the more bits under management, the more revenue and so on. You know, we talk a lot about cloud, about the hardest part is to get the data back out. And it's not necessarily relevant to this part of the conversation. But the other big push is around environmental concerns. If we continue to store things that aren't worthwhile, forgetting about the store things that aren't worthwhile, forgetting about the risk, just aren't worthwhile for whatever reason, it's not a lot for a couple terabytes of data, but in aggregate, it starts to get pretty big and we've got to spin up data centers and we've got to have equipment on or at least accessible, even if it's on a tape drive
Starting point is 00:23:03 somewhere, it still has to reside somewhere and be managed and maintained and cooled and whatever. So it's not nothing. And for all these organizations with green initiatives that range from publicity to reality, that seems like another angle to look at this to say, just because we can and we can continue to get it for a penny a gig or whatever the number is, is that consistent with our other corporate initiatives too, right? Including ESG and whatever else. So that's an angle I hadn't even thought about. I like it. And the question back to you would then be, what is the energy cost per terabyte?
Starting point is 00:23:48 Is there like a, does anyone, has anyone ever thought of that? I'm sure. I'm sure because if you look at like the Open Compute Project of which Meta and Azure and others are involved in, I mean, they're regularly talking about the power efficiency of their data centers and, you know, how they get there and cram all your useless Facebook photos into a smallest footprint possible. But I mean, you talked about social somewhere along the way here, and that's one of the big ones, right, Is that we've got a Facebook account or an Instagram account and we post in the videos or TikTok. All of this content that atrophies over time. Like the photo I took 10 years ago and posted it.
Starting point is 00:24:37 The only person that's relevant to anymore is probably me. I mean, you don't care, but children hardly care. Then it's a picture of them, you know? And so do we delete parts of the early internet that documented all of these via through all these social media sites? I mean, are the early MySpace pages still in existence? And if they are, should they be? I mean uh it is a fundamental question of data and maybe you know we're morphing this conversation too far but is there a greater philosophical you know reason why why we shouldn't be hanging on to all of this stuff you know i
Starting point is 00:25:18 brian this is you took this in a really interesting direction and i'm actually smiling because it gets to but it gets to one of those wonderful sort of party conversations if you're standing around with a few friends right like you know should the internet start to forget some things because um when we grew up there was no internet and you know you things were forgotten today right what do we teach our children be extremely careful what you put on the internet because when i hire a person the first thing i do is i go through their linkedin and their instagrams their facebooks and their twitters to try to figure out what kind of person
Starting point is 00:25:54 this is right and you had different ideas 15 years ago than you do today right people people change right and so that stays with you all your life right so you tell your children be super careful what you do on the internet don't say too many crazy things on facebook right it's really the thing you're trying to tell them what you're driving to is is taking it one step further right like if um if forgetting things is actually a very useful function otherwise humans wouldn't be forgetting things we remember everything right i don't think there's necessarily an inherent limitation there it's just not worth remembering things that are a long time ago and therefore we don't have even bigger heads that can remember even more things right like there's a right should i don't know some gigantic heads okay
Starting point is 00:26:40 yeah i should the internet forget i mean i think about what we do in in the storage world uh you know i've got a couple things sitting around me a nas for instance um are widely popular because people can put them in their basement throw a couple hard drives in there and now store their family memory memories forever and and that's, and there's a use for that, and you hand it down to the generations, just like every now and then, Kevin, our lab guy, just digitized a bunch of old family eight millimeter.
Starting point is 00:27:15 And does the internet care? No, but his family does, probably, and wants that memory for forever and ever. But most people have abdicated that responsibility of storage to the social media companies or to a Google Drive or whatever that's free or low cost. There is an interesting dynamic there. I click the accept the terms button,
Starting point is 00:27:42 but I don't know what the actual guarantees or SLAs are from those services. Probably none would be my guess. But at the same time, if Meta said we're deleting everything prior to 2010, then there'd be some sort of backlash against that too. But it is an interesting conversation about in the corporate world, I think those organizations largely understand that they have data, they're creating data, they generally need to protect it and back it up. Where they are in that spectrum of how good they are is probably pretty wildly diverse. But on the end user, on the individual side, we're just putting it out there and kind of hoping for the best.
Starting point is 00:28:27 So the dynamics that are supposed to be a little bit different for most people. Yeah, I would say there's sort of two things to think about there. Number one is, you are right. There is a reason that my family photographs are also on a NAS and that NAS gets backed up and that backup is off site, right? Because there's an enormous amount of inherent value to the photographs of my children when they were young, right? Like I want to never forget that. The point you make about giving that to a SaaS provider to safekeep, that opens up another discussion, right? Because it opens up the discussion not just of, well, are they going to delete all your Instagram photographs that are more than a decade old? Because, okay, yeah, pitchforks come out, but maybe you can download them before they do that.
Starting point is 00:29:23 And you've got more. So there's a way around it the the the counterpoint to that is the bigger worry that i have about um okay you clicked through that sla that said we will not sell your photographs we will not use your photographs for any other purposes we will you know etc etc it's okay i trust that but right right now we're going through a climate of you know the tech industry is is is getting squeezed a bit and decisions have to be made and without a doubt companies get acquired right we saw this with roomba right um perhaps the acquirer has a lot less ethics around what they want to do with the photographs or the
Starting point is 00:30:05 data that you stored there and you know now that it's no longer in your control right so that's sort of the counterpoint to that yeah yeah well and those guys certainly you know it's funny that so a lot of the risk is uh in in any data now is being targeted through backups. So I spent a lot of time with Veeam or any backup provider of your choice. They'll all say the same thing, that the security attacks in organizations are going after the backups because once you control that, now you've got no ammo. You've got to pay the ransom and hope for the best. But still, backups are a big problem. And I'm laughing to myself because this came up at the
Starting point is 00:30:56 end of last year. If you look at the cloud service providers, they're starting to say in some cases, maybe quietly, that you don't need to back up. And the first question is, that's bizarre because that's counter to everything the storage industry has been saying since day zero, really. I mean, you talked about even in your case, the 3-2-1 of three copies, two media, one off-site, you're close to that. But the cloud argument is, well, we've never had a loss, which is probably not entirely true. But resiliency, I suppose, on its face does not make for data protection entirely. And I don't even know why I started going down that road. But when we start thinking about cloud, it's just different. And to your point, if that company gets acquired or goes under or whatever, or does have a material data failure, I mean, in some ways, I guess it's contrary
Starting point is 00:32:06 to your notion of we should be doing less. It might encourage people to have more copies, more places and create a bigger headache. I don't know. I mean, it's a difficult question. Well, I think what we're now doing is we're splitting it, right? So if there is data that you really want to keep, then the cloud is a fantastic place to stick it. You can still do 3-2-1 because the cloud is one medium
Starting point is 00:32:32 and it's also your offsite potentially, right? So the philosophy extends. And without a doubt, you've talked to many people about that already. This is data I want to keep. I have to make choices, right? Do I want to encrypt it? Well, yeah, probably. Who do I want to keep i have to make choices right do i want to encrypt it well yeah probably where who do i want to have access to the keys right that's um does my um uh the
Starting point is 00:32:54 provider of my cell phone get to have the keys to that data or should i not um the data that i put there is there is there a way to um to demilitarize it if you will right can I if this data for instance about customers or patients or or financial information can I do something with it in such a way that it is less harmful if it was ever exposed right but still equally valuable to me that's data I want to keep this second question which is completely separate from that, which is how we really got this discussion started, is what data really shouldn't we be keeping anymore? Because the risk and the liability is so much bigger than the potential future gain we can see ahead of us. And that data doesn't need to go to the cloud and it doesn't need to go to the backup. In fact, it probably will stick around on a backup where it's just as vulnerable for
Starting point is 00:33:45 exposure, but that's a process question. So I think that's where that splits down where you, where you to keep it. So we talked about backups. We talked about keeping everything maybe is not a great idea. What are the other things that worry you? I've been thinking a little more lately. And when you were talking before about some of the layoffs,
Starting point is 00:34:10 thinking about access to data, permissions to data, even orphan data. So if I'm at an organization creating a bunch of whatever, or I have, you know, I'm an application guy that has written an application that's going out and doing some sort of task for the organization. Once I'm gone, who's monitoring, who's checking my folders on the corporate intranet or license from Google or AWS? And also the applications. I mean, you'd like to think someone's really actively managing those and making sure, you'd mentioned API calls before, that either we're not exposing our own API or the APIs we're consuming. I mean, there's just so much there that as you start to peel it back starts to get, I think, maybe even a little bit scarier as you think of all the ways you could get hurt.
Starting point is 00:35:04 But orphan data, like I said, is a big one. Copy data, having multiple copies of things that you're managing is problematic, not for the reasons, not security so much, but just footprint, just raw footprint. Is there anything else that you're worried about or do any of these, in your opinion, provide a good starting point to come after this problem? Is there an easy button somewhere to mitigate these things? Well, there never really is, is there? I suppose not. One place I would, one area we should talk about is the externalized cost to the benefit of the computing industry that we have been benefiting from, right? Like pretty much any great advancement that mankind has ever had ended up having some drawback we did not foresee at the time, right?
Starting point is 00:36:05 Oh, absolutely, yeah. At some point, the Surgeon General thought that smoking was a great way to relax, right? And then we discover what it does, right? So, you know, probably the same thing is happening with the fossil fuel burning, right? Like we can see that's like, oh, okay, maybe there was a drawback to doing that. And I think that perhaps we haven't quite seen the drawback of a storage medium that can remember everything forever and is actually pretty good at it. Like losing data is actually the exception that we get mad about. I think that what you're talking about is perhaps tipping the veil on this and this externalized cost,
Starting point is 00:36:46 right? This unintended side effect, right? Like, oh boy, I mean, you know, it can get pretty scary out there if you think about it, because I did mention API calls. We're seeing some great API security companies pop up today, but it's just scratching the surface of what needs to be done. You know, myself, you know, I've led a number of data science teams over the years, and data science teams are generally, you know, consistent of developers, people that write code or use tools that do data processing and are extremely creative in doing so. having been in cybersecurity space i'm very cognizant of this right so what ends up happening is you know application developers are able to collect a lot of data that moves through the system right and sometimes it's just for debugging this user logged in with this password has this social security number and this credit card for this address right i'm just making sure it's debugged so that the application works right and then that gets written to a log and then it's
Starting point is 00:37:44 forgotten to turn off or it was written to a log and then it's forgotten to turn off or it was written to a log and not encrypted and then you know that person's like oh yeah it's on my long list to do so i gotta do something about it then that person leaves the organization whoever cleans up their data when they leave the organization right i've seen of all the people that have that have worked for me over the years one one or two. It's like it's not the common case, right? Like it's not what people do. And I think sort of the scope and magnitude, right, it's not even the known unknowns.
Starting point is 00:38:14 It's the unknown unknowns that I haven't even thought about where I think that should keep one awake at night if you think about it. Well, you're doing a good job of making everyone scared of having... That wasn't my intent! Well, look, the conversation is usually around security and protection, which everyone should have and do, and that's great to have a backup and move your data off-site, whatever. But I think the industry at large doesn't generally have
Starting point is 00:38:46 this conversation. So that's why I thought it would be interesting. And literally, like I said, it was that event six or seven years ago was the last time when I heard someone really banging the drum about getting rid of data for security concerns. Man, you're swimming a little upstream though, I think, because of so much pressure and fear by organizations that they, like you said, that there may be a legitimate or perhaps it's a fallacy to think that I can get something out of this one day. Yeah, I just don't know how we educate and define that data officer role so that you get to the point where it's okay to get rid of some things. Well, and I think that's, you know, you're right about it, right? This thinking is a little bit further ahead. And this is why I thought it'd be enjoyable to get on your podcast
Starting point is 00:39:54 and just debate this issue with you, because there really is no immediate, like, solution here. But there is certainly some thinking, right, that might be great for our industry to start having around data as a liability. So really, I feel like we've explored it pretty far, to be honest. We took a turn and I thought we would. We got into greenness of hard drives and tape. So it's a worthwhile conversation, and it may require a role that doesn't fully exist yet. And I'd be interested, and maybe someone will pop up after we post this, I would be interested to hear their data policies and say, for all this file data that
Starting point is 00:40:50 we've created, sales pitches and slide decks, slide decks are the worst. I got a deck from somebody the other day, 280 slides or something insane. And they all have graphics and they all have colors. So each one of those is a couple meg times 280. It was so big that they, of course, it was shared on a cloud service. But in a large organization, how many of those slide decks are in any Fortune 2000 or Global 2000 or whatever? Insane numbers. I mean i'm the word docs at least are are pretty lightweight if they don't have images and embedded video and whatnot but the powerpoints are are insane and the amount of duplication that happens in them because you of course you copy a lot of
Starting point is 00:41:38 slides from the guy that came before you and then now your PowerPoint is 280 megabytes. Yeah, too many slides. But yeah, I think just putting some thought to it, even if you don't do anything, having the conversation in an organization of saying, we recognize that this is a concern. I think maybe even that obviously doesn't solve the problem, but recognizing the concern is a good starting point,
Starting point is 00:42:10 wouldn't you think? Yeah, I would say, I mean, like generally, right, I would say, especially if you do have data, you know, a person responsible for the data in the organization, is there a thinking process that could that could go around classifying risk cost right was the cost of the liability
Starting point is 00:42:31 right never mind that the revenue potential of the benefit like what is the the risk of the liability of that data and if we were to classify that and we were able to do a good job at that, what would be the processes and controls and justifications we put around those types of data that we choose to keep versus those that we choose not to? I think that it's a healthy thought process to go around. It might actually not be so bad. I don't know. It sounds pretty awful to me. But then again, we don't have that much production data. I wonder too, if all the SaaS companies maybe could take this opportunity to
Starting point is 00:43:15 get out ahead of this and say, if I'm salesforce.com or something, hey, organization, you don't have to delete anything, but here's some best practices or here's some ideas about the types of records and some sort of timeline, and maybe we can help you mitigate some of that by putting that in effect and saying all sales deals that you lost that are over seven years old, we can, once a month, we can kind of clear those out. Yeah.
Starting point is 00:43:50 So, you know, maybe, because now the world, everything's being consumed as a service now, even infrastructure on-prem is as a service through Dell Apex or HP GreenLake or whatever the big companies are. So it will be interesting to see as they consume, as organizations consume more cloud services, more software in the cloud, does that dynamic change how to best handle these data questions and where are the drivers of change to, to really be contemplative and, and come up with,
Starting point is 00:44:27 with at least understood answers, if not the best answers. And I think, you know, sort of as a, as a thought on that, the example of Salesforce is a, is a good one, right? The example of SharePoint is a, is a bad one, right? Because when we think about SaaS services, SaaS services typically deal with one type or class of data, right? A financial processing system, a sales processing system, a medical records processing system, right?
Starting point is 00:44:56 Like the type of data already sort of, there's a set of best practices around that. Whereas, right, like an open document store where like anything can be a powerpoint or everything you know it becomes much more it becomes much harder right like it's a big bag of data so i think i think you're right that especially the sas companies could could start to form that thinking as part of their sort of you know um their service to their customers. Yeah, well, look, we're in agreement that organizations need to be thinking about this thing,
Starting point is 00:45:29 that whether it's the chief data officer or a legal function or CTO, CIO, whoever it is, somebody needs to help in that organization have this conversation and just be thoughtful in how data is managed and where it goes and how it's kept and all that sort of thing. And that there's still quite a bit of opportunity that backing up data is not sufficient for these types of tasks that we're talking about. And you can expose yourself or reduce your risk based on some of these decisions. So it's a good conversation.
Starting point is 00:46:09 And I appreciate you coming on and doing this. And hopefully we'll hear both the stories of people that have done really great. And I mean, perhaps they'll never know if they protected themselves by protecting themselves, which is sort of a strange thing. But also, if someone hears this and just goes to their boss and says, hey, are we doing this? Are we thinking about this? I mean, that would be a fantastic outcome, I think. Would be a great outcome. And it's been a real pleasure, too, to do this because the reality is I came here to
Starting point is 00:46:41 explore these ideas with you and I feel like I learned a lot. So I think it was a success for me, too. I mean, you're the doctor. I'm not sure if you learned from me that we accomplished as much as you think we did. You learn by unpacking ideas and mulling them over. So how about that? All right, I'll concede that. Thanks for doing this.
Starting point is 00:47:03 Appreciate your time. Absolutely. It was my pleasure.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.