Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 05x13. Dark Data at the Edge

Starting point is 00:00:00 Welcome to Utilizing Tech, the podcast about emerging technology from Gestalt IT. This season of Utilizing Tech focuses on edge computing, which demands a new approach to compute, storage, networking, and more. I'm your host, Stephen Foskett, organizer of Tech Field Day and publisher of Gestalt IT. And joining me today as my co-host is Allison Klein. Welcome. Hey, Stephen. It's great to be back. It's great to have you back. And, you know, we are doing a special episode here, kind of mid-season, talking to some of our Field Day Delegate friends about EDGE, which is one reason that we decided to invite to join us today, Gina.

Starting point is 00:00:44 Gina, welcome to the show. Hey, y'all. Thanks for having me. So today on the episode, we are going to talk about something that is, I think, sort of lurking in the background of a lot of Edge conversations. And that is this whole idea of, as Allison coined it, dark data. Allison, tell us a little bit about your thought of dark data. Well, dark data really, in my mind, refers to the data that an organization may have created or collected, but is unaware that is within their data stores and is not actioning on in terms of monetization or service delivery associated with that data. And I think that what they have in terms of has dark data present, and it's that

Starting point is 00:01:36 opportunity to explore, you know, expanding what they're doing with their data, if they can actually find out where their dark data is and put a light on it. And I think one of the things that's challenging here is that there certainly is dark data in the data center and in the cloud, but in the edge, well, you really don't know what you've got, right, Gina? Well, you should kind of know what you've got if you're doing it right, I think. So that's kind of, I think, why you have the dark data. But I think if you're intentional with what you're doing at the edge and you know what data is being created and what data is available, you'll know whether it's something you want to pull into your data lake or if it's something you want to not keep or you'll be able to make those decisions. But I

Starting point is 00:02:20 think you need to know what kind of data you got everywhere, though. Well, is that realistic? I mean, do people know what they've got going on out there? I just think that, you know, I've spent a lot of time in the industry, and I think that we assume that all of this is a lot easier than it actually is in practice. And the reality is IT organizations have tremendous priorities associated with security, keeping their workloads up. Maybe some things fall through the crack in terms of data that's being created by different business groups at the edge. Maybe they don't have the right tools in place in terms of observability and discoverability so that when this data is created, it's obviously

Starting point is 00:03:05 flagged for their IT organization to explore. And maybe the IT organization knows about the data, but it's dark to the line of business that it would actually see value from it. So I think that there's a lot of different angles that you can use in terms of thinking about what the challenge or opportunity statement of dark data is. Right. I do agree with you. I think IT probably knows about the existence of dark data or the ability to create more dark data and makes decisions based on costs and not necessarily costs of, you know, opportunities for the business.

Starting point is 00:03:47 Because I know we used to do that back in the day, and I don't think that's something that's really changed, you know. I think another opportunity to create dark data is probably the use of SaaS applications, which are used everywhere to try just to get business done as fast as possible. That's really interesting, Gina. How do you see the tie between SaaS and dark data? Can you say more? Well, sure. I help a lot of people with marketing because that's what I do. And I think a lot of times people get frustrated with the overall big company rules about how you can do marketing, especially collecting customer data. Where are your customers at? They

Starting point is 00:04:25 just want to throw something together, collect some leads and go on with it. So you'll have a sales team wherever they may be, a sales leader saying, just go get this done for me. Go use XYZ application. Just use your credit card, run it and get it. Sometimes really not thinking about, well, there's an entire, you know, CDP, which is supposed to be gathering all the data in one place from everywhere. It might be, it's the idea to have the single plane of glass, as they like to say, and also security and everything,

Starting point is 00:04:58 especially with personal information, which we know has lots of governance around it. So the overall line of business may not know about it. The sales leader that's decided to go off and create his own workshop to get the leads he needs to drive his business may not know about the opportunities of tying all the data in the organization together to find more customers and find more leads and the rest of it. So you end up with kind of fragmented siloed data all over the place. For edge, I think it would be interesting to see if there is, you know, cause I think it was interesting what you said a few minutes ago that there's,

Starting point is 00:05:37 it's really hard and there's not, is there any best practices that people are using when it comes to building an edge application? Are they just grabbing whatever looks good at the time, which will fit to get it out the door and fits within the cost constraints just to make it happen? And nobody ever circles back around and tries to put that into a bigger system. I think when you're talking about edge, one of the things that came up when we were talking about far edge applications and, for example, at retail was that you have maybe not SaaS applications specifically, but you have applications that are being brought in by different people, different departments. Some of them are outside applications. Some of them are inside, but from different departments. Some of them are outside applications. Some of them are inside, but from different departments. And all of those are running and are expected to run everywhere remotely. And I think that that seems like kind of an analogous situation. So you have,

Starting point is 00:06:38 you know, for example, maybe an advertising application that's running in your Quick Mart. And that application came from a third party vendor and it has no real ties back to the company except through its own communications links and its own service provider and its own data format and so on. And all of that is really isolated from the overall corporation. I think that that's really probably a very normal use case in the edge. And that's why when you brought up this idea of data silos and dark data, it really resonated with me in terms of edge, because I could totally see that happening in this environment where you've got all these different applications running. It really is like sales enablement applications running on phones and laptops and everything. It's just a completely different set of data, right? I think that one of the things that comes to mind is it's a technology problem and it's a human problem because without full stack integration and really thoughtful tie-in of you know the leads generated from this application automatically flow into my crm you know those things can be done but are they done all the time? Sometimes not. The human problem is really having all of

Starting point is 00:08:06 these different organizations deploying applications and maybe not seeing the point. To Gina's point earlier, the sales lead leader doesn't really care if we capture those leads, that contact information for further purposes. He just wants to meet his quarterly numbers. And so, you know, lost opportunity. I think that, you know, the edge is kind of accelerating both of these challenges because so much of the data that is generated today is being created at the edge. It needs processing at the edge. And we are seeing that these two challenges, technology and cross-departmental communication, are just getting, you know, kind of in the way of the full value proposition that's in front of us. If you think about how much better you can do things if you have the data in one location that you can write applications against and how much better you can derive data for the entire organization.

Starting point is 00:09:09 Or think of it another way. If you can think of how much better it would be if you were able to take, you know, whatever it was, a sales leader, being able to push him to what he needs to create his own sales, his sales program, but all protected using all the data from an organization that that's kind of the short sighted not to do that. But human nature is, like you said, just to get it out there and get it done, meet my number. Another thing that occurs to me is that in many cases with edge, we're talking about data that is intentionally darkened.

Starting point is 00:09:45 Let's say that. So you're collecting metrics from machinery or you are collecting video and processing it and so on. The idea is that you are going to collect and process that data at the edge. You're going to extract what you think is valuable today from that data. And then you are going to either chuck or just ignore that source data and send back only the results. And I think that in many ways, this is an optimization that people are doing in order to optimize bandwidth and connectivity and cost. But at the same time, that's dark data too, right? So I mean, if you

Starting point is 00:10:32 think about like a security camera example, for instance, you've got a whole bunch of security cameras, they're all processing data. You've got maybe a machine learning algorithm that is processing that and extracting the interesting bits and only sending the interesting bits back home. Who's to say that you don't need or want that data in the future? And even if you are pretty sure you don't need it because it's just an empty parking lot or a machine that's idle, you know, maybe some of the other data that you've rejected, you know, you might need. Or maybe the machine, the algorithm rejected some of it on its own. And you find that, oh, man, I wish I could see what was going on at this time code. I think that's another area where data could be dark, even in a system that is correctly configured to ship some data back to

Starting point is 00:11:21 the source, right? It's an interesting question, Stephen, because at some point, what is dark, right? Is it stuff that you don't know about? Is it stuff that you're not acting on? Is it stuff that you're throwing away? I don't know where the right demarcation point is. And I think that's worth exploring is if an it organization has already looked at this particular use case and said you know what we're just going to move this portion

Starting point is 00:11:51 of the data because we already know that this other stuff is junk i mean it's like it's almost like at that point does holding on to the junk become data hoarding? Um, you know, is, is where is the demarcation line that it's no longer dark, it's useless, and we're just going to get rid of it. I understand your point about the security cameras. I mean, I, maybe I've watched too much true crime, um, episodes, but I always know that it's bad when they lose the data of the security camera. That's a common theme. But I think that, all joking aside and black humor aside, I think that if we think about an IT organization making thoughtful decisions about what they want to keep, I don't know if that is dark, if it's a really good data policy. Because I don't think that the right answer is that an organization should hold on to everything for all time because what is useless

Starting point is 00:12:56 may become useful in the future. I don't know. I could be talked out of that position. I like everything you said, but I do want to say, I'm not sure that the IT department should be deciding all of that information because there's more to it than just, you know, there's the cost involved with keeping it, but then there's also, it's all discoverable at that point in time. And there's also the security piece like that, that does any, it is keeping that much data you know what the data is does any of that expose you know give you a wider exposure platform so i i think again this goes back to you ought to know what's what your it department's throwing away and they should be able to provide

Starting point is 00:13:43 you with their rationale for why they're throwing it away that if they are if they're hoarding it they should be able to provide you with that rationale but there probably needs to be more than the it department involved probably legal department whoever's responsible for security like and then just the business itself like would you do anything with this data and what would you do and deciding deciding from the business point of view, either because they don't know what they don't know, or because maybe they did decide that this wasn't needed, and now they find out it is. Or, you know, we, as you brought up at the beginning, Gina, that the idea that sort of third party external applications, and so on, you know, they may have value that,

Starting point is 00:14:45 that customers, that companies aren't even aware of. And I think in all those cases, it's, it's possible that there could be, you know, data out there at the edge that people kind of wish there wasn't or wish there, and maybe that's the next thing to talk about is sort of what is the repercussion here? I mean, is it bad? Is it bad to have data that you don't know about? Or is it okay? Is that just how business works? What are the risks here? I think that ultimately you've got a situation where it depends on the type of data that it is, right? There may be data that you don't necessarily want to hold on to from a standpoint of,

Starting point is 00:15:38 you know, it could open you up to privacy concerns. It could open you up to a number of different reasons why your lawyers might tell you, actually, it's not good business practice for us to hold on to this data. So to that point, if you don't know you're collecting it, and it's just sitting out at the edge somewhere, that could be exposing you to things that you don't want to be exposed to. And then I think that, you know, there's always security risks associated with data. And so if you're, if you don't know what you're protecting, then how do you know if you've been breached? If you don't know what value that data is, how do you know the cost of that breach or the business risk associated with this that lead me to the determination that if it was my IT organization, I would not want to have dark data exposure because it's a very difficult thing to wrap my hands around quantifying what the risk of that exposure is. I would also think, you know,

Starting point is 00:16:41 as an application owner, I would want to know the risks for sure, but there's also an opportunity. So if I own an application or a program from a business perspective, I want to know if it's creating some, even if it's a SaaS application or if it's a, you know, if it's something we built for edge consumption. I want to know what data is being created or the possibility of what data can be created. I want to know what IT is not keeping. Is there something they could keep that we want to keep? And how do we make sure that's secure? And if it's discoverable, what does that mean to the organization? So I think there has to be a tighter connection between IT and the data to the data scientist that will help them solve like the last piece of the puzzle to help them solve a problem for the business. That bit of data that they didn't know existed till they did know.

Starting point is 00:17:54 So I think that the business folks need to understand what dark data is and work with their IT people to understand the risk of keeping the data. You know, it's interesting. I was just talking to the good folks over at Calyptia, Stephen, and one of the things that he said to me, and for those who don't know who Calyptia is, they build observability solutions. They're the ones who are behind FluentD, and a lot of folks are familiar with FluentBit and FluentD from the open source space. And one of the things that he said that really perked my interest was his customers, after using their solutions, their data problems got worse

Starting point is 00:18:34 because they discovered their dark data. And all of a sudden they had a bigger, you know, data hygiene challenge ahead of them in terms of what to do with all of this data that they didn't even know existed. But it does kind of validate the point of what we're talking about, that every organization has it. A lot of companies are not using those observability solutions to find it. And once they do, they've got a big cleanup on aisle five challenge in front of them in terms of getting it in order and figuring out what's valuable. I want to talk a little bit about tools, too, here. So you mentioned Calyptia. A couple other companies that I've been talking to about similar issues.

Starting point is 00:19:16 I talked to Hammerspace earlier. using, basically bringing unstructured data from everywhere and anywhere into corporate control and centralization. Another company I just talked to today was Resilio, which is another company that's looking at basically how can we transport lots of data from lots of places and consolidate it into one spot. There's a whole world of tools out there. You're talking observability tools, you're talking data analytics tools. I mean, I'm sure that the usual suspects in the cloud would be useful in edge conditions as well in terms of basically collecting and centralizing and getting smart about data. And then, of course, there's one more elephant in the room, and that's AI. And I know that this is something that both of you have talked about in terms of how can we maybe use that as a tool to

Starting point is 00:20:20 help with this dark data problem. What do you guys think about tools and about AI specifically? Well, I'm interested to hear what Gina says on this. I mean, obviously, the holy grail of getting a handle on all this data is actually to go train an algorithm and do something interesting with it. And I think that that is the main opportunity in front of IT organizations to help be the center of business growth is looking at ways to apply AI to their business. I do think that, you know, does data discovery get better with AI? Does data hygiene get better with AI? I can't imagine that data hygiene wouldn't get better with AI.

Starting point is 00:21:08 But I haven't seen anything that says that, you know, somebody's come up with the solution yet. And maybe I'm just unaware. Gina, have you seen anything in the industry in this space? I think this is a really tricky question, right? Because AI is going to, and AI is going to use algorithms to extract meaning from data. So the quality of the data is going to be what actually gives you a result. So an example I have, I'm helping a company called Simon Data, and they have a platform that runs on top of Snowflake data.

Starting point is 00:21:48 And it's a marketing platform. So they're looking at taking all of the information about customers that might exist in Snowflake and turning that into meaning for the marketers so that they can give very personalized campaigns down to their customers that they already have or prospects they might see in the pipe. And what that they have to do to do that is a ton of data cleaning. So they can't run any kind of AI to do anything until they say, okay, this customer, maybe this customer is, um, we'll say Delta Airlines. I'm just going to throw something out there. I have no idea if they're even a customer of theirs. That might be in their CDP and even in their Snowflake platform, 500 different ways, which

Starting point is 00:22:37 is the one way they want to talk about that. And what's just the one source of data that's going to be the best source of data to describe their customer base and their and their prospect space they have to go through a whole customers have to go through a whole whole methodology of cleaning that data and getting it straight before they can even apply ai to it because if you don't you're going to get garbage out so when you run an algorithm it tries to find something or tries to do something it it's not going to do a great job. So if you have an algorithm that goes across all of your data sets to find the dark data, if you'd have to have it just all really defined

Starting point is 00:23:16 really well, what is the, what is the data I'm looking for? And can I look at this? And you'd have to have that all defined before the algorithm could even do anything. So what you're saying, I think, is it's probably not going to find the dark data. I don't think so. I think that, well, I don't know. I don't think that would be an AI. I think that might be a tool or a script, like you guys are talking about, a tool to actually go and find those extra data sets.

Starting point is 00:23:51 Once you find them and you bring it in line with your other data sets that you have, then you probably could write an algorithm to do something, but like, what would you want it to do? I don't know if you can write an, maybe you could write an algorithm to clean the data, to get it to the right place. But I think that's one of those things where you're going to have to go in like line by line to find every single Delta airline, every single way it's misspelled, everything, because that's going to be potentially too hard to teach an algorithm to do because of the things that humans do to create the data. Yeah, I think that this comes back to the human problem, right? We create these challenges and create bad data. And until you go through the cleaning process, I haven't seen an AI cleaner yet.

Starting point is 00:24:38 That would be lovely because nobody likes data cleaning. Nobody likes doing it. But unless you do that, your algorithm is going to be trained in a way that isn't going to be effective. So I think that AI is the opportunity statement of why you want to go find that dark data, why you want to clean it,

Starting point is 00:25:00 why you want to talk across organizations of why it's valuable. But I don't think it's the solution to finding it. I think that's observability and I think it's discovery solutions. And I think it's roll up your sleeves and clean some data. Yeah, I think so too. I think that's more of an ops type of role and then the data scientists and the data science is more for the algorithms but i don't know i feel like it shouldn't be so

Starting point is 00:25:36 all doom and gloom though about dark data i think dark data can be very exciting and very good depending you know on what it is is and the data that comes back. I just think it comes back to good old data center hygiene, though. It just does. You can't – I also want to say I think it would be really dangerous potentially to let AI, as things stand now, to let it go and clean itself or to clean its own data? Because what kind of dark data could an AI algorithm create? I think that they would create more dark data than not, or junk data. You have to be really careful with all of that right now.

Starting point is 00:26:16 I agree with you, Gina. I don't think it's a negative. I think it's buried treasure. You know, you're going to get a lot of treasure boxes. Maybe some of them are going to be empty, but that's kind of that thrill of, is this something that's going to be valuable that we didn't even know we had? That's exciting. I would say that there's actually a reason for optimism here when it comes to edge specifically, because one of the challenges and opportunities of, of this environment is that in many cases it's not general purpose computing. It's not general. It's very specific purpose that's being deployed by a specific organization for a specific reason.

Starting point is 00:26:57 And I think that that gives us the possibility, unlike, for example, you know, in the data center where, you know, basically almost anything could be run or on the desktop where certainly anything can be run, you know, at the edge, if you're deploying something, if you're deploying an application or some hardware, you kind of, you know what you're putting out there. It's not like there's just random stuff running out there. And that means that those organizations might have better handle on the data that they're collecting, how they're processing it, how they're organizing it, how they're retrieving it, how they're centralizing it, then it might be in other kinds of environments.

Starting point is 00:27:34 You also don't have the problem of the proverbial, you know, give my credit card to Amazon problem that we have in the cloud, where somebody can just deploy something and then it gets out of hand. Because there again, I mean, you're just not going to be deploying that stuff at the edge unless you really know what you're doing. And so, you know, for all the things that we've said, I do feel like there's some optimism here that this time may be better than previously. Well, thank you so much, Allison and Gina, for joining us today on Utilizing Tech. As we wrap this up, where can we connect with you and continue this conversation on Edge and any other topic you're interested in? Yeah, you can find me on LinkedIn, Gina Rosenthal,

Starting point is 00:28:16 or at Digital Sunshine Solutions with an S dot com. And also we talk about things just like this. We actually are going to publish a new episode of our podcast with interviewing a data scientist. It's called Tech Aunties. So techaunties.com. from across the edge on my platform, as well as a 2023 edge report that you might want to download and check out around some of the key challenges associated with enterprise adoption of edge. You can also find me at tech Allison on Twitter and at Allison Klein on LinkedIn. And as for me, you'll find me here on utilizing edge every Monday on the on-premise it podcast, most Tuesdays and on the weekly gestalt IT News Rundown on Wednesdays. So thanks for listening to Utilizing Edge, part of the Utilizing Tech podcast series.

Starting point is 00:29:12 If you enjoyed this discussion, we would love to hear from you. Please reach out to us, find us on most social media networks at Utilizing Tech, or just drop me a line. You'll find me at S. Foskett on most social media networks. Also, if you like listening to this, you can find us in most podcast applications as well as on YouTube. This podcast is brought to you by gestaltit.com,

Starting point is 00:29:36 your home for IT coverage from across the enterprise. For show notes and more episodes, though, head to our special dedicated website, utilizingtech.com. And as I said, you can find us on social media at Utilizing Tech. Thanks for listening, and we will see you next week.

Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 05x13. Dark Data at the Edge

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.