Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 05x13. Dark Data at the Edge
Episode Date: July 24, 2023In this episode of the Utilizing Tech podcast, Stephen Foskett, Alison Klein, and Gina Rosenthal discuss dark data in edge computing. Dark data is unutilized or unknown data collected by organizations.... The distributed nature and use of third-party apps can make it challenging to handle dark data, limiting insights and posing security risks. Establishing a stronger IT-business connection is crucial. Observability solutions and data analytics can aid in discovering and centralizing dark data. AI has potential for data hygiene improvement, but human-driven cleaning is still necessary. Despite challenges, edge computing offers better data management due to controlled deployments. Host: Stephen Foskett: https://www.twitter.com/SFoskett Panelists: Allyson Klein: https://www.twitter.com/TechAllyson Gina Rosenthal: https://www.twitter.com/Digi_Sunshine Follow Gestalt IT and Utilizing Tech Website: https://www.GestaltIT.com/ Utilizing Tech: https://www.UtilizingTech.com/ Twitter: https://www.twitter.com/GestaltIT Twitter: https://www.twitter.com/UtilizingTech LinkedIn: https://www.linkedin.com/company/Gestalt-IT Tags: #UtilizingEdge, #DarkData, #EdgeComputing, #Edge,
Transcript
Discussion (0)
Welcome to Utilizing Tech, the podcast about emerging technology from Gestalt IT.
This season of Utilizing Tech focuses on edge computing, which demands a new approach to compute, storage, networking, and more.
I'm your host, Stephen Foskett, organizer of Tech Field Day and publisher of Gestalt IT.
And joining me today as my co-host is Allison Klein. Welcome.
Hey, Stephen. It's great to be back.
It's great to have you back.
And, you know, we are doing a special episode here, kind of mid-season, talking to some of our Field Day Delegate friends about EDGE,
which is one reason that we decided to invite to join us today, Gina.
Gina, welcome to the show.
Hey, y'all. Thanks for having me.
So today on the episode, we are going to talk about something that is, I think, sort of lurking
in the background of a lot of Edge conversations. And that is this whole idea of, as Allison coined
it, dark data. Allison, tell us a little bit about your thought of dark data.
Well, dark data really, in my mind, refers to the data that an organization may have created
or collected, but is unaware that is within their data stores and is not actioning on in terms of monetization or service delivery associated
with that data. And I think that what they have in terms of has dark data present, and it's that
opportunity to explore, you know, expanding what they're doing with their data, if they can
actually find out where their dark data is and put a light on it. And I think one of the things that's challenging here is that there certainly
is dark data in the data center and in the cloud, but in the edge, well, you really don't know what
you've got, right, Gina? Well, you should kind of know what you've got if you're doing it right,
I think. So that's kind of, I think, why you have the dark data. But I think
if you're intentional with what you're doing at the edge and you know what data is being created
and what data is available, you'll know whether it's something you want to pull into your data
lake or if it's something you want to not keep or you'll be able to make those decisions. But I
think you need to know what kind of data you got everywhere, though. Well, is that realistic?
I mean, do people know what they've got going on out there?
I just think that, you know, I've spent a lot of time in the industry, and I think that
we assume that all of this is a lot easier than it actually is in practice.
And the reality is IT organizations have tremendous priorities associated with
security, keeping their workloads up. Maybe some things fall through the crack in terms of data
that's being created by different business groups at the edge. Maybe they don't have the right tools
in place in terms of observability and discoverability so that when this data is created, it's obviously
flagged for their IT organization to explore.
And maybe the IT organization knows about the data, but it's dark to the line of business
that it would actually see value from it.
So I think that there's a lot of different angles that you can use in terms of thinking
about what the challenge or opportunity statement of dark data is.
Right. I do agree with you. I think IT probably knows about the existence of dark data or the
ability to create more dark data and makes decisions based on costs and not necessarily
costs of, you know, opportunities for the business.
Because I know we used to do that back in the day,
and I don't think that's something that's really changed, you know.
I think another opportunity to create dark data is probably the use of SaaS applications, which are used everywhere to try just to get business done as fast as possible.
That's really interesting, Gina.
How do you see the tie between SaaS and dark data? Can you say more? Well, sure. I help a lot of people with marketing
because that's what I do. And I think a lot of times people get frustrated with the overall
big company rules about how you can do marketing, especially collecting customer data. Where are
your customers at? They
just want to throw something together, collect some leads and go on with it. So you'll have a
sales team wherever they may be, a sales leader saying, just go get this done for me. Go use
XYZ application. Just use your credit card, run it and get it. Sometimes really not thinking about,
well, there's an entire, you know,
CDP,
which is supposed to be gathering all the data in one place from everywhere.
It might be, it's the idea to have the single plane of glass,
as they like to say, and also security and everything,
especially with personal information,
which we know has lots of governance around it.
So the overall line of business may not know about it.
The sales leader that's decided to go off and create his own workshop to get the leads he needs to drive his business
may not know about the opportunities of tying all the data in the organization together to find more customers and find more leads and the rest of it.
So you end up with kind of fragmented siloed
data all over the place. For edge, I think it would be interesting to see if there is,
you know, cause I think it was interesting what you said a few minutes ago that there's,
it's really hard and there's not, is there any best practices that people are using when it
comes to building an edge application?
Are they just grabbing whatever looks good at the time, which will fit to get it out the door and fits within the cost constraints just to make it happen?
And nobody ever circles back around and tries to put that into a bigger system.
I think when you're talking about edge, one of the things that came up when we were talking about far edge applications and, for example, at retail was that you have maybe not SaaS applications specifically, but you have applications that are being brought in by different people, different departments.
Some of them are outside applications.
Some of them are inside, but from different departments. Some of them are outside applications. Some of them are inside, but from different departments. And all of those are running and are expected to run everywhere
remotely. And I think that that seems like kind of an analogous situation. So you have,
you know, for example, maybe an advertising application that's running in your Quick Mart.
And that application came from a third party vendor and it has no real ties back to the company except through its own communications links and its own
service provider and its own data format and so on.
And all of that is really isolated from the overall corporation. I think that that's really probably a very normal use case in the edge. And that's why when you brought up this idea of data silos and dark data, it really resonated with me in terms of edge, because I could totally see that happening in this environment where you've got all these different applications running. It really is like sales enablement applications running on
phones and laptops and everything. It's just a completely different set of data, right?
I think that one of the things that comes to mind is it's a technology problem and it's a human problem because without full stack integration and really thoughtful
tie-in of you know the leads generated from this application automatically flow into my crm
you know those things can be done but are they done all the time? Sometimes not. The human problem is really having all of
these different organizations deploying applications and maybe not seeing the point.
To Gina's point earlier, the sales lead leader doesn't really care if we capture those leads,
that contact information for further purposes. He just wants to meet his quarterly numbers.
And so, you know, lost opportunity. I think that, you know, the edge is kind of accelerating both of these challenges because so much of the data that is generated today is being created at
the edge. It needs processing at the edge. And we are seeing that these two challenges, technology and
cross-departmental communication, are just getting, you know, kind of in the way of the
full value proposition that's in front of us. If you think about how much better you can do
things if you have the data in one location that you can write applications against and how much better you can derive data for the entire organization.
Or think of it another way.
If you can think of how much better it would be if you were able to take, you know, whatever
it was, a sales leader, being able to push him to what he needs to create his own sales, his sales program, but all protected using all the data from an organization that
that's kind of the short sighted not to do that.
But human nature is, like you said, just to get it out there and get it done, meet my
number.
Another thing that occurs to me is that in many cases with edge, we're talking about
data that is intentionally darkened.
Let's say that.
So you're collecting metrics from machinery or you are collecting video and processing
it and so on.
The idea is that you are going to collect and process that data at the edge. You're going to extract
what you think is valuable today from that data. And then you are going to either chuck or just
ignore that source data and send back only the results. And I think that in many ways,
this is an optimization that people are doing in order to optimize bandwidth and
connectivity and cost. But at the same time, that's dark data too, right? So I mean, if you
think about like a security camera example, for instance, you've got a whole bunch of security
cameras, they're all processing data. You've got maybe a machine learning algorithm that is
processing that and extracting the interesting bits and only sending the interesting bits back home.
Who's to say that you don't need or want that data in the future?
And even if you are pretty sure you don't need it because it's just an empty parking lot or a machine that's idle, you know, maybe some of the other data that you've rejected, you know, you might need.
Or maybe the machine, the algorithm rejected some of it on its own. And you find that, oh,
man, I wish I could see what was going on at this time code. I think that's another area where
data could be dark, even in a system that is correctly configured to ship some data back to
the source, right? It's an interesting question, Stephen, because at some point, what is dark, right?
Is it stuff that you don't know about?
Is it stuff that you're not acting on?
Is it stuff that you're throwing away?
I don't know where the right demarcation point is.
And I think that's worth exploring
is if an it organization has already
looked at this particular use case and said you know what we're just going to move this portion
of the data because we already know that this other stuff is junk i mean it's like it's almost
like at that point does holding on to the junk become data hoarding? Um, you know, is, is where is the demarcation line
that it's no longer dark, it's useless, and we're just going to get rid of it. I understand your
point about the security cameras. I mean, I, maybe I've watched too much true crime, um, episodes,
but I always know that it's bad when they lose the data of the security camera. That's a common theme. But I think that, all joking aside and
black humor aside, I think that if we think about an IT organization making thoughtful decisions
about what they want to keep, I don't know if that is dark, if it's a really good data policy. Because I don't think that the right
answer is that an organization should hold on to everything for all time because what is useless
may become useful in the future. I don't know. I could be talked out of that position.
I like everything you said, but I do want to say,
I'm not sure that the IT department should be deciding all of that information because
there's more to it than just, you know, there's the cost involved with keeping it,
but then there's also, it's all discoverable at that point in time. And there's also the security
piece like that, that does any, it is keeping that much data you know what the data is does any of that
expose you know give you a wider exposure platform so i i think again this goes back to
you ought to know what's what your it department's throwing away and they should be able to provide
you with their rationale for why they're throwing it away
that if they are if they're hoarding it they should be able to provide you with that rationale
but there probably needs to be more than the it department involved probably legal department
whoever's responsible for security like and then just the business itself like would you do anything
with this data and what would you do and deciding deciding from the business point of view, either because they don't know what they don't know,
or because maybe they did decide that this wasn't needed, and now they find out it is.
Or, you know, we, as you brought up at the beginning, Gina, that the idea that sort of
third party external applications, and so on, you know, they may have value that,
that customers, that companies aren't even aware of. And I think in all those cases, it's,
it's possible that there could be, you know, data out there at the edge that people
kind of wish there wasn't or wish there, and maybe that's the next thing to talk about is
sort of what is the repercussion here?
I mean, is it bad? Is it bad to have data that you don't know about? Or is it okay? Is that just how
business works? What are the risks here? I think that ultimately you've got a situation where
it depends on the type of data that it is, right?
There may be data that you don't necessarily want to hold on to from a standpoint of,
you know, it could open you up to privacy concerns. It could open you up to a number of different reasons why your lawyers might tell you, actually, it's not good business practice
for us to hold on to this data. So to that point, if you don't know you're collecting it, and it's just sitting out at the edge
somewhere, that could be exposing you to things that you don't want to be exposed to.
And then I think that, you know, there's always security risks associated with data.
And so if you're, if you don't know what you're protecting, then how do you know if you've been breached?
If you don't know what value that data is, how do you know the cost of that breach or the business risk associated with this that lead me to the determination that if it was my IT
organization, I would not want to have dark data exposure because it's a very difficult thing to
wrap my hands around quantifying what the risk of that exposure is. I would also think, you know,
as an application owner, I would want to know the risks for sure, but there's also an opportunity.
So if I own an application or a program from a business perspective, I want to know if it's creating some, even if it's a SaaS application or if it's a, you know, if it's something we built for edge consumption.
I want to know what data is being created or the possibility of what data can be created.
I want to know what IT is not keeping.
Is there something they could keep that we want to keep?
And how do we make sure that's secure?
And if it's discoverable, what does that mean to the organization? So I think there has to be a tighter connection between IT and the data to the data scientist that will help them solve like the last piece of the puzzle to help them solve a problem for the business.
That bit of data that they didn't know existed till they did know.
So I think that the business folks need to understand what dark data is and work with their IT people to understand the risk of keeping the data.
You know, it's interesting.
I was just talking to the
good folks over at Calyptia, Stephen, and one of the things that he said to me, and for those who
don't know who Calyptia is, they build observability solutions. They're the ones who are behind FluentD,
and a lot of folks are familiar with FluentBit and FluentD from the open source space.
And one of the things that he said that really
perked my interest was his customers, after using their solutions, their data problems got worse
because they discovered their dark data. And all of a sudden they had a bigger, you know, data
hygiene challenge ahead of them in terms of what to do with all of this data that they
didn't even know existed. But it does kind of validate the point of what we're talking about,
that every organization has it. A lot of companies are not using those observability solutions to
find it. And once they do, they've got a big cleanup on aisle five challenge in front of them in terms of getting it in order and figuring out what's valuable.
I want to talk a little bit about tools, too, here.
So you mentioned Calyptia.
A couple other companies that I've been talking to about similar issues.
I talked to Hammerspace earlier. using, basically bringing unstructured data from everywhere and anywhere into corporate control
and centralization. Another company I just talked to today was Resilio, which is another company
that's looking at basically how can we transport lots of data from lots of places and consolidate it into one spot. There's a whole
world of tools out there. You're talking observability tools, you're talking data
analytics tools. I mean, I'm sure that the usual suspects in the cloud would be useful
in edge conditions as well in terms of basically collecting and centralizing and getting smart about data.
And then, of course, there's one more elephant in the room, and that's AI. And I know that this is
something that both of you have talked about in terms of how can we maybe use that as a tool to
help with this dark data problem. What do you guys think about tools and about AI specifically?
Well, I'm interested to hear what Gina says on this.
I mean, obviously, the holy grail of getting a handle on all this data is actually to go
train an algorithm and do something interesting with it.
And I think that that is the main opportunity in front of IT organizations to help be the center of business growth is looking at ways to apply AI to their business.
I do think that, you know, does data discovery get better with AI?
Does data hygiene get better with AI?
I can't imagine that data hygiene wouldn't get better with AI.
But I haven't seen anything that says that, you know, somebody's come up with the solution yet.
And maybe I'm just unaware.
Gina, have you seen anything in the industry in this space?
I think this is a really tricky question, right?
Because AI is going to, and AI is going to
use algorithms to extract meaning from data. So the quality of the data is going to be what
actually gives you a result. So an example I have, I'm helping a company called Simon Data,
and they have a platform that runs on top of Snowflake data.
And it's a marketing platform. So they're looking at taking all of the information about customers
that might exist in Snowflake and turning that into meaning for the marketers so that they can
give very personalized campaigns down to their customers
that they already have or prospects they might see in the pipe. And what that they have to do to do
that is a ton of data cleaning. So they can't run any kind of AI to do anything until they say,
okay, this customer, maybe this customer is, um, we'll say Delta Airlines. I'm just going to throw something out there.
I have no idea if they're even a customer of theirs.
That might be in their CDP and even in their Snowflake platform, 500 different ways, which
is the one way they want to talk about that.
And what's just the one source of data that's going to be the best source of data to describe
their customer base
and their and their prospect space they have to go through a whole customers have to go through a
whole whole methodology of cleaning that data and getting it straight before they can even apply ai
to it because if you don't you're going to get garbage out so when you run an algorithm it tries
to find something or tries to do something it it's not going to do a great job. So if you have an algorithm that goes across
all of your data sets to find the dark data, if you'd have to have it just all really defined
really well, what is the, what is the data I'm looking for? And can I look at this? And you'd
have to have that all defined before the algorithm could even do anything.
So what you're saying, I think, is it's probably not going to find the dark data.
I don't think so.
I think that, well, I don't know.
I don't think that would be an AI.
I think that might be a tool or a script, like you guys are talking about, a tool to
actually go and find those extra data sets.
Once you find them and you bring it in line with your other data sets that you have, then you probably could write an algorithm to do something, but like, what would you want it to do? I don't
know if you can write an, maybe you could write an algorithm to clean the data, to get it to the
right place. But I think that's one of those things where you're going to have to go in like line by line to find every single Delta airline, every single way it's
misspelled, everything, because that's going to be potentially too hard to teach an algorithm to do
because of the things that humans do to create the data. Yeah, I think that this comes back to the human problem, right?
We create these challenges and create bad data.
And until you go through the cleaning process,
I haven't seen an AI cleaner yet.
That would be lovely because nobody likes data cleaning.
Nobody likes doing it.
But unless you do that,
your algorithm is going to be trained
in a way that isn't going to be effective.
So I think that AI is the opportunity statement
of why you want to go find that dark data,
why you want to clean it,
why you want to talk across organizations
of why it's valuable.
But I don't think it's the solution to finding it.
I think that's observability and I think it's discovery solutions.
And I think it's roll up your sleeves and clean some data.
Yeah, I think so too.
I think that's more of an ops type of role and then the data scientists
and the data science is more for the algorithms but i don't know i feel like it shouldn't be so
all doom and gloom though about dark data i think dark data can be very exciting and very good
depending you know on what it is is and the data that comes back.
I just think it comes back to good old data center hygiene, though.
It just does.
You can't – I also want to say I think it would be really dangerous potentially to let
AI, as things stand now, to let it go and clean itself or to clean its own data? Because
what kind of dark data could an AI algorithm create? I think that they would create
more dark data than not, or junk data. You have to be really careful with all of that right now.
I agree with you, Gina. I don't think it's a negative. I think it's buried treasure.
You know, you're going to get a lot of treasure boxes. Maybe some of them are going to
be empty, but that's kind of that thrill of, is this something that's going to be valuable that
we didn't even know we had? That's exciting. I would say that there's actually a reason for
optimism here when it comes to edge specifically, because one of the challenges and opportunities of, of this environment is that in many cases it's not general purpose
computing. It's not general.
It's very specific purpose that's being deployed by a specific organization
for a specific reason.
And I think that that gives us the possibility, unlike, for example,
you know, in the data center where, you know,
basically almost anything could be run or on the desktop where certainly anything can
be run, you know, at the edge, if you're deploying something, if you're deploying an
application or some hardware, you kind of, you know what you're putting out there.
It's not like there's just random stuff running out there.
And that means that those organizations might have better handle on the data that they're collecting, how they're processing it, how they're organizing it,
how they're retrieving it, how they're centralizing it, then it might be in other kinds of environments.
You also don't have the problem of the proverbial, you know, give my credit card to Amazon problem
that we have in the cloud, where somebody can just deploy something and then it
gets out of hand. Because there again, I mean, you're just not going to be deploying that stuff
at the edge unless you really know what you're doing. And so, you know, for all the things that
we've said, I do feel like there's some optimism here that this time may be better than previously.
Well, thank you so much, Allison and Gina, for joining us today on
Utilizing Tech. As we wrap this up, where can we connect with you and continue this conversation
on Edge and any other topic you're interested in? Yeah, you can find me on LinkedIn, Gina Rosenthal,
or at Digital Sunshine Solutions with an S dot com. And also we talk about things just like this.
We actually are going to publish a new episode of our podcast with interviewing a data scientist. It's called Tech Aunties. So techaunties.com. from across the edge on my platform, as well as a 2023 edge report that you might want to download
and check out around some of the key challenges associated with enterprise adoption of edge.
You can also find me at tech Allison on Twitter and at Allison Klein on LinkedIn.
And as for me, you'll find me here on utilizing edge every Monday on the on-premise it podcast,
most Tuesdays and on the weekly gestalt IT News Rundown on Wednesdays.
So thanks for listening to Utilizing Edge,
part of the Utilizing Tech podcast series.
If you enjoyed this discussion, we would love to hear from you.
Please reach out to us, find us on most social media networks
at Utilizing Tech, or just drop me a line.
You'll find me at S. Foskett on most social media networks.
Also, if you like listening to this,
you can find us in most podcast applications
as well as on YouTube.
This podcast is brought to you by gestaltit.com,
your home for IT coverage from across the enterprise.
For show notes and more episodes, though,
head to our special dedicated website, utilizingtech.com.
And as I said, you can find us on social media at Utilizing Tech.
Thanks for listening, and we will see you next week.