Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 2x05: How AI Can Save IT Operations From Drowning in Data with Josh Atwell from Splunk
Episode Date: February 2, 2021AI is impacting IT operations more quickly than expected, and companies like Splunk are leveraging it to augment staff capabilities. Josh Atwell joins Andy Thurai and Stephen Foskett to discuss practi...cal application of AI to help keep IT operations from drowning in data as applications are distributed in containers and the cloud. The key to using AI for operations is to leverage it to assist staff to process the volume and velocity of data, not replace them. Guests and Hosts Josh Atwell is Senior Technology Advocate at Splunk. Find Josh on Twitter as @Josh_Atwell and learn more about Splunk at Splunk.com Andy Thurai, technology influencer and thought leader. Find Andy’s content at theFieldCTO.com and on Twitter at @AndyThurai Stephen Foskett, Publisher of Gestalt IT and Organizer of Tech Field Day. Find Stephen’s writing at GestaltIT.com and on Twitter at @SFoskett Date: 2/2/2021 Tags: @SFoskett, @AndyThurai, @Josh_Atwell, @Splunk
Transcript
Discussion (0)
Welcome to Utilizing AI, the podcast about enterprise applications for machine learning,
deep learning, and other artificial intelligence topics. Each episode brings in experts in
enterprise infrastructure to discuss applications of AI in today's data center. Today, we're
discussing applications of AI in the modern data center and specifically AI operations
or AI-assisted operations.
First, let's meet our guest, Josh Atwell.
Thanks, Stephen.
I'm Josh Atwell, Senior Technology Advocate at Splunk.
I'm on Twitter at Josh underscore Atwell, and you can find me on podcasts and blogs
all over the place.
And I am Andy Thorai, Founder and Principal at TheFieldCTO.com.
We do a lot of content and AI observability workshop for large enterprises moving to cloud.
Check us out at thefieldcto.com or you can find me on Twitter at Andy Thorey.
And I'm Stephen Foskett, organizer of Tech Field Day and publisher of Gestalt IT.
You can find me on Twitter at S Foskett.
So, Josh, a lot of folks are really familiar with Splunk, especially
in the IT operations space, because it's become sort of an invaluable tool for, you know, basically
everything that we do as a day-to-day operations of IT infrastructure. But many people probably
aren't aware of all the ways that Splunk is implementing AI into the sort of the operations
workflow. And that's really what I want to focus
on here. Not so much Splunk, but really like, you know, how can AI be used to assist in IT operations?
Yeah, you're absolutely right, Stephen. I think most people are familiar with Splunk as a log
company and spending a lot of time with log analytics and consolidating logs. But we've
absolutely made tremendous amount of investments,
particularly in machine learning and the application of various AI technologies,
those foundational technologies, and being able to bring those in to assist IT operations
professionals and businesses being successful in managing, you know, especially their larger,
more complex environments. So, Josh, so I just, as you and I talked about,
I, in my role as an analyst, I just finished up two reports.
One is the apps, one is observability.
And I heard a lot of things from vendors,
including yourself.
Curiosity question.
A lot of vendors are moving to cloud, of course,
most of them moving to AWS,
and they are having the cloud ops issues. But then also there's a lot of those digital enterprises trying to do private,
particularly the finance, banking, insurance, and whatnot, right? They're also transforming
themselves, their private data centers, because they can't work with the legacy systems anymore.
In your view, with the large customers you're dealing with, I'm not talking about the small
mushroom or day by operations that are coming up. I'm talking with the large customers you're dealing with, I'm not talking about the small mushroom or day-by-day operations that are coming up.
I'm talking about the large established customers who have a digital footprint for 20, 30, 40 years.
Are you seeing the general movement from them moving, run to the doors, run to the cloud, or make more investment in my data center?
We've got to save that or a combination of both. What do you see?
It's absolutely a combination of both.
A lot of the conversations I have with customers is really about their,
you know, how they rationalize their application portfolio.
And I think that's the most important way to view, you know, that strategy.
You know, what applications do we need to optimize in the data center? Which applications
can we simply move that service or that utility to the cloud? Which applications do we need to
refactor so that we can take advantage of the cloud effectively? And which applications are
we building natively in the cloud to meet new needs or to replace functions and capabilities that at one point were in the
data center, but now need to be in the cloud. So they're closer to the applications that are
touching customers most frequently. And I think the larger companies, more established ones,
as you described, they have a lot more technology in their portfolio that does not lend itself to an easy, smooth transition to a cloud
service. And there's a variety of reasons for that. Prioritization, of course, comes in.
The function that application may have to the business. But the biggest limiter I see
really comes down to the data and the availability of moving that data from their data center into the cloud effectively
to make it located near those applications to reduce latency and to take advantage of the
other capabilities in the cloud. So it's hard for a lot of those applications in that portfolio
to necessarily move to the cloud. And there's a lot of work that has to be had to reconcile that.
So you mentioned the dirty word data.
So let's talk about that a little bit, right?
One of the issues, particularly with IT operations
that I'm a little surprised to see is
your IT existing digital footprint
was generating decent amount of data.
You know, we thought that was a huge wall in those days, you know,
and when they moved to the cloud, they are in a shock.
They are in awe that particularly when you distribute our applications,
when you have containers, even serverless running all over the place,
the amount of logs, the amount of traces, the amount of distributed traces,
the amount of, you know, of distributed traces, the amount of other
metrics you create is unbelievable.
First of all, the data is kind of siloed and distributed everywhere.
Which means you need to send the insights as you were talking about to the data rather
than sending the data back in because it can overwhelm people, the data lake itself.
And two, the volume of data, we're not talking about one time, two times.
We are talking about there are companies I've seen
that I've had conversations with
from your enterprise deal footprint to cloud when you go,
you're talking about hundreds, even thousands.
There is a couple of companies that I've seen,
the digital, what do you call, even alerts,
notifications or information,
the data points that telemetry
produces how do you see companies handle that i mean that's a major issue for a lot of companies
they are drowning in data right now right they absolutely are i i think the key component of
that and what's interesting to me as a long-time it professional is that a lot of the data that we're seeing in the in the volume and velocity of data
isn't necessarily so distinct and unique in you know in the past it's just all those transactions
that we're now able to have full visibility into were hidden within an application you know nobody
ever coded that application actually present that information out because nobody thought it would be useful.
When you move to the cloud and everything's an API call and it's a service call or it's a utilization of a serverless function,
these are all distinct transactions that go over the wire.
And with each transaction, there's the request, there's the response, there's the acknowledgement,
there's the log of whether it succeeded or it failed.
There's information around every transaction that we now have the opportunity to inspect, that we have the opportunity to optimize on.
We can review that for cost savings perspectives or review it to optimize in performance. and to the point that you're making, the biggest challenge is we had not in the past
had a discipline in order to consolidate
and utilize that data.
I think that's been the biggest barrier.
And then the following barrier is,
is once you've made the decision and said,
hey, this data has value,
we're gonna make an effort to get value from that data.
How do you centralize it?
You know, because it's happening in multiple services in multiple places. And it could be
coming from your on-prem environment and your cloud environment. And multiple clouds too.
And it could be across multiple clouds or multiple services. You know, if you're connecting something
to ServiceNow or Salesforce to an application that you have, they may be on completely different platforms
in completely different parts of the world. And so that transaction framework exists and that data
exists there. People need to now develop a discipline around how you consolidate and utilize
that data. Right. So that's an interesting point, right? It's the first point you made, which, which a lot of people don't even realize is in the past when you had an application, as you said, the application information, what do you have?
Mostly it's the metrics that you collect saying if my application is working or not, if it is working fine, we're good, right? You test from maybe a couple of locations, maybe do a synthetic monitoring,
maybe some real user monitoring, you're done, right?
But now the application, how do you define an application?
It's not like, you know, running in one server
or two servers or maybe three servers
with a middleware and whatnot.
You're talking about a thousand containers,
which are 200 microservices
or 400 microservices running a distributor.
That becomes your application.
So it's kind of like a common word
or a dirty word nowadays, right?
So when you distribute that application,
having collect them, all of them into one place,
whether it's centralized or a place where you can analyze,
not necessarily centralized.
And then when you have portions of that
moving into a different cloud,
for example, if you use Google
for one microservice among this application, it becomes really complicated. So on a related note, I have a question for you. So one
is about data. The second one is, do you commonly see companies rather move the models and insight
closer to the data? Or do you see most companies would prefer to move the data to where they are comfortable with?
Yeah, I think that's a great question.
The challenge there really comes down to where you need insight and where you can have action, like what context.
I hear people say, oh, I want a single pane of glass, you know, to be able to see everything. And while
at first that seems like that's what people want, I don't think that's actually the desired state.
I think the desired state is that as your context changes or the view that you need to have,
whether it being a macro view or a micro view, what you're really looking for is more like a lens response where you can zoom in and
zoom out as needed. And depending on the service and depending on what you're utilizing, that
information may not reside where your application is, or it may have to correlate with information
that's in another location, particularly as you take a more macro view and you're looking at,
you know, the holistics of a service or your service as it is compared and measured against business KPIs and business metrics.
And when you do that, that changes your scope and may change where you need to centralize or associate that data.
So Josh, a related question. I know I'm going a little bit too deep on this.
There's also a related issue. You guys do that, I know for a fact, but there are other vendors who do it
differently. In other words, what they call it is particularly when it comes to operational data,
there's a difference between a full fidelity data versus sampling. The problem most people
do sampling is because they couldn't handle
the volume of data, which we talked about, right? I'm not even talking to a centralized place,
even at the source of collection, the point of production, they couldn't handle it. So they
tend to kind of sample it. They try to reduce that, aggregate that, and produce insight and tell you
that's all you need because you know
after all it's not going to change in a matter of seconds or whatever right so that's good enough
for you to make a decision and there is next school of theory which you guys do is more about
i'm going to give you a full fatality data not from small amount not a sampling all information
what do you have and all insights based on
information on all full stack metrics, right? Which one is better, your view? I mean, there's,
depends on the situation, both could be better. Which one is better? And more importantly,
what do you see large, particularly large digital enterprises, like, for example, take Uber,
the amount of data they'll produce, or or Airbnb and things like that. What do
you see that is going? Yeah, I'll use a metaphor, if I may, to help with that. And it's a photography
perspective. Within photography, photographers, especially professional ones, want to shoot in
RAW, right? Because they want the full scope of the data that the sensor can provide and they need that
because they have specific needs and requirements to deliver on a certain product and so sampling
or a compressed jpeg format isn't ideal for that like they're not going to be delivered able to
deliver precisely what they want and so they will shoot it raw and they will take the hit on you
know the increased time it takes to process the image or to increase data to store the image.
But they do that because they need that flexibility.
And I'm not going to say that sampling is bad because most of what I have is JPEG and it's okay.
But what is necessary is that when you need that full fidelity and when it's important to do that, you know, perhaps you're doing something with green screen or you're going to do heavy editing
or in the case of the data that we have, where you really need to be looking into the
transactions and understanding what's happening because a sample is going to lose insight
that you may need.
Or if you're training a learning model, like a machine learning model, you don't necessarily
want to use sampling there. You want to make sure that it's getting full fidelity, it can get it
processed so it can learn properly and be able to see that correlation between a sample set
and a full fidelity set. And so I'd look at it that way. Sampling can be just fine for a lot
of things. What you really, what organizations really are looking for is that flexibility and the opportunity and the option, right? Not having the option, I think, is the biggest liability versus making a decision one way or another. You need to have that flexibility to be really siloed. Essentially, you know, it's a tool for like one very, very specific part of IT infrastructure. I think that that's something
that really is a real shortcoming with a lot of these tools. And I'm wondering, you know,
there's kind of a perverse thing going on here where we're getting these much, much more diverse
environments where we've got Kubernetes. And like you were saying, we're implementing containers and we're spreading data across all sorts of locations and
everything. And yet, in a way, you could look at Kubernetes as a way to organize a lot of this
infrastructure as well. So it's not just a way to spread out the compute. It's a way to spread out the, you know, compute. It's a way also to organize compute. And I'm wondering
if maybe that's going to help us because one of the challenges for a lot of AI applications is
that, you know, it has to deal with a lot of sort of surprise or, you know, out of the ordinary data
with, you know, with a Kubernetes based cloud infrastructure,
the data is almost self-describing.
And so we can end up with a situation where we have
basically better organized data,
even though it's more diverse and more widespread.
I don't know if you're following me there,
but it just seems interesting to me
that Kubernetes is both exploding our data centers
and also organizing them.
Yeah, I think that's a fair perspective.
I will counter slightly first, and then I'll dive in to why I agree with you,
in that as someone who's automated data centers quite a bit over the years,
especially pre-cloud, with VMware in particular,
they did a really great job, in my opinion,
of making information known about the infrastructure and where your applications resided so that you could not only be informed with what's going on, but automate against that.
OpenStack attempted to open that up even more.
There was just other complexities and challenges there that I think has caused it to stall and falter. With Kubernetes in particular, though, while I won't
necessarily agree that it specifically makes that visibility better, I do think it makes that
capacity to deliver a consistent service both on-premises and in the cloud really valuable.
Because now, if you're going to refactor your application to be supported
to run on Kubernetes, in most cases, you'll be able to port or migrate that between different
types of Kubernetes instances. And I think that's critically important. And then in that, your
monitoring can remain fairly consistent. The view of the world, your automation, a lot of those
things won't transition dramatically.
I mean, you may have to make transitions depending on whether you're running OpenShift,
whether you're running on-prem, whether you're running as a public cloud service.
But providing more consistency in how the application can be managed if it's running
on Kubernetes, I think is where that real key there is.
So I'd also like to take the discussion, I think, a little bit more toward AI. So we had talked
about AIOps and how artificial intelligence can help to kind of cut through this mountain of data
and keep you, as Andy said, from drowning in data. What specific ways do you think AI is going to come to operational data and logs and things
like that? Yeah, I like the drowning in data. I think we just all need to evolve gills. But short
of evolving gills, I really think that the key in AI technologies for operations, I use the analogy more of like an Iron Man suit.
Our IT operations professionals, they understand their environments. They have a strong knowledge
base that they share amongst themselves and just from their learning. And what they really need is
some assistance in processing that volume and velocity that Andy pointed out and making it more effective
so that they can make actionable decisions
based on the information at hand.
And there's a few key areas that we see AI jumping in
and helping with IT operations.
A lot of that has to do with being able
to identify event correlation,
things that are happening that weren't anticipated,
maybe a series of events that come around,
anomaly detection, like we don't anticipate seeing this.
So let's look into it.
Being able to aggregate events.
So we've seen this event happen 200 times.
I've long complained that any vendor, and I have said this about Splunk in the past,
that allows you to create alert and open up a ticket and service now based on an alert
needs to also allow you to automate that so that you don't end up with a thousand of the same
alerts. And so things like that are, I think, the real frontline with the data analytics stuff and
being able to augment the capability and capacity
of our human IT operators.
Wow, so you dropped all of the AI use cases quickly
in there, event correlation, nice reduction, you know.
And so one of the things also,
this is where the predictive analytics somewhat failed, is a
combination of anomaly detection and trending and seasonality.
AI can go through all of the data, combination of metrics and both infrastructure as well
as containers and application level metrics and the whole nine yards.
I've seen some tools fairly accurately predict
on what's going to happen within the next,
whatever the timeframe may be.
Could be useful in capacity planning.
You can, you know, kick up a bigger server combination
thereof.
So basically I call that as a Jedi trike.
Things happening, things before it happens,
seeing things rather before it happens.
Do you see, I see a lot of digital enterprises
that have that, by the way,
if you're a cloud native, cloud born,
it's a mandated thing for them.
They do it, right?
But the normal enterprise is still
kind of struggling with that.
So they are throwing more bodies at it,
especially with the pandemic and stuff.
Do you see that the tables have turned now
or are they going the other
route or they're still throwing more warm bodies to solve those issues? It's definitely a little
column A and a little column B. From our perspective, we absolutely have products that
when applied into your data center to monitor services and look at them. You know, we have customers who
are getting 30 minute plus lead times on potential incidents. And it's also one of those things where,
you know, if the paradox of, well, we've predicted that an incident is going to come,
if you take care of it, was that incident really going to happen? I appreciate that. So at the very
least, what I would argue is that we now have systems in place that can solve for the most critical problem of like drive space filling up, which still happens.
In fact, we just had an outage last week with Google dealing with a volume filled up and it nixed the service for authentication for a while. I mean, this stuff still happens and it's just a well-known, easily understood problem
that we can apply technology
to look at that and say,
based on current consumption rate,
here is your level of threat
of this thing potentially
causing a service interruption.
And one thing I also want to point out
that we haven't dove into
is not just the intelligence provided
in identifying a potential issue, but how you inform and notify responders to do something about it.
We've fallen into this trap of saying, well, here's my defined escalation plan and here are the people.
But it's not always the most intelligent way of getting to a solution. Being able to identify the right responder for a service, be able to provide the information on why the incident is being called or why a response is required, provide that information to the responder, make it easy for them to respond or to move that to the next
logical person, and then provide suggestions on how to automate the remediation of that in the
future. I mean, that's the real, you know, I mentioned the Ironman suit, like that's the real
promise. That's what we're driving towards, you know, with respect to AIOps. It's not just finding problems.
Trust me, we can find problems.
A related note on that would be, I did a piece on this recently, a video.
In the past, when the incidents were to happen, it has to be in a structured way, level one,
level two, level three, and then SME and all that. Now, if AIOps can detect with
a fairly accurate information of 99 plus 99.99%, what's the need to involve level one, two, and
three? Just save the time, send it to the person who you know exactly is going to fix the problem,
you know, avoid all this, right? So that I agree with you, a lot of enterprises I see do that. I have a related question, though.
So, yes, ops people, cost center,
maintain application.
If it goes down, yell at them.
Used to happen.
Now with the distributed thing,
DevOps became a big thing.
Developers and ops,
Coalesce and, you know, work together,
supposedly to make things better.
But now I'm seeing a new movement called Biz DevOps.
The business people work with the dev people,
prioritizing things, what needs to be done,
and then it goes to the ops.
So that's a big movement now.
And then to go to your level of notification and escalation,
there are times that there are a few enterprises
that I had a conversation with customers.
I don't want to name names,
but they have set up notifications. If my specific system, order taking system goes down,
my business guy gets notified. He wakes up like 12 in the night and then he comes down to the
ops guy saying that, what's the problem? I don't have a server. You know what? I don't care. You
spend a thousand dollars on it. I need the system up right now because he is the guy who's going to be responsible to answer to the executives. How do you think that's gaining
traction? It's just an outlier or most people are doing it that way?
Oh, it's absolutely not an outlier. It is a desired state and it's becoming a critical
desired state. Within our organization, we focus on looking at it from a service level,
right? So if you look at a bank or a retailer delivering out a service, you know, the customer
satisfaction, customer expectation is tied to the health and availability of that service.
What we now see are more business leaders empowering their service owners and their product leaders to invest in
tools and capabilities so that they can maintain that visibility and understand the impact of
business KPIs on these new digital services. And I think that's been a dramatic shift. I mean,
the last time we saw something similar, the two phases would be obviously when the internet became
an e-commerce
platform and people started monetizing on the internet, there was a desire to understand
customer cart behavior. Amazon really nailed this also with recommendations and things of that
nature. Then we saw with mobile platforms, the same type of thing. Now it's with digital services
and how those services connect to multiple
connection points to customers who then have a big impact on the business. And then having that
quick visibility and awareness on that service availability or that service performance,
aligning with the business performance and the business capabilities.
I know that we could probably continue talking about this all day long, but we do have to wrap the episode. But before we go, there's one more thing that we're doing here in season two that I want to make sure that I have a chance to put to you, Josh, since you're one of those people that I've known in the industry for a few kind of fun follow-up questions for you on AI. Just say
whatever comes to your mind as an answer to these things. So here we go. And for the audience,
Josh is not prepared for these. He's just being taken by surprise. All right. Question number one,
Josh, are there any jobs that will be completely eliminated by AI in the next five years? In the next five years?
I am disinclined to believe that we will have jobs completely eliminated. I believe we will have,
we'll see reductions of jobs, particularly, you know, things around insurance claims,
around people checking to see
if my car warranty needs to be renewed. You know, I think a lot of the spray and pray
businesses will start seeing that. Yeah, I think we'll see a reduction in workforce
as a result, but I don't think we'll see jobs wholly eliminated.
All right. Question number two, is machine
learning a product or a feature? Machine learning, if it's being delivered as a service that can be
consumed by other people, then I would say that that's a product. I really think it's a feature.
I have a few different views on that, but primarily I think it's because machine learning is really the result of applying advanced algorithms that we've known for decades or hundreds of years to new problems.
I think that it's going to be more respective to the algorithm. The algorithms are more of a product than I think, you know, saying the machine learning itself is. All right. And one more, this is one of the favorites of the panel because so many
people have disagreed on it. When will we see a full self-driving car that can go anywhere, anytime?
I am, you know, I would have five years ago, I would have said there's no chance. And then we
started landing first stage booster rockets. So I think anything's possible. I still think it's going to be a long way out. I still think that the infrastructure that supports vehicles, whether you're looking at the highway system, if you're looking at metropolitan areas, you're looking at dirt roads, the level of complexity and then adding you know people and
and obstacles I I don't expect to see one other than in controlled environments anytime in the
near future all right well thank you so much Josh and Andy thank you so much for this conversation
again I know that you guys could continue this discussion quite a lot and if folks do want to
continue this discussion with you Josh where can they connect with you?
The easiest place to find me is on Twitter at Josh underscore Atwell.
Or you can find me at a variety of different events where I speak or places where I'm writing articles, a lot of them on AIOps topics.
And you can find me on Twitter at Andy Thurai,
or you can find me at thefieldcto.com.
That's thefieldcto.com.
And I'm Stephen Foskett.
You can find me on Twitter at sfoskett.
You can find my writing at gestaltit.com
and in your favorite search engine.
And of course, you can find me here every week
at Utilizing AI.
So thank you everyone for listening to the Utilizing AI. So thank you,
everyone, for listening to the Utilizing AI podcast. If you enjoyed this discussion,
please subscribe, rate, and review the show. And please do connect with our guests and hosts online and continue the conversation there. This podcast was brought to you by gestaltit.com,
your home for IT coverage from across the enterprise. For show notes and more episodes,
please go to utilizing dash AI dot com or find us on Twitter at utilizing underscore AI.
Thanks, and we'll see you next time.