Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 2x05: How AI Can Save IT Operations From Drowning in Data with Josh Atwell from Splunk

Starting point is 00:00:00 Welcome to Utilizing AI, the podcast about enterprise applications for machine learning, deep learning, and other artificial intelligence topics. Each episode brings in experts in enterprise infrastructure to discuss applications of AI in today's data center. Today, we're discussing applications of AI in the modern data center and specifically AI operations or AI-assisted operations. First, let's meet our guest, Josh Atwell. Thanks, Stephen. I'm Josh Atwell, Senior Technology Advocate at Splunk.

Starting point is 00:00:32 I'm on Twitter at Josh underscore Atwell, and you can find me on podcasts and blogs all over the place. And I am Andy Thorai, Founder and Principal at TheFieldCTO.com. We do a lot of content and AI observability workshop for large enterprises moving to cloud. Check us out at thefieldcto.com or you can find me on Twitter at Andy Thorey. And I'm Stephen Foskett, organizer of Tech Field Day and publisher of Gestalt IT. You can find me on Twitter at S Foskett. So, Josh, a lot of folks are really familiar with Splunk, especially

Starting point is 00:01:06 in the IT operations space, because it's become sort of an invaluable tool for, you know, basically everything that we do as a day-to-day operations of IT infrastructure. But many people probably aren't aware of all the ways that Splunk is implementing AI into the sort of the operations workflow. And that's really what I want to focus on here. Not so much Splunk, but really like, you know, how can AI be used to assist in IT operations? Yeah, you're absolutely right, Stephen. I think most people are familiar with Splunk as a log company and spending a lot of time with log analytics and consolidating logs. But we've absolutely made tremendous amount of investments,

Starting point is 00:01:46 particularly in machine learning and the application of various AI technologies, those foundational technologies, and being able to bring those in to assist IT operations professionals and businesses being successful in managing, you know, especially their larger, more complex environments. So, Josh, so I just, as you and I talked about, I, in my role as an analyst, I just finished up two reports. One is the apps, one is observability. And I heard a lot of things from vendors, including yourself.

Starting point is 00:02:16 Curiosity question. A lot of vendors are moving to cloud, of course, most of them moving to AWS, and they are having the cloud ops issues. But then also there's a lot of those digital enterprises trying to do private, particularly the finance, banking, insurance, and whatnot, right? They're also transforming themselves, their private data centers, because they can't work with the legacy systems anymore. In your view, with the large customers you're dealing with, I'm not talking about the small mushroom or day by operations that are coming up. I'm talking with the large customers you're dealing with, I'm not talking about the small mushroom or day-by-day operations that are coming up.

Starting point is 00:02:47 I'm talking about the large established customers who have a digital footprint for 20, 30, 40 years. Are you seeing the general movement from them moving, run to the doors, run to the cloud, or make more investment in my data center? We've got to save that or a combination of both. What do you see? It's absolutely a combination of both. A lot of the conversations I have with customers is really about their, you know, how they rationalize their application portfolio. And I think that's the most important way to view, you know, that strategy. You know, what applications do we need to optimize in the data center? Which applications

Starting point is 00:03:26 can we simply move that service or that utility to the cloud? Which applications do we need to refactor so that we can take advantage of the cloud effectively? And which applications are we building natively in the cloud to meet new needs or to replace functions and capabilities that at one point were in the data center, but now need to be in the cloud. So they're closer to the applications that are touching customers most frequently. And I think the larger companies, more established ones, as you described, they have a lot more technology in their portfolio that does not lend itself to an easy, smooth transition to a cloud service. And there's a variety of reasons for that. Prioritization, of course, comes in. The function that application may have to the business. But the biggest limiter I see

Starting point is 00:04:18 really comes down to the data and the availability of moving that data from their data center into the cloud effectively to make it located near those applications to reduce latency and to take advantage of the other capabilities in the cloud. So it's hard for a lot of those applications in that portfolio to necessarily move to the cloud. And there's a lot of work that has to be had to reconcile that. So you mentioned the dirty word data. So let's talk about that a little bit, right? One of the issues, particularly with IT operations that I'm a little surprised to see is

Starting point is 00:04:58 your IT existing digital footprint was generating decent amount of data. You know, we thought that was a huge wall in those days, you know, and when they moved to the cloud, they are in a shock. They are in awe that particularly when you distribute our applications, when you have containers, even serverless running all over the place, the amount of logs, the amount of traces, the amount of distributed traces, the amount of, you know, of distributed traces, the amount of other

Starting point is 00:05:25 metrics you create is unbelievable. First of all, the data is kind of siloed and distributed everywhere. Which means you need to send the insights as you were talking about to the data rather than sending the data back in because it can overwhelm people, the data lake itself. And two, the volume of data, we're not talking about one time, two times. We are talking about there are companies I've seen that I've had conversations with from your enterprise deal footprint to cloud when you go,

Starting point is 00:05:54 you're talking about hundreds, even thousands. There is a couple of companies that I've seen, the digital, what do you call, even alerts, notifications or information, the data points that telemetry produces how do you see companies handle that i mean that's a major issue for a lot of companies they are drowning in data right now right they absolutely are i i think the key component of that and what's interesting to me as a long-time it professional is that a lot of the data that we're seeing in the in the volume and velocity of data

Starting point is 00:06:26 isn't necessarily so distinct and unique in you know in the past it's just all those transactions that we're now able to have full visibility into were hidden within an application you know nobody ever coded that application actually present that information out because nobody thought it would be useful. When you move to the cloud and everything's an API call and it's a service call or it's a utilization of a serverless function, these are all distinct transactions that go over the wire. And with each transaction, there's the request, there's the response, there's the acknowledgement, there's the log of whether it succeeded or it failed. There's information around every transaction that we now have the opportunity to inspect, that we have the opportunity to optimize on.

Starting point is 00:07:14 We can review that for cost savings perspectives or review it to optimize in performance. and to the point that you're making, the biggest challenge is we had not in the past had a discipline in order to consolidate and utilize that data. I think that's been the biggest barrier. And then the following barrier is, is once you've made the decision and said, hey, this data has value, we're gonna make an effort to get value from that data.

Starting point is 00:07:44 How do you centralize it? You know, because it's happening in multiple services in multiple places. And it could be coming from your on-prem environment and your cloud environment. And multiple clouds too. And it could be across multiple clouds or multiple services. You know, if you're connecting something to ServiceNow or Salesforce to an application that you have, they may be on completely different platforms in completely different parts of the world. And so that transaction framework exists and that data exists there. People need to now develop a discipline around how you consolidate and utilize that data. Right. So that's an interesting point, right? It's the first point you made, which, which a lot of people don't even realize is in the past when you had an application, as you said, the application information, what do you have?

Starting point is 00:08:35 Mostly it's the metrics that you collect saying if my application is working or not, if it is working fine, we're good, right? You test from maybe a couple of locations, maybe do a synthetic monitoring, maybe some real user monitoring, you're done, right? But now the application, how do you define an application? It's not like, you know, running in one server or two servers or maybe three servers with a middleware and whatnot. You're talking about a thousand containers, which are 200 microservices

Starting point is 00:09:02 or 400 microservices running a distributor. That becomes your application. So it's kind of like a common word or a dirty word nowadays, right? So when you distribute that application, having collect them, all of them into one place, whether it's centralized or a place where you can analyze, not necessarily centralized.

Starting point is 00:09:18 And then when you have portions of that moving into a different cloud, for example, if you use Google for one microservice among this application, it becomes really complicated. So on a related note, I have a question for you. So one is about data. The second one is, do you commonly see companies rather move the models and insight closer to the data? Or do you see most companies would prefer to move the data to where they are comfortable with? Yeah, I think that's a great question. The challenge there really comes down to where you need insight and where you can have action, like what context.

Starting point is 00:09:59 I hear people say, oh, I want a single pane of glass, you know, to be able to see everything. And while at first that seems like that's what people want, I don't think that's actually the desired state. I think the desired state is that as your context changes or the view that you need to have, whether it being a macro view or a micro view, what you're really looking for is more like a lens response where you can zoom in and zoom out as needed. And depending on the service and depending on what you're utilizing, that information may not reside where your application is, or it may have to correlate with information that's in another location, particularly as you take a more macro view and you're looking at, you know, the holistics of a service or your service as it is compared and measured against business KPIs and business metrics.

Starting point is 00:10:52 And when you do that, that changes your scope and may change where you need to centralize or associate that data. So Josh, a related question. I know I'm going a little bit too deep on this. There's also a related issue. You guys do that, I know for a fact, but there are other vendors who do it differently. In other words, what they call it is particularly when it comes to operational data, there's a difference between a full fidelity data versus sampling. The problem most people do sampling is because they couldn't handle the volume of data, which we talked about, right? I'm not even talking to a centralized place, even at the source of collection, the point of production, they couldn't handle it. So they

Starting point is 00:11:37 tend to kind of sample it. They try to reduce that, aggregate that, and produce insight and tell you that's all you need because you know after all it's not going to change in a matter of seconds or whatever right so that's good enough for you to make a decision and there is next school of theory which you guys do is more about i'm going to give you a full fatality data not from small amount not a sampling all information what do you have and all insights based on information on all full stack metrics, right? Which one is better, your view? I mean, there's, depends on the situation, both could be better. Which one is better? And more importantly,

Starting point is 00:12:15 what do you see large, particularly large digital enterprises, like, for example, take Uber, the amount of data they'll produce, or or Airbnb and things like that. What do you see that is going? Yeah, I'll use a metaphor, if I may, to help with that. And it's a photography perspective. Within photography, photographers, especially professional ones, want to shoot in RAW, right? Because they want the full scope of the data that the sensor can provide and they need that because they have specific needs and requirements to deliver on a certain product and so sampling or a compressed jpeg format isn't ideal for that like they're not going to be delivered able to deliver precisely what they want and so they will shoot it raw and they will take the hit on you

Starting point is 00:13:03 know the increased time it takes to process the image or to increase data to store the image. But they do that because they need that flexibility. And I'm not going to say that sampling is bad because most of what I have is JPEG and it's okay. But what is necessary is that when you need that full fidelity and when it's important to do that, you know, perhaps you're doing something with green screen or you're going to do heavy editing or in the case of the data that we have, where you really need to be looking into the transactions and understanding what's happening because a sample is going to lose insight that you may need. Or if you're training a learning model, like a machine learning model, you don't necessarily

Starting point is 00:13:43 want to use sampling there. You want to make sure that it's getting full fidelity, it can get it processed so it can learn properly and be able to see that correlation between a sample set and a full fidelity set. And so I'd look at it that way. Sampling can be just fine for a lot of things. What you really, what organizations really are looking for is that flexibility and the opportunity and the option, right? Not having the option, I think, is the biggest liability versus making a decision one way or another. You need to have that flexibility to be really siloed. Essentially, you know, it's a tool for like one very, very specific part of IT infrastructure. I think that that's something that really is a real shortcoming with a lot of these tools. And I'm wondering, you know, there's kind of a perverse thing going on here where we're getting these much, much more diverse environments where we've got Kubernetes. And like you were saying, we're implementing containers and we're spreading data across all sorts of locations and everything. And yet, in a way, you could look at Kubernetes as a way to organize a lot of this

Starting point is 00:14:57 infrastructure as well. So it's not just a way to spread out the compute. It's a way to spread out the, you know, compute. It's a way also to organize compute. And I'm wondering if maybe that's going to help us because one of the challenges for a lot of AI applications is that, you know, it has to deal with a lot of sort of surprise or, you know, out of the ordinary data with, you know, with a Kubernetes based cloud infrastructure, the data is almost self-describing. And so we can end up with a situation where we have basically better organized data, even though it's more diverse and more widespread.

Starting point is 00:15:36 I don't know if you're following me there, but it just seems interesting to me that Kubernetes is both exploding our data centers and also organizing them. Yeah, I think that's a fair perspective. I will counter slightly first, and then I'll dive in to why I agree with you, in that as someone who's automated data centers quite a bit over the years, especially pre-cloud, with VMware in particular,

Starting point is 00:16:00 they did a really great job, in my opinion, of making information known about the infrastructure and where your applications resided so that you could not only be informed with what's going on, but automate against that. OpenStack attempted to open that up even more. There was just other complexities and challenges there that I think has caused it to stall and falter. With Kubernetes in particular, though, while I won't necessarily agree that it specifically makes that visibility better, I do think it makes that capacity to deliver a consistent service both on-premises and in the cloud really valuable. Because now, if you're going to refactor your application to be supported to run on Kubernetes, in most cases, you'll be able to port or migrate that between different

Starting point is 00:16:52 types of Kubernetes instances. And I think that's critically important. And then in that, your monitoring can remain fairly consistent. The view of the world, your automation, a lot of those things won't transition dramatically. I mean, you may have to make transitions depending on whether you're running OpenShift, whether you're running on-prem, whether you're running as a public cloud service. But providing more consistency in how the application can be managed if it's running on Kubernetes, I think is where that real key there is. So I'd also like to take the discussion, I think, a little bit more toward AI. So we had talked

Starting point is 00:17:31 about AIOps and how artificial intelligence can help to kind of cut through this mountain of data and keep you, as Andy said, from drowning in data. What specific ways do you think AI is going to come to operational data and logs and things like that? Yeah, I like the drowning in data. I think we just all need to evolve gills. But short of evolving gills, I really think that the key in AI technologies for operations, I use the analogy more of like an Iron Man suit. Our IT operations professionals, they understand their environments. They have a strong knowledge base that they share amongst themselves and just from their learning. And what they really need is some assistance in processing that volume and velocity that Andy pointed out and making it more effective so that they can make actionable decisions

Starting point is 00:18:29 based on the information at hand. And there's a few key areas that we see AI jumping in and helping with IT operations. A lot of that has to do with being able to identify event correlation, things that are happening that weren't anticipated, maybe a series of events that come around, anomaly detection, like we don't anticipate seeing this.

Starting point is 00:18:53 So let's look into it. Being able to aggregate events. So we've seen this event happen 200 times. I've long complained that any vendor, and I have said this about Splunk in the past, that allows you to create alert and open up a ticket and service now based on an alert needs to also allow you to automate that so that you don't end up with a thousand of the same alerts. And so things like that are, I think, the real frontline with the data analytics stuff and being able to augment the capability and capacity

Starting point is 00:19:28 of our human IT operators. Wow, so you dropped all of the AI use cases quickly in there, event correlation, nice reduction, you know. And so one of the things also, this is where the predictive analytics somewhat failed, is a combination of anomaly detection and trending and seasonality. AI can go through all of the data, combination of metrics and both infrastructure as well as containers and application level metrics and the whole nine yards.

Starting point is 00:20:00 I've seen some tools fairly accurately predict on what's going to happen within the next, whatever the timeframe may be. Could be useful in capacity planning. You can, you know, kick up a bigger server combination thereof. So basically I call that as a Jedi trike. Things happening, things before it happens,

Starting point is 00:20:22 seeing things rather before it happens. Do you see, I see a lot of digital enterprises that have that, by the way, if you're a cloud native, cloud born, it's a mandated thing for them. They do it, right? But the normal enterprise is still kind of struggling with that.

Starting point is 00:20:37 So they are throwing more bodies at it, especially with the pandemic and stuff. Do you see that the tables have turned now or are they going the other route or they're still throwing more warm bodies to solve those issues? It's definitely a little column A and a little column B. From our perspective, we absolutely have products that when applied into your data center to monitor services and look at them. You know, we have customers who are getting 30 minute plus lead times on potential incidents. And it's also one of those things where,

Starting point is 00:21:12 you know, if the paradox of, well, we've predicted that an incident is going to come, if you take care of it, was that incident really going to happen? I appreciate that. So at the very least, what I would argue is that we now have systems in place that can solve for the most critical problem of like drive space filling up, which still happens. In fact, we just had an outage last week with Google dealing with a volume filled up and it nixed the service for authentication for a while. I mean, this stuff still happens and it's just a well-known, easily understood problem that we can apply technology to look at that and say, based on current consumption rate, here is your level of threat

Starting point is 00:21:55 of this thing potentially causing a service interruption. And one thing I also want to point out that we haven't dove into is not just the intelligence provided in identifying a potential issue, but how you inform and notify responders to do something about it. We've fallen into this trap of saying, well, here's my defined escalation plan and here are the people. But it's not always the most intelligent way of getting to a solution. Being able to identify the right responder for a service, be able to provide the information on why the incident is being called or why a response is required, provide that information to the responder, make it easy for them to respond or to move that to the next

Starting point is 00:22:46 logical person, and then provide suggestions on how to automate the remediation of that in the future. I mean, that's the real, you know, I mentioned the Ironman suit, like that's the real promise. That's what we're driving towards, you know, with respect to AIOps. It's not just finding problems. Trust me, we can find problems. A related note on that would be, I did a piece on this recently, a video. In the past, when the incidents were to happen, it has to be in a structured way, level one, level two, level three, and then SME and all that. Now, if AIOps can detect with a fairly accurate information of 99 plus 99.99%, what's the need to involve level one, two, and

Starting point is 00:23:32 three? Just save the time, send it to the person who you know exactly is going to fix the problem, you know, avoid all this, right? So that I agree with you, a lot of enterprises I see do that. I have a related question, though. So, yes, ops people, cost center, maintain application. If it goes down, yell at them. Used to happen. Now with the distributed thing, DevOps became a big thing.

Starting point is 00:23:58 Developers and ops, Coalesce and, you know, work together, supposedly to make things better. But now I'm seeing a new movement called Biz DevOps. The business people work with the dev people, prioritizing things, what needs to be done, and then it goes to the ops. So that's a big movement now.

Starting point is 00:24:14 And then to go to your level of notification and escalation, there are times that there are a few enterprises that I had a conversation with customers. I don't want to name names, but they have set up notifications. If my specific system, order taking system goes down, my business guy gets notified. He wakes up like 12 in the night and then he comes down to the ops guy saying that, what's the problem? I don't have a server. You know what? I don't care. You spend a thousand dollars on it. I need the system up right now because he is the guy who's going to be responsible to answer to the executives. How do you think that's gaining

Starting point is 00:24:49 traction? It's just an outlier or most people are doing it that way? Oh, it's absolutely not an outlier. It is a desired state and it's becoming a critical desired state. Within our organization, we focus on looking at it from a service level, right? So if you look at a bank or a retailer delivering out a service, you know, the customer satisfaction, customer expectation is tied to the health and availability of that service. What we now see are more business leaders empowering their service owners and their product leaders to invest in tools and capabilities so that they can maintain that visibility and understand the impact of business KPIs on these new digital services. And I think that's been a dramatic shift. I mean,

Starting point is 00:25:39 the last time we saw something similar, the two phases would be obviously when the internet became an e-commerce platform and people started monetizing on the internet, there was a desire to understand customer cart behavior. Amazon really nailed this also with recommendations and things of that nature. Then we saw with mobile platforms, the same type of thing. Now it's with digital services and how those services connect to multiple connection points to customers who then have a big impact on the business. And then having that quick visibility and awareness on that service availability or that service performance,

Starting point is 00:26:15 aligning with the business performance and the business capabilities. I know that we could probably continue talking about this all day long, but we do have to wrap the episode. But before we go, there's one more thing that we're doing here in season two that I want to make sure that I have a chance to put to you, Josh, since you're one of those people that I've known in the industry for a few kind of fun follow-up questions for you on AI. Just say whatever comes to your mind as an answer to these things. So here we go. And for the audience, Josh is not prepared for these. He's just being taken by surprise. All right. Question number one, Josh, are there any jobs that will be completely eliminated by AI in the next five years? In the next five years? I am disinclined to believe that we will have jobs completely eliminated. I believe we will have, we'll see reductions of jobs, particularly, you know, things around insurance claims, around people checking to see

Starting point is 00:27:26 if my car warranty needs to be renewed. You know, I think a lot of the spray and pray businesses will start seeing that. Yeah, I think we'll see a reduction in workforce as a result, but I don't think we'll see jobs wholly eliminated. All right. Question number two, is machine learning a product or a feature? Machine learning, if it's being delivered as a service that can be consumed by other people, then I would say that that's a product. I really think it's a feature. I have a few different views on that, but primarily I think it's because machine learning is really the result of applying advanced algorithms that we've known for decades or hundreds of years to new problems. I think that it's going to be more respective to the algorithm. The algorithms are more of a product than I think, you know, saying the machine learning itself is. All right. And one more, this is one of the favorites of the panel because so many

Starting point is 00:28:30 people have disagreed on it. When will we see a full self-driving car that can go anywhere, anytime? I am, you know, I would have five years ago, I would have said there's no chance. And then we started landing first stage booster rockets. So I think anything's possible. I still think it's going to be a long way out. I still think that the infrastructure that supports vehicles, whether you're looking at the highway system, if you're looking at metropolitan areas, you're looking at dirt roads, the level of complexity and then adding you know people and and obstacles I I don't expect to see one other than in controlled environments anytime in the near future all right well thank you so much Josh and Andy thank you so much for this conversation again I know that you guys could continue this discussion quite a lot and if folks do want to continue this discussion with you Josh where can they connect with you? The easiest place to find me is on Twitter at Josh underscore Atwell.

Starting point is 00:29:33 Or you can find me at a variety of different events where I speak or places where I'm writing articles, a lot of them on AIOps topics. And you can find me on Twitter at Andy Thurai, or you can find me at thefieldcto.com. That's thefieldcto.com. And I'm Stephen Foskett. You can find me on Twitter at sfoskett. You can find my writing at gestaltit.com and in your favorite search engine.

Starting point is 00:30:00 And of course, you can find me here every week at Utilizing AI. So thank you everyone for listening to the Utilizing AI. So thank you, everyone, for listening to the Utilizing AI podcast. If you enjoyed this discussion, please subscribe, rate, and review the show. And please do connect with our guests and hosts online and continue the conversation there. This podcast was brought to you by gestaltit.com, your home for IT coverage from across the enterprise. For show notes and more episodes, please go to utilizing dash AI dot com or find us on Twitter at utilizing underscore AI. Thanks, and we'll see you next time.

Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 2x05: How AI Can Save IT Operations From Drowning in Data with Josh Atwell from Splunk

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.