Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 04: Is Enterprise IT Operations Really Ready for AI with @RayLucchesi

Starting point is 00:00:00 Welcome to Utilizing AI, a podcast about enterprise adoption of machine learning and artificial intelligence technologies. I'm your host, Stephen Foskett from Gestalt IT. You can find me writing at gestaltit.com and you can find me on Twitter at S Foskett. And I'm Ray Lucchese. I'm a Greybeards on Storage podcaster. I'm also a Ray on Storage blogger, and I'm at silvertonconsulting.com. And my Twitter ID is at Ray Lucchese. Thanks, Ray. Honestly, it's really great to have you here because I think that in a way, you embody the topic of utilizing AI more than many of our guests because you've been involved in enterprise IT for a long time, and you've seen a lot of different technologies come and go

Starting point is 00:00:47 and but yet you know one of the things that I always enjoy about your blog and about you know your you know your podcast and just talking to you generally is that you don't seem stuck in the past in terms of you know this is how it should be or this is how it was it always seems like you have an open mind about new technologies and yeah you know you've been writing a lot about ai i try to like you know i've been following ai for now gosh 30 40 plus years i've been really interested in off and on during the ai winters and ai summers and all that stuff it's really emerged over the last decade or so as a solution in uh pattern matching and that sort of thing that it never

Starting point is 00:01:26 had before. So it's kind of exciting from my perspective to see what it can and can't do. And I try to keep adept or try to keep up with the technology as it emerges. Exactly. And, you know, also, I think that, you know, by being involved in enterprise tech for a long time, you know, you've seen how it really is. And one of my concerns, honestly, about artificial intelligence machine learning is that a lot of the pitches don't seem all that practical. You know, a lot of the, you know, the things that people are saying would require, you know, massive business model changes in term, you know, you know, basically change, you know, transform your whole way of doing business and use AI to predict, you know, future growth and all this. I mean, it's really business and kind of things. But when we look at the IT space, I, my question is, do you see this technology coming to IT operations as well? Or is it really a line of business technology? I see it both. I mean, to a large extent, from a line of business perspective,

Starting point is 00:02:26 you know, depending on the application, depending on the environment that they have, if there's something that might benefit from a more accurate pattern recognition and that sort of stuff, AI could be a significant benefit. But, you know, Google and other organizations have shown that they can utilize AI to reduce power consumption, to reduce air cooling, reduce those sorts of things. So it's operational aspect as well to AI. It's harder, I would say, from an operational perspective, IT operations, how's a question of matching. It's a pattern matching machine today. If you can, if you can define something that requires better pattern matching in AI and IT operations, then this would be a natural for it. Yeah. And that to me is the natural first application of this. And I think that's really where we're seeing it. Things like processing log files. I think it's just a total slam dunk, right?

Starting point is 00:03:26 Yeah, because you've got, you know, you're looking for patterns out of logs to see what's going bad or what's in the process of going bad. You know, if you're doing some preventive maintenance kinds of things, if you want to try to, let's say you've got a thousand servers out there with, you know, 10,000 disks or something like that, you can utilize AI to try to ascertain which one of those disks is going bad and get rid of it or swap it out before it happens, before it fails and stuff like that.

Starting point is 00:03:52 I think those are natural types of situations that you could use this for. But again, it's what's a reasonable pattern recognition application of the technology. If you can come up with something like that and you've got the data, I think AI is a natural solution. AI deep learning today, right? Yeah, and I guess it depends on what you mean by AI, right? Right, right.

Starting point is 00:04:15 But let's say deep learning and, you know, machine learning applications. Certainly in storage, a lot of companies are using or claiming to use AI. So let's talk about that. You know, storage management, you know, you got things like the InfoSight. HPE's InfoSight, and there's a number of other generations of that across a number of other vendors. I think so. If you look at AI, if you look at InfoSight to a large extent, it cut its teeth on preventive maintenance of the solutions that it has. So it's trying to identify problems before they occur and have, you know, some activity or some action that a

Starting point is 00:04:53 customer can take or service can take to make sure it doesn't affect their operations. Again, it's pure pattern recognition. So they're getting masses of amounts of data from every one of these systems out in the field. They've got this database out there. They can see when a problem actually does occur. And then they can try to feed that sort of pattern information into a deep learning model and go forward with it. And, you know, the challenge with AI is you have to continue to train it. It's not something you can train once and leave. You have to continue to train it.

Starting point is 00:05:26 It's not something you can train once and leave. You have to continue to train. So you're taking all this data in. Every time it sees something new, you're going to have to feed that back into it. And I think that that's one of those things that presents a lot of challenges on sort of an operational basis, but also on a technical basis. Because if you're going to continue to train it, then that means that you need hardware that is capable of doing the training. And I know that you've looked into that, like what it takes. Yeah, yeah, absolutely. It's, you know, in this recent post I did on machine learning performance, and, you know, one of the top players there was NVIDIA. They had their latest ggx a100 solution there

Starting point is 00:06:05 they had over 14 1400 gpus running this mini go algorithm to learn go to how to play go 50 percent effectivity not everybody's going to have that um it's it's it's it can take a lot of time and effort to do this well and the hardware requirements can be significant. It depends on the data, too. I mean, God, you know, I did a mini AI algorithm. I did my blog post titles. I'm trying to figure out which one was more popular and why they're more popular and have some way of predicting popularity of blog posts based on titles. And I can do it in a single GPU without too much of a problem.

Starting point is 00:06:41 I've got, you know, I've got a couple more I can fire up to do other stuff. But, you know, I've got a couple more I can fire up to do other stuff. But, you know, I've got less than 1000 blog posts, you know, so there's less than 1000 titles, maybe 1012 words each, it's not a lot of data. You know, customer info site, every solution out there, every nimble storage system out there is running literally thousands of data items every minute, if not every second, and they're sending that back to HPE and they're feeding on, you know, it's a massive repository of data. It's a perfect solution for AI, but, you know, providing the hardware to run all that stuff through and try to, is significant. Yeah, absolutely. Do you think that in the future,

Starting point is 00:07:27 do you think that enterprise IT infrastructure, I mean, do you think the data center is gonna have like an AI corner that's sitting there running these applications? I think that, absolutely. I mean, today, you know, you're starting to see the introduction of Dell servers that are focused on GPUs, VxRail has a GPU configuration.

Starting point is 00:07:47 There are a number of solutions out there that are focused on, you know, providing computational technologies there that can be deployed for either video displays or for AI machine learning. And once you've got that, so let's say, I don't know, you have a thousand person organization running VDI, you probably got some servers out there with GPUs in there. During the day, devoted to VDI sorts of solutions. At night, if that even exists for you, because you might be worldwide, fire up some AI stuff. Well, that's an interesting thought. I was actually thinking of that when you were talking about the GPUs, because a lot of companies do currently have, you know, a lot of GPUs for VDI applications. And since it's a similar technology, really. It is exactly the same technology. I mean,

Starting point is 00:08:35 you know, like I said, we were talking before, I took my crypto mind, which was three GPUs, I converted it to an AI workstation, because the, you know, it just wasn't there anymore. Yeah. Yeah. So do you think though, that this is going to be something, I mean, you, you talked about the 24 seven is this going to outgrow the VDI footprint? Is this going to be something where you're going to have a whole other world of enterprise tech you know, just a whole other world of enterprise tech, you know, just a whole other vertical that's AI. There are a number of hardware only solutions, Cerebus and Graphcore and those kinds of guys

Starting point is 00:09:14 are creating special processing units strictly for neural network processing or, you know, deep learning kinds of things. Those things exist. I mean, if you look at what NVIDIA did with their A100, it's focused to a large extent on machine learning workloads. I'm sure it can do GPU kinds of stuff, but it's been touted as an AI workstation environment.

Starting point is 00:09:39 You know, Google has their TPU. There's special hardware that's available. It's very costly. In my mind, the service thing is the size of a 16-inch wafer or something like that. It's got 40,000 cores on the thing. Okay, that can do that, but it's

Starting point is 00:09:55 a specialized solution. A lot depends. Absolutely. A lot depends on the amount of data you're working with and what you're trying to do. If it's a one-off or a two-off thing, use the GPUs you've got in-house and have at it. If it's going to be something you're going to be doing thousands of data items from every solution you have in the field every second, then you need something a little bit more beefy to work with this stuff. Well, it's interesting that you say that because you and i were both at the intel intel xeon scalable launch and i don't know if you remember i mean one of the things that they talked about was basically deep learning as a deep learning platform uh they they

Starting point is 00:10:36 have intel is honestly well my feeling is that they they went in the wrong direction initially by trying to make CPUs with just massive size, you know, vector calculators, you know, AVX 512. And then they totally like went the other way and they realized, Oh wait, what do you mean? You only want to eat bit. We can do a bit. And suddenly, you know, they're talking about you know, the new, new, which is, you know, low resolution, you know, they have their own deep learning library that they've supplied, which is low resolution. They have their own deep learning library

Starting point is 00:11:05 that they've supplied, which is lower resolution than 32 and 64 bit floating point. And can it do the job? I think it can. It doesn't seem like 64 bit resolution is required for most AI opportunities. And 16 is probably adequate and eight might work. And in fact, this last blog post

Starting point is 00:11:24 I wrote on machine learning performance this was solutions that were learning um many you know minigo how to play go and stuff like that and intel xeon system copper lake six or something like that came in like number six or seven on the list 14 000 not 1,400 A100 GPUs was first. These guys, so that was like, it took 20 seconds to do this. And there was eight Copper Lake servers with maybe four CPUs each or cores each

Starting point is 00:11:55 or something like that. And it took like 500 seconds. So you can do it. It's not impossible. But that was their pitch was, and i remember at the time we were all kind of like scratching our heads like are you kidding me you can't compete with gpus but their pitch was um xeon may not be able to compete with gpus but xeon can compete with no gpus in other words if you're not gonna buy gpus yeah then the xeon can take on a lot of that workload. You know, you can kind of

Starting point is 00:12:26 repurpose your virtual infrastructure or your cloud, you know, your onboard, you know, on-premises cloud to do the deep learning calculations. And then you don't have to buy those GPUs. Yeah. And to me, that was like, oh, I get what you guys are saying. You're not trying to say that Xeon is going to beat the A100. You're trying to say that not having the A100 is a viable alternative. Yeah, yeah. I mean, you have to use their library. You have to do some conversion activity of your model. But they claim it's all TensorFlow compatible.

Starting point is 00:12:57 And if that's the case, then you can do it. And honestly, I think that, like you said, I mean, as well, maybe it's not as fast, but maybe that's a way that companies might not need to have. Because I'm just imagining this, I mean, like my background in data center comes from when there used to be lots of verticals in the data center. Like there was the telecom rack. There was the, you know, the finance rack. There was the HR rack. There was the, you know, manufacturing finance rack, there was the HR rack, there was the, you know, manufacturing rack, and all these kind of different things, everybody had their own

Starting point is 00:13:29 vertically integrated stack of equipment. And then we kind of virtualized and commoditized the data center with VMware. And, and everybody could share this, or more applications could share the same equipment. But I'm just thinking like, okay, well, if it needs GPUs, then that's a specialized piece of hardware. That means we're going to not be able to commoditize that. Yeah. I think, well, I mean, you know, obviously there are more than just NVIDIA and the GPU game and that's the solution. A lot of these environments do VDI, have GPUs, they're just not maybe using them all day day long depending on the size of the organization that sort of thing so that's an option and and if you want you can use you know Xeon CPUs to do it with the proper libraries and all that stuff it's another step it's not a big deal from my perspective to to make that conversion I haven't tried that

Starting point is 00:14:19 logic but it's not saying I couldn't do it. Well, then the other thing though, is that, you know, Intel, yeah, they maybe got caught a little bit on the back foot when it came to, you know, the GPU dominance of deep learning, but they're no fools and they've been working really hard. And in fact, they just in August did announced a whole range of, you know, sort of GP GPU platforms that they can now produce. They'll be rolling those out. Like I said, they've got new instructions in the Xeon. I do think that this is something that maybe is going to be part of the future data center.

Starting point is 00:15:00 Yeah, I think so too. I mean, it's, it's, it's surprising to me how many things that we do on a daily basis that pattern matching can help. It's just, it's just, and that's what this was, that's what this deep learning stuff does for us. It's just this fine-tuned pattern matching that, that you can fire up without really that much effort. I think the biggest challenge is getting the data right, quite frankly. Well, exactly. And let's talk about that. So, I mean, you know, you're an old school storage guy like me.

Starting point is 00:15:31 This is a really weird data set, right? It's huge and it needs high performance occasionally at various spots. And it needs to be able to like dump a massive amount of data in here and then take that out and dump a massive amount of data in here and then take that out and dump a different set of data in there. I mean, it's just a really weird storage application. So, so during the training opportunity, you're, you're, you're feeding all this data into this model. The model is making computational changes to its network and, and then you're doing it again and you're, you're, you're probably feeding it in anywhere from 20 to 100 times the same data, randomized.

Starting point is 00:16:08 And, you know, it's to some extent, it's sequential, but it's not. You know, it's small picture files or small text images or in the case of InfoSight, it's probably, you know, large status dumps from a storage system or something like that. And yeah, it is a lot of data, but it's not like the old day where we're processing, we're updating, where, you know, this OLTP, you know, paradigm. No, that's not it. This is more like, give me a bunch of small files as fast as you possibly can while I'm training. Then I don't want to talk to you again until the next time we do a training run, you know, to a large extent, because now it's all inferencing. And yeah, maybe I'll record what happened, my inferencing activities and stuff, but. It is, it's just such a weird application because we've, I don't, is there any application in the data center that you can think

Starting point is 00:16:58 of that we, that has been like this? I mean, I guess, you know, maybe a data lake? Yeah, yeah, big data types of things, data, you know, batch processing of old, not really, just to a large extent with input and output kinds of stuff. But this is input many times, right, over and over again, randomized, and then you're actually, the output is this model, this set of weights associated with a neural net that you're creating. Well, one of the, one of the things too, is that, um, there's been a big demand for sort of snapshotting the data set along with the, the model, because basically, you know, you're, you're doing this training, and then you need to be able to save that data set so that you can do the same

Starting point is 00:17:41 training with the same data set again, if you should change the model, or maybe you just need to go back and look. Right. You want to snapshot the model. You have to save the model. You have to save the data with the model. So you can, if you have a compliance problem with the model, it's not properly doing something or it's biased or something like that, you want to be able to go back to the data, see why it got to that point. And if you're going to do any inferencing with the model, you have to save the model in any fashion. You have to save it in some form

Starting point is 00:18:06 that you can run an inferencing engine on. TensorFlow has got this thing called TFLights, Google, obviously. And effectively what they're doing is you can take a model trained on TensorFlow and convert it into a TFLight model that you can run on a Raspberry Pi or an Android or Arduino if you want want something like that. So this is for IoT applications and stuff. But sooner or later, you have to take the model that inferencing engine and deploy it someplace.

Starting point is 00:18:38 Scott DeRue, MD, Yeah. Yeah. And that's, I think what we're going to start seeing is as this technology makes more waves, you know, we're going to start seeing special purpose, you know, special purpose processing units. I mean, Apple is building those into their A-series CPU. I definitely think that we're going to see that. But, you know, that doesn't need to be heavy, like you're pointing out. Like it can run, I mean, absolutely. I've actually run stuff like that for Home assistant on the Raspberry Pi to do, you know, like pattern

Starting point is 00:19:09 recognition, like, is it a bicycle? Is it a man? You know, that kind of thing. And it works. Even on that little low powered ARM CPU, because the model was already created, right? Yeah, the hard lifting is all done during the model training and stuff like that. After that, it's a pretty straightforward process. Yeah. Yeah. So, you know, we're starting to hear as well, you know, a lot of the storage companies talking about how they're going to leverage their products in this AI space. I know that, you know, just off the top of my head, I've heard that from Pure Storage. I've heard that from, you know, VMware with, you know, vSAN and stuff. Yeah, with vSAN. You know, I certainly have heard that from Dell. Do you buy it? Do you buy it that we need special purpose storage systems? I don't think you need special purpose storage systems. I mean, these are, you know, to a large extent, these are file systems. This is file system based data and,

Starting point is 00:20:02 you know, pure FlashBlade, sure. Would you use that as a, you know, as based data and, you know, pure flashblades, sure. Would you use that as a, you know, as an input or, you know, someplace to gather the data and supply it to an AI engine? Sure, why not? You could use Cumulo, you could use S3, you could use just about anything that can support objects or files. You know, the big question, a lot of the storage stuff that I've written in the past about AI is, you know, you want to keep that expensive hardware, GPU, Cerebus, Graph Core, whatever, busy. In order to do that, you have to feed it data all the time. Those sorts of things might require, you know, more enterprise grade types of storage services and stuff like that. But can you do it without? I think so.

Starting point is 00:20:45 Well, I feel like some of these companies kind of lucked out. For the longest time, we've had these object stores. They were kind of like a solution looking for a problem. And I remember when Pure introduced FlashBlade, a lot of us were scratching our heads like, okay, that's awesome, but what's it for? And then suddenly AI comes along and is like, I need all your file and object. And Pure is like, ooh, ooh, ooh, right here, right here, we got that. Yeah, yeah. It was an interesting play for them, and it certainly made a good fit from that perspective. And are there other solutions out there that could do this sort of thing?

Starting point is 00:21:16 Absolutely. It's obviously a unique architecture, and it's got advantages from that perspective. But, you know, scale- out files has existed for years. Oh yeah, absolutely. You look at, like we said, you know, you look at Cumulo, you look at PowerScale, you know, these are, these systems are really, you know, solid as well. And you know, and I do think that they're going to be seeing some uptake in this new AI world.

Starting point is 00:21:41 Yeah. If not already, absolutely. Well, thanks a lot, Ray. You know, it's been great to talk to you. Great to catch up. You know, I always enjoy listening to Greybeards on storage, because like I said, I mean, you guys, you know, you've got a lot of experience, but you're not stuck in the past. It's not like you're like, you know, my IBM invented everything and the main frames all, you know, I mean, you know, you're like, oh, hey, show me the new stuff. I'd like to see the new stuff. Yeah, we like to try to understand what's going on in the new space and stuff like that. And lately, we've been, we've been on sort of a new technology trend in the podcast.

Starting point is 00:22:08 We'll go back to normal someday. We'll see. We don't have to. That's right. You know, the technology is changing so rapidly. It's amazing. Those of you listening, thank you very much for joining us as well. Please do subscribe to the Utilizing AI podcast where we have conversations like this

Starting point is 00:22:26 with folks like Ray all the time. So Ray, once again, can you remind us, where can we find content from yours on the topic of AI? So I do a lot of AI writing in rayonstorage.com blog. And again, I'm a Greybeard on Storage podcaster. And occasionally we talk about AI storage concerns there as well. And thanks for having me, Steve. Great. Thank you very much. Those of you listening,

Starting point is 00:22:50 please do subscribe, rate, and review us on iTunes since that, I know everybody says it really does help us, but it really, really does help us, you know, get listeners. We would be glad to have you. And you can find more content like this as well at gershbell.it.com.

Your Ad Here

Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 04: Is Enterprise IT Operations Really Ready for AI with @RayLucchesi

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.