Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 04: Is Enterprise IT Operations Really Ready for AI with @RayLucchesi
Episode Date: September 15, 2020Stephen Foskett is joined by Ray Lucchesi, an expert on enterprise IT infrastructure and operations. Ray has seen many technologies come and go, but he's impressed by AI. Why does he think it's more r...eality than hype and how does he think it will affect the datacenter going forward? How are product vendors using AI and ML technology today, from storage to security to systems management? What will AI mean to datacenter infrastructure and the future of CPU and GPU hardware? Stephen Foskett can be found at GestaltIT.com and on Twitter @SFoskett. Ray Lucchesi can be found online at SilvertonConsulting.com and on Twitter @RayLucchesi. This episode features: Stephen Foskett, publisher of Gestalt IT and organizer of Tech Field Day. Find Stephen's writing at GestaltIT.com and on Twitter at @SFoskett Ray Lucchesi, President of Silverton Consulting. Find Ray on Twitter at @RayLucchesi Date: 09/15/2020 Tags: @SFoskett, @RayLucchesi
Transcript
Discussion (0)
Welcome to Utilizing AI, a podcast about enterprise adoption of machine learning and artificial
intelligence technologies. I'm your host, Stephen Foskett from Gestalt IT. You can find me writing
at gestaltit.com and you can find me on Twitter at S Foskett. And I'm Ray Lucchese. I'm a Greybeards
on Storage podcaster. I'm also a Ray on Storage
blogger, and I'm at silvertonconsulting.com. And my Twitter ID is at Ray Lucchese.
Thanks, Ray. Honestly, it's really great to have you here because I think that in a way,
you embody the topic of utilizing AI more than many of our guests because you've been involved
in enterprise IT for a long time, and you've seen a lot of different technologies come and go
and but yet you know one of the things that I always enjoy about your blog and
about you know your you know your podcast and just talking to you generally
is that you don't seem stuck in the past in terms of you know this is how it
should be or this is how it was it always seems like you have an open mind
about new technologies and yeah you know you've been writing a lot about ai i try to like you know i've been
following ai for now gosh 30 40 plus years i've been really interested in off and on during the
ai winters and ai summers and all that stuff it's really emerged over the last decade or so as a
solution in uh pattern matching and that sort of thing that it never
had before. So it's kind of exciting from my perspective to see what it can and can't do.
And I try to keep adept or try to keep up with the technology as it emerges.
Exactly. And, you know, also, I think that, you know, by being involved in enterprise tech for
a long time, you know, you've seen how it really is. And one of my concerns, honestly, about artificial intelligence machine learning is that a lot of the pitches don't seem all that practical.
You know, a lot of the, you know, the things that people are saying would require, you know, massive business model changes in term, you know, you know, basically change, you know, transform your whole way of doing business and use AI to predict, you know, future growth and all this. I mean, it's really business and
kind of things. But when we look at the IT space, I, my question is, do you see this technology
coming to IT operations as well? Or is it really a line of business technology?
I see it both. I mean, to a large extent, from a line of business perspective,
you know, depending on the application, depending on the environment that they have,
if there's something that might benefit from a more accurate pattern recognition and that sort
of stuff, AI could be a significant benefit. But, you know, Google and other organizations have
shown that they can utilize AI to reduce power consumption, to reduce air cooling, reduce those sorts of things. So it's operational aspect as well to AI. It's harder, I would say, from an operational perspective, IT operations, how's a question of matching. It's a pattern matching machine today.
If you can, if you can define something that requires better pattern matching in AI and IT
operations, then this would be a natural for it. Yeah. And that to me is the natural first
application of this. And I think that's really where we're seeing it. Things like
processing log files. I think it's just a total slam dunk, right?
Yeah, because you've got, you know, you're looking for patterns out of logs to see what's
going bad or what's in the process of going bad.
You know, if you're doing some preventive maintenance kinds of things, if you want to
try to, let's say you've got a thousand servers out there with, you know, 10,000 disks or
something like that, you can utilize AI to try to ascertain
which one of those disks is going bad
and get rid of it or swap it out before it happens,
before it fails and stuff like that.
I think those are natural types of situations
that you could use this for.
But again, it's what's a reasonable pattern recognition
application of the technology.
If you can come up with something like that
and you've got the data, I think AI is a natural solution. AI deep learning today, right?
Yeah, and I guess it depends on what you mean by AI, right?
Right, right.
But let's say deep learning and, you know, machine learning applications. Certainly in storage,
a lot of companies are using or claiming to use AI.
So let's talk about that.
You know, storage management, you know, you got things like the InfoSight.
HPE's InfoSight, and there's a number of other generations of that across a number of other vendors.
I think so.
If you look at AI, if you look at InfoSight to a large extent, it cut its teeth on preventive maintenance of the solutions that it has. So it's trying to
identify problems before they occur and have, you know, some activity or some action that a
customer can take or service can take to make sure it doesn't affect their operations. Again,
it's pure pattern recognition. So they're getting masses of amounts of data from every one of these systems out in the field.
They've got this database out there.
They can see when a problem actually does occur.
And then they can try to feed that sort of pattern information
into a deep learning model and go forward with it.
And, you know, the challenge with AI is you have to continue to train it.
It's not something you can train once and leave. You have to continue to train it.
It's not something you can train once and leave.
You have to continue to train.
So you're taking all this data in.
Every time it sees something new, you're going to have to feed that back into it.
And I think that that's one of those things that presents a lot of challenges on sort of an operational basis, but also on a technical basis. Because if you're going to continue to train it, then that means that you need hardware that is capable of doing the training.
And I know that you've looked into that, like what it takes. Yeah, yeah, absolutely. It's, you know,
in this recent post I did on machine learning performance, and, you know, one of the top
players there was NVIDIA. They had their latest ggx a100 solution there
they had over 14 1400 gpus running this mini go algorithm to learn go to how to play go
50 percent effectivity not everybody's going to have that um it's it's it's it can take a lot of
time and effort to do this well and the hardware requirements can be significant.
It depends on the data, too.
I mean, God, you know, I did a mini AI algorithm.
I did my blog post titles.
I'm trying to figure out which one was more popular and why they're more popular and have some way of predicting popularity of blog posts based on titles.
And I can do it in a single GPU without too much of a problem.
I've got, you know, I've got a couple more I can fire up to do other stuff.
But, you know, I've got a couple more I can fire up to do other stuff. But, you know, I've got less than 1000 blog posts, you know, so there's less than 1000 titles, maybe 1012 words each, it's not a lot of data. You know, customer info site,
every solution out there, every nimble storage system out there is running literally thousands of data items every minute,
if not every second, and they're sending that back to HPE and they're feeding on, you know,
it's a massive repository of data. It's a perfect solution for AI, but, you know,
providing the hardware to run all that stuff through and try to, is significant.
Yeah, absolutely.
Do you think that in the future,
do you think that enterprise IT infrastructure,
I mean, do you think the data center is gonna have
like an AI corner that's sitting there
running these applications?
I think that, absolutely.
I mean, today, you know, you're starting to see
the introduction of Dell servers that are focused on GPUs,
VxRail has a GPU configuration.
There are a number of solutions out there that are focused on, you know, providing computational
technologies there that can be deployed for either video displays or for AI machine learning.
And once you've got that, so let's say, I don't know, you have a
thousand person organization running VDI, you probably got some servers out there with GPUs in
there. During the day, devoted to VDI sorts of solutions. At night, if that even exists for you,
because you might be worldwide, fire up some AI stuff. Well, that's an interesting thought. I was
actually thinking of that when you were talking about the GPUs, because a lot of companies do currently have, you know, a lot of GPUs for VDI
applications. And since it's a similar technology, really. It is exactly the same technology. I mean,
you know, like I said, we were talking before, I took my crypto mind, which was three GPUs,
I converted it to an AI workstation, because the, you know, it just wasn't there anymore. Yeah. Yeah. So do you think though,
that this is going to be something, I mean, you,
you talked about the 24 seven is this going to outgrow the VDI footprint?
Is this going to be something where you're going to have a whole other world of
enterprise tech you know, just a whole other world of enterprise tech, you know, just a whole other vertical that's AI.
There are a number of hardware only solutions,
Cerebus and Graphcore and those kinds of guys
are creating special processing units
strictly for neural network processing
or, you know, deep learning kinds of things.
Those things exist.
I mean, if you look at what NVIDIA did with their A100,
it's focused to a large extent on machine learning workloads.
I'm sure it can do GPU kinds of stuff,
but it's been touted as an AI workstation environment.
You know, Google has their TPU.
There's special hardware that's available.
It's very costly.
In my mind, the service thing is the size of
a 16-inch wafer or something like that.
It's got 40,000 cores
on the thing. Okay, that
can do that, but it's
a specialized solution.
A lot depends.
Absolutely. A lot depends on
the amount of data you're working with and what
you're trying to do. If it's a one-off or a two-off thing, use the GPUs you've got in-house and have at it.
If it's going to be something you're going to be doing thousands of data items from every solution you have in the field every second, then you need something a little bit more beefy to work with this stuff.
Well, it's interesting that you say that because you and i were both at the intel intel xeon scalable launch and i don't know if you remember i mean one of the
things that they talked about was basically deep learning as a deep learning platform uh they they
have intel is honestly well my feeling is that they they went in the wrong direction initially
by trying to make CPUs with just
massive size, you know, vector calculators, you know, AVX 512.
And then they totally like went the other way and they realized, Oh wait,
what do you mean? You only want to eat bit. We can do a bit. And suddenly,
you know, they're talking about you know, the new, new, which is, you know,
low resolution, you know,
they have their own deep learning library that they've supplied, which is low resolution. They have their own deep learning library
that they've supplied,
which is lower resolution than 32 and 64 bit floating point.
And can it do the job?
I think it can.
It doesn't seem like 64 bit resolution
is required for most AI opportunities.
And 16 is probably adequate and eight might work.
And in fact, this last blog post
I wrote on machine
learning performance this was solutions that were learning um many you know minigo how to play go
and stuff like that and intel xeon system copper lake six or something like that came in like
number six or seven on the list 14 000 not 1,400 A100 GPUs was first.
These guys, so that was like,
it took 20 seconds to do this.
And there was eight Copper Lake servers
with maybe four CPUs each or cores each
or something like that.
And it took like 500 seconds.
So you can do it.
It's not impossible.
But that was their pitch was, and i remember at the time we
were all kind of like scratching our heads like are you kidding me you can't compete with gpus
but their pitch was um xeon may not be able to compete with gpus but xeon can compete with no
gpus in other words if you're not gonna buy gpus yeah then the xeon can take on a lot of that workload. You know, you can kind of
repurpose your virtual infrastructure or your cloud, you know, your onboard, you know, on-premises
cloud to do the deep learning calculations. And then you don't have to buy those GPUs.
Yeah. And to me, that was like, oh, I get what you guys are saying. You're not trying to say
that Xeon is going to beat the A100. You're trying to say that not having the A100 is a viable alternative.
Yeah, yeah.
I mean, you have to use their library.
You have to do some conversion activity of your model.
But they claim it's all TensorFlow compatible.
And if that's the case, then you can do it.
And honestly, I think that, like you said, I mean, as well, maybe it's not as fast, but maybe that's a way that companies might not need to have.
Because I'm just imagining this, I mean, like my background in data center comes from when there used to be lots of verticals in the data center.
Like there was the telecom rack.
There was the, you know, the finance rack.
There was the HR rack.
There was the, you know, manufacturing finance rack, there was the HR rack, there was the,
you know, manufacturing rack, and all these kind of different things, everybody had their own
vertically integrated stack of equipment. And then we kind of virtualized and commoditized
the data center with VMware. And, and everybody could share this, or more applications could
share the same equipment. But I'm just thinking like, okay, well, if it needs GPUs, then that's a specialized piece of hardware. That means we're going to not be able to
commoditize that. Yeah. I think, well, I mean, you know, obviously there are more than just NVIDIA
and the GPU game and that's the solution. A lot of these environments do VDI, have GPUs,
they're just not maybe using them all day day long depending on the size of the organization that sort of thing so that's an option and and if you
want you can use you know Xeon CPUs to do it with the proper libraries and all that stuff it's
another step it's not a big deal from my perspective to to make that conversion I haven't tried that
logic but it's not saying I couldn't do it. Well, then the other thing though, is that, you know, Intel, yeah,
they maybe got caught a little bit on the back foot when it came to, you know,
the GPU dominance of deep learning,
but they're no fools and they've been working really hard. And in fact,
they just in August did announced a whole range of,
you know, sort of GP GPU platforms that they can now produce.
They'll be rolling those out. Like I said, they've got new instructions in the Xeon.
I do think that this is something that maybe is going to be part of the future data center.
Yeah, I think so too. I mean, it's, it's, it's surprising to me how many things
that we do on a daily basis that pattern matching can help. It's just, it's just,
and that's what this was, that's what this deep learning stuff does for us. It's just this
fine-tuned pattern matching that, that you can fire up without really that much effort. I think
the biggest challenge is getting the data right, quite frankly.
Well, exactly.
And let's talk about that.
So, I mean, you know, you're an old school storage guy like me.
This is a really weird data set, right?
It's huge and it needs high performance occasionally at various spots.
And it needs to be able to like dump a massive amount of data in here
and then take that out and dump a massive amount of data in here and then take that out and dump a different
set of data in there. I mean, it's just a really weird storage application. So, so during the
training opportunity, you're, you're, you're feeding all this data into this model. The model
is making computational changes to its network and, and then you're doing it again and you're,
you're, you're probably feeding it in anywhere from 20 to 100 times the same data, randomized.
And, you know, it's to some extent, it's sequential, but it's not.
You know, it's small picture files or small text images or in the case of InfoSight, it's probably, you know, large status dumps from a storage system or something like that. And yeah, it is a lot of data,
but it's not like the old day where we're processing, we're updating, where, you know,
this OLTP, you know, paradigm. No, that's not it. This is more like, give me a bunch of small
files as fast as you possibly can while I'm training. Then I don't want to talk to you again
until the next time we do a training run, you know, to a large extent, because now it's all inferencing. And yeah, maybe I'll
record what happened, my inferencing activities and stuff, but. It is, it's just such a weird
application because we've, I don't, is there any application in the data center that you can think
of that we, that has been like this? I mean, I guess, you know, maybe a data lake? Yeah, yeah, big data types of things,
data, you know, batch processing of old, not really, just to a large extent with input and
output kinds of stuff. But this is input many times, right, over and over again, randomized,
and then you're actually, the output is this model, this set of weights associated with a
neural net that you're creating.
Well, one of the, one of the things too, is that, um, there's been a big demand for sort of snapshotting the data set along with the,
the model, because basically, you know, you're, you're doing this training,
and then you need to be able to save that data set so that you can do the same
training with the same data set again,
if you should change the model,
or maybe you just need to go back and look. Right. You want to snapshot the model. You have
to save the model. You have to save the data with the model. So you can, if you have a compliance
problem with the model, it's not properly doing something or it's biased or something like that,
you want to be able to go back to the data, see why it got to that point. And if you're going to
do any inferencing with the model, you have to save the model in any fashion.
You have to save it in some form
that you can run an inferencing engine on.
TensorFlow has got this thing called TFLights,
Google, obviously.
And effectively what they're doing
is you can take a model trained on TensorFlow
and convert it into a TFLight model
that you can run on a Raspberry Pi or an Android
or Arduino if you want want something like that. So this is for IoT applications and stuff. But sooner or later, you have to take the model that inferencing engine and deploy it someplace.
Scott DeRue, MD, Yeah. Yeah. And that's, I think what we're going to start seeing is as this technology makes more waves,
you know, we're going to start seeing special purpose, you know,
special purpose processing units.
I mean, Apple is building those into their A-series CPU.
I definitely think that we're going to see that.
But, you know, that doesn't need to be heavy, like you're pointing out.
Like it can run, I mean, absolutely.
I've actually run stuff like that for Home assistant on the Raspberry Pi to do, you know, like pattern
recognition, like, is it a bicycle? Is it a man? You know, that kind of thing. And it works.
Even on that little low powered ARM CPU, because the model was already created, right?
Yeah, the hard lifting is all done during the model training and stuff like that. After that, it's a pretty straightforward process.
Yeah. Yeah. So, you know, we're starting to hear as well, you know, a lot of the storage companies talking about how they're going to leverage their products in this AI space.
I know that, you know, just off the top of my head, I've heard that from Pure Storage. I've heard that from, you know, VMware with, you know, vSAN and stuff. Yeah, with vSAN. You know, I certainly have heard that from Dell.
Do you buy it? Do you buy it that we need special purpose storage systems?
I don't think you need special purpose storage systems. I mean, these are,
you know, to a large extent, these are file systems. This is file system based data and,
you know, pure FlashBlade, sure. Would you use that as a, you know, as based data and, you know, pure flashblades, sure. Would you use
that as a, you know, as an input or, you know, someplace to gather the data and supply it to an
AI engine? Sure, why not? You could use Cumulo, you could use S3, you could use just about anything
that can support objects or files. You know, the big question, a lot of the storage stuff that I've written in the past about AI is, you know, you want to keep that expensive hardware, GPU, Cerebus, Graph Core, whatever, busy.
In order to do that, you have to feed it data all the time.
Those sorts of things might require, you know, more enterprise grade types of storage services and stuff like that.
But can you do it without?
I think so.
Well, I feel like some of these companies kind of lucked out. For the longest time,
we've had these object stores. They were kind of like a solution looking for a problem.
And I remember when Pure introduced FlashBlade, a lot of us were scratching our heads like,
okay, that's awesome, but what's it for? And then suddenly AI comes along and is like,
I need all your file and object. And Pure is like, ooh, ooh, ooh, right here, right here, we got that.
Yeah, yeah.
It was an interesting play for them, and it certainly made a good fit from that perspective.
And are there other solutions out there that could do this sort of thing?
Absolutely.
It's obviously a unique architecture, and it's got advantages from that perspective.
But, you know, scale- out files has existed for years.
Oh yeah, absolutely. You look at, like we said, you know, you look at Cumulo,
you look at PowerScale, you know, these are, these systems are really,
you know, solid as well. And you know,
and I do think that they're going to be seeing some uptake in this new AI
world.
Yeah. If not already, absolutely.
Well, thanks a lot, Ray. You know, it's been great to
talk to you. Great to catch up. You know, I always enjoy listening to Greybeards on storage,
because like I said, I mean, you guys, you know, you've got a lot of experience, but you're not
stuck in the past. It's not like you're like, you know, my IBM invented everything and the main
frames all, you know, I mean, you know, you're like, oh, hey, show me the new stuff. I'd like
to see the new stuff. Yeah, we like to try to understand what's going on in the new space and
stuff like that. And lately, we've been, we've been on sort of a new technology trend in the podcast.
We'll go back to normal someday.
We'll see.
We don't have to.
That's right.
You know, the technology is changing so rapidly.
It's amazing.
Those of you listening, thank you very much for joining us as well.
Please do subscribe to the Utilizing AI podcast where we have conversations like this
with folks like Ray all the time.
So Ray, once again, can you remind us,
where can we find content from yours on the topic of AI?
So I do a lot of AI writing in rayonstorage.com blog.
And again, I'm a Greybeard on Storage podcaster.
And occasionally we talk about AI storage concerns
there as well.
And thanks for having me, Steve. Great. Thank you very much. Those of you listening,
please do subscribe, rate, and review us on iTunes since that, I know everybody says it
really does help us, but it really, really does help us, you know, get listeners. We would be
glad to have you. And you can find more content like this as well at gershbell.it.com.