Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 3x28: Revisiting Utilizing AI Season 3
Episode Date: April 25, 2022Frederic Van Haren and Stephen Foskett look back on all the subjects covered during Season 3 of Utilizing AI. The podcast covered many topics, from religious and ethical implications of AI to the tech...nology that enables machine learning, but one topic that stands out is data science. If data is the key to AI, then the collection, management, organization, and sharing of data is a critical element of making AI projects possible. We also continue our “three questions” tradition by bringing in open-ended questions from Rich Harang of Duo Security, Sunil Samel of Akridata, Adi Gelvan of Speedb, Bin Fan of Alluxio, Professor Katina Michael, and David Kanter of MLCommons. Three Questions: Stephen's Question: Can you think of an application for ML that has not yet been rolled out but will make a major impact in the future? Frederic's Question: What market is going to benefit the most from AI technology in the next 12 months Rich Harang Senior Technical Lead, Duo Security: In an alternate timeline where we didn't develop automatic-differentiation and put it on top of GUPs do this entire deep learning hardware family that we depend on now never got invented. What would the dominat AI/ ML technology be and what would have been different? Sunil Samel, VP of Pusiness Development, Akriadata: How will new technologies like AI help marginalized members of the communities. Folks like senior citizens, minorities, pepole with disabilities, veterans trying to reenter civilian life? Adi Gelvan, CEO and Co-Founder of Speedb: What do you think the risks of AI are and what is your recommended solution? Bin Fan, Founding Member, Alluxio: Im wondering if AI can help with a humanitarian crisis happening in the future? Katina Michael, Professor, School for the Future of Innovation in Society, Arizona State University: If AI was to self replicate what would be the first thing it would do? David Kanter, Executive Director of MLCommons: what s a problem in the AI world where you are held back by the lack of good publicly available data? Hosts: Frederic Van Haren, Founder at HighFens Inc., Consultancy & Services. Connect with Frederic on Highfens.com or on Twitter at @FredericVHaren. Stephen Foskett, Publisher of Gestalt IT and Organizer of Tech Field Day. Find Stephen’s writing at GestaltIT.com and on Twitter at @SFoskett. Date: 4/25/2022 Tags: @SFoskett, @FredericVHaren,
Transcript
Discussion (0)
I'm Stephen Foskett.
I'm Frederik van Heren.
And this is Utilizing AI.
Welcome to another episode of Utilizing AI,
the podcast about enterprise applications for machine learning,
deep learning, data science, and other artificial intelligence topics.
In the past, we've gone through a lot of different things here on Utilizing AI.
And now to wrap up our third season before our AI Field Day event, I wanted to take a moment and just look back at season three.
Frederick, it's been a long season.
We've had a lot of episodes recorded and shared.
What do you make of it?
I really like it.
Like you said, we had many sessions, but the reality is I do like the variety.
We talked about hardware.
We talked about methodologies, DevOps, DataOps.
We talked about bias, all kinds of bias, religious, marketing.
And we even had the state of AI and the enterprise by Manoj, who works for Deloitte, which gave us a great view on where the enterprise is.
So, like I said, I really do like the variety and I find it really difficult to find one or two sessions that really stuck out because they were all quite good. Yeah, I agree. I love the curveball sessions that we do,
like the religious and ethical aspects of AI with Leon Adato. That one was absolutely amazing.
I even, you know, kind of thinking back as well, you know, we had some really interesting
discussions about the future of work and impact of AI on the third world.
Things like that that you might not come to think about here in an enterprise setting.
But we all have to think about those things, don't we?
Right, we do.
Yeah, and it's important.
I think that's the whole point of AI.
It's about learning and learning new things. And
that's also one of the reasons, you know, on the side where I like to travel, because that's where
you learn new things, right? And by having those great sessions, we kind of better understand where
AI is going and what AI can do for us, and also what AI will do for us that we don't want it to do.
Exactly. And so we talked, for example, about the invisible workers behind the algorithms and the
impact of the algorithms. But of course, also a lot of nuts and bolts about how it works. So we
talked about data infrastructure, a lot about data. How does that strike you? I mean, I know that you have a background in data science and databases.
It seems like data is the key to AI.
Right. I mean, once upon a time, you could say that the source code was the IP
and that data was kind of just used for testing.
Now it's completely flipped around, meaning that the source code is actually all open source.
Most of the frameworks are coming from the open source world, but it's the data that's now your IP.
So whoever owns the right data and the right amount of data has a foot forward on the competition. And I think that's also one of the reasons that a lot of enterprises are kind of
forced into AI because of the competitiveness and that they have data. But if you don't use
or consume that data, then it's as if you don't have the data at all.
Yeah, true. And I think that that's actually the case with a lot of enterprises where they've been consuming or collecting a lot of data on the anticipation that it would be valuable, you know, filling up data warehouses, data lakes, you know, whatever, you know, with all this information, and then hoping that they could make use of it. And I think that they see machine learning as a way to leverage that data.
But how practical is that?
I mean, I think that it's challenging to take existing data sets and make something, you
know, spin them into gold, right?
Right.
I mean, data management is still the number one limiting factor to success in AI. And one of the reasons for that is because in the early days of AI,
enterprises were told, collect a lot of data.
And what people forgot to tell is, once you collected it,
you have to do something with it.
And so a lot of people started piling up a lot of data.
And in the end, they realized,
how do we apply those ML and DL methodologies
on that data? And then it's the wake-up call, right? So what do we do? Not all data is equal.
They might have some duplicate data, some data that still needs some cleaning. But I think we're
getting there. I think with the focus more on data and data management,
I do feel that people understand that data is key and that they have to have proper methodologies to
handle data. Yeah, absolutely. And that does seem to be one of the recurring themes that we've had here.
Another theme of course, is sort of the just nuts and bolts aspect
with regard to data of sort of where do we store this?
How do we store it?
How do we make it perform?
How do we make it accessible to,
whether it's for training or inferencing?
And a lot of discussion this season
focused on those sort of nuts and bolts
storage aspects.
Right.
I think one thing I really learned in season three is that, yes, source code is being shared
through open source, but also now there are hubs or data hubs for data.
And certainly in the last session we had with David Cantor, where he was talking about benchmarking
and methodologies to apply to benchmarking, he also said that the open source availability
of data and the ability to have data ready to benchmark is really key.
Yeah. And that was actually one of the highlight episodes for me was talking to him about how we're
going to be doing, how do we know whether things are performing or not? And, you know, I think that it was interesting
that it was a very pragmatic discussion. It wasn't, oh, well, you know, we got that covered.
It was much more, yeah, this is a challenge. And this is one of those things that there's a lot of
solutions to. And we're not sure how we know how things are performing.
Right.
One of the big challenges is that there's a lot of open source tools that are changing really rapidly.
And then there is the new hardware technology.
So the amount of possibilities of combinations, if you want, between all of them is very difficult
to figure out on your own.
And then once somebody figured it out, how do you replicate it?
I think one thing I see in the AI world a lot is that a lot of the advancements are
the result of repeating what somebody else did.
Maybe repeating is not the right word.
Maybe the ability to reproduce somebody else's process.
And in order to be able to reproduce it,
you need to have some kind of a benchmark
to know, you know, how do we,
if I can reproduce it,
what does it do to my system?
And what are the components I need to do to improve?
So I do think, you know,
benchmarking is important
from an absolute standpoint,
but also from a relative standpoint
towards, you know,
newer technologies and other people in the same market.
Yep, absolutely. And not just for, you know, sort of shoot out reasons, but just to make
sure that the solutions that you're getting are going to be performing appropriately. Yeah, so another discussion focused around the changing nature of training versus inferencing, whether it was BrainChip talking about training at the edge, whether it was, you know, Ben Phan talking about the challenges of finding data sets and integrating those, or whether it was talking
about just basically how are ML tools evolving? How do you see this? I mean, is it still the old
paradigm of sort of you train in the back room and then you roll it out into production,
or are we going to see a more dynamic push and pull?
Well, I think the personalization of AI forces you to kind of combine the training and the inference components in a fast loop.
In a fast loop, I mean, where you have the ability to react in real time. And so the split concept of training and inference really worked out well when you had zero understanding or very little understanding of the AI problem you were trying to solve.
So for example, for us at speech recognition in the early days, we were really trying to boil
the ocean. And the way we did that is to separate the training piece from the
inference piece and to collect as much data as we could on the training site and then hopefully
come out with a model that then would be able to use for everybody else. While in the world today,
it's completely different in the sense that there are basic models for speech recognition. So you don't have to reinvent the
wheel. What you do have to do is to personalize the AI. And so you do that by taking the existing
model and to adapt that model to your own personal experience. So examples of that are automated
attendance or assistance, Waze, Amazon, where they kind of start to learn your individual
behavior. And I do think that forces the concept of training and inference into a single cycle.
Now, that being said, it's not that easy, right? The way you train and the hardware and the technologies and methodologies
you use to train data can be significantly different from your inference approach. But,
you know, I think AI is really evolving. And I think that should be one of the goals is to kind
of close the loop on training and inference and to benefit people more from a personalized version.
So another topic that came up quite a lot in season three was hardware.
We talked with NVIDIA, we talked with Cerebrus, Hibana, Brainchip, as I mentioned a moment ago.
What's your take on hardware and the changing nature of
machine learning? Well, I think one of the challenges is that, well, first of all,
it's all about math. Well, if I'm simplifying, it's all about math, right? And if you go from
billion and trillion parameters, it basically means you have to do a lot of calculations. And so the technology that will survive today is if you can find a shortcut.
So even if you can cut off or shave off a few microseconds per calculation,
if you multiply that with billions and trillions, that gives you a lot of time saved.
And the challenge is that those shortcuts are really different for different types of technologies, you know,
for analyzing binary videos or binary audio or text.
The methodologies are different.
So I do feel there is a need
to have different types of hardware
that are optimized for those shortcuts.
Now, in the reality, as time goes by,
we better understand how to solve those problems.
So we might not have to use a mathematical shortcut,
but we might have to use a logical shortcut
and then there is a the the topic of scale right so a lot of the hardware
vendors do provide a some somehow a processing unit being a CPU or GPU or an
FPGA but what the market is demanding now is scale. And so the best way to explain it is if you have a cycle
and it takes you six months to generate a model,
you will ask yourself the question,
how can I reduce that from six months to six weeks or six days?
What do I need to do?
And so you need to be able to cluster that hardware.
And that's also a position that hardware vendors are trying to solve
is how do I make it scalable?
Like Cerebras goes for a much larger silicon
while NVIDIA sticks with its own GPU format
but provides a lot of hardware around it
like with Mellanox networking
and NVLink as a way to communicate between the
GPUs. And so they're trying to push scalability from that perspective. So I do think that we
haven't seen the end of it. And actually another thing that strikes me, and this relates to what
you just said, is that it's not all about hardware either.
We heard at NVIDIA's GTC this year, we've certainly heard loud and clear from Intel and from Habana Labs that the APIs and the integration of this hardware into the software ecosystem are actually much more important to getting this stuff in the hands of data scientists and machine
learning engineers and application developers.
I know that Intel has been doing a great job with their OneAPI, and we've seen a lot of
sort of pre-built models, pre-cooked zoo-style offerings
where it's all about making this stuff available
and easy to use
and sort of hiding the complexities of the infrastructure.
Like I said, at GTC,
NVIDIA pointed out that they have more people
working on software than hardware now, which is, I think, a good reflection of the importance of integrating this stuff and not just assuming that it's going to work.
Right. It's the ecosystem, right?
You don't have to have the fastest hardware.
You don't have to have the best data.
You don't have to have the fastest hardware. You don't have to have the best data. You don't have the best algorithms.
But if you have a, let's call it what people call it, an efficient data pipeline, that can get you pretty far.
And again, it's also about optimizing, right?
So you learn as you go and you make some changes. And I think that's one thing that many enterprises still don't understand is that they think they need to buy the best and the fastest hardware or storage for their solution.
And then the fastest and the best network.
And then the same thing with the compute side.
And then hoping that all these pieces together will actually work really well.
The reality is that those components will not be fine-tuned.
And so you don't get the efficiency of a well-defined ecosystem.
And really, if you're in the business of building models as fast as you can and innovate, it's
really challenging.
I mean, the market changes really quickly all the time,
so you have to make decisions when you cut over to a new algorithm or new piece of hardware.
But here's a good thing. You have plenty of options, right? You have no excuses not to do AI,
I guess. You just need to find the right options and come up with a strategy
that works out for you and your enterprise.
Absolutely.
And I think that that's really the core of it
is that even though we spend so much time
talking about all this new cool,
new hardware, new software, new models,
new approaches to,
none of it matters unless it's actually serving the needs of the business. And that came through loud and clear, as you said,
when we talked with some of the folks who were actually there on the front lines implementing
machine learning. So to wrap up the season, I thought it would be fun, since we don't have a
guest on here, we've been doing these three questions uh at the end of each episode i thought it would be fun to uh ask
ourselves uh some of the questions and have some of the guests join us here to ask us some of the
questions so to kick it off i'm going to ask you one and then you can ask me one and we'll see
where that goes so first off uh frederick i'm going to ask you one of the questions that I've been asking guests all season long.
And that is, quite simply, can you think of an application for machine learning that has not yet been rolled out, but is going to have a big impact on the future?
Yes, I think there are many applications out there that haven't seen the light yet.
And I do think that's because AI is penetrating a variety of markets.
For example, there was a market I never thought AI was going to be used,
which was reviewing resumes, and it's happening.
So is there one particular one right now?
I don't know.
I think in the end, everybody will fall for AI, I guess. So what market will benefit the most
from AI in the next 12 months? Oh boy, what market will benefit the most from AI in the next 12
months? Okay, I am going to be controversial and a bit obnoxious. I'm going to say black hat security. In other words, basically, I see a huge trend in attackers using machine
learning and AI to find holes to fuzz attacks and to get into
systems. And I feel like that's been building and building,
especially since we've had
it so much machine learning on the white hat side of things as you know, on the defensive side of
things. I think that the big story is going to be basically the bad guys using AI to attack networks
and systems. And that's going to be really horrible, but they're going to benefit from the technology.
So now I thought it'd be fun to invite some of our previous podcast guests to ask us some of their three questions. So first off, we've got a question from Rich Harang, a senior technical
lead at Duo Security. In an alternate timeline where we didn't develop auto differentiation and put it on top of GPUs, and so this entire deep learning hardware family that we depend on right now never got invented or was invented many years later,
what would the dominant AI slash ML technology be and what would have been different?
So I think that without the invention of GPUs and hardware accelerators, we would still be
in a world that was really CPU centric in the sense that it wasn't all about data but more about CPU
so I do think we would have more more of the the the craze in the and the Apollo's
of the world being a lot more successful as opposed to being run out of business
and and have data centers filled with large compute environments as opposed to
data environments. And we also have a question from Sunil Samel from Acrydata, who is the VP
of Business Development. I'm Sunil Samel from Acrydata. And the question I am wrestling with is how will new technologies like AI or
what's coming up, Metaverse, how will these help marginalized members of our community? These are
folks like senior citizens, minority groups, people with disabilities, veterans trying to
reenter civilian life. So I'm really excited about the prospects of artificial intelligence technology
helping marginalized, you know, disabled, differently abled people. This is something
we actually talked about recently on the Brain Chip podcast. We talked about the many ways in
which machine learning could, for example, help someone to see or help someone to hear.
Or, you know, even things like, you know, maybe you have long COVID and you can't smell anymore.
And you could have a system that could detect spoiled food.
You could have a system that is constantly watching if someone falls down so they don't have to push a button,
that it's actually kind of monitoring them
and assisting them in that way.
I think that it could be really tremendously,
tremendously beneficial.
And we also talked about this on the podcast as well,
that AI can help, for example, with pain management
by helping people to understand their own sensory inputs, you know, by having a computer offload.
So, frankly, I think that this is going to be a tremendous, tremendous market for personal AI.
The next question we have comes from Adi Gelvin, CEO and co-founder of Speedybee.
Hello, I'm Adi. I'm the CEO and co-founder of Speedybee. Hello, I'm Adi, I'm the CEO and co-founder of Speedybee.
My question for you is,
what do you think the risks of AI are
and what recommended solution do you have for it?
Well, I think the risk is obviously bias, right?
I think that that would be my number one issue.
It's not technology.
It's maybe malicious people, but I definitely would say a bias is the biggest issue.
And a solution, I don't know.
I mean, it's easy to say control your data.
I mean, we're talking now about data lineage, big meaning where's data coming from and being able to
show where the data is coming from in a model. So I would say let's maybe do what we can do to
monitor where the data is coming from. Next, we have a question from Bin Fan, who is a founding member of Aluxio.
Hi, I'm Bin Fan, founding member from Aluxio.
I'm wondering if there's any way AI can help for a humanitarian crisis happening in the future.
This is a really interesting aspect. And there are so many different things
that happen to people, war, pestilence, famine. How about we take one of them? How about we take
famine? I think that there is a great potential for AI to help improving crop yields, especially in marginal places.
We've already seen a lot of this technology getting out there where, for example, sensors
are, even remote sensors, even satellite-based sensors are monitoring rainfall and making
sure that farmers know where to water more or less.
And I think that that's really going to be something that AI can help.
Basically improving crop yields, helping us grow crops in marginalized places, and helping us to adapt to climate change.
I think all of these things are things that AI can really help with and help to avoid humanitarian crises.
The next question comes from a memorable episode, Katina Michael, a professor of the School
of Future Innovation and Society at Arizona State University.
Katina Michael from Arizona State University. And my question is, if AI was to self-replicate,
what would be the first thing it would do?
That's a good question. What is the first thing to do? I think the first thing it would do is try
to find ways to learn, because I think AI realizes that learning is the only way to make progress. So
I think the first thing it would do is learn as much as it can. Now, the next question we have is from David Cantor,
who is the executive director at ML Commons. Hi, this is David Cantor. I'm the executive
director of ML Commons. And my one question for you is, what is a problem in the AI world
where you are held back by the lack of good publicly available data?
I'm looking forward to hearing the answer.
I think it's hard to find an ML problem that's not held back by good publicly available data.
Everything from autonomous driving to medical applications, in most cases, either the data sets are incomplete, or the data
sets are biased through the wrong, you know, limited collection, limited availability,
or they're proprietary and hidden. And that's one reason that I so so love what companies
are doing like Apple Commons, to try to broaden the availability of
these data sets. I think that this is one of those things where you can't have enough good data,
and yet we really don't have enough good data. So is it a cop-out to say everywhere?
I guess I'm going to say that, everywhere. Well, thanks so much for joining us.
This is the wrap-up episode for season three of Utilizing AI.
As I mentioned, we've got our AI Field Day event,
May 16th through 18th.
And if you'd like to be part of that, please reach out.
Or if you'd like to be part of Utilizing AI
or AI Field Day in the future,
just reach out to host at utilizingai.com.
We'd love to hear from you.
Fredrik, before we hop off, where can people follow you
and connect with you on Enterprise AI?
Well, they can find me on Twitter and LinkedIn as Fredrik V. Heron.
And as for me, you can find me at techfieldday.com
or gestaltit.com. You can find me as well on the Utilizing AI podcast and the Gestalt IT Rundown, which is a weekly news program. We're available on most podcast applications as well as on YouTube.
This podcast is brought to you by gestaltit.com, your home for IT coverage from across the enterprise.
For show notes and more episodes, go to utilizing-ai.com or find us on Twitter at utilizing underscore AI.
Thanks for joining and we'll see you next time.