Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 15: Applying the Lessons of Data Science to Artificial Intelligence with @YvesMulkers
Episode Date: December 1, 2020We begin by taking a look at the world of data and analytics in the enterprise. Data architects like Yves have been involved in enterprise IT applications for decades, but the world really took off wi...th the advent of data warehouses and the field of data science. Now AI and ML are impacting the field in many ways, and we discuss how this world has changed. Data scientists come from a statistics background, while modelers come from software engineering. How do the tools interact and intersect? What is Yves excited about and what frightens him? How does the infrastructure support all this? We finish with a look at what the future looks like: We will see a lot of evolution in science and medicine for data and ML, and this technology will be found everywhere in the datacenter and the cloud. Key questions covered include the following: How did the world of data and analytics evolve in the enterprise? How has AI/ML impacted the job of the data scientist? How can the data modelers, software developers, and IT Ops work together? Episode Hosts and Guests Yves Mulkers, Data & Analytics Strategist. Find Yves on Twitter at @YvesMulkers. Stephen Foskett, Publisher of Gestalt IT and Organizer of Tech Field Day. Find Stephen’s writing at GestaltIT.com and on Twitter at @SFoskett. Date: 12/1/2020 Tags: @SFoskett, @YvesMulkers
Transcript
Discussion (0)
Welcome to Utilizing AI, the podcast about enterprise applications for machine learning,
deep learning, and other artificial intelligence topics.
Each episode brings experts in enterprise infrastructure together to discuss applications
of AI in today's data center.
Today we're discussing the collision between data science and AI.
First, let's meet our guest, Yves Mokers.
Hi, I'm Yves Mokers. I'm a data strategist,
and I blog about my passion about data on 7wdata.be. And you can find me, of course,
on Twitter at Yves Mokers or on LinkedIn. Thanks, Yves. And I'm Stephen Foskett,
organizer of Tech Field Day and publisher of Gestalt IT. You can find me on Twitter at S Foskett.
So, Eve, you and I met a few years back when we did a data-focused Tech Field Day event called Data Field Day.
And I met you through the whole wider data community.
And many of us in IT operations aren't as familiar with the world of data analytics, data science, big data,
all those things. So I wonder if maybe we can start out by talking a little bit about
what is that world all about? What is your world as a data strategist? And how does that apply to
the enterprise? Well, as a data strategist, I think on a higher level, I look at the data
assets that companies have in place and look at
the use cases how they can make benefits out out of that understanding their business and
transforming that that data into means to help them in their decision support on the other side
as a data architect i go way back i mean 30 years when we were building databases and times that we were
speaking from management information systems that's in fact the reporting what you know
financial reporting management reporting operational type of reporting we made a big
evolution thanks to kimball and inbound people that identified a modeling technique to optimize
the querying performance for your data
warehouses where you stored your data and which feels in a natural way how we look at the world
we identify products invoices customers and by defining those holistic entities which are
world known we were able to tie everything together as common dimensions. And this is an
aspect that went on and further on in data science and in artificial intelligence. But in artificial
intelligence and data science, these techniques help us, in fact, in defining those holistic models.
And that's the evolution, what I saw, and the analogy over the years as well.
Indeed and I would think that you know the world of data science would be intrinsically linked
with the world of AI since essentially I mean if you kind of take it apart AI in the enterprise
really is about making sense of data making making sense of data inputs. And that's
really what data scientists have been trying to do for decades is make sense of it. Now you've got,
I guess, the robot brains to help you do this. How do you suppose AI is going or how has it
affected the data space and how will it affect in the future?
Well, it definitely has. What I saw and am quite happy about that is data management,
which is the foundation for building all these solutions. I mean, the models and the solutions
has gotten much more attention in the last two years. That's a trend I see happening. I mean,
discussing with
people in the field for more than 20 years doing data management and being frustrated
that data was not high on on the agenda of the most of the ceos or the cios uh that's that's
something where people started to understand if we get crappy data we don't get uh performing
results we get crappy results as well and that's
that's an important part what we saw happening in in in that area i think as well why does it get
more attention is because data science and the models that run in an automated way help us in
optimizing our data management for example in data, looking at missing data, looking at doing profiling.
These are all data management techniques that help us in producing better data and better results as well.
So the new ways of managing your data helps us in the data management. Indeed. And it would seem that it would be incredibly useful to you to be able to
work with artificial intelligence, machine learning concepts when assessing data. But
I have a suspicion that the fields are not yet fully integrated. Is that true? I mean,
are there still different people doing ML work and data
science work? Yeah, exactly. Exactly. I think you put that right. And it comes from that we
have different skill sets. That's something data scientists were more related to the statistics
world. That's something how they built their models. They were not so into the relational database.
One thing that we had in common
was having access to the operational systems,
but they stored their data in a different way
where in traditional ways,
we were still working on the data warehouses
and pulling our data from there.
Whereas data scientists had,
just for that specific use case,
they pulled together the data.
They were doing exactly the same, cleansing the data, preparing the data, manipulating the data so it would fit their model.
So that preparation part is the same as in the traditional world, what data scientists are doing.
But if they run a model, it's organized, the data is organized in a different way.
And therefore, they needed different compute, different storage and capabilities of the resources.
Now, with the cloud, where you have all these capabilities available on demand,
that's where they mostly and faster went to the cloud to deploy and develop their models
as we were still in the traditional world working in on-prem systems.
And the need for the different calculations or power to calculate these models
made that there stayed some separation between the
different teams although that 80 of the work we are doing is kind of the same thing we are doing
i see it now emerging and merging a bit more together if i saw architects architectural
designs or we made architectural designs. It was the traditional way.
Then you blend in the analytics.
And if you were lucky, they blend in the analytics into the data warehouse or vice versa.
But never one system that allowed you to do the different types of calculations or loads.
Another thing that occurs to me as well is, you know, we've talked on the podcast before about the various tools that are being used in the, you know, ML Ops space.
And although those tools say, you know, they have the word data in there quite a lot, it occurs to me that perhaps those tools aren't all that integrated with the workflow of the data scientist. In other words, you know, you've got sort of another
situation here where the data model folks who maybe are more software developers or, you know,
more enthusiasts about artificial intelligence and machine learning, those folks might not have
much understanding of the world of data science and specifically the workflow of data science.
You know, what is, I guess, in a nutshell,
what is the workflow for data analytics?
How do you collect, you know, organize
and store data for use?
And how does that affect machine learning workloads?
Well, there is big difference. And back in the days in the
data warehouse, it was the same challenge you had. You couldn't run your workflow on the complete
data set that you had available. For example, if you were working in development, you had a subset
of what was available in production. If, for example, you build up a new data warehouse and
you had to load 20 years
of financial history data, you couldn't do that overnight. With the new systems in place,
you can run on the complete data set. Data science works in the same way. They took a sample of the
complete data set, built their models, and then they run it in production to see that these models
weren't performing in the same way. They got different results because the data distribution was different
and they got different insights.
An advantage is what we see now
that you have the power of the machines
that allow you to run on the full set
and in an acceptable time,
not waiting for two weeks
before you get results of your models,
but you get these results in five minutes
if you have performant infrastructure in
place. And that's what I see that things come together. The preparation, the suffering of
working on a subset of the data and not the real data. Let's follow that thread for a moment,
actually. The idea of the fact that machine learning opens new doors to data science, and specifically the idea that
by bringing in artificial intelligence processing, you can do more with the data.
What excites you about this technology? Why do you feel that this is a positive for data science?
Well, I look at it as a big supporter in the traditional things that we were doing, doing everything ourselves manually, writing SQL scripts, doing profiling, doing data cleansing.
If you have these models and they run against your data and you give them the way of how they should respond and correct, and they help you in doing that, they can take a lot of the lab, some and very hard work and tedious work out of the hands of data scientists.
And therefore, they become more productive in building these models.
They have more time for analyzing the models to bring real value to the knowledge, what they have to discuss with the robots that take out the tedious work from the
labor workers. That's what I see happening now with machine learning and artificial intelligence
that help us in optimizing our way of how we handle data.
Yeah. So is this a case though of sort of the tools driving the field, or is this a case where you are adopting the tools?
And I'm not trying to insinuate
that this isn't such a good idea,
but is this something you had been longing
to be able to do in the data science field,
or is this a new door that's opened
and you're sort of exploring
what you could do with these tools?
I think you put it exactly right.
It's really longing for so much time to be able to do that.
Doing profiling in an automated way, always available.
If I open up a database, for example, that it can tell me your data looks on a meta level,
it looks like this.
So I exactly know that 20% of the customers don't have an invoice. So I can
start questioning already what is happening in there. Before I had to spend two weeks analyzing
the data before I understood what was in there. So it's the tools driving this, no longing for
these tools that make me more productive. And for me, data science was 10 years back for me,
a complete new world.
And I started looking into what is the analogy and what are the differences between the two worlds.
And I saw if these algorithms and modeling tools can help me in optimizing the traditional way of working, what I'm doing, I can become so more proficient in what I'm doing.
I mean, we're still 10 years after.
There is a big evolution being made,
but it's still going pretty slowly at that level.
But it goes in two ways.
I mean, the tools support what we do,
and we are wanting to have some more tools that make us productive.
So you see that that is happening.
And that's great to hear because I do, you know, one of the things, again, that we've talked about
on the podcast is the fact that AI does tend to, basically, people tend to go be too optimistic
about it. We talked about that, for example, with Josh Fidel a few weeks ago when we were talking about the just the many applications of AI and the fact that people may just take these things and run with them think, you know, wow, the machine can generate convincing text.
That means we should have it generate all the text. From your field, what are the guardrails that are being put on ML
applications by, you know, competent data scientists instead of just, you know, people
sort of running off the, you know, getting excited about the technology? How do you ensure that the
result that you're getting is a valuable result instead of just applying
the technology for the technology's sake?
Well, yeah, traditionally you do that with governance and policies where you try to guide
that on artificial intelligence.
It's a bit harder.
We try to build models as a copy of how we humans respond to certain events.
And that's a part where you have to be careful.
You don't need to use AI for each and everything.
You need to have the use case and say, in this area, it has its sense.
It helps in optimizing.
For example, if you see one of the applications are conversational chatbots,
and that helps in, for example, if you see that a help desk gets about 80% of the time, exactly the same questions of what is the status of my order?
When will it be delivered? These are typical questions that can be easily answered by machine learning or artificial intelligence systems that look into your system and get these responses back.
And it's using speech to text, text to speech. It looks into the databases and then builds you a
contextual answer in that respect. That's where it has its sense. But like you said,
I've been in those fields as well. Wow, excited about the technology and then starting to use it
for everything. What I've learned back in the days, first try to build it yourself manually.
Then you understand how it should work and then you can optimize it and automate it.
And that's still a very good thing to understand.
First, learn to understand what you're trying to solve and then put technology to that.
And that's something kind of core values, what you need to have embedded
as an artificial intelligence developer
or solution provider.
That's very important to have those kind of ethical skills
in that area.
Yeah, indeed.
And I think that that's really critical
and I wouldn't expect anything less from you.
I mean, that really is an architecture perspective, right?
An architect's perspective to say, first think about what we want, then think about what
we can do or can't do with the tools at hand and then apply those tools.
I mean, that's a very methodical approach for somebody like a data architect would come
to it.
Unfortunately, that's not the approach
that everyone is taking. Are there things that you're seeing, especially in your field, that
do frighten you a little bit about how people are using this technology? Yeah, it's because
you start rambling off on your laptop, for example. You find some models, you develop that model
in Python, and you don't really know what is coming out of that.
So I think building the people that dare to build these models and just say, okay, let's do it, and put them together with the architects where they say, okay, have you thought of this and this and this and this, together with the policymakers, but make sure there is a balance between all the three or the
four stakeholders in building those type of solutions. I mean, if you leave it over to the
policymakers, I think in the next 100 years, we don't have a model that is working. So it's kind
of that experimentation that needs to be available or allowed in a controlled manner. And that's bringing those multiple skills
together in one team that can build that or educate people the most in what are the dangers
of if you have these models, what are they capable of doing or what is the wrong conclusions
they can get out of the model or suggest to you?
So that's something where we have to be very careful upon.
Indeed.
And you have to be careful with that with all aspects of technology.
I mean, that's the bottom line.
I mean, again, that's, you know, things that folks like us have been focusing on for years.
But, I mean, you know, it's the old standby of, you know, and I talked about technological determinism in a previous episode as well.
The idea that the tools will tend to drive an outcome instead of, you know, the desired
outcome driving the tools.
I want to shift gears, though, and get back to something you talked about maybe about
10 minutes ago, which is the infrastructure aspect and, you know, sort of the practical
aspects here. So we are developing and deploying
machine learning tools. We are developing and deploying, you know, data warehouses,
data lakes, whatever you want to call them. You know, how does this impact infrastructure?
How, you know, how does IT need to change in order to support these workloads?
Well, mostly I call it in a simple way. They need to be faster. They need to be more agile. I mean,
I've been in so many places where you say we need this new field from the operational system
because business needs to report upon that. And it takes another three months before it becomes available in your organization.
Or maybe we did a new promotion
for the new products we launched
and we want to have so many simulations.
So additionally, maybe we do 1,000 a day,
but for this specific action,
we need to do 1 million a day.
So on the infrastructure part,
having that scalable infrastructure available
where you can
switch it on have it available for doing this specific use case and then put it in putting
it just back to to to halt that's that's something what we expect from from infrastructure as well
that we don't have to think about the infrastructure as such i mean i come from the times where we
looked at the bits and the bytes and the headers and the size of a table to make it more performant, really, and how
to tweak indexes to make it more performant. If you don't have to think about that, and your
infrastructure is as smart for you as that, you can really focus on your solution. And that's
something where I think we put a lot of stress on the infrastructure,
as well. If you see you don't have enough storage capacity that you just can switch in new storage
that you have available to do your simulations of or whatever.
Yeah. And we've been talking about this. We talked about this with Chris Grunderman,
the impact on the network and the fact that
we need high performance, low latency networks.
We also have talked in the past about infrastructure overall.
Surprisingly, given me, we haven't talked very much about storage, but we probably will.
And the truth is that, if I can kind of dive in on that, the storage world
is definitely seeing possibilities in the world of machine learning.
And specifically, you know, you have companies developing basically high performance, scalable
object stores to feed machine learning workloads.
So that's one thing.
Another thing we're seeing, and this is
something that we talked about on the episode with Matt Bryson about sort of the various companies
that are working on AI, is the impact of AI on the cloud and whether AI inferencing is going to
be done in the cloud and the impact of things like data gravity and the movement of data between on-premises and the cloud, and whether that's going to sort of foreclose the ability of us to grips with the limits of IT infrastructure and how that
puts a boundary on what you can do as a data scientist. Yeah, exactly. I mean, it's a limitation.
It's a boundary. I mean, you have an idea and you need to, the systems need to follow at the speed
of thought. That's how I address it most of the time. And if that's not the case, you need to follow at the speed of thought. That's how I address it most of the time.
And if that's not the case, you need to carefully plan,
okay, we want to do this.
And then the next month, this is what becomes available.
And then we can move and move.
And you missed a lot of opportunity on the market.
So I'm quite happy to see that infrastructure definitely is catching up
with virtualization, with hybrid clouds, with multi-clouds as well.
From the architecture perspective, we're quite happy to see that this becomes really transparent
if you think about privacy and that, for example, you are only allowed to store your data on the
European servers or easily move them into the area where you're allowed to store that that that
personal data if we talk about that so i see very very much a positive evolution and especially the
vendors as well that they understand what is happening in that that that world the same goes
with well the traditional hardware or infrastructure uh vendors they they provide
as well their on-prem service in in a more cloud approach where you can rent their infrastructure
so you don't have that discussion anymore of of capex and opex that's that's another evolution
what i see happening as as well which only allows us to be more flexible and have more
capabilities of, yeah, build models and software at the speed of thought and follow up with the
business. So are you excited? Do you, I guess, think about where this is going? Do you anticipate
that businesses are going to be processing machine
learning models and churning through data and everything? Do you anticipate that being something
that happens in the data center, or do you anticipate that it would happen in the cloud?
I heard kind of both things from you just now. Well, I see it. We all, and there's still a big
evolution of moving everything to the clouds. What we see that there's still a big evolution of moving everything to the clouds.
What we see that there is still a big issue when you need to move your data out of the clouds.
That's still an issue. So we see a lot of solutions coming out where they offer a solution to have that hybrid cloud type of solution in place.
We see IoT what is coming in.
So you need to have that calculation at the edge as well.
So you're pushing, in fact, the models towards where the data is
instead of always centralizing the data and then doing your model execution.
So that's something where what I see happening. And I see a trend more and more
really having the models being mobile, moving them where they need to be, or move and don't
need to move the data all over the place all the time. That's something and it aligns with
one of the first systems I ever saw. I mean, if we were registering in an accounting system
at the uttermost detail line of transaction,
we already calculated the aggregations at that time
just for optimization.
And I see the same approaches coming back now
by saying, okay, we do that calculation where the data is.
So that's more virtualization of the data,
what is happening in these days.
And again, if I can dive in from an architecture and infrastructure enthusiast perspective,
I do hear a lot about that. So we have, you know, composable infrastructure is a solution that's been looking for a problem for a while. You know, It seems like a good idea to be able to have
hardware that can be flexibly reconfigured on the fly. That solution has not yet found the
ideal problem, but we're starting to see companies, for example, Liquid talking about
using composable infrastructure as a solution to the AI question. Similarly, we also see sort of disaggregated compute
and moving compute closer to storage.
So, you know, compute and enabled storage solutions,
those are being proposed as a way to do data processing.
I'm not sure they'll be able to do ML processing,
but again, we talk to companies that are talking, you know, make on your time traveler's hat and
step into your time machine. Where do you suppose the fields of data science and artificial
intelligence are going to be in the enterprise 10 years from now? Well, I think we're going
pretty, pretty hard, although I sometimes feel that we're not going fast enough.
But in 10 years, I think we will have a big evolution on where these systems are going.
So really, especially in the financial world and the medical world, we will see a lot of evolution in where that is going.
So these are the industries keeping'm keeping close eye on,
on where it's evolving.
And I think there are big demanders as well where this is going.
If I see the evolution, I always think of my first idea
when I started into IT where we had the Commodore 64
and my huge record collection, which just fit in three basic listings.
And now I pick up my phone and i
have spotify will with all possible uh music somewhere stored so miniaturized so if you see
that that only happened in the last 30 years uh i'm very uh positive and and hopeful and excited
to see where technology will bring us in in next 10 years. You can see already the evolution from big data 10 years ago
onto where we are now with the evolution of machine learning
and artificial intelligence.
So it will keep on going and,
and if looting even at a higher speed.
So I heard you say, you know, science and medicine.
Do you think that ML is just going to be sort of a standard component of everything in the data center?
Or do you think that that's not going to happen?
I see that happening more and more.
I mean, we kind of weeded out which algorithms are best for which type of problem. You have already the AutoML solutions as well,
where you just run your model against various libraries
and see what is the best outcome
and combine various algorithms together.
So that's what I see.
You see that in various autonomous databases already,
where that is being used.
You talked about as well,
having AI on the chipset in itself.
So I think, yeah, if we can make these algorithms perform
much, much performant and much faster,
definitely they will go onto the chipset
and make even the hardware smarter
than not only the data models in itself.
That's, you know, it is an optimistic future, but it seems to be the direction that many of
these companies are hoping to take it. I mean, certainly, again, you know, back to the episode
with Matt Bryson, we talked about, we probably name dropped 20 companies that are working on
putting AI literally everywhere in the IT stack. So we'll see if that happens.
Thank you so much for joining us today, Yves.
It's always great to catch up with you.
And I really do look forward to hearing, you know,
sort of the data architect perspective on machine learning.
Where can people connect with you to follow your thoughts
on enterprise AI, data science, and other topics?
Steven, they can follow me on my blog, 7wdata.be,
where we curate the hottest trends on enterprise AI, machine learning,
everything related to the data field as such.
Or they can follow my Twitter handles at Eve Milkers or at 7wdata.
Or they can connect with me on LinkedIn.
Great. And thank you very much.
If you'd like to connect with me,
you can find me on Twitter at sfoskett.
And I'd love to hear what you think about this podcast.
Thanks for listening to Utilizing AI.
If you enjoyed this discussion,
remember to subscribe, rate, and review the show on iTunes
since that really does help our visibility.
This podcast is brought to you by gestaltit.com,
your home for IT coverage across the enterprise.
For show notes and more episodes, go by gestaltit.com, your home for IT coverage across the enterprise. For show notes and more episodes,
go to utilizing-ai.com,
or you can find us on Twitter at utilizing underscore AI.
Thanks, and we'll see you next time.