Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 15: Applying the Lessons of Data Science to Artificial Intelligence with @YvesMulkers

Starting point is 00:00:00 Welcome to Utilizing AI, the podcast about enterprise applications for machine learning, deep learning, and other artificial intelligence topics. Each episode brings experts in enterprise infrastructure together to discuss applications of AI in today's data center. Today we're discussing the collision between data science and AI. First, let's meet our guest, Yves Mokers. Hi, I'm Yves Mokers. I'm a data strategist, and I blog about my passion about data on 7wdata.be. And you can find me, of course,

Starting point is 00:00:33 on Twitter at Yves Mokers or on LinkedIn. Thanks, Yves. And I'm Stephen Foskett, organizer of Tech Field Day and publisher of Gestalt IT. You can find me on Twitter at S Foskett. So, Eve, you and I met a few years back when we did a data-focused Tech Field Day event called Data Field Day. And I met you through the whole wider data community. And many of us in IT operations aren't as familiar with the world of data analytics, data science, big data, all those things. So I wonder if maybe we can start out by talking a little bit about what is that world all about? What is your world as a data strategist? And how does that apply to the enterprise? Well, as a data strategist, I think on a higher level, I look at the data

Starting point is 00:01:22 assets that companies have in place and look at the use cases how they can make benefits out out of that understanding their business and transforming that that data into means to help them in their decision support on the other side as a data architect i go way back i mean 30 years when we were building databases and times that we were speaking from management information systems that's in fact the reporting what you know financial reporting management reporting operational type of reporting we made a big evolution thanks to kimball and inbound people that identified a modeling technique to optimize the querying performance for your data

Starting point is 00:02:06 warehouses where you stored your data and which feels in a natural way how we look at the world we identify products invoices customers and by defining those holistic entities which are world known we were able to tie everything together as common dimensions. And this is an aspect that went on and further on in data science and in artificial intelligence. But in artificial intelligence and data science, these techniques help us, in fact, in defining those holistic models. And that's the evolution, what I saw, and the analogy over the years as well. Indeed and I would think that you know the world of data science would be intrinsically linked with the world of AI since essentially I mean if you kind of take it apart AI in the enterprise

Starting point is 00:03:00 really is about making sense of data making making sense of data inputs. And that's really what data scientists have been trying to do for decades is make sense of it. Now you've got, I guess, the robot brains to help you do this. How do you suppose AI is going or how has it affected the data space and how will it affect in the future? Well, it definitely has. What I saw and am quite happy about that is data management, which is the foundation for building all these solutions. I mean, the models and the solutions has gotten much more attention in the last two years. That's a trend I see happening. I mean, discussing with

Starting point is 00:03:45 people in the field for more than 20 years doing data management and being frustrated that data was not high on on the agenda of the most of the ceos or the cios uh that's that's something where people started to understand if we get crappy data we don't get uh performing results we get crappy results as well and that's that's an important part what we saw happening in in in that area i think as well why does it get more attention is because data science and the models that run in an automated way help us in optimizing our data management for example in data, looking at missing data, looking at doing profiling. These are all data management techniques that help us in producing better data and better results as well.

Starting point is 00:04:34 So the new ways of managing your data helps us in the data management. Indeed. And it would seem that it would be incredibly useful to you to be able to work with artificial intelligence, machine learning concepts when assessing data. But I have a suspicion that the fields are not yet fully integrated. Is that true? I mean, are there still different people doing ML work and data science work? Yeah, exactly. Exactly. I think you put that right. And it comes from that we have different skill sets. That's something data scientists were more related to the statistics world. That's something how they built their models. They were not so into the relational database. One thing that we had in common

Starting point is 00:05:29 was having access to the operational systems, but they stored their data in a different way where in traditional ways, we were still working on the data warehouses and pulling our data from there. Whereas data scientists had, just for that specific use case, they pulled together the data.

Starting point is 00:05:45 They were doing exactly the same, cleansing the data, preparing the data, manipulating the data so it would fit their model. So that preparation part is the same as in the traditional world, what data scientists are doing. But if they run a model, it's organized, the data is organized in a different way. And therefore, they needed different compute, different storage and capabilities of the resources. Now, with the cloud, where you have all these capabilities available on demand, that's where they mostly and faster went to the cloud to deploy and develop their models as we were still in the traditional world working in on-prem systems. And the need for the different calculations or power to calculate these models

Starting point is 00:06:42 made that there stayed some separation between the different teams although that 80 of the work we are doing is kind of the same thing we are doing i see it now emerging and merging a bit more together if i saw architects architectural designs or we made architectural designs. It was the traditional way. Then you blend in the analytics. And if you were lucky, they blend in the analytics into the data warehouse or vice versa. But never one system that allowed you to do the different types of calculations or loads. Another thing that occurs to me as well is, you know, we've talked on the podcast before about the various tools that are being used in the, you know, ML Ops space.

Starting point is 00:07:30 And although those tools say, you know, they have the word data in there quite a lot, it occurs to me that perhaps those tools aren't all that integrated with the workflow of the data scientist. In other words, you know, you've got sort of another situation here where the data model folks who maybe are more software developers or, you know, more enthusiasts about artificial intelligence and machine learning, those folks might not have much understanding of the world of data science and specifically the workflow of data science. You know, what is, I guess, in a nutshell, what is the workflow for data analytics? How do you collect, you know, organize and store data for use?

Starting point is 00:08:19 And how does that affect machine learning workloads? Well, there is big difference. And back in the days in the data warehouse, it was the same challenge you had. You couldn't run your workflow on the complete data set that you had available. For example, if you were working in development, you had a subset of what was available in production. If, for example, you build up a new data warehouse and you had to load 20 years of financial history data, you couldn't do that overnight. With the new systems in place, you can run on the complete data set. Data science works in the same way. They took a sample of the

Starting point is 00:08:55 complete data set, built their models, and then they run it in production to see that these models weren't performing in the same way. They got different results because the data distribution was different and they got different insights. An advantage is what we see now that you have the power of the machines that allow you to run on the full set and in an acceptable time, not waiting for two weeks

Starting point is 00:09:18 before you get results of your models, but you get these results in five minutes if you have performant infrastructure in place. And that's what I see that things come together. The preparation, the suffering of working on a subset of the data and not the real data. Let's follow that thread for a moment, actually. The idea of the fact that machine learning opens new doors to data science, and specifically the idea that by bringing in artificial intelligence processing, you can do more with the data. What excites you about this technology? Why do you feel that this is a positive for data science?

Starting point is 00:10:00 Well, I look at it as a big supporter in the traditional things that we were doing, doing everything ourselves manually, writing SQL scripts, doing profiling, doing data cleansing. If you have these models and they run against your data and you give them the way of how they should respond and correct, and they help you in doing that, they can take a lot of the lab, some and very hard work and tedious work out of the hands of data scientists. And therefore, they become more productive in building these models. They have more time for analyzing the models to bring real value to the knowledge, what they have to discuss with the robots that take out the tedious work from the labor workers. That's what I see happening now with machine learning and artificial intelligence that help us in optimizing our way of how we handle data. Yeah. So is this a case though of sort of the tools driving the field, or is this a case where you are adopting the tools? And I'm not trying to insinuate

Starting point is 00:11:29 that this isn't such a good idea, but is this something you had been longing to be able to do in the data science field, or is this a new door that's opened and you're sort of exploring what you could do with these tools? I think you put it exactly right. It's really longing for so much time to be able to do that.

Starting point is 00:11:50 Doing profiling in an automated way, always available. If I open up a database, for example, that it can tell me your data looks on a meta level, it looks like this. So I exactly know that 20% of the customers don't have an invoice. So I can start questioning already what is happening in there. Before I had to spend two weeks analyzing the data before I understood what was in there. So it's the tools driving this, no longing for these tools that make me more productive. And for me, data science was 10 years back for me, a complete new world.

Starting point is 00:12:25 And I started looking into what is the analogy and what are the differences between the two worlds. And I saw if these algorithms and modeling tools can help me in optimizing the traditional way of working, what I'm doing, I can become so more proficient in what I'm doing. I mean, we're still 10 years after. There is a big evolution being made, but it's still going pretty slowly at that level. But it goes in two ways. I mean, the tools support what we do, and we are wanting to have some more tools that make us productive.

Starting point is 00:13:02 So you see that that is happening. And that's great to hear because I do, you know, one of the things, again, that we've talked about on the podcast is the fact that AI does tend to, basically, people tend to go be too optimistic about it. We talked about that, for example, with Josh Fidel a few weeks ago when we were talking about the just the many applications of AI and the fact that people may just take these things and run with them think, you know, wow, the machine can generate convincing text. That means we should have it generate all the text. From your field, what are the guardrails that are being put on ML applications by, you know, competent data scientists instead of just, you know, people sort of running off the, you know, getting excited about the technology? How do you ensure that the result that you're getting is a valuable result instead of just applying

Starting point is 00:14:06 the technology for the technology's sake? Well, yeah, traditionally you do that with governance and policies where you try to guide that on artificial intelligence. It's a bit harder. We try to build models as a copy of how we humans respond to certain events. And that's a part where you have to be careful. You don't need to use AI for each and everything. You need to have the use case and say, in this area, it has its sense.

Starting point is 00:14:39 It helps in optimizing. For example, if you see one of the applications are conversational chatbots, and that helps in, for example, if you see that a help desk gets about 80% of the time, exactly the same questions of what is the status of my order? When will it be delivered? These are typical questions that can be easily answered by machine learning or artificial intelligence systems that look into your system and get these responses back. And it's using speech to text, text to speech. It looks into the databases and then builds you a contextual answer in that respect. That's where it has its sense. But like you said, I've been in those fields as well. Wow, excited about the technology and then starting to use it for everything. What I've learned back in the days, first try to build it yourself manually.

Starting point is 00:15:28 Then you understand how it should work and then you can optimize it and automate it. And that's still a very good thing to understand. First, learn to understand what you're trying to solve and then put technology to that. And that's something kind of core values, what you need to have embedded as an artificial intelligence developer or solution provider. That's very important to have those kind of ethical skills in that area.

Starting point is 00:15:58 Yeah, indeed. And I think that that's really critical and I wouldn't expect anything less from you. I mean, that really is an architecture perspective, right? An architect's perspective to say, first think about what we want, then think about what we can do or can't do with the tools at hand and then apply those tools. I mean, that's a very methodical approach for somebody like a data architect would come to it.

Starting point is 00:16:23 Unfortunately, that's not the approach that everyone is taking. Are there things that you're seeing, especially in your field, that do frighten you a little bit about how people are using this technology? Yeah, it's because you start rambling off on your laptop, for example. You find some models, you develop that model in Python, and you don't really know what is coming out of that. So I think building the people that dare to build these models and just say, okay, let's do it, and put them together with the architects where they say, okay, have you thought of this and this and this and this, together with the policymakers, but make sure there is a balance between all the three or the four stakeholders in building those type of solutions. I mean, if you leave it over to the policymakers, I think in the next 100 years, we don't have a model that is working. So it's kind

Starting point is 00:17:18 of that experimentation that needs to be available or allowed in a controlled manner. And that's bringing those multiple skills together in one team that can build that or educate people the most in what are the dangers of if you have these models, what are they capable of doing or what is the wrong conclusions they can get out of the model or suggest to you? So that's something where we have to be very careful upon. Indeed. And you have to be careful with that with all aspects of technology. I mean, that's the bottom line.

Starting point is 00:17:55 I mean, again, that's, you know, things that folks like us have been focusing on for years. But, I mean, you know, it's the old standby of, you know, and I talked about technological determinism in a previous episode as well. The idea that the tools will tend to drive an outcome instead of, you know, the desired outcome driving the tools. I want to shift gears, though, and get back to something you talked about maybe about 10 minutes ago, which is the infrastructure aspect and, you know, sort of the practical aspects here. So we are developing and deploying machine learning tools. We are developing and deploying, you know, data warehouses,

Starting point is 00:18:34 data lakes, whatever you want to call them. You know, how does this impact infrastructure? How, you know, how does IT need to change in order to support these workloads? Well, mostly I call it in a simple way. They need to be faster. They need to be more agile. I mean, I've been in so many places where you say we need this new field from the operational system because business needs to report upon that. And it takes another three months before it becomes available in your organization. Or maybe we did a new promotion for the new products we launched and we want to have so many simulations.

Starting point is 00:19:14 So additionally, maybe we do 1,000 a day, but for this specific action, we need to do 1 million a day. So on the infrastructure part, having that scalable infrastructure available where you can switch it on have it available for doing this specific use case and then put it in putting it just back to to to halt that's that's something what we expect from from infrastructure as well

Starting point is 00:19:38 that we don't have to think about the infrastructure as such i mean i come from the times where we looked at the bits and the bytes and the headers and the size of a table to make it more performant, really, and how to tweak indexes to make it more performant. If you don't have to think about that, and your infrastructure is as smart for you as that, you can really focus on your solution. And that's something where I think we put a lot of stress on the infrastructure, as well. If you see you don't have enough storage capacity that you just can switch in new storage that you have available to do your simulations of or whatever. Yeah. And we've been talking about this. We talked about this with Chris Grunderman,

Starting point is 00:20:23 the impact on the network and the fact that we need high performance, low latency networks. We also have talked in the past about infrastructure overall. Surprisingly, given me, we haven't talked very much about storage, but we probably will. And the truth is that, if I can kind of dive in on that, the storage world is definitely seeing possibilities in the world of machine learning. And specifically, you know, you have companies developing basically high performance, scalable object stores to feed machine learning workloads.

Starting point is 00:21:01 So that's one thing. Another thing we're seeing, and this is something that we talked about on the episode with Matt Bryson about sort of the various companies that are working on AI, is the impact of AI on the cloud and whether AI inferencing is going to be done in the cloud and the impact of things like data gravity and the movement of data between on-premises and the cloud, and whether that's going to sort of foreclose the ability of us to grips with the limits of IT infrastructure and how that puts a boundary on what you can do as a data scientist. Yeah, exactly. I mean, it's a limitation. It's a boundary. I mean, you have an idea and you need to, the systems need to follow at the speed of thought. That's how I address it most of the time. And if that's not the case, you need to follow at the speed of thought. That's how I address it most of the time.

Starting point is 00:22:05 And if that's not the case, you need to carefully plan, okay, we want to do this. And then the next month, this is what becomes available. And then we can move and move. And you missed a lot of opportunity on the market. So I'm quite happy to see that infrastructure definitely is catching up with virtualization, with hybrid clouds, with multi-clouds as well. From the architecture perspective, we're quite happy to see that this becomes really transparent

Starting point is 00:22:33 if you think about privacy and that, for example, you are only allowed to store your data on the European servers or easily move them into the area where you're allowed to store that that that personal data if we talk about that so i see very very much a positive evolution and especially the vendors as well that they understand what is happening in that that that world the same goes with well the traditional hardware or infrastructure uh vendors they they provide as well their on-prem service in in a more cloud approach where you can rent their infrastructure so you don't have that discussion anymore of of capex and opex that's that's another evolution what i see happening as as well which only allows us to be more flexible and have more

Starting point is 00:23:26 capabilities of, yeah, build models and software at the speed of thought and follow up with the business. So are you excited? Do you, I guess, think about where this is going? Do you anticipate that businesses are going to be processing machine learning models and churning through data and everything? Do you anticipate that being something that happens in the data center, or do you anticipate that it would happen in the cloud? I heard kind of both things from you just now. Well, I see it. We all, and there's still a big evolution of moving everything to the clouds. What we see that there's still a big evolution of moving everything to the clouds. What we see that there is still a big issue when you need to move your data out of the clouds.

Starting point is 00:24:12 That's still an issue. So we see a lot of solutions coming out where they offer a solution to have that hybrid cloud type of solution in place. We see IoT what is coming in. So you need to have that calculation at the edge as well. So you're pushing, in fact, the models towards where the data is instead of always centralizing the data and then doing your model execution. So that's something where what I see happening. And I see a trend more and more really having the models being mobile, moving them where they need to be, or move and don't need to move the data all over the place all the time. That's something and it aligns with

Starting point is 00:24:59 one of the first systems I ever saw. I mean, if we were registering in an accounting system at the uttermost detail line of transaction, we already calculated the aggregations at that time just for optimization. And I see the same approaches coming back now by saying, okay, we do that calculation where the data is. So that's more virtualization of the data, what is happening in these days.

Starting point is 00:25:34 And again, if I can dive in from an architecture and infrastructure enthusiast perspective, I do hear a lot about that. So we have, you know, composable infrastructure is a solution that's been looking for a problem for a while. You know, It seems like a good idea to be able to have hardware that can be flexibly reconfigured on the fly. That solution has not yet found the ideal problem, but we're starting to see companies, for example, Liquid talking about using composable infrastructure as a solution to the AI question. Similarly, we also see sort of disaggregated compute and moving compute closer to storage. So, you know, compute and enabled storage solutions, those are being proposed as a way to do data processing.

Starting point is 00:26:21 I'm not sure they'll be able to do ML processing, but again, we talk to companies that are talking, you know, make on your time traveler's hat and step into your time machine. Where do you suppose the fields of data science and artificial intelligence are going to be in the enterprise 10 years from now? Well, I think we're going pretty, pretty hard, although I sometimes feel that we're not going fast enough. But in 10 years, I think we will have a big evolution on where these systems are going. So really, especially in the financial world and the medical world, we will see a lot of evolution in where that is going. So these are the industries keeping'm keeping close eye on,

Starting point is 00:27:26 on where it's evolving. And I think there are big demanders as well where this is going. If I see the evolution, I always think of my first idea when I started into IT where we had the Commodore 64 and my huge record collection, which just fit in three basic listings. And now I pick up my phone and i have spotify will with all possible uh music somewhere stored so miniaturized so if you see that that only happened in the last 30 years uh i'm very uh positive and and hopeful and excited

Starting point is 00:28:00 to see where technology will bring us in in next 10 years. You can see already the evolution from big data 10 years ago onto where we are now with the evolution of machine learning and artificial intelligence. So it will keep on going and, and if looting even at a higher speed. So I heard you say, you know, science and medicine. Do you think that ML is just going to be sort of a standard component of everything in the data center? Or do you think that that's not going to happen?

Starting point is 00:28:33 I see that happening more and more. I mean, we kind of weeded out which algorithms are best for which type of problem. You have already the AutoML solutions as well, where you just run your model against various libraries and see what is the best outcome and combine various algorithms together. So that's what I see. You see that in various autonomous databases already, where that is being used.

Starting point is 00:29:02 You talked about as well, having AI on the chipset in itself. So I think, yeah, if we can make these algorithms perform much, much performant and much faster, definitely they will go onto the chipset and make even the hardware smarter than not only the data models in itself. That's, you know, it is an optimistic future, but it seems to be the direction that many of

Starting point is 00:29:30 these companies are hoping to take it. I mean, certainly, again, you know, back to the episode with Matt Bryson, we talked about, we probably name dropped 20 companies that are working on putting AI literally everywhere in the IT stack. So we'll see if that happens. Thank you so much for joining us today, Yves. It's always great to catch up with you. And I really do look forward to hearing, you know, sort of the data architect perspective on machine learning. Where can people connect with you to follow your thoughts

Starting point is 00:29:58 on enterprise AI, data science, and other topics? Steven, they can follow me on my blog, 7wdata.be, where we curate the hottest trends on enterprise AI, machine learning, everything related to the data field as such. Or they can follow my Twitter handles at Eve Milkers or at 7wdata. Or they can connect with me on LinkedIn. Great. And thank you very much. If you'd like to connect with me,

Starting point is 00:30:26 you can find me on Twitter at sfoskett. And I'd love to hear what you think about this podcast. Thanks for listening to Utilizing AI. If you enjoyed this discussion, remember to subscribe, rate, and review the show on iTunes since that really does help our visibility. This podcast is brought to you by gestaltit.com, your home for IT coverage across the enterprise.

Starting point is 00:30:45 For show notes and more episodes, go by gestaltit.com, your home for IT coverage across the enterprise. For show notes and more episodes, go to utilizing-ai.com, or you can find us on Twitter at utilizing underscore AI. Thanks, and we'll see you next time.

Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 15: Applying the Lessons of Data Science to Artificial Intelligence with @YvesMulkers

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.