Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 2x24: MLOps Is About Quality Not Technology with Steph Locke

Starting point is 00:00:00 Welcome to Utilizing AI, the podcast about enterprise applications for machine learning, deep learning, and other artificial intelligence topics. Each episode brings in experts in enterprise infrastructure together to discuss applications of AI in today's data center. Today, we're discussing the topic of MLOps and the fact that MLOps really is all about the operations and not so much about the technology. So first let's meet our guest Steph Locke. Hi Stephen, I'm Steph. I run Nightingale HQ. We help manufacturers adopt artificial intelligence and my background is as a data scientist and across business intelligence, I'm kind of the more basic end through to ML engineering.

Starting point is 00:00:53 So through that, I've done a significant amount of work on how do we operationalize machine learning and build out effective data science teams to be able to deliver quality product more effectively. And that has kind of helped me bring more into the technical community in this. So I've been speaking, organizing conferences and developing user groups and even global conference brands around data science and getting things into production for more than a decade now. Glad to be here and to talk about ML Ops. Well, thank you so much. It's great to have you. One of the things you didn't mention

Starting point is 00:01:39 was that you were a guest, one of the delegates at our recent AI Field Day event, and it was great to have you there. I'm Stephen Foskett, organizer of Tech Field Day and publisher of Gishtalt IT. You can find me on most social media networks at sfoskett, and also you can find our weekly news rundown every Wednesday at gishtaltit.com. So, Steph, one of the big topics in AI and ML is MLOps. And, of course, I think a lot of people sort of intuitively realize that there's some relationship between the whole DevOps movement and MLOps. And maybe they assume that MLOps is just DevOps for ML. I wonder if you can start off by just giving us a little bit of an overview. What is M Ops to

Starting point is 00:02:26 you? So machine learning ops and operations is kind of DevOps, but focused in on ML. So DevOps really is the process of improving how we deliver quality code and maintain that out in the real world. It's about being effective and doing more as an engineering team than you would otherwise and having real quality in there. And machine learning operations is pretty similar. There's a unique set of challenges involved because it's not like building software where, you know, you compile your code pretty quickly, you drop some files and you put it in, there's a number of unique challenges. So the machine learning operations side of things really tries to focus in on the challenges

Starting point is 00:03:16 of a machine learning community and how we can integrate that with some of the broader DevOps pieces, the classics of continuous integration and deployment being vital among them. So is it fair to say that MLOps is just DevOps for ML, or are there some aspects of it that are unique and different? I think they are different. So DevOps, as it was originally conceived, is much more of a broader structural and cultural piece. It's not so much a process piece, whereas machine learning operations tends to be a bit more on the practical side around how we solve some of the engineering challenges of the quality, compliance, and speed of delivery into production. So it's kind of really the DevOps movement distilled into the specific domain of machine learning. And that aspect is important because one of the things that I've noticed about machine learning

Starting point is 00:04:20 application development that's quite different from regular software development is that it tends to include software developers and software engineers but of course it also includes data scientists and in many cases machine learning and artificial intelligence experts which brings more people to the table so whereas devops tends to very much be focused on developers and the needs of developers and rolling out code and production and platforms and so on. I feel like MLOps is kind of a bigger table with more chairs at it. And does that really change things? So depending on who you ask, they'll say DevOps is for everyone. That doesn't really typically translate from that big cultural piece into being into effective cross-functional teams in many places so you still see things like data ops as well for helping

Starting point is 00:05:13 data specific people think about and apply devops principles into their roles and capabilities and it's kind of the same with machine learning. You're totally right that it is a team sport. There is not many data scientists and machine learning engineers who can do everything from the database design and infrastructure work to get something running, to creating a feature store, to building that model, to turning it into an API that gets monitored and has authentication and logging

Starting point is 00:05:46 and build that end-to-end solution. It really is much more of a of that kind of cross-functional capability and as such getting that process and some of those underpinning technologies of MLOps right really helps reduce the frictions and misunderstandings that can come from four or five different people sitting around the table with their own jargon, own ways of doing things, and own key goals. And for me, with a background in IT infrastructure, I found that DevOps was actually a really compelling and exciting opportunity because, well, for one thing, a lot of infrastructure engineers really understand that they aren't missing out on the bigger picture by only focusing on the platforms and the infrastructure that are being used and not really seeing what it's being used for. I remember when I was a systems administrator years ago at some big companies, banks and oil companies and so on, I remember feeling a real sense of frustration that I could put my hands on a server, a storage device or network device. And I knew that it was roughly being used by

Starting point is 00:06:57 this particular group. Oh, this is downstream operations, you know, but I had no idea what application was running on it. I didn't even know if it was test or production, because frankly, we had the same SLAs for all of them. And I felt like I was missing out and I could have contributed more. And I think that's one area of DevOps that gets overlooked is the fact that it can benefit the IT infrastructure people, not just the developers. And I think that what you're saying as well is that MLOps can benefit everyone who's coming to that table by bringing all these different backgrounds together. Is that right? Definitely. It turns it into not how do I do my bit, but how do we as a pipeline and a process and a group of people make things happen, that adds value to the business. So no individual becomes a cost center. Everyone's adding value. So having everybody being able to

Starting point is 00:07:54 contribute and very transparently to the delivery of something that's going to add value to the business, where it reduces your overheads by optimizing your real estate or helping customers find the right thing more effectively. You're adding value. And by being part of that pipeline and going, here's the bit I deliver, that really helps turn it into a very different conversation than I'm keeping the lights on. Another thing that I've noticed when it comes to AI applications is that you're using additional types of hardware as well. And that has been a big challenge for a lot of IT ops people, because frankly, they're unfamiliar with a lot of these hardware paradigms.

Starting point is 00:08:37 So if you're looking at, for example, an NVIDIA EGX system, you know, it looks like a server to a server guy, but then you're kind of looking under the hood and you're like, what is that thing exactly? What is the unique elements here? And I think in a way that shows how the hardware reflects the people, you know, because again, just like DevOps, where you basically are focused predominantly on the traditional pillars of server storage network. With MLOps, you have to consider in different kinds of special application processing units and so on. And that, I think, can be a little off-putting to IT operations. But is the hardware aspect, the infrastructure aspect, off-putting to data scientists and software developers as well? Oh, definitely. I think of myself as a software gal. I've put together my

Starting point is 00:09:33 own machine, my own desktop, and got all the GPUs and everything in. But I still remember doing a little bit of IT hardware sales and having to know about things like front side bus speeds and different versions of PCIe and things like that. It's actually a very difficult field and networking and infrastructure is becoming increasingly vital to large scale distributed machine learning. So it was really great at the AI Field Day to be able to see some of the organizations who are helping change how infrastructure, servers, and things like that get put together to support these increasing demands that data scientists need. And lots of us can overcome our weaknesses in knowledge areas to be better and know more but

Starting point is 00:10:29 having the putting the right people in the room and helping each other talk to each other about the same challenge from their different perspectives can be a much stronger proposition so have the IT the infrastructure person who understands about these racks of GPUs that can now be switched against all the different virtual machines for the distributed training, whilst the machine learning engineer says, I need to be able to do this kind of process over this amount of time, and here's the type of performance requirements I need and work collaboratively on a solution. And that leads me to another aspect of DevOps that I think has been really beneficial. And that is by creating a communications channel between the different parties involved, it has allowed people to be more efficient in terms of spending, basically. Whereas once there was a hard wall and it was just sort of, no, we need more infrastructure, just go get it.

Starting point is 00:11:31 Now, I think that there's a bit of a communications channel there and people realize that this is still ongoing, but perhaps you need to improve efficiency. You need to start thinking about the impacts of the choices you're making. I know that this is a big thing in cloud computing as well, where it's very, very easy to overspend. And in machine learning, I think that that's also a really key aspect because, frankly, inferencing, processing is precious. And to have enough GPU to do the job is not a question of

Starting point is 00:12:09 overspending as much as availability. You have to basically have communication between the teams in order to help the machine learning folks realize sort of what they've got to work with. It's not a question of, okay, what do you need? We'll give you anything. It's very much a question of, what are the boundaries in which we have to work and how can we build our model to work effectively in those boundaries? I think that's a very important development, right? Yeah. And it's so critical right now in a time where everybody is scrambling for gpus to try it because of the boom of cryptocurrency that if we want to be able to do deep learning effectively within an organization we do need to have that um that hybrid on-premises in cloud strategy around where we're going to acquire

Starting point is 00:13:09 hardware and change things to a CapEx expenditure and when we're going to put it onto the cloud, make it an operational expenditure and how we make that trade-off. And that is totally something that is a collaborative effort because as you say you know you can't just chuck more ram at it you know you can't put another memory stick in gpus they're they're much more complicated it's much more of a difficult proposition to put those into servers and retrofit and they're getting increasingly more expensive. Particularly, we're seeing some of the things like Intel Movidius GPU-style USB sticks, where you can make like a Raspberry Pi cluster with a load of GPU sticks in them and a much lower total cost of ownership than a kind of a traditional DAO appliance style server and be

Starting point is 00:14:06 able to achieve the goal. So there's some areas where knowing about hardware and having the people in the right room can really give you an alternative solution that increases your speed to get that kind of long-term capability without needing to really chuck tons and tons of cash at it. Well, another aspect as well is that machine learning applications, as we've seen again and again, both here on the podcast as well as at the field day and in the MLOps community and so on, these applications are not just being rolled out in data center environments. These are being rolled out at the edge, even in IoT industrial situations and settings. And this, of course, presents another challenge to the whole mindset, because not to cast

Starting point is 00:14:55 stones too hard, but, you know, data scientists can get a little ivory tower. And when you have to think about, well, my data is going to be living and working and existing out there in the real world, it's very, very important to be able to have that communication of how is this really going to be rolled out? What is it really going to look like in practice? And how is that going to impact the decisions that I'm making here in my development? Yeah, the embedded systems and IoT is such a huge way that we can add value as data scientists and also be much more privacy conscious as well. So being able to do things at the edge means that we can start, you know, really keeping things like COVID social distance monitoring details right in the workplace and not disclose that information elsewhere, for instance. So it is a critical capability and it's where actually machine learning ops and the whole kind of

Starting point is 00:16:05 robust pipeline comes in because data scientists you're right most particularly hired from an academic background rather than an industry background and they do not have a lot of software engineering skills and they don't necessarily know about the differences in architectures that their models are going to be deployed in. So being able to build that robust pipeline where the data scientist builds something effective on maybe like a big training set with a complex model, and then the pipeline compiles that down into the right format to go onto the Android device or onto the newest

Starting point is 00:16:49 like CCTV kit with a minor GPU on it and and go through tests it for performance stress tests it and then automatically deploys that is going to be uh the kind of thing that more and more data scientists are going to need to do. And it's also where a lot of infrastructure folks are also going to have to upskill their knowledge. You know, we're putting embedded systems in so much more things. It's now no longer expected to just be the province of the, you know, the beardy guys, you know, see, and can kind of like write practically at assembly level, more and more things need to be deployed. And we need to be able to do that within the broader IT community. So that brings me to another thought. When it comes to MLOps, certainly we're bringing more people to this table. This metaphorical table includes more chairs. But are there still chairs that are unfilled? Are there

Starting point is 00:17:51 areas that are not yet represented at this table that ought to be? So, for example, one of the things that I can think of is that a lot of these MLOps, DevOps, AIOps, data ops, all this, I don't hear anybody talking about the business application users. I suppose that their needs are represented by the product managers or the software developers, but is that really true? I'm a firm believer that we can't build something if we don't have the right domain knowledge in which we're going to build something. So the subject matter experts in their respective departments where we're building that value is critical, at least to me. I also think that one of the areas that we're quite underrepresented in,

Starting point is 00:18:40 and it does keep me up at night kind of thing, is that we do not have enough discussions around security right now, around machine learning. Data scientists are coming back to that kind of academic stereotype. Production systems, software engineering systems, are a far cry from how they do things. And you also can't, even the classics like SQL injection attacks, a data scientist makes an API, puts that into a microservices environment

Starting point is 00:19:13 on some Kubernetes cluster. How are they going to know to check for SQL injection attacks and things like that? So we really need to have it more value add for end-to-end stream where we have subject matter experts opining about like what features do they know are important? What quirks are there in the data? What processes generate it? That means you have to be a little bit careful. What are the hidden assumptions that we have when we build this model? And then compliance and security and software engineering, senior leaders who can help perform

Starting point is 00:19:54 code reviews and improve the rigor and quality that the software gets built at. It's all literally taking a village. So, you know, in summary, I think that one of the things that strikes me again about DevOps generally, MLOps specifically, is that it really is a hopeful message. It means that we can start deploying, developing, deploying applications

Starting point is 00:20:18 in a way that is more all-encompassing, more likely to succeed. And I think that that is a real hopeful development. Are there aspects that are less hopeful, more concerning to you? Are there things that you wish that were happening that are not? One of the things coming from a more data engineering

Starting point is 00:20:42 business intelligence background is I still see a lot of people trying to reinvent the wheel on the data side. So it's still one of the hardest things that part of the data science process is getting the right data there to build a model and then making that right data available at the point of inference.

Starting point is 00:21:04 And you see people doing things like source controlling multi-terabytes data sets and things like that. And there's a movement on being able to register data sets, kind of manage versions and things like that. But it's still a big area of concern for a lot of the people involved, because if we get the wrong data, we make the wrong conclusions. If we don't store the data appropriately, we get data breaches. When we store the data, if we just keep accumulating it, it's an infrastructure challenge. And we need to get that sorted. When we're talking about it from this interacting with the software piece we need to have that kind of data lineage map across our entire software engineering estate to be able to go yes we can use this field from this data because yes that

Starting point is 00:22:00 is in the application that's going to use it. So it's a really, the data is a fundamental path to doing machine learning effectively at scale in an organization. And it is not yet anywhere near a solved problem. And lots of us should be working together on it. In a way, it sounds a little bit pie in the sky. I mean, this is what we've been working toward throughout history of software development. And I think that, unfortunately, there's also been a history of not taking care of these areas.

Starting point is 00:22:38 But, you know, again, from my perspective, I feel like this is really a hopeful and positive movement because frankly, if you just dismiss it and say, oh, that's never going to happen, well, then it's certainly never going to happen. But if you are focused on it, if you recognize that you need to start getting these constituencies together to address these challenges, there's a much greater chance that you'll actually be able to achieve that. So I really feel that this is a, like I said, a hopeful message, a positive message, something that can really help to improve

Starting point is 00:23:10 the quality, the efficiency, the time to deploy ML applications. What's your summary of this question of MLOps? What do you think the take-home message is? Data scientists, machine learning engineers are scarce resources. We need to put the right people, processes, and tools in place around data scientists to be able to deliver as much value out of these people as possible. So that's taking lessons from every other field that we possibly can on being effective and applying that to the unique challenges of the data scientist and machine learning engineer. And MLOps is a great way for us to keep that focus on how do we be better tomorrow than we were today at doing data science at scale.

Starting point is 00:24:09 Well, thank you very much. I think that that's absolutely true. And I think that folks like yourself are definitely part of the solution, if not, you know, solving the problem already. So thank you for being there and for helping to explain the world of MLOps to us. As we usually end our podcast, it's time for the three questions. You ready for this? So to new listeners, Steph has not been prepared for these. She doesn't know what the questions are, but it's supposed to be fun. And we'll look forward to some amusing, interesting, and perhaps thought-provoking answers to our open-ended questions. So we did have you on the podcast before where you answered three different questions,

Starting point is 00:24:47 so I picked three other ones. So let's see what we get here. First, I'm going to ask you, speaking of jobs and job roles and so on, can you think of any job roles, any jobs that people have today that are going to be completely eliminated by AI in five years? I'd almost argue the data scientist is going to be obsolete. So as we said, right, data scientists are scarce resource.

Starting point is 00:25:18 It's a complicated current set of processes from data to building models, to getting them into production. We're democratizing so much of that capability and working on automating it that longer term more people will be data and ML savvy who will be able to do the job even if they're not a classically trained data scientist and we'll be able to build much more with fewer amounts of dedicated resource for this. Everybody should be data engineers.

Starting point is 00:25:55 All right, question number two. When do you think we will have a verbal conversational AI that can pass the Turing test and fool the average person into thinking they're talking to somebody else? Probably not long now. So GPT-3 is pretty effective at generating great text that is compelling and we're seeing neural machine learning models that are training ever more human-like voices. So not only will we get the classic talking to a chatbot type of Turing test pass, but also engaging with somebody as a person. lot of that with some of the pieces around developing AI influencers on the internet and building social media AI celebrities. Yeah, but of course, we've also seen a lot of these

Starting point is 00:26:56 projects that haven't happened yet and that we're still waiting for success. So we'll see when it happens. Finally, how big do you think ML models can get? So today we have 100 billion parameter models. Will that look small tomorrow or have we reached some kind of limit? I think there's a practical limit to this. So already with GPT-3 and the size of it, there's only a few companies in the world

Starting point is 00:27:23 who are able to operationalize that level of compute power and make it available. So OpenAI exclusively licensed commercial applications of the underlying model to Microsoft because there are so few businesses that can do it. We might see increased gains more broadly. More and more people, we're getting more data. We're starting to get the compute power to be able to train broader models. But that comes with a huge price and trade-off for very little, to a certain extent, very little gains in accuracy. So most business

Starting point is 00:28:07 decision makers won't say, I'm going to give you an extra $100,000 of GPU to be able to get me this extra capability, because they're only going to make $20,000 a year extra if they improve on that. So the marginal benefits of bigger and bigger models won't be cost effective for most organizations. One thing I am interested in is how we can start using quantum computing for machine learning and how we can start using it at minimum around things like simulations to be able to greatly increase the sophistication of our models through using some quantum indeterminacy. Wow, unfortunately this is the three questions part so I don't have time to ask you to get into that but maybe we can have you come back and talk about quantum computing in the future. I will have to recommend somebody because this is a field I'm just fangirling right now rather than being anywhere near a leader in,

Starting point is 00:29:16 but there are some amazing people working in helping bring quantum computing to most businesses right now. Well, thank you so much for joining us today. Where can people connect with you and follow your thoughts on enterprise AI and other topics? And is there something you're particularly proud of that you've done recently? So I'm on Twitter as at the staff lock. I'm on LinkedIn, love to follow and engage with people. And June 22nd, in association with Quest, I'm delivering a webinar on kind of this topic of there's too many something ops.

Starting point is 00:29:53 So DevOps, DataOps, MLOps, and how this fits into the concept of developer velocity, which is how we can be more effective at delivering value more broadly inside the organization to deliver top-line business results. Well, thank you very much. Look forward to that and to following you and continuing this conversation. As for me, you can find me on Twitter at S Foskett. You can also find me, as I mentioned here on the Utilizing AI podcast, as well as our on-premise IT Roundtable podcast, which I promise we're using premise correctly in that

Starting point is 00:30:30 sentence. And of course, every Wednesday at the Gestalt IT News Rundown at gestaltit.com. So thank you for joining us for the Utilizing AI podcast. If you enjoyed this discussion, remember to subscribe, rate, and review since that really does help, especially if you're doing it on iTunes. And please do share this show with your friends and mention it in other areas of the MLOps community. This podcast is brought to you by gestaltit.com,

Starting point is 00:30:57 your home for IT coverage from across the enterprise. For show notes and more episodes, go to utilizing-ai.com or find us on Twitter at utilizing underscore AI. Thanks, and we'll see you next week.

Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 2x24: MLOps Is About Quality Not Technology with Steph Locke

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.