Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 2x27: How the ML Community Has Evolved in 2021 with Demetrios Brinkmann and David Aponte

Episode Date: July 6, 2021

The MLOps community has grown dramatically recently, with security, a data-centric approach, ethical implications, and a growing and diverse community rising in 2021. In this episode, MLOps Community ...managers Demetrios Brinkmann and David Aponte join Steph Locke and Stephen Foskett to discuss what has changed over the last year. It seems that a new ML company is launching every week, and the MLOps Community provides a great way to learn about these. We are also seeing a push and pull between open source and cloud platforms, and concern about lock-in and technical debt. Data science and machine learning are merging, with greater focus on data quality and quantity when training models. Three Questions Is MLOps a lasting trend or just a step on the way for ML and DevOps becoming normal? Can you think of an application for ML that has not yet been rolled out but will make a major impact in the future? How big can ML models get? Will today’s hundred-billion parameter model look small tomorrow or have we reached the limit? Companies Mentioned  Microsoft, Tecton, Scale AI Sara Williams Talk D. Sculley interview Guests and Hosts Demetrios Brinkmann, Community Coordinator at MLOps Community. Connect with Demetrios on LinkedIn, on Twitter at @DPBrinkm or at mlops.community. David Aponte, Community Coordinator at MLOps Community. Connect with David on LinkedIn.  Steph Locke, Data Scientist and CEO of Nightingale HQ. Connect with Steph on LinkedIn or on Twitter @TheStephLocke. Stephen Foskett, Publisher of Gestalt IT and Organizer of Tech Field Day. Find Stephen’s writing at GestaltIT.com and on Twitter at @SFoskett.   Date: 7/6/2021 Tags: @SFoskett, @TheStephLocke, @DPBrinkm, @MLOpsCommunity

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to Utilizing AI, the podcast about enterprise applications for machine learning, deep learning, and other artificial intelligence topics. Each episode brings in experts in enterprise infrastructure to discuss applications of AI in today's data center. Today, we're discussing the changes that we've seen over the last year in MLOps and machine
Starting point is 00:00:26 learning and the artificial intelligence community. First, let's meet our guests, Demetrius Brinkman and David Aponte. Hello, everyone. I'm Demetrius Brinkman. I'm one of the main organizers of the MLOps community, which is a community that has about 5,000 people in it now. We're running 5,000 strong and we love to talk about everything MLOps. So if you are interested and you enjoy MLOps, I think there's probably in there, something is in there for you. And my name is David Aponte. I work as a software engineer at Microsoft and a bit of a data engineer, ML engineer working in the ML Ops space. I'm also one of the organizers for the ML Ops community. Please feel free to reach out to me on LinkedIn and we're so happy to be here.
Starting point is 00:01:15 I'm Steph Locke, CEO of Nightingale HQ. We help manufacturers adopt artificial intelligence from putting in things like invoice processing through to custom defect detection models. So MLOps is a big point in life where I try to help manufacturers. And I'm Stephen Foskett, publisher of Gestalt IT and organizer of Tech Field Day, including the AI Field Day, which just happened a few weeks ago. So the reason that we wanted Demetrius and David back on here is because, frankly, the MLOps community is the primary community for machine learning and AI and practitioners that I know of. It's a wonderful, amazing, vibrant community. And frankly, it's pretty huge. There's a ton of people on there. And I think maybe that took you by surprise, Demetrius. I don't know if you ever expected that. But that gives you two a chance to really have your finger on the pulse of the changes that are happening in AI and ML. And since that's kind of the topic here, I wonder if you guys can help us.
Starting point is 00:02:31 Demetrius, what have you seen over the last year overall, sort of like meta trends in ML Ops and AI? So first of all, thanks for having us back. It's an honor to be invited back for i think this is our third time so it's really cool and also a huge honor to be at the ai field day that was a blast i got a bunch of cool schwag to show off to all my friends and family around here but anyway when it comes to ml ops and what we've seen, I think David will be able to talk more in depth about this. But as for me, what I notice is and what I try to actively bring more of to the community by having meetups and having podcasts around these different subjects. One thing that's huge is security. It's not just enough to put your model
Starting point is 00:03:26 into production now and then monitor it. It's also how are you going to securely do that? And what kind of threats are there when it comes to machine learning and having a machine learning model out there? So that is a huge one. And then the other one is being more data-centric. You probably have seen Andrew NG or Andrew Ning talk about this, and it's becoming more of a thing day by day. The more that he talks about it, other people talk about it, it is something that is huge. And by data-centric, we're talking about making sure that the data you're collecting is the right data,
Starting point is 00:04:09 making sure that it's being cleaned properly. And we're putting out a whole podcast series on the data layer, making sure that you have data access or the right people have the right access to the right data and not just everyone has access to whatever data they want. You can have data poisoning problems, as I said, collecting the right data, but data collection goes so much further beyond that because of the fact that if you're going out and you're collecting data, you want to make sure that you know what you're going to use that data for before you just go and mindlessly collect everything that you can, which has kind of been the default up until now. infamous D. Scully from the ML test score and the high credit card debt of machine learning,
Starting point is 00:05:08 those two papers, which are pretty, pretty well received within the MLOps community and they're kind of standard reading. And we asked him what's changed over the past five years that he's been around this space and he's been doing it. And he wrote that first paper five years ago or six years ago, almost. And what he said was, well, in that first paper, that high interest of credit card debt of machine learning, there wasn't, or there was only one paragraph about ethics. And now it's blown up into its own thing. It's a huge piece of ML Ops and just machine learning in general. And so that is something else that I want to point out that ethics is huge. I have another podcast on that because I love it so much. And I think it is
Starting point is 00:05:57 so important that we talk about the different implications. That's definitely something that's come up again and again on this podcast as well is the ethical implications of the choices that you're making when you're creating machine learning models and the data that you've selected to train them on and so on. How about you, David? What are you seeing over the last year has changed in ML? Yeah, I think just exactly what Demetrius said. The only other thing I would add is community. I think the fact that we have now a whole community
Starting point is 00:06:31 dedicated to a kind of like specialized subfield within machine learning already is a huge change. And Dee Scully spoke about this as well. Like there wasn't a community to talk about some of these things. There wasn't a place to go and ask questions. Now we have a channel, a very active channel called ML Ops Questions Answered that it has so many, if you want to know what people are dealing with, what challenges people are facing and how people are injured, like what are they doing to solve
Starting point is 00:07:01 those challenges? That's the place to go. You'll see lots of questions around finding an abstraction between the model and the data, hence a lot of feature store chat, security, compliance. Some people are trying to scale out. Some people are just starting. And so one of the shifts that I've seen, like Demetrius is getting at as well, is just like he said, a focus on the data.
Starting point is 00:07:26 And this is not anything new. This has been around for some time. And so like, it's funny how Andrew, not to say anything bad by Andrew Ng, I always mispronounce his name, but people who have been working in the ML space have known for a long time that it's always been about the data. But I guess because the focus was on the serving side, the model side, you kind of never talk about that. But it doesn't mean that practitioners don't think that that's important. And if you really look at a lot of the day-to-day work of a machine learning engineer, I would argue that data is a big part of that. Making use out of that, cleaning that, generating features from that. That's what you really spend most of your time doing. So anyways, just wanted to say that there is a big shift in the focus, things like data, data meshes, new patterns of organizing that data, which is really interesting.
Starting point is 00:08:13 But yeah, generally speaking, I think it's just those two things. There's a community having a conversation about common problems. And now that community is very focused on data, not just the model part, not just the algorithm, but how do we actually get good quality data and make use out of that? And I'll jump in real fast. Another thing that has changed is there feels, it feels to me like there is a new company every other day
Starting point is 00:08:43 that is trying to do something in the ml ops space and so of course since we are the ml ops community and we're in the center of this microcosm or this community maybe i see it more because a lot more people are approaching us to try and talk to us or see how they can better leverage the community for their cause. But that's something that's really interesting. The space is very, very hot. And I always joke about this with, uh, with David and our other co-host Vishnu, like I have a blog pending that I really want to write. And it's called please don't start an MLOps company. Because there are too many right now, obviously, like, that's a joke, please go out and do it if
Starting point is 00:09:32 you feel like it is useful. But we see the same pattern happening over and over. Someone says, hey, at my last company, we were having trouble with this, we built this tool. And now I'm going to go out and create a company around that because there's probably other people that are having trouble with that same problem. And this tool can be useful for them. So what I've been seeing a lot is a lot of these like tools that are just that they They're coming from a big organization or from a technology-first company, but they're trying to solve the problem that that company had
Starting point is 00:10:13 and then seeing if the market will take it and if others have that problem too. It's an excellent point. I mean, in any kind of emerging technology area, we see a really huge amount of fragmentation of companies trying to hone in on generalizable solutions, big problems, and then we start hopefully at some point seeing consolidation. You know, you'll see those like ML Ops, big vendor infographics, and it's just overwhelming. Are there kind of key players in the market who you think are gaining significant
Starting point is 00:10:58 traction right now, consolidating things, or just generally making it easier for your ml ops community to build that more robust machine learning capability inside their business i'll let david answer this one because i'm very biased and a there's a lot of great companies that are sponsoring the community and b there's some companies that I've been looking to invest in. And so I don't know if I can give you a straight answer on that, but David can. I don't know if I can either. One, you know, like I said, I work at Microsoft. We are a very large, you know, cloud provider interested in that space as well. So yes, it's going to be difficult, but you know, I'm going to give the cheap answer and say the cloud providers. And the reason why I think that they are going to, I mean, it's just a common pattern that they've done in the past anyway.
Starting point is 00:11:53 So, for example, there's a good, you know, Google, there's a good open source solution. They'll have a software as a service implementation of that where they take care of the infrastructure. They take care of the security aspects where they take care of the infrastructure, they take care of the security aspects, they take care of the deployment. All you have to do is press a couple buttons and configure it to whatever it is you need, and they'll stand that up for you. Examples of that have been like, you know, QFL pipelines. There's a lot of tools that I noticed that these companies end up supporting as a part of their stack. Kafka is another one. Azure has a managed instance of that. And it's like, that's the trend.
Starting point is 00:12:30 And so I see, my guess would be that it would also happen in the ML op space where these solutions eventually are just going to get simpler and simpler. I envision a low-code drag and drop environment for a lot of machine learning. I already see that, particularly within the Azure cloud provider. There's a lot of things that I'm using to this day that are actually drag-and-drop. And I think it's great. It makes my life a lot easier. And when the cloud provider takes care of the platform as a service, the VMs, the network, the compute, that makes things easier for me.
Starting point is 00:13:07 I can just focus on the application side of things, configuring it to a way that it works to solve the specific problems that I'm working on versus me handling the end-to-end stack, which I know a lot of machine learning engineers, data engineers have had to deal with, especially at smaller companies where you're essentially wearing all those hats. I don't see that as being a sustainable model. I think that over time, we're going to see more specialization, more focus in certain areas. I think that's going to enable people to be more innovative. But I also see that meaning that the big responsibilities are going to be offloaded to these service providers, these cloud service providers, so to speak, that are going to be offloaded to these service providers, these cloud service providers, so to speak, that are going to make it just a nice abstraction on top of all that,
Starting point is 00:13:50 something that you can easily integrate into what you're doing. The downside to that is that it's going to be very expensive. So because the cost is always going to be there, I see open source not going anywhere. There's going to be continuous innovation in the open source space, trying to find similar solutions, if not the same solutions in the open source space, trying to find similar solutions, if not the same solutions, but open source implementation. So while there may be, let's say an Azure service for a feature store, for example, one day, there still will be hopefully an open source implementation of that. We already see that model, for example, Tecton, one of our sponsors with Tecton as their enterprise implementation, and then Feast,
Starting point is 00:14:23 their open store implementation. And I see that model being very successful because it's attractive to get you started with something that's free. And then when you see, oh, I want more scale, I want more things, and we don't have the team to do that, we'll pay you for that. So I see that trend happening. And that's just, again, my bias, I do work at a larger company, but even working at a smaller company previously, where we did everything from scratch, I found more pain points than benefits to doing everything yourself. And so that's why my feelings towards this is that it's going to become, we're going to offload that to people who are really good at that to companies that that's what they do best. And we're just going to focus on what we
Starting point is 00:15:08 do best, where our particular competitive advantage lies. And that's what I think companies should focus on actually, because they can really excel in that area versus trying to do that and be a cloud provider and develop new state-of-the-art tools and develop state-of-the-art machine learning applications. That just seems like too much to be successful in the long term. I will add to that too, that when it comes to the cloud providers, there's a few things that I think the people that we interviewed have said that have been very useful for me to look at and to think of the space. And one was Noah Gift. And he said that if you're getting started, like David was talking about, just go bet on the cloud provider because it's
Starting point is 00:15:53 going to be there. It's going to give you 80% of what you need, and it's going to give it to you easily. The thing that I've seen in the community is though, once you get to a certain point and that 80% no longer is good enough, then it is a real headache to try and get just a minimal, minimal gain, like to any 1%, you're just going to have to work really hard to break out of the chains of what the cloud provider is trying to give you. And so we've seen, there's an incredible post that we just saw in the community. And it was talking about like the downsides or the pains of using SageMaker and how people have used it. And they just want to pull their hair out because they want to get that other 20%. They want to unlock that. The 80% that SageMaker gives you
Starting point is 00:16:46 is not enough. And they want to get to that next percentage, but it's really hard to do because of all of the opinions that SageMaker takes. And so once you get to that point, I feel like there is a benefit of trying to look outside of the box for what you have and start taking on the best of breed and not using so much, not relying so heavily, or just basically packaging up everything from the cloud provider and saying, all right, I'm going to go with the whole SageMaker suite or whatever it is, Azure or GCP. And like David was saying, one of the biggest complaints that people had was in SageMaker, it is like four times more expensive for the instances, even though it's the same instance, it's an AWS instance. And so why
Starting point is 00:17:41 should it be so much more expensive just because you're using SageMaker? And this was one thing that people love to talk about, but I'm not sure yet if it is really that big of a game changer for people. I think it's more in the heads. It's in everyone's head, but it is the vendor lock-in. And so nobody wants vendor lock-in, but then sometimes people just say, well, you know what? It's going to get us that 80% really quick. So I don't care. Like later on, we'll figure it out. And so the vendor lock-in is huge. Everyone has a big fear of it, but then they go behind the curtains and they do it anyway. So I don't know about the vendor lock-in. I'm still unsure how to take that.
Starting point is 00:18:25 But that is a big argument that you'll hear. I think that's just a great point. And I just want to add to that, that that's a challenge in software engineering in general with respect to finding the right set of abstractions. You want it to meet all the needs of your users to solve all those problems, to have the features that they ask for.
Starting point is 00:18:44 But there's always going to be new things that you just cannot do. And like you said, there's like, it'll get you 80% of the way. And if part of your requirements is to have a POC quickly, to have something that will get up and running as fast as possible, and then you can iterate to make it something that's more configurable, more customized to the domain. I think that's often a why strategy, because then you may end up spending a year working on building something that isn't up to par to what you really need it to be. And maybe you're not you don't have, you know, the right set of team members to do that. You know, it's building infrastructure is a very challenging thing to do and not to mention the machine learning side of things. So it's when you're trying to do all of that, I think you end up wasting more money, more talent, also in time.
Starting point is 00:19:32 It's very expensive to hire a machine learning engineer nowadays. And so you have to also consider that cost as well, the manpower. But if you're interested in investing in that space, if you're interested in the long term, where you want something of your own, maybe something that you can give back to the community, let's say through an open source implementation, then I think that that is an important requirement. And then like you said, to avoid that vendor lock in, which is a real thing, you don't want to be too dependent upon something someone else external to what it is that you're working on. But there's that flip side of the coin where if your competitive advantage is in something
Starting point is 00:20:08 completely different, let's say in the drug discovery space, for example, I was working in the last time, then I don't know if it makes sense to invest a whole lot of money to being excellent in this other area as well. And I just find that it's going to be very difficult to do all of those, you know, as well as you could. Yeah, one of the things I found kind of the difference between like SageMaker and the Azure Machine Learning solution is the Azure ML approach to ML Ops is basically you add extra lines of code to log to the cloud what you're doing. So you can use your general framework and just kind of backbone machine as your ML compared to AWS where you have to kind of do everything a little bit more closed. And that kind of approach actually helps people who are still doing things on their laptops and or looking at doing things inside their own data centers.
Starting point is 00:21:13 Are you seeing people taking kind of a cost benefit approach to working on-prem with ML or are most people in the cloud? I know I'm cloud by default. Yeah, I would say most people are going to the cloud. I think everything is going to the cloud. I think even the everyday things that consumers use is going to the cloud. I think there is still a need though to bridge that gap. So for example, again, I'm sorry to talk about Microsoft, but I know that Microsoft has something called Azure Arc, which allows you to integrate with your on-prem. And I know that the other cloud providers have that as well, they just, they have such specific requirements with their clients that it just doesn't make sense to have them on the cloud. And so for that, I still think there's a need to support that, but the trend, in my opinion, from what I've seen is going towards the cloud as they're, that's what they're, most people are doing anyway. So it's kind of like hop on the bandwagon type of thing. So one of the other things that you brought up, David, and I want
Starting point is 00:22:30 to try to make sure that we get to this before we run out of time here is the whole datafication of ML or the MLfication of data. And to me, especially with Steph here as well, because that's kind of your background too, it seems that, you know, I'm not sure what's happening here, but it seems like data science and ML are merging. Is there such a thing as data without ML or ML without data? No, I think that's like a model has its learned behavior from the data. It is a function of the data. So it's impossible to have machine learning, uh, specifically, I guess, your things that you're a learning algorithm, right? Something that's updating based on whatever the data has, what you that's, that's, that's not going anywhere. But what is
Starting point is 00:23:22 interesting though, is finding new ways to make it more efficient with all this data that doesn't have labels, for example. That's a very important research area because if you can make – a lot of companies don't lack data. It's just good data, data that is able to be used for machine learning specifically. So you mentioned the data is kind of being – I don't know, like being with a focus on machine learning.. So you mentioned like the data is kind of being Emma, I don't know, like being with a focus on machine learning. Yes, there is like, for example, like Dimitra is bringing up that new course that Andrew put out. I think there's a competition associated with it where you have a fixed model and the goal is to improve the data set, which is kind of the opposite of what you would normally do. And I think that's really
Starting point is 00:24:05 interesting because there is a lot of challenges there. So if we can learn new things to make that data better for machine learning applications, that will make the lives of a lot of people a lot easier. And it will also open up the space of what's possible with that data. There's sometimes too much that you don't know what to do with. And more data, if it's not good, isn't necessarily better. You still need data that is specific to your machine learning application. And that's where the data science and the data engineering is really focused on right now. Not just the infrastructure, getting it from all these sources and getting it from all these sources
Starting point is 00:24:45 and getting it to the machine learning in real time, which is also a new trend, not necessarily new, but something that is becoming more and more important, but also doing that in a way that's easy for the data scientists so that they can experiment and iterate quickly. There's a need for abstractions
Starting point is 00:25:02 that allow people to work effectively, but also there's a need to do that securely in a way that it's auditable. Like you need to know what sources came from. How did this transformation lead to this result? And how does that affect the actual outputs of the model? That is still an open-ended question that I think a lot of, I struggle with that.
Starting point is 00:25:22 How do I know how one relates to the other? And I think there's a lot of interest and research as well in understanding that relationship between the data and the models. And the whole course is coming out where they're teaching that about a data-centric approach towards machine learning. And like I said earlier,
Starting point is 00:25:39 there's nothing necessarily novel about that. People have been in this machine learning space, have known that data is where all the where most of the challenges are and where most of the gains will come from better better data better features but if you're already focused on that space and now you need to get it shipped to your your customers then it comes how do i serve this how do i that's where most people think of ml ops but i think MLOps should capture that data component as a lot of that is, again, machine learning specific. And it's not just for experiments.
Starting point is 00:26:12 It's also for production. Some of those things are actually going to be used and affecting users. And so I think there's a need to have better governance and better abstractions that are easier for both the data scientist that's doing that and also the ML engineer maybe that needs to ship that. And even the senior leadership that wants to have a better understanding of where things are coming from and where the issues are. So yeah, sorry, that's my long-winded answer to that. Just on that real fast, there's a few things that we've heard. I think the collective community and even people out there that know a little bit about machine learning and AI have heard too much. much so that it's a cliche now, which is like the data scientist on their laptop, and then they try
Starting point is 00:27:08 to productionize something and it doesn't work. Or 80% of the data scientist's time is spent cleaning data or working with the data. And the other one is something like X amount of the machine learning or data products that get produced don't actually make it into production. And so that, that being said, I wanted to just, I don't know why I was, I was saying that, but I feel like David mentioned something along those lines of, of that. And it's good to see that we're starting to take more advantage of this idea of the data and how important it is. And I wanted to just mention also that there are some really cool startups that are doing stuff in this space. And it took a while for me to actually understand what why this was ml ops because it didn't really feel
Starting point is 00:28:08 like it at first when you're talking about someone like the the company that comes to mind right away and i have no affiliation with them they don't sponsor the community or anything is superb ai and they're trying to do uh a little bit of what like scale AI is doing where they're labeling your data, but they're doing it in an automated fashion. So that is really cool to see. Like that is something where you think, wow, okay, then if that could work, like David was saying, you can boost, hopefully you can have in an automated way, the ability to make sure that the data is high quality and make sure that you are able to get
Starting point is 00:28:57 these better predictions downstream because you have that higher quality data that you're entering in. So that's just my little quick note on that. And of course, when we talk about data quality, this leads us to what's been brought up again and again by many guests, including the two of you, including Steph, including a lot of the guests that we've had here, which is the whole challenge of bias in AI. And I think before we go, I think that we would be remiss if we didn't bring that up as well. Is there a prospect here? I mean, how are we going to deal with this problem?
Starting point is 00:29:37 That's a hard question to answer. I'm not sure if right now we know how we're going to deal with that problem. And there was a pretty funny meme that I saw. And of course, like explaining a meme never does it justice, but it's something along the lines of let's use a machine learning model to get rid of the human bias. And then someone underneath it says, okay, and so where'd you get the data from? And who was messing with that data? How did they train? Like, how did they label that data? Was there no bias involved in that? So there's what I've seen. And what I realized is there are so many places and so many ways that bias can creep in. And we just have to accept that it's there and really try to recognize that we are going to have it.
Starting point is 00:30:34 And so gaining the diverse viewpoints from many different fields and many different stakeholders is one way that's going to help. But also, it's just, it's a, yeah, it's a nasty beast that people have to try to make sure that it's not, and it's non-biased is a very difficult nut to crack. And I don't know if I have the right answer for that. But what is interesting that someone told me the other day on the AI ethics podcast is, if we don't have the right answer, at least we can start asking the right questions. And so trying to figure out what the right questions would be around what it looks like if we have less bias how do we get there what does that even mean less bias right so that's all i can say about it and i'll just add to that that i think that's actually one of the most important steps to do is actually frame the question like what exactly are we trying to do and
Starting point is 00:31:37 by doing that you you get a lot of that legwork done um but yeah, just to piggyback off what Demetrius said, I think that one, more humans in the loop. There's a need to have domain experts interact with the technologists and be a part of them building that, generating the data, qualifying the data, validating that data. There's still a need for people. You can never eliminate the human factor from machine learning. And I think if you do like what Demetrius is saying, we joke about that, but using machine learning for machine learning is a real thing and people are doing that. And I would be worried that as well, because we need humans. We need that judgment. That's very hard to replicate. It's very hard to build that in into an algorithm to encode
Starting point is 00:32:26 that. You still have that intuition of what bias is. And that's, again, what we mean by that is what I think part of the challenge is, what exactly qualifies as bias in this particular domain. Two, I think we've been getting at more important data quality tools, practices, processes will also improve the issues or mitigate issues of bias. Because from my understanding, from data science perspective, bias is inherent in the data because we don't always have or are aware of the data generation process. We don't know where that came from, how the distribution of that data, we are approximating it and modeling that with our best judgments and some tools that are imperfect.
Starting point is 00:33:12 So if we're generating more data with imperfect tools and qualifying them with imperfect tools, I think there's a need to improve that. And that's a known problem. And there's even arguments within the statistics field of how to actually approximate a data set, like there's there's even arguments within the statistics field of how to actually approximate a data set, you know, like there's the Bayesians and the frequentists. And but anyways, the point is that the data quality is always going to be a help. If you can improve
Starting point is 00:33:34 the quality of the data, you will help mitigate bias. And again, depending on what you mean by that. So I would say those two things, more humans in the loop, being integrated into these machine learning pipelines and continued work with data quality. Oh, I'm sorry. And thirdly, the interpretability. That helps because if you can understand how your model is using something, you can maybe better understand where this decision is coming from. And that will help you, again, not necessarily maybe fully understand where the bias is, but get you closer to understanding how you got to where you got to.
Starting point is 00:34:13 And that auditability, the ability to reproduce things, going back to a previous step and reproducing something more downstream, that's something that's also going to really make a difference. Because if you can explain to a stakeholder why this model made these predictions or what data sets went into this model, that's just going to make that process go a lot smoother, as that's one of the first challenges is understanding where things came from. Well, thanks a lot for that. Before we get to our three questions, though, Steph, I want to give you a chance. Do you have any last words about the topic of data quality and bias? It's, as you say, bias is an artifact of a whole host of things that generate the data that we work with. We can be also working more closely with business experts around that legacy process
Starting point is 00:35:11 because if you think about most machine learning departments, they're less than five years old. But the data that they might be working on, maybe in a financial sector, worked with data that could be up to 150 years old on economics. So there's all sorts of historic processes that go into that, changing definitions, and really that data stewardship and understanding of the processes is really key to being aware of bias because as Demetria said you're never going to get rid of it but you need to know where the dragons are on your map to know where to be more cautious about it and I think we'll get there. It's a collaborative effort. One thing that I will just mention too that I thought was super useful when looking at this
Starting point is 00:36:10 was when I interviewed a woman named Sarah Williams, she told me about how when she creates maps and what she does is she does a lot of visual data. And so she'll create different maps, or she'll create different statistics, but in a visual sense. And she told me how it's really easy to lie with that. And I imagine that you, you all may have heard of this book, How to Lie with Statistics. And there's another one, I think it's called How to Lie with Data, that kind of thing. And so what she said that she does to try and make sure that her maps do not have any bias or the least amount of bias as possible is she has a diverse range of stakeholders,
Starting point is 00:37:06 like she said, or like I mentioned before. And then she also will go to the people that the different maps are talking about, and she'll interview the actual people. So for example, she was talking about how they had a map where it said X amount of people or X percent of people in these neighborhoods experience X percent more lung cancer. So it's more possible for people living in this neighborhood to get lung cancer for some reason. That's what they found with this data. And so instead of just publishing that, she went to that neighborhood and she talked to people on the ground and she asked them, does this look like something that could be possible? Do you know people that are getting lung cancer? So that it's not just a way to get the data and then publish
Starting point is 00:38:04 it, show, okay, here's some correlation. Now let's put it on to a visual, let's make it visual and then let's show people so that it's easier to digest. No, she went out and she made sure that she's able to get the firsthand experiences and confirmations from the people that this data is talking about. And I thought that was a really interesting way of mitigating bias within what she was doing. Well, thank you guys very much for this wonderful conversation. Always a pleasure to have you. But before we go, we do have a tradition here at the end of the Utilizing AI podcast where I ask three questions, unexpected questions without warning, and we'll just see how you guys react.
Starting point is 00:38:50 Let's treat this as a jump ball between our guests. If you want to jump in and answer it, feel free. So here we go. So first up, I have a new question, brand new, just came to me while we were talking. And here it is. Maybe I'm going to throw this one at Demetrios. First up, I have a new question, brand new, just came to me while we were talking. And here it is. Maybe I'm going to throw this one at Demetrios. Is MLOps a lasting trend or is this just a step on the way from ML and DevOps becoming normally how things are done? Ooh, this is something that I think about quite a bit, obviously, because I'm doing a lot with
Starting point is 00:39:28 the MLOps community and I hope it is a lasting trend, but there is something to be said. Maybe a new software comes out and it just eats up the need for MLOps completely. Or like some have said in our meetups that it's just going to be DevOps. You're going to have DevOps and it will be when you say DevOps, it's going to mean that you deal with data because data is becoming so ubiquitous. So I'm not sure. It is totally possible that it might get eaten up by something else. I hope not just because we've got this community and it is those things, but machine
Starting point is 00:40:29 learning is that and a lot more. There's the math. There's the understanding of the experimental side of things, someone who knows how to iterate. And those skill sets are very different from a software engineer. As an engineer, I focus on these concrete deliverables. And I like that. I don't see, you know, a DevOps person fully like, you know, taking on all that responsibility. I think there's still going to be a need for people that have specialties and can understand a specific part of that problem. But that being said, I think it will do some good if things become more like, I don't know if more categories is better. So if we can consolidate some things, I think that will help. But I do think that the need for a special set of skills will still remain. And because of that, MLOps will still remain. Just like Liam Neeson, special set of skills. All right, next up,
Starting point is 00:41:26 how about this? Can either of you think of an application for machine learning that has not yet been rolled out? So it doesn't exist yet, but you're like, man, machine learning would make that great and it'll have a major impact in the future. What will ML be used for in the future that you haven't seen yet? That's a good one. Steph, how about you first since you haven't seen yet that's a good one i what steph how about you first since you haven't spoken i'm punting that to you uh i already answered this question on my podcast you're the guest okay oh man okay put you on the spot demetetrius, got anything? Yeah, good, good. Yeah, it's super hard. It's really hard because we don't know how machine learning is being used in all its different forms and all its different iterations right now. Like I just look around my room and I think I have a garden outside
Starting point is 00:42:20 and that's one area that would be amazing. But I know there are people that are doing stuff with machine learning and gardening and vertical gardens and all of that right now in robotics so i can't say like what would be a really interesting one that hasn't been mentioned before yeah one thing that is interesting is the use of machine learning in music production and how that gets really because there's a certain amount of ai that is allowed or is okay when it comes to music production and then you want still you want a little bit of human involved and so i look at that percentage and I think,
Starting point is 00:43:06 oh, that's an interesting one to look at. But again, it's nothing new and it's not something that we haven't thought about or it's not already actually happening. So I can't give you a good answer on that. That's a good question. Yeah, I think the only thing, a couple of things came to mind. So things involving devices, the internet of things,
Starting point is 00:43:24 more things being connected. That means more data is being collected, like more, we're getting data from places that we historically were able to receive data from. And because there's new data, I think new data sets will allow us to find new interesting problems to model. So I think that's going to be an interesting domain. The other thing I was thinking about was, oh, so one of the biggest things that I think machine learning is really good for is augmenting human intelligence, supporting things that we are already really good at.
Starting point is 00:43:54 So you mentioned the creative side of things. I think that's actually going to be really interesting. We already see like AI or machine learning based creation than art. And I think that's going to be really interesting. We already see like AI or machine learning based creations in our, and I think that's going to be really interesting to explore. Like what else can we use to narrow down the search space? So a field where there is a historically huge search space, I think machine learning can be really useful there to narrow that down, to augment what we already know and to make things a lot easier. So that's a bit of a catch-all, but yeah,
Starting point is 00:44:25 I think, I guess, yeah, fields that involve a huge amount of, it's just too large to explore on our own. So yeah, that's my cheap answer. Sorry, Stephen. Okay. Final question here. How big can ML models get? We've got a hundred billion parameter models now. Will these look small tomorrow or where we reach some kind of limit? I think, yeah, it's like with chips, right? There's going to be a physical limit at some point. I think people have discussed this. I forget the name of that law. Moore's law, I think it is, if I'm understanding correctly. But yeah, I do think that it's just not sustainable. Compute is still going to be, it's not going to be free. So in storage is not going to be free.
Starting point is 00:45:13 So we're still going to need to find ways to make that more efficient. So I think the trend is actually going to be making them smaller. And even now you see lots of things regarding that quantization and, you know, there's still a need to make things smaller. So how big can they get? I guess as big as we have storage. So the cloud is essentially infinite, but I don't think that's where things are going to go. I think the goal is to try to make them smaller and smaller. More efficient is what I think people most are mostly interested in yeah and along those lines of creating more efficient and better models that don't need to be so gigantic when one thing that a lot of people to the planet and so that never
Starting point is 00:46:09 comes up when you're talking about the different models and wow look at how incredible this is but these things are like david said yeah we have infinite compute and we have infinite storage but do we really want to use all of that infinite, see how infinite it really is? So that's something that also I think needs to be kept in mind as we move forward. Yeah, that's a, that's a really good point. Yeah. Do we want to use infinite energy to see where we can go with this thing? So, well, thank you guys so much. It's great to have you here on the Utilizing AI podcast. Before we go, where can people connect with you and follow your thoughts on enterprise AI and other topics? And is there something that you've done recently that you're
Starting point is 00:46:54 particularly proud of? So David, let's go with you first. Yeah. So yeah, first off, thank you so much for having us again. It was an absolute pleasure. And I think the best place to reach me is on LinkedIn. So David Aponte, you can just find me. And something that I've worked on that I'm proud of, I think I'll just say this, in my work at Benevolent AI was a great learning experience. I helped set up a lot of the machine learning infrastructure there. And I know that the team there is still working hard on that. A lot of those learnings have benefited me and I've been able to bring into my new work at Microsoft. So I am proud of the work that I've done. As for where I left that off, that I'm not sure. I need to reach out to my coworkers, but I am very proud of the work that I've done in machine learning infrastructure as it's, it's been a nice, like it's been fun to like work at the intersection of the research side of
Starting point is 00:47:48 things and the production side of things. And the, the system design reviews that. Oh yeah. Sorry for, I'm sorry. Yeah, that is something. Thank you for bringing it up to me. We are doing something very cool that I am very excited about where we're going to be reviewing system designs at many different companies. So for example, we have one coming out soon at Pinterest where we're going to go over their real time image similarity architecture. And we're going to be going through these system designs and really breaking it down into a way that practitioners can use that to learn that skill set of system design and architecture,
Starting point is 00:48:20 but also as like a starting point for maybe systems that they need to build in their workplace. And so it's going to be a bit of like a master class in system design. And so be stay tuned for that. I am very excited for that. Me, Vishnu and Demetrius are working very hard on that. Yeah, and that's something that David has spearheaded and has been really doing a great job on. I think that right now what we're seeing from the MLOps community is a dire need to get more technical because what you see when it comes to MLOps right now is so many different articles that just explain what MLOps is or maybe toy projects that people have. But this is really like productionizing at a large scale in large companies. And I think it's super cool. So I'm also proud of that. You can reach me
Starting point is 00:49:15 on LinkedIn too or in the MLOps community Slack. I'm very active there. And I think that's it for us. Well, thank you. And I do want to say the MLOps Community Podcast, if you're not subscribed to it, definitely go subscribe to that. It's a great, great podcast. How about you, Steph? Anything exciting in your life? I've just finished up a series on developer velocity, looking at how businesses can do data machine learning software engineering as a like key value driver inside an organization based off a lot of research from Microsoft and McKinsey. And I cover some of the topics that we discussed today, like will ML Ops be eaten? There being too many something ops and stuff in that series.
Starting point is 00:50:12 I also provide a great use case for getting a tools budget, which I know some businesses and tech teams can struggle with. And that's available at bit.ly slash developer velocity. So those are available online. Well, thank you so much. And thank you for joining us here on your first stint as the co-host of Utilizing AI. It's wonderful to have you here. And thank you everyone for joining us.
Starting point is 00:50:39 As for me, I'm really excited that we just wrapped our second AI Field Day event. We will have another one in the future. We would love to have you join us at that event. Just go to techfieldday.com, click on delegates, and you can submit your name to become a delegate or click on sponsors, and you can learn how the companies sponsor and present at those events. Also, of course, we would love to have you join us here on the Utilizing AI podcast. And I will finally give one last shout out to the MLOps community, which really opened a lot of doors to me. And I really appreciate the time and effort that David, Demetrios, and everybody else puts into that. So thank you very much for joining us.
Starting point is 00:51:19 If you enjoyed this discussion, please do subscribe to the podcast. It really does help to have more subscribers. We're basically on every podcast thing out there. This podcast is brought to you by gestaltit.com, your home for IT coverage from across the enterprise. For show notes and more episodes, go to utilizing-ai.com, or you can follow us on Twitter at utilizing underscore AI. Thanks, and we'll see you next time.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.