Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 2x27: How the ML Community Has Evolved in 2021 with Demetrios Brinkmann and David Aponte
Episode Date: July 6, 2021The MLOps community has grown dramatically recently, with security, a data-centric approach, ethical implications, and a growing and diverse community rising in 2021. In this episode, MLOps Community ...managers Demetrios Brinkmann and David Aponte join Steph Locke and Stephen Foskett to discuss what has changed over the last year. It seems that a new ML company is launching every week, and the MLOps Community provides a great way to learn about these. We are also seeing a push and pull between open source and cloud platforms, and concern about lock-in and technical debt. Data science and machine learning are merging, with greater focus on data quality and quantity when training models. Three Questions Is MLOps a lasting trend or just a step on the way for ML and DevOps becoming normal? Can you think of an application for ML that has not yet been rolled out but will make a major impact in the future? How big can ML models get? Will today’s hundred-billion parameter model look small tomorrow or have we reached the limit? Companies Mentioned Microsoft, Tecton, Scale AI Sara Williams Talk D. Sculley interview Guests and Hosts Demetrios Brinkmann, Community Coordinator at MLOps Community. Connect with Demetrios on LinkedIn, on Twitter at @DPBrinkm or at mlops.community. David Aponte, Community Coordinator at MLOps Community. Connect with David on LinkedIn. Steph Locke, Data Scientist and CEO of Nightingale HQ. Connect with Steph on LinkedIn or on Twitter @TheStephLocke. Stephen Foskett, Publisher of Gestalt IT and Organizer of Tech Field Day. Find Stephen’s writing at GestaltIT.com and on Twitter at @SFoskett. Date: 7/6/2021 Tags: @SFoskett, @TheStephLocke, @DPBrinkm, @MLOpsCommunity
Transcript
Discussion (0)
Welcome to Utilizing AI,
the podcast about enterprise applications for machine learning,
deep learning, and other artificial intelligence topics.
Each episode brings in experts in
enterprise infrastructure to discuss
applications of AI in today's data center.
Today, we're discussing the changes that we've seen over
the last year in MLOps and machine
learning and the artificial intelligence community. First, let's meet our guests, Demetrius Brinkman
and David Aponte. Hello, everyone. I'm Demetrius Brinkman. I'm one of the main organizers of the
MLOps community, which is a community that has about 5,000 people in it now. We're running 5,000
strong and we love to talk about everything MLOps. So if you are interested and you enjoy
MLOps, I think there's probably in there, something is in there for you.
And my name is David Aponte. I work as a software engineer at Microsoft and a bit of a data engineer,
ML engineer working in the ML Ops space. I'm also one of the organizers for the ML Ops community.
Please feel free to reach out to me on LinkedIn and we're so happy to be here.
I'm Steph Locke, CEO of Nightingale HQ. We help manufacturers adopt artificial intelligence from putting in things like invoice processing through to custom defect detection models.
So MLOps is a big point in life where I try to help manufacturers. And I'm Stephen Foskett, publisher of Gestalt IT and organizer of Tech Field Day, including the AI Field Day, which just happened a few weeks ago.
So the reason that we wanted Demetrius and David back on here is because, frankly, the MLOps community is the primary community for machine learning and AI and practitioners that I know of. It's a wonderful, amazing,
vibrant community. And frankly, it's pretty huge. There's a ton of people on there. And I think
maybe that took you by surprise, Demetrius. I don't know if you ever expected that.
But that gives you two a chance to really have your finger on the pulse of the changes that are
happening in AI and ML.
And since that's kind of the topic here, I wonder if you guys can help us.
Demetrius, what have you seen over the last year overall, sort of like meta trends in ML Ops and AI?
So first of all, thanks for having us back. It's an honor to be invited back for i think this is
our third time so it's really cool and also a huge honor to be at the ai field day that was
a blast i got a bunch of cool schwag to show off to all my friends and family around here but
anyway when it comes to ml ops and what we've seen, I think David will
be able to talk more in depth about this. But as for me, what I notice is and what I try to actively
bring more of to the community by having meetups and having podcasts around these different
subjects. One thing that's huge is security. It's not just enough to put your model
into production now and then monitor it. It's also how are you going to securely do that? And
what kind of threats are there when it comes to machine learning and having a machine learning
model out there? So that is a huge one. And then the other one is being more data-centric.
You probably have seen Andrew NG or Andrew Ning talk about this, and it's becoming more
of a thing day by day.
The more that he talks about it, other people talk about it, it is something that is huge.
And by data-centric, we're talking about making sure
that the data you're collecting is the right data,
making sure that it's being cleaned properly.
And we're putting out a whole podcast series on the data layer,
making sure that you have data access
or the right people have the right access to the right data
and not just everyone has access to
whatever data they want. You can have data poisoning problems, as I said, collecting the
right data, but data collection goes so much further beyond that because of the fact that
if you're going out and you're collecting data, you want to make sure that you know what you're going to use that data for before you just go and mindlessly collect everything that you can, which has kind of been the default up until now. infamous D. Scully from the ML test score and the high credit card debt of machine learning,
those two papers, which are pretty, pretty well received within the MLOps community and they're
kind of standard reading. And we asked him what's changed over the past five years that he's been
around this space and he's been doing it. And he wrote that first paper five
years ago or six years ago, almost. And what he said was, well, in that first paper, that high
interest of credit card debt of machine learning, there wasn't, or there was only one paragraph
about ethics. And now it's blown up into its own thing. It's a huge piece of ML Ops
and just machine learning in general. And so that is something else that I want to point out that
ethics is huge. I have another podcast on that because I love it so much. And I think it is
so important that we talk about the different implications.
That's definitely something that's come up again and again on this
podcast as well is the ethical implications of the choices that you're making when you're creating
machine learning models and the data that you've selected to train them on and so on.
How about you, David? What are you seeing over the last year has changed in ML?
Yeah, I think just exactly what Demetrius said.
The only other thing I would add is community.
I think the fact that we have now a whole community
dedicated to a kind of like specialized subfield
within machine learning already is a huge change.
And Dee Scully spoke about this as well.
Like there wasn't a community
to talk about some of these things.
There wasn't a place to go and ask questions. Now we have a channel, a very active channel called ML Ops
Questions Answered that it has so many, if you want to know what people are dealing with,
what challenges people are facing and how people are injured, like what are they doing to solve
those challenges? That's the place to go. You'll see lots of questions around finding an abstraction
between the model and the data, hence a lot of feature store chat,
security, compliance.
Some people are trying to scale out.
Some people are just starting.
And so one of the shifts that I've seen,
like Demetrius is getting at as well,
is just like he said, a focus on the data.
And this is not anything new. This has been around for some time. And so like, it's funny how Andrew,
not to say anything bad by Andrew Ng, I always mispronounce his name, but people who have been
working in the ML space have known for a long time that it's always been about the data.
But I guess because the focus was on the serving side, the model side, you kind of never talk about that.
But it doesn't mean that practitioners don't think that that's important.
And if you really look at a lot of the day-to-day work of a machine learning engineer, I would argue that data is a big part of that.
Making use out of that, cleaning that, generating features from that.
That's what you really spend most of your time doing. So anyways, just wanted to say that there is a big shift in the focus, things like data, data meshes, new patterns of organizing that data, which is really interesting.
But yeah, generally speaking, I think it's just those two things. There's a community
having a conversation about common problems. And now that community is very focused on data,
not just the model part, not just the algorithm,
but how do we actually get good quality data
and make use out of that?
And I'll jump in real fast.
Another thing that has changed is there feels,
it feels to me like there is a new company every other day
that is trying to do something in the ml ops space
and so of course since we are the ml ops community and we're in the center of this microcosm or this
community maybe i see it more because a lot more people are approaching us to try and talk to us
or see how they can better leverage the community for their cause.
But that's something that's really interesting. The space is very, very hot. And I always joke
about this with, uh, with David and our other co-host Vishnu, like I have a blog pending
that I really want to write. And it's called please don't start an MLOps company.
Because there are too many right now, obviously, like, that's a joke, please go out and do it if
you feel like it is useful. But we see the same pattern happening over and over. Someone says,
hey, at my last company, we were having trouble with this, we built this tool. And now I'm going to
go out and create a company around that because there's probably other people that are having
trouble with that same problem. And this tool can be useful for them. So what I've been seeing a lot
is a lot of these like tools that are just that they They're coming from a big organization
or from a technology-first company,
but they're trying to solve the problem
that that company had
and then seeing if the market will take it
and if others have that problem too.
It's an excellent point.
I mean, in any kind of emerging technology area,
we see a really huge amount of
fragmentation of companies trying to hone in on generalizable solutions, big problems,
and then we start hopefully at some point seeing consolidation. You know, you'll see those like ML Ops, big vendor infographics, and it's just
overwhelming. Are there kind of key players in the market who you think are gaining significant
traction right now, consolidating things, or just generally making it easier for your ml ops community to
build that more robust machine learning capability inside their business i'll let
david answer this one because i'm very biased and a there's a lot of great companies that are
sponsoring the community and b there's some companies that I've been looking to invest in. And so I don't know if I can give you a straight answer on that, but David can.
I don't know if I can either. One, you know, like I said, I work at Microsoft. We are a very large,
you know, cloud provider interested in that space as well. So yes, it's going to be difficult,
but you know, I'm going to give the cheap answer and say the cloud providers.
And the reason why I think that they are going to, I mean, it's just a common pattern that they've done in the past anyway.
So, for example, there's a good, you know, Google, there's a good open source solution.
They'll have a software as a service implementation of that where they take care of the infrastructure. They take care of the security aspects where they take care of the infrastructure,
they take care of the security aspects, they take care of the deployment. All you have to do is
press a couple buttons and configure it to whatever it is you need, and they'll
stand that up for you. Examples of that have been like, you know, QFL pipelines. There's a lot of
tools that I noticed that these companies end up supporting as a part of their stack. Kafka is another one.
Azure has a managed instance of that.
And it's like, that's the trend.
And so I see, my guess would be that it would also happen in the ML op space where these
solutions eventually are just going to get simpler and simpler.
I envision a low-code drag and drop environment for a lot of machine learning.
I already see that, particularly within the Azure cloud provider.
There's a lot of things that I'm using to this day that are actually drag-and-drop.
And I think it's great.
It makes my life a lot easier.
And when the cloud provider takes care of the platform as a service, the VMs, the network, the compute, that makes things easier for me.
I can just focus on the application side of things, configuring it to a way that
it works to solve the specific problems that I'm working on versus me handling the end-to-end stack,
which I know a lot of machine learning engineers, data engineers have had to deal with,
especially at smaller companies where you're essentially wearing all those hats.
I don't see that as being a sustainable model. I think that over time, we're going to see
more specialization, more focus in certain areas. I think that's going to enable people to be more
innovative. But I also see that meaning that the big responsibilities are going to be offloaded
to these service providers, these cloud service providers, so to speak, that are going to be offloaded to these service providers, these cloud service providers, so to speak, that are going to make it just a nice abstraction on top of all that,
something that you can easily integrate into what you're doing. The downside to that is that it's
going to be very expensive. So because the cost is always going to be there, I see open source
not going anywhere. There's going to be continuous innovation in the open source space, trying to
find similar solutions, if not the same solutions in the open source space, trying to find similar
solutions, if not the same solutions, but open source implementation. So while there may be,
let's say an Azure service for a feature store, for example, one day, there still will be hopefully
an open source implementation of that. We already see that model, for example, Tecton,
one of our sponsors with Tecton as their enterprise implementation, and then Feast,
their open store implementation. And I see that model being very successful because it's attractive to get you started
with something that's free. And then when you see, oh, I want more scale, I want more things,
and we don't have the team to do that, we'll pay you for that. So I see that trend happening.
And that's just, again, my bias, I do work at a larger company,
but even working at a smaller company previously, where we did everything from scratch, I found
more pain points than benefits to doing everything yourself. And so that's why my feelings towards
this is that it's going to become, we're going to offload that to people who are really good at that
to companies that that's what they do best. And we're just going to focus on what we
do best, where our particular competitive advantage lies. And that's what I think companies should
focus on actually, because they can really excel in that area versus trying to do that and be a
cloud provider and develop new state-of-the-art tools and develop state-of-the-art machine learning
applications. That just seems like too much to be successful in the long term.
I will add to that too, that when it comes to the cloud providers, there's a few things that I think
the people that we interviewed have said that have been very useful for me to look at and to
think of the space. And one was Noah Gift. And he said that if you're
getting started, like David was talking about, just go bet on the cloud provider because it's
going to be there. It's going to give you 80% of what you need, and it's going to give it to you
easily. The thing that I've seen in the community is though, once you get to a certain point and that 80% no longer
is good enough, then it is a real headache to try and get just a minimal, minimal gain,
like to any 1%, you're just going to have to work really hard to break out of the chains of what the
cloud provider is trying to give you. And so we've seen, there's an incredible post that
we just saw in the community. And it was talking about like the downsides or the pains of using
SageMaker and how people have used it. And they just want to pull their hair out because they
want to get that other 20%. They want to unlock that. The 80% that SageMaker gives you
is not enough. And they want to get to that next percentage, but it's really hard to do because of
all of the opinions that SageMaker takes. And so once you get to that point, I feel like
there is a benefit of trying to look outside of the box for what you have and start taking on the best of
breed and not using so much, not relying so heavily, or just basically packaging up everything
from the cloud provider and saying, all right, I'm going to go with the whole SageMaker suite
or whatever it is, Azure or GCP. And like David was saying, one of the
biggest complaints that people had was in SageMaker, it is like four times more expensive
for the instances, even though it's the same instance, it's an AWS instance. And so why
should it be so much more expensive just because you're using SageMaker? And this was one
thing that people love to talk about, but I'm not sure yet if it is really that big of a game
changer for people. I think it's more in the heads. It's in everyone's head, but it is the
vendor lock-in. And so nobody wants vendor lock-in, but then sometimes people just say,
well, you know what? It's going to get us that 80% really quick. So I don't care. Like later on,
we'll figure it out. And so the vendor lock-in is huge. Everyone has a big fear of it, but then
they go behind the curtains and they do it anyway. So I don't know about the vendor lock-in. I'm
still unsure how to take that.
But that is a big argument that you'll hear.
I think that's just a great point.
And I just want to add to that,
that that's a challenge in software engineering in general
with respect to finding the right set of abstractions.
You want it to meet all the needs of your users
to solve all those problems,
to have the features that they ask for.
But there's always going to be new things that you just cannot do.
And like you said, there's like, it'll get you 80% of the way.
And if part of your requirements is to have a POC quickly, to have something that will
get up and running as fast as possible, and then you can iterate to make it something
that's more configurable, more customized to the domain. I think that's often a why strategy, because then you may end up spending a year working on building something that isn't up to par to what you really need it to be.
And maybe you're not you don't have, you know, the right set of team members to do that.
You know, it's building infrastructure is a very challenging thing to do and not to mention the machine learning side of things.
So it's when you're trying to do all of that, I think you end up wasting more money, more talent, also in time.
It's very expensive to hire a machine learning engineer nowadays.
And so you have to also consider that cost as well, the manpower.
But if you're interested in investing in that space, if you're interested in
the long term, where you want something of your own, maybe something that you can give back to
the community, let's say through an open source implementation, then I think that that is an
important requirement. And then like you said, to avoid that vendor lock in, which is a real thing,
you don't want to be too dependent upon something someone else external to what it is that you're
working on. But there's that flip side of the coin where if your competitive advantage is in something
completely different, let's say in the drug discovery space, for example, I was working in
the last time, then I don't know if it makes sense to invest a whole lot of money to being
excellent in this other area as well. And I just find that it's going to be very difficult to do
all of those, you know, as well as you could.
Yeah, one of the things I found kind of the difference between like SageMaker and the Azure Machine Learning solution is the Azure ML approach to ML Ops is basically you add extra lines of code to log to the cloud what you're doing.
So you can use your general framework and just kind of backbone machine as your ML
compared to AWS where you have to kind of do everything a little bit more closed.
And that kind of approach actually helps people who are still doing things on their laptops and or looking at doing things inside their own data centers.
Are you seeing people taking kind of a cost benefit approach to working on-prem with ML or are most people in the cloud? I know I'm cloud by default.
Yeah, I would say most people are going to the cloud. I think everything is going to the cloud.
I think even the everyday things that consumers use is going to the cloud. I think there is
still a need though to bridge that gap. So for example, again, I'm sorry to talk about Microsoft, but I know that Microsoft has something called Azure Arc, which allows you to integrate with your on-prem.
And I know that the other cloud providers have that as well, they just, they have such specific requirements with their clients that it
just doesn't make sense to have them on the cloud. And so for that, I still think there's a need to
support that, but the trend, in my opinion, from what I've seen is going towards the cloud as
they're, that's what they're, most people are doing anyway. So it's kind of like hop on the bandwagon type of thing. So one of the other things that you brought up, David, and I want
to try to make sure that we get to this before we run out of time here is the whole datafication
of ML or the MLfication of data. And to me, especially with Steph here as well, because
that's kind of your background too, it seems that, you know, I'm not sure what's happening here, but it seems like data science and ML are merging.
Is there such a thing as data without ML or ML without data?
No, I think that's like a model has its learned behavior from the data.
It is a function of the data. So it's impossible to have machine learning, uh, specifically,
I guess, your things that you're a learning algorithm, right? Something that's updating
based on whatever the data has, what you that's, that's, that's not going anywhere. But what is
interesting though, is finding new ways to make it more efficient with all this data that doesn't have labels, for example.
That's a very important research area because if you can make – a lot of companies don't lack data.
It's just good data, data that is able to be used for machine learning specifically.
So you mentioned the data is kind of being – I don't know, like being with a focus on machine learning.. So you mentioned like the data is kind of being Emma,
I don't know, like being with a focus on machine learning. Yes, there is like, for example,
like Dimitra is bringing up that new course that Andrew put out. I think there's a competition
associated with it where you have a fixed model and the goal is to improve the data set, which
is kind of the opposite of what you would normally do. And I think that's really
interesting because there is a lot of challenges there. So if we can learn new things to make that
data better for machine learning applications, that will make the lives of a lot of people
a lot easier. And it will also open up the space of what's possible with that data. There's
sometimes too much that you don't know
what to do with. And more data, if it's not good, isn't necessarily better. You still need data that
is specific to your machine learning application. And that's where the data science and the data
engineering is really focused on right now. Not just the infrastructure, getting it from all
these sources and getting it from all these sources
and getting it to the machine learning in real time,
which is also a new trend,
not necessarily new,
but something that is becoming more and more important,
but also doing that in a way
that's easy for the data scientists
so that they can experiment and iterate quickly.
There's a need for abstractions
that allow people to work effectively,
but also there's a need to do that securely
in a way that it's auditable.
Like you need to know what sources came from.
How did this transformation lead to this result?
And how does that affect the actual outputs of the model?
That is still an open-ended question
that I think a lot of, I struggle with that.
How do I know how one relates to the other?
And I think there's a lot of interest
and research as well in understanding that relationship
between the data and the models.
And the whole course is coming out
where they're teaching that about
a data-centric approach towards machine learning.
And like I said earlier,
there's nothing necessarily novel about that.
People have been in this machine learning space,
have known that data is where all the where most of the challenges are and where most of the gains will come from
better better data better features but if you're already focused on that space and now you need to
get it shipped to your your customers then it comes how do i serve this how do i that's where
most people think of ml ops but i think MLOps should capture that data component
as a lot of that is, again, machine learning specific.
And it's not just for experiments.
It's also for production.
Some of those things are actually going to be used and affecting users.
And so I think there's a need to have better governance
and better abstractions that are easier for both the data scientist that's
doing that and also the ML engineer maybe that needs to ship that. And even the senior leadership
that wants to have a better understanding of where things are coming from and where the issues are.
So yeah, sorry, that's my long-winded answer to that.
Just on that real fast, there's a few things that we've heard. I think the collective community and even people out there that know a little bit about machine learning and AI have heard too much. much so that it's a cliche now, which is like the data scientist on their laptop, and then they try
to productionize something and it doesn't work. Or 80% of the data scientist's time is spent
cleaning data or working with the data. And the other one is something like X amount of
the machine learning or data products that get produced don't actually make it into
production. And so that, that being said, I wanted to just, I don't know why I was, I was saying that,
but I feel like David mentioned something along those lines of, of that. And it's good to see
that we're starting to take more advantage of this idea of the data
and how important it is. And I wanted to just mention also that there are some really cool
startups that are doing stuff in this space. And it took a while for me to actually understand what why this was ml ops because it didn't really feel
like it at first when you're talking about someone like the the company that comes to mind right away
and i have no affiliation with them they don't sponsor the community or anything is superb ai
and they're trying to do uh a little bit of what like scale AI is doing where they're labeling your data, but they're doing it in an automated fashion.
So that is really cool to see.
Like that is something where you think, wow, okay, then if that could work, like David was saying, you can boost,
hopefully you can have in an automated way,
the ability to make sure that the data is high quality
and make sure that you are able to get
these better predictions downstream
because you have that higher quality data that you're entering in.
So that's just my little quick note on that.
And of course, when we talk about data quality, this leads us to what's been brought up again
and again by many guests, including the two of you, including Steph, including a lot of
the guests that we've had here, which is the whole challenge
of bias in AI. And I think before we go, I think that we would be remiss if we didn't bring that
up as well. Is there a prospect here? I mean, how are we going to deal with this problem?
That's a hard question to answer. I'm not sure if right now we know how we're going to deal with that problem.
And there was a pretty funny meme that I saw. And of course, like explaining a meme never does it
justice, but it's something along the lines of let's use a machine learning model to get rid of
the human bias. And then someone underneath it says, okay, and so where'd you
get the data from? And who was messing with that data? How did they train? Like, how did they label
that data? Was there no bias involved in that? So there's what I've seen. And what I realized is
there are so many places and so many ways that bias can creep in. And we just have to accept that it's there
and really try to recognize that we are going to have it.
And so gaining the diverse viewpoints
from many different fields and many different stakeholders
is one way that's going to help. But also, it's just, it's a, yeah, it's a nasty beast that people have to try to make sure that it's not, and it's non-biased is a very difficult nut to crack. And I don't know if I have the right answer for that. But what is interesting
that someone told me the other day on the AI ethics podcast is, if we don't have the right
answer, at least we can start asking the right questions. And so trying to figure out what the
right questions would be around what it looks like if we have less bias how do we get there what does that even mean less bias right so
that's all i can say about it and i'll just add to that that i think that's actually one of the
most important steps to do is actually frame the question like what exactly are we trying to do and
by doing that you you get a lot of that legwork done um but yeah, just to piggyback off what Demetrius said, I think that
one, more humans in the loop. There's a need to have domain experts interact with the technologists
and be a part of them building that, generating the data, qualifying the data, validating that
data. There's still a need for people. You can never eliminate the human factor from machine
learning. And I think if you do like what Demetrius is saying, we joke about that,
but using machine learning for machine learning is a real thing and people are doing that.
And I would be worried that as well, because we need humans. We need that judgment. That's
very hard to replicate. It's very hard to build that in into an algorithm to encode
that. You still have that intuition of what bias is. And that's, again, what we mean by that is
what I think part of the challenge is, what exactly qualifies as bias in this particular domain.
Two, I think we've been getting at more important data quality tools, practices, processes will also improve the issues or
mitigate issues of bias.
Because from my understanding, from data science perspective, bias is inherent in the data
because we don't always have or are aware of the data generation process.
We don't know where that came from, how the distribution of that data, we are approximating it and modeling that with our best judgments
and some tools that are imperfect.
So if we're generating more data with imperfect tools
and qualifying them with imperfect tools,
I think there's a need to improve that.
And that's a known problem.
And there's even arguments within the statistics field
of how to actually approximate a data set, like there's there's even arguments within the statistics field of how to
actually approximate a data set, you know, like there's the Bayesians and the frequentists. And
but anyways, the point is that the data quality is always going to be a help. If you can improve
the quality of the data, you will help mitigate bias. And again, depending on what you mean by
that. So I would say those two things, more humans in the loop, being integrated into these machine learning
pipelines and continued work with data quality.
Oh, I'm sorry.
And thirdly, the interpretability.
That helps because if you can understand how your model is using something, you can maybe
better understand where this decision is coming from. And that will help you, again, not necessarily maybe fully understand where the bias is,
but get you closer to understanding how you got to where you got to.
And that auditability, the ability to reproduce things, going back to a previous step and
reproducing something more downstream, that's something that's also going
to really make a difference. Because if you can explain to a stakeholder why this model made these
predictions or what data sets went into this model, that's just going to make that process
go a lot smoother, as that's one of the first challenges is understanding where things came from.
Well, thanks a lot for that. Before we get to our three questions, though, Steph,
I want to give you a chance. Do you have any last words about the topic of data quality and bias?
It's, as you say, bias is an artifact of a whole host of things that generate the data that we work with. We can be also working more closely with business experts around that legacy process
because if you think about most machine learning departments, they're less than five years
old. But the data that they might be working on, maybe in a financial sector, worked with data that could be up to 150 years old on
economics.
So there's all sorts of historic processes that go into that, changing definitions, and
really that data stewardship and understanding of the processes is really key to being aware of bias because as Demetria said
you're never going to get rid of it but you need to know where the dragons are on your map to know
where to be more cautious about it and I think we'll get there. It's a collaborative effort. One thing that I will just mention too
that I thought was super useful when looking at this
was when I interviewed a woman named Sarah Williams,
she told me about how when she creates maps
and what she does is she does a lot of visual data. And so she'll create different
maps, or she'll create different statistics, but in a visual sense. And she told me how
it's really easy to lie with that. And I imagine that you, you all may have heard of this book, How to Lie with Statistics. And
there's another one, I think it's called How to Lie with Data, that kind of thing. And so
what she said that she does to try and make sure that her maps do not have any bias or the least
amount of bias as possible is she has a diverse range of stakeholders,
like she said, or like I mentioned before. And then she also will go to the people that
the different maps are talking about, and she'll interview the actual people. So for example, she was talking about how they had a map where it said X amount of people
or X percent of people in these neighborhoods experience X percent more lung cancer.
So it's more possible for people living in this neighborhood to get lung cancer for some
reason.
That's what they found with this data. And so instead of just publishing that, she went to that neighborhood and she talked to people
on the ground and she asked them, does this look like something that could be possible? Do you know
people that are getting lung cancer? So that it's not just a way to get the data and then publish
it, show, okay, here's some correlation. Now let's
put it on to a visual, let's make it visual and then let's show people so that it's easier to
digest. No, she went out and she made sure that she's able to get the firsthand experiences and
confirmations from the people that this data is talking about.
And I thought that was a really interesting way of mitigating bias within what she was doing.
Well, thank you guys very much for this wonderful conversation. Always a pleasure to have you. But
before we go, we do have a tradition here at the end of the Utilizing AI podcast where I ask three
questions, unexpected questions without warning, and we'll just see how you guys react.
Let's treat this as a jump ball between our guests.
If you want to jump in and answer it, feel free.
So here we go.
So first up, I have a new question, brand new, just came to me while we were talking.
And here it is. Maybe I'm going to throw this one at Demetrios. First up, I have a new question, brand new, just came to me while we were talking.
And here it is.
Maybe I'm going to throw this one at Demetrios. Is MLOps a lasting trend or is this just a step on the way from ML and DevOps becoming normally how things are done?
Ooh, this is something that I think about quite a bit, obviously, because I'm doing a lot with
the MLOps community and I hope it is a lasting trend, but there is something to be said. Maybe
a new software comes out and it just eats up the need for MLOps completely. Or like some have said
in our meetups that it's just going to be DevOps.
You're going to have DevOps and it will be when you say DevOps, it's going to mean that
you deal with data because data is becoming so ubiquitous.
So I'm not sure.
It is totally possible that it might get eaten up by something else.
I hope not just because we've got this community and it is those things, but machine
learning is that and a lot more. There's the math. There's the understanding of the experimental side
of things, someone who knows how to iterate. And those skill sets are very different from a
software engineer. As an engineer, I focus on these concrete deliverables. And I like that. I don't see, you know, a DevOps person fully like, you know,
taking on all that responsibility. I think there's still going to be a need for people that have
specialties and can understand a specific part of that problem. But that being said, I think it will do some good if things become more like,
I don't know if more categories is better. So if we can consolidate some things, I think that will
help. But I do think that the need for a special set of skills will still remain. And because of
that, MLOps will still remain. Just like Liam Neeson, special set of skills. All right, next up,
how about this? Can either of you think of an application for machine learning that has not
yet been rolled out? So it doesn't exist yet, but you're like, man, machine learning would make that
great and it'll have a major impact in the future. What will ML be used for in the future that you
haven't seen yet? That's a good one. Steph, how about you first since you haven't seen yet that's a good one i what steph how about you first since
you haven't spoken i'm punting that to you uh i already answered this question on my podcast
you're the guest okay oh man okay put you on the spot demetetrius, got anything? Yeah, good, good. Yeah, it's super hard.
It's really hard because we don't know how machine learning is being used in all its different forms and all its different iterations right now.
Like I just look around my room and I think I have a garden outside
and that's one area that would be amazing.
But I know there are people that are doing stuff
with machine learning and gardening and vertical gardens and all of that right now in robotics
so i can't say like what would be a really interesting one that hasn't been mentioned
before yeah one thing that is interesting is the use of machine learning in music production
and how that gets really because there's a certain amount of ai that is allowed or is okay when it
comes to music production and then you want still you want a little bit of human involved and so i
look at that percentage and I think,
oh, that's an interesting one to look at.
But again, it's nothing new
and it's not something that we haven't thought about
or it's not already actually happening.
So I can't give you a good answer on that.
That's a good question.
Yeah, I think the only thing, a couple of things came to mind.
So things involving devices, the internet of things,
more things being connected.
That means more data is being collected, like more, we're getting data from places that we
historically were able to receive data from. And because there's new data, I think new data sets
will allow us to find new interesting problems to model. So I think that's going to be an
interesting domain. The other thing I was thinking about was, oh, so one of the biggest things
that I think machine learning is really good for
is augmenting human intelligence,
supporting things that we are already really good at.
So you mentioned the creative side of things.
I think that's actually going to be really interesting.
We already see like AI
or machine learning based creation than art. And I think that's going to be really interesting. We already see like AI or machine learning based creations in our, and I
think that's going to be really interesting to explore. Like what else can we use to narrow down
the search space? So a field where there is a historically huge search space, I think machine
learning can be really useful there to narrow that down, to augment what we already know and to make
things a lot easier. So that's a bit of a catch-all, but yeah,
I think, I guess, yeah, fields that involve a huge amount of, it's just too large to explore on our
own. So yeah, that's my cheap answer. Sorry, Stephen. Okay. Final question here. How big can
ML models get? We've got a hundred billion parameter models now. Will these look small
tomorrow or where we reach some kind of limit? I think, yeah, it's like with chips, right?
There's going to be a physical limit at some point. I think people have discussed this. I
forget the name of that law. Moore's law, I think it is, if I'm understanding correctly.
But yeah, I do think that it's just not sustainable.
Compute is still going to be, it's not going to be free. So in storage is not going to be free.
So we're still going to need to find ways to make that more efficient. So I think the trend is
actually going to be making them smaller. And even now you see lots of things regarding that quantization and, you know, there's still a need to make things smaller.
So how big can they get? I guess as big as we have storage.
So the cloud is essentially infinite, but I don't think that's where things are going to go.
I think the goal is to try to make them smaller and smaller.
More efficient is what I think people most are mostly interested in yeah and along those
lines of creating more efficient and better models that don't need to be so gigantic when
one thing that a lot of people to the planet and so that never
comes up when you're talking about the different models and wow look at how incredible this is
but these things are like david said yeah we have infinite compute and we have infinite storage
but do we really want to use all of that infinite,
see how infinite it really is? So that's something that also I think needs to be kept in mind as we
move forward. Yeah, that's a, that's a really good point. Yeah. Do we want to use infinite energy to
see where we can go with this thing? So, well, thank you guys so much. It's great to have you here on the
Utilizing AI podcast. Before we go, where can people connect with you and follow your thoughts
on enterprise AI and other topics? And is there something that you've done recently that you're
particularly proud of? So David, let's go with you first. Yeah. So yeah, first off, thank you
so much for having us again. It was an absolute pleasure. And I think the best place to reach me is on LinkedIn. So David Aponte, you can just find me. And something that I've worked on
that I'm proud of, I think I'll just say this, in my work at Benevolent AI was a great learning
experience. I helped set up a lot of the machine learning infrastructure there. And I know that
the team there is still working hard on that. A lot of those learnings have benefited me and I've been able to bring into
my new work at Microsoft. So I am proud of the work that I've done. As for where I left that off,
that I'm not sure. I need to reach out to my coworkers, but I am very proud of the work
that I've done in machine learning infrastructure as it's, it's been a nice, like it's been fun to like work at the intersection of the research side of
things and the production side of things.
And the, the system design reviews that.
Oh yeah. Sorry for, I'm sorry. Yeah, that is something.
Thank you for bringing it up to me.
We are doing something very cool that I am very excited about where we're
going to be reviewing system designs at many different companies.
So for example, we have one coming out soon at Pinterest where we're going to go over their real time image similarity architecture.
And we're going to be going through these system designs and really breaking it down into a way that practitioners can use that to learn that skill set of system design and architecture,
but also as like a starting point for maybe systems that they need to build in their workplace.
And so it's going to be a bit of like a master class in system design.
And so be stay tuned for that. I am very excited for that.
Me, Vishnu and Demetrius are working very hard on that.
Yeah, and that's something that David has spearheaded and has been really doing a great job on. I think that right now what we're seeing from the MLOps community
is a dire need to get more technical because what you see when it comes to MLOps right now
is so many different articles that just explain what MLOps is or maybe toy projects that people have. But this is really like productionizing at a large
scale in large companies. And I think it's super cool. So I'm also proud of that. You can reach me
on LinkedIn too or in the MLOps community Slack. I'm very active there. And I think that's it for us. Well, thank you.
And I do want to say the MLOps Community Podcast, if you're not subscribed to it, definitely go subscribe to that.
It's a great, great podcast.
How about you, Steph?
Anything exciting in your life? I've just finished up a series on developer velocity, looking at how businesses can do
data machine learning software engineering as a like key value driver inside an organization
based off a lot of research from Microsoft and McKinsey. And I cover some of the topics that we discussed today, like will ML Ops be eaten?
There being too many something ops and stuff in that series.
I also provide a great use case for getting a tools budget, which I know some businesses and tech teams can struggle with.
And that's available at bit.ly slash developer velocity.
So those are available online.
Well, thank you so much.
And thank you for joining us here on your first stint
as the co-host of Utilizing AI.
It's wonderful to have you here.
And thank you everyone for joining us.
As for me, I'm really excited
that we just wrapped our second AI Field Day event. We will have another
one in the future. We would love to have you join us at that event. Just go to techfieldday.com,
click on delegates, and you can submit your name to become a delegate or click on sponsors,
and you can learn how the companies sponsor and present at those events. Also, of course,
we would love to have you join us here on the Utilizing AI podcast. And I will finally give one last shout out to the MLOps community, which really opened a lot of doors to me.
And I really appreciate the time and effort that David, Demetrios, and everybody else puts into that.
So thank you very much for joining us.
If you enjoyed this discussion, please do subscribe to the podcast.
It really does help to have more subscribers.
We're basically on every podcast thing out there. This podcast is brought to you by
gestaltit.com, your home for IT coverage from across the enterprise. For show notes and more
episodes, go to utilizing-ai.com, or you can follow us on Twitter at utilizing underscore AI.
Thanks, and we'll see you next time.