Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 2x23: Overcoming the Obstacles of AI Application Development with Snorkel AI
Episode Date: June 8, 2021Developers of AI applications face many obstacles, but the chief challenge is simply that these are different from traditional software development projects. 85% of businesses say they are looking to ...adopt AI but a similar percentage of data science projects never reach production. Too many organizations approach AI application development similarly to other software projects. Another issue is focusing on the machine learning model rather than the data set that will be used. Devang Sachdev of Snorkel AI suggests being data-focused instead, and reducing and optimizing models instead of continually expanding the number of parameters. Another issue is the manual process of developing training data, which is time-consuming and error-prone. Finally, we must consider a process of iteration over models and training data to ensure quality. Machine learning is an excellent tool but it requires a re-think in how a company approaches software development. Three Questions Is it possible to create a truly unbiased AI? Can you think of an application for ML that has not yet been rolled out but will make a major impact in the future? How big can ML models get? Will today’s hundred-billion parameter model look small tomorrow or have we reached the limit? Guests and Hosts Devang Sachdev, VP of Marketing at Snorkel AI. Connect with Devang on LinkedIn or on Twitter at @DevangSachdev. Chris Grundemann, Gigaom Analyst and Managing Director at Grundemann Technology Solutions. Connect with Chris on ChrisGrundemann.com on Twitter at @ChrisGrundemann. Stephen Foskett, Publisher of Gestalt IT and Organizer of Tech Field Day. Find Stephen’s writing at GestaltIT.com and on Twitter at @SFoskett. Date: 6/08/2021 Tags: @SFoskett, @ChrisGrundemann, @SnorkelAI, @DevangSachdev
Transcript
Discussion (0)
Welcome to Utilizing AI, the podcast about enterprise applications for machine learning,
deep learning, and other artificial intelligence topics.
Each episode brings in experts in enterprise infrastructure to discuss applications of
AI in today's data center.
Today, we're discussing the obstacles of AI application development.
First, let's meet our guest, Devang Sakhtev.
Hey, everyone.
My name is Devang.
I am the VP of Marketing at Snorkel AI.
And you can find me on Twitter, on LinkedIn,
or on Gmail at Devang Sakhtev.
So yeah, just look for that and you can find me there.
And I'm Chris Grundemann,
part-time co-host of the Utilizing AI podcast, full-time consultant and content creator. You can learn more at
chrisgrundemann.com. And of course, I'm Stephen Foskett, full-time host of Utilizing AI, also
organizer of Tech Field Day and publisher of Gestalt IT. You can find me on Twitter and most
other social media networks at sfosket. Devang, when we
were talking previously, we were talking about some of the challenges of actually developing
AI applications and the fact that this is one of the things that's holding back the development of
AI in the enterprise. Maybe we can start off with just a bit of an overview from you about what are
the challenges that companies face when developing new applications that use AI technology? Yeah, you bet. First of all, Stephen and Chris, thank you so much for
having me. And thanks to your listeners as well. I'm a big fan of the show and I've found a lot of
insightful conversations on utilizing AI podcasts. So it's a pleasure and a privilege to be here.
So the topic today, we're talking about obstacles for AI development. Just,'s a pleasure and a privilege to be here. So the topic today, we're talking about
obstacles for AI development. Just taking a step back, looking at the big picture,
if things trend to be the way they are going right now, the AI industry is expected to pass
half a trillion dollars in terms of its market size by 2024. That means that this would be probably one of
the fastest growing technologies ever, right? In fact, there's a huge adoption in terms of
AI technologies. A recent survey from O'Reilly pointed out that 85% of organizations in 2020
have mentioned that they are looking to adopt AI. And this is up from 24%, not only just in 2019, right?
So great momentum in terms of folks wanting to put this amazing technology to use.
Yet at the same time, we have information, we have data, and even anecdotally, we know
that most, specifically 87% of data science projects never reach into production.
So why is there such a big disconnect between the aspirations of an AI practitioner, the
aspirations of the IT community, the developer community, and putting or utilizing AI, mind
the pun, in real life?
So there are a few different areas where I believe the challenges lie.
And as part of being Snorkel, I've had the privilege to work with some of the largest
organizations in the world. That is top two out of the top three US banks, several government
agencies, large telecommunication providers, insurance providers, and just see from a vantage point
that in spite of having large pools of talent, in spite of having spent millions of dollars,
if not more, even largest of the organizations in the world run into some of the same issues
that organizations that are new to AI run into as well.
And it really comes down to the approach
that most AI practitioners are taking.
On one end, AI practice in a lot of places
is looked as an extension of software development,
as a tack on to software development.
I think we need to look at this with a different lens,
with even a fresh outlook to say
that possibly AI development,
even though it's software is significantly different.
And then second, there's an emphasis
or has been an emphasis rightfully so, on developing AI but keeping model first
or taking a model first approach, which means spending a lot of energy and focus on really
choosing and developing the model. But we see a shift taking place where rather than the focus
being model for development, the focus is more and more becoming the data that is required to train the model,
or as we call it, the data-centric approach
to AI development.
So in reality, those are the two big areas.
I would love to drill down a bit more further
into each of these, but in a nutshell,
those are the challenges that I see from my vantage point.
Yeah, so just to kind of maybe restate that
a little bit differently when I'm looking at this
from, you know, in terms of a company that wants to develop artificial intelligence, whether it's for internal applications or for customer applications, you know, there's really kind of the data itself, obviously, the algorithms, and then, you know, the combination of data and algorithms becomes models. And so obviously each of those could be a, could be a pinching point.
And there's, you know, storage issues for the data itself.
There's compute and processing issues for training the models themselves.
And so it sounds like you're saying that the shift from being model first to being data
first has shifted where some of those bottlenecks line up.
Obviously, in some cases, right, and not that it's completely easy to just wave a credit
card around and have more GPUs or more CPUs or more, but a lot of those are definitely
solvable problems.
So I wonder if this is angling towards how we use data and does that actually present
a bigger challenge than just looking at
model first and stacking more, you know, GPUs in a data center? Yeah, I think you hit it on the
head right there, Chris, which is the two things that have taken place in the last decade, specifically
in the last five years, right? I had spent some time or some part of my career at NVIDIA, partly I was a product manager and
engineering lead for GPU computing, which is now fueling a lot of GPU-based compute.
And I was just looking at the most recent releases, NVIDIA being the leader in the space,
that computer is definitely getting faster and faster. If you look
at the Ampere architecture, which is the latest one from NVIDIA, that offers five petaflops of
performance in one box. And this was unheard of, you know, even just in 2013 when I was there,
we had a deployment for my product at Oak Ridge Supercomputing Lab, and reaching a petaflop of performance was like a massive feat.
Seven, eight years later, we're putting this in a box.
So with every architecture, and it's not just NVIDIA, it's even more so with specialized hardware from Google or Salma Nova Systems or GraphCore,
we are getting two to three times faster than the previous generation
at the very least, and even more so. So compute is becoming much more accessible and available,
and especially with cloud, the accessibility challenges is much lower. And then on the other
end, the models themselves are getting more and more bigger, more complex, but also more available and more open source.
So 2018, we were all amazed when the first BERT came out
from Google, transformer-based model.
It had 340 million parameters.
And that was like mind-blowing.
And a year later, OpenAI publishes GPT-2,
which is, I believe, one and a half billion parameters.
And we're like, wow, we're in the billion parameter range now.
And the very next year, GPT-3 comes out, 170 billion parameters, right?
So these models are becoming increasingly complex, and the compute is becoming increasingly
faster.
That means that the models are getting more data hungry, or the application development
process is getting more data hungry or the application development process is getting more data hungry.
So Google in 2020, early 2020,
they published a chatbot called Mina.
This was based on the Lambda architecture
that they talked about.
And it had 2.6 billion parameters,
but it was trained using 40 billion words.
It was an NLP oriented chatbot.
40 billion, a corpus using 40 billion words. It was an NLP oriented chatbot. A corpus of 40 billion words that's labeled is just unthinkable, especially when you're working on practical applications.
And while Mina was a great proof of concept, it's probably fueling some of Google's technology as
well. Most organizations don't have that luxury, right? Most organizations are looking to do very specific, like you said, business or customer
facing or internal facing enterprise applications.
And in order to build those applications, the data that they need is often private data,
meaning they can't ship it to someone third party to label it and get it back.
Or the data is complex. So it's,
you know, electronic medical records or it's financial records, maybe it's network analysis,
things that require subject matter experts to look and really help the machine understand what is
what. And then lastly, data constantly changes. So what model you train with is a snapshot of that
data. When it goes in the wild,
the data is constantly changing. And if the rate of the pace of change of your data is really rapid,
you need to think through your training data collection or creation process for it to be a
lot more scalable. So particularly those two things are yielding this bottleneck of data. And then at the end of the day, we're also seeing that bigger model is not necessarily a better model. Bigger models definitely are harder to train. There's value in sort of a research accomplishment in creating a larger and larger model. But in a practical setting, would you rather have one very large complex model
that's 90% accurate, or would you rather have 10 smaller models that are 99% accurate and take
fraction amount of cost to develop and train and deploy? I think a private agency or an enterprise
would go for the latter option, latter option just being a harder one to implement as we move sort of into production level settings.
I think one of the challenges here is that there's sort of a disincentive to optimization simply because so many companies are working at, as you mentioned,
developing faster and faster, you know, ever more capable GPUs and ML processors, and additionally, the components
that surround those.
So we've talked on the show here about the rapid development of network and IO connectivity
and flexibility in terms of performance.
And we've talked about development in terms of better and better
storage capacity. We've certainly talked about the development of processors and chips. And this
serves as a disincentive because frankly, if you can build a model bigger and bigger and the
hardware can handle it, why not? I think that that's maybe the mindset that some people go into these things with. Is
that right? Most definitely. I think having been on the other side where building the next
architecture was the grand challenge and being able to beat benchmarks is the grand challenge.
I think that is still very important and required for the industry to move forward.
I mean, we're looking at the horizon. We're looking at quantum computing.
We're looking at next generation of even hardware technologies.
That's certainly going to aid in moving technology forward.
But at the same time, when it comes to AI practice,
it's a lot more than just putting components together, right? It's a matter of
building a discipline around it. And we've done this very successfully over several iterations
with just pure software development. If you remember, there used to be the waterfall ways
of doing software development, and that's transitioned to agile, and now we have scrum
teams. And if you think about a typical software scrum team,
it usually has a project manager or program manager or product manager who interfaces with
the business, collects your requirements, translates that into some sort of an architecture
with the help from a technical lead or an architect, hands it off to maybe a few developers.
And you might have some QA folks
who will QA that software that they've built.
And then you hand it off to your development DevOps team
or your IT team.
And then they get to put all the goodness
that they've built in the infrastructure
to deploy this application.
But this team looks very different
when it comes to AI development.
You know, I remember reading somewhere that Jeff Bezos likes to talk about that, you know, our development teams are only as big as they can eat from two boxes of pizza.
I think with AI development teams, you might need a food truck for you to be able to feed the team because there are a few other players involved too, right?
There's a data scientist.
There's often data engineers.
So like you said, you are developing a faster and faster compute.
You are developing bigger, faster storage.
So you need folks like data engineers to create craft pipelines
on top of that infrastructure to be able to move data from data lakes
out into application development site platforms
or from application development platforms
or developed apps into warehouse
and even put that warehouse data to use too.
You would probably have folks
like machine learning engineers
who are involved in this pipeline.
You could have folks
from machine learning operations team
in addition to your DevOps team who are looking at monitoring your models once they've been put into production.
You're looking at interpretability or auditability. So this whole domain of how we do AI development
is still very nascent. And in some sense, it's quite exciting because we get to learn from our
mistakes or what has worked for us in software development and contrast it and say what's going to work for us in this new paradigm, the AI development or not.
In a way, it reminds me of what happened with software development generally when what had been once seen as something of an art became much more of a practice, a business
practice. And, you know, it's a maturity in a way that you go from people experimenting with
machine learning and experimenting with models and trying to make something bigger, better,
more interesting. And then you go to basically, how do we do something productive and practical? And
that's kind of the theme of the podcast, after all. So is that how you see it?
You're totally right that software has almost become everybody's business. It's no longer
a segment in the market. It's just everybody does software. Every company is a software company.
And there are particular practices that we put in place in order for us to be able to
get here, right?
One, I think we've done a really good job as just practitioners to be able to decompose
applications, right?
And so we were able to look at this larger business problem and say, what are the underlying
apps that we need to develop this larger or solve for this larger business problem?
And then underneath that particular application, we're very good at saying,
what are the functions that we need to now develop as microservices or what are microservices
available out there in the market that we can implement and put to use so that we have this
full application system available to us? And if you were to think about breaking down
a larger software application into microservices, job becomes relatively easy because there's a
dedicated team, there's a dedicated focus, there's a dedicated QA effort, and then you're testing for
local correctness or local quality, and then you're applying the same techniques to a global quality,
and that's how you get your applications published. When it comes to AI development,
just to contrast that a little bit, decomposition of AI apps is still very nascent.
So one of the classical examples of an AI task would be classification. So if you think about you are trying to classify
job codes, you're an expense software company and you're applying, how does different job codes segment into different
expenses coming from different individuals? You are looking through Bureau of Labor Statistics
published job codes. There are about 800 job codes that are published on this list.
And your AI applications task would be to classify individual receipts or individuals themselves into one of these 800 classes. That's a practical example, right? Similar example could be classifying companies. So there's a published list of standard industry codes, about 1,200 items long. classic problem. And when it comes to real production apps, you're looking at these very
light shades of gray bifurcation between some of these classes versus research, where you're
looking at very broad classes and very few classes. So now the question becomes that, is there a way
to decompose this classification challenge from an AI point of view? Yes, there is. There's a lot
of research that's been published on decomposing applications as well.
One way to do it is to build multiple classifiers.
But here's the rub, right?
When you're building multiple classifiers,
you have to make sure that any mistakes
that each of these classifiers make
are going to be independent of each other.
If they feed into each other,
then as a whole,
you're going to have a lower quality application.
And the worst part is that then you're playing the game of whack-a-mole, trying to discover
which of my classifier is going wrong where, rather than being able to very precisely pinpoint
that this is the area where I need to spend more time fixing my application.
So it sounds like the approaches to the actual classifying of the data, obviously,
but just more generally, maybe the approaches to
how to get data into a model are our main obstacle, right? And enable the right data
getting into the model, but also in an unbiased way and at the right fidelity, I would guess as
well, is just as important. Yes. So when I, again, thinking through just contrasting software development versus AI development and Chris, you would sit down and say,
what is it logically that I'm trying to accomplish?
And then you'll pick a language
based on your personal preference
or organizational preference
or some specific function that you want to write
like a front-end development or back-end development.
But at the end of the day,
your major input as a human being is logic
and your output is code.
With AI development, things are changed, right?
Things have changed because your input is actually the data, the training data that you are crafting,
curating, creating, and then you are giving this data to a model and you're letting that model,
especially deep learning models, they discover solution spaces on their own. That's why they're
so powerful. You're giving these examples to the model
and then model is then able to develop code
that is then used to make the decisions that it makes.
And we hope that they make them as accurately
as we want them to.
So training data on its own
has now become the interface to write code.
Question is, how do you generate this training data? Today, you generate this
training data almost like you used to generate code through punch cards, right? You would punch
every single card with every single instruction, you would put it in this deck, and then the machine
would read it. And, you know, imagine doing something like that today, you know, we wouldn't
be flying on planes, or we wouldn't be talking over Zoom if that's what we were doing. So to me, the way we're generating training data
today is in the punch card ages, right? We're looking at every single data point. We're saying,
should I label this data point A or should I label this B or should I label this C?
Or one of the other 800 or 797 classes that I do need to generate data to show my model what each class looks like, right?
And not only do we need to show one example, we need to show several examples.
And the more examples we show, the better the model gets, right?
So there's the manual way of labeling, which is what's the current blocker and also the current
technique. Again, a lot of organizations that are just getting started rely on this approach,
whether they do it in-house or through third party. And with Snorkel, particularly with the
research that we've done, we focused on developing what's called a programmatic way of creating training data.
So what does programmatic way do?
So number one, instead of taking each
and every data point by hand,
you are looking at what are the heuristics
that a subject matter understands?
What are the different rules or intuition
that we have about the data?
And then how do we use a simple tool,
whether it's a UI based tool or a code based tool
to translate and encode that heuristics, that information,
those rules, that intuition into a simple,
what we call labeling functions
to then generate a training dataset.
Now this training dataset might have some noisy data,
as we call it, but not be very precise.
But when you take it through this loop of iteration
that you train your model to begin with
using this particular training data set,
and then you inspect that model,
and then you come back and iterate,
not just on your model,
which is what typically people do,
but also on your training data.
So at this point, you are not only showing your model more examples,
you're also showing it better examples
because you have a full iteration loop,
just like we have with software development.
You're able to yield a much more high accurate or high quality model.
Because of this iterative cycle,
you're no longer doing things in operational silos.
You're doing this as a team, a collaborative team that works on one platform.
So you're able to publish models more rapidly.
Because you're not looking at every single individual data point, you're able to do labeling at scale, but do it in a private way.
So keeping your data private or even having proxy data so that you're not looking
at actual data, which might be for compliance reasons. We know for government use cases,
that's definitely a case and more so with new consumer data protection compliance requirements
as well. And then at the end of the day, because you have your labeling done through software techniques, when your data changes or when your business objectives change, even when the model is in production, you can come back and adapt your application relatively quickly, rather than having to relabel all your data sets from hand to begin with. So the trained data bottleneck is real, but at the same time, rather than focusing
just on the model and iterating and trying to tweak the model, focusing on the data is important.
And there are techniques like programmatic labeling, weak supervision that can help you
adopt a data-centric approach rather than being stuck behind manual labeling as well.
Yeah, a lot of that's really amazing and really powerful.
I do want to roll back to something you said in the middle there and just kind of underline
it because I think it's a fairly big statement that could be hidden and just slide away,
which is that programmers used to apply logic to create code, and now programmers in a machine learning, artificial intelligence world,
are providing data or specifically labeled data and training data to create code.
And that's a big paradigm shift to me as far as how we are creating code
and what programming means and how all this even works at a very fundamental level.
And so to me, that ties in
with a lot of the things you said earlier
about how big the team has become
and there's all these new roles.
And I think that one of the biggest obstacles
it sounds like to AI application development
is not just that we need to buy more pizza,
but that we really need to rethink
the way we're developing applications
to move into a machine learning world.
Is that fair and accurate? That's 100% accurate. And that paradigm shift is going to
come easy to some organizations that have good data practices that are orienting themselves to
a data scientist-led development rather than a pure programmer or programmer-based
development? Because the criteria that you're choosing to develop an AI application are quite
different, right? You're thinking about what is going to be the eventual quality of this application
with software development or traditional ways. It's a little binary, right? When you say a
performance of a software application,
is it either, does it do the job or does it not do the job? But with AI applications,
it's not that binary. It's 99% accurate or it can be 50% accurate. And both are okay,
depending upon what are you trying to do, right? If you're building a recommendation engine,
it's okay to have some inaccuracy, but if you are building some life critical or mission
critical application, you better make sure that it is more accurate than not. And if it is
inaccurate, you know, when it is inaccurate and you have mechanisms to either detect that inaccuracy
and then, and then present a different and alternative path. You have things like
interpretability, which is why did it take this action? You have challenges with just data
cleansing in general, because it's not as if all the data that you have is ready to be put for
machine learning use. So for some organizations, this shift is going to be easy. For others,
it needs to be intentional. But the sooner that they orient themselves with this mindset,
the quicker they'll be able to achieve success.
Well, thank you very much for that.
I'm wondering as we kind of near the end here,
is there one takeaway message
that you'd like to deliver to the audience
on how they can improve AI application development?
Yeah, you know, machine learning is a
fantastic tool. It's one of the many in your development tool belt, but it definitely requires
a rethink or a reframe of how you're approaching software development. It does have a different
cast or an additional cast of characters, and more so, fundamentally, the approach is to be more data-centric
and training data-oriented rather than just logic or model-oriented,
which have worked great for legacy,
but at the same time, as we're moving forward in this new paradigm,
will help you accelerate your efforts.
And Chris, what do you think?
Is that practical for companies?
Do you think they're going to be able to do that?
I think they have to.
And I definitely like the approach that Devane got into a little bit there that Snorkel's
taking with this programmatic approach to labeling data, which I think actually makes
it more accessible to more folks to be able to get this right versus trying to do everything manually themselves. Well, thank you so much. Now, before we go, before we sign off,
let me quickly jump into the fun lightning round here. We've got three questions for Devang,
and none of these are things that he's been warned about, though if he listens to the podcast,
he might have heard them before. I picked three of them based on the topics of our conversation here and also based on what I would
love to hear from him. And I've added a new question. New question, new question. So let's
jump right into it. First of all, one of the things that comes up on utilizing AI quite a lot
is bias in data sets and models. And I'm wondering, do you think that it's possible to create a truly unbiased AI?
There's an academic answer to this, and then there's a practical answer to this.
I believe AI should continue and will continue to be human-driven.
And humans inherently are biased creatures as much as we don't want to be.
But rather than thinking about building AI that is completely unbiased,
I think we should think about what are the ways in which we can detect bias and act on it.
So rather than saying we are going to prevent bias from the beginning, it should be more about bias management more than prevention.
All right. I like that answer.
Next up, can you think of one application of machine learning that has not yet been rolled out, but will have a major impact in the future?
And maybe this is a bit of a challenge because I'm putting you on the spot here, but
is there something you said, you know what, machine learning would be really good at that?
Oh, so many things to come to my mind. Childcare, spousal satisfaction, self-spousal care. No, I mean, I think more practically, if I can just get a good
meal built using some smart technology that can read my mind and say, you know, today's Tuesday in June, and you must be craving a fresh pasta salad,
and that fresh pasta salad is produced for me, I think that would be my ideal ML aspiration.
You know, that's such a good idea, isn't it? Especially if it knew what ingredients you had
on hand, and it could say, you know what, you haven't had vegetable soup in
a while. How about we make that today? Yeah, I'll definitely lose some pounds if AI were to feed me.
All right, finally, you've inspired a new question I'm going to ask people in the future,
so I'll kick it off with you. And that is, how big can ML models get? Will today's 100 billion parameter models
look small tomorrow or have we reached a limit?
We have in no way reached a limit.
I think they're going to get much, much bigger
before they get any smaller.
I'll give you a funny anecdote.
So I was sitting with one of NVIDIA's customers back in the day.
I think it was 2006, 2007.
I was a young engineer.
I was a proud engineer.
I helped build this GPU that had 1 billion transistors.
And I was very excited.
And I was telling the customer all about how we were able to fit a billion transistors
in this little tiny one inch by one inch semiconductor chip. And if you look at, I don't even know how many billions of transistors are
in a GPU anymore, but I'm sure that they are over several hundred billion, maybe somewhere in that
magnitude. But in a span of 15 years, we've gone from a billion to not just more transistors in a single GPU or a single device, but also, you know, all the devices that are connected together and how many transistors do they represent.
And we've done this at a physical level, right?
Like this is actual real things that we have developed.
So when it comes to models, it's still software.
It's all virtual. So for me,
if history tells me anything, we have barely scratched the surface and get ready to be in
the trillions and the gazillions of parameters and numerous such sized models, and then also
billions of smaller models. I want to be the first to develop a gazillion
parameter model. So Devang, thank you so much for joining us today. Where can people connect
with you and follow your thoughts on enterprise AI and other topics? Yeah, really easy. Devang
Sachdev, you can find me on LinkedIn, Twitter, or you can also find me on Gmail, which is
devangsachdev.gmail.com.
Great. Thanks. And how about you, Chris? What are you into these days?
Yeah, having great conversations on LinkedIn. Also, you can follow me on Twitter at Chris Gunderman or check out the website for kind of everything else, chrisgunderman.com.
And as for me, I'm pretty excited that we just pulled off our second AI Field Day event.
If you go to youtube.com slash tech field day, you'll find
the video recordings of all the presentations from AI Field Day number one and number two.
And of course, AI Field Day number three will come next year. So thank you so much for joining us for
the Utilizing AI podcast. If you've enjoyed this discussion, remember to subscribe, rate and review
the show. That really does help. And please do share it with your friends and colleagues. This podcast is brought to you by
gestaltit.com, your home for IT coverage from across the enterprise. For show notes and more
episodes, go to utilizing-ai.com or find us on Twitter at utilizing underscore AI.
Thanks for joining us and we'll see you next time.