Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 2x23: Overcoming the Obstacles of AI Application Development with Snorkel AI

Starting point is 00:00:00 Welcome to Utilizing AI, the podcast about enterprise applications for machine learning, deep learning, and other artificial intelligence topics. Each episode brings in experts in enterprise infrastructure to discuss applications of AI in today's data center. Today, we're discussing the obstacles of AI application development. First, let's meet our guest, Devang Sakhtev. Hey, everyone. My name is Devang.

Starting point is 00:00:28 I am the VP of Marketing at Snorkel AI. And you can find me on Twitter, on LinkedIn, or on Gmail at Devang Sakhtev. So yeah, just look for that and you can find me there. And I'm Chris Grundemann, part-time co-host of the Utilizing AI podcast, full-time consultant and content creator. You can learn more at chrisgrundemann.com. And of course, I'm Stephen Foskett, full-time host of Utilizing AI, also organizer of Tech Field Day and publisher of Gestalt IT. You can find me on Twitter and most

Starting point is 00:01:02 other social media networks at sfosket. Devang, when we were talking previously, we were talking about some of the challenges of actually developing AI applications and the fact that this is one of the things that's holding back the development of AI in the enterprise. Maybe we can start off with just a bit of an overview from you about what are the challenges that companies face when developing new applications that use AI technology? Yeah, you bet. First of all, Stephen and Chris, thank you so much for having me. And thanks to your listeners as well. I'm a big fan of the show and I've found a lot of insightful conversations on utilizing AI podcasts. So it's a pleasure and a privilege to be here. So the topic today, we're talking about obstacles for AI development. Just,'s a pleasure and a privilege to be here. So the topic today, we're talking about

Starting point is 00:01:45 obstacles for AI development. Just taking a step back, looking at the big picture, if things trend to be the way they are going right now, the AI industry is expected to pass half a trillion dollars in terms of its market size by 2024. That means that this would be probably one of the fastest growing technologies ever, right? In fact, there's a huge adoption in terms of AI technologies. A recent survey from O'Reilly pointed out that 85% of organizations in 2020 have mentioned that they are looking to adopt AI. And this is up from 24%, not only just in 2019, right? So great momentum in terms of folks wanting to put this amazing technology to use. Yet at the same time, we have information, we have data, and even anecdotally, we know

Starting point is 00:02:39 that most, specifically 87% of data science projects never reach into production. So why is there such a big disconnect between the aspirations of an AI practitioner, the aspirations of the IT community, the developer community, and putting or utilizing AI, mind the pun, in real life? So there are a few different areas where I believe the challenges lie. And as part of being Snorkel, I've had the privilege to work with some of the largest organizations in the world. That is top two out of the top three US banks, several government agencies, large telecommunication providers, insurance providers, and just see from a vantage point

Starting point is 00:03:25 that in spite of having large pools of talent, in spite of having spent millions of dollars, if not more, even largest of the organizations in the world run into some of the same issues that organizations that are new to AI run into as well. And it really comes down to the approach that most AI practitioners are taking. On one end, AI practice in a lot of places is looked as an extension of software development, as a tack on to software development.

Starting point is 00:04:05 I think we need to look at this with a different lens, with even a fresh outlook to say that possibly AI development, even though it's software is significantly different. And then second, there's an emphasis or has been an emphasis rightfully so, on developing AI but keeping model first or taking a model first approach, which means spending a lot of energy and focus on really choosing and developing the model. But we see a shift taking place where rather than the focus

Starting point is 00:04:40 being model for development, the focus is more and more becoming the data that is required to train the model, or as we call it, the data-centric approach to AI development. So in reality, those are the two big areas. I would love to drill down a bit more further into each of these, but in a nutshell, those are the challenges that I see from my vantage point. Yeah, so just to kind of maybe restate that

Starting point is 00:05:03 a little bit differently when I'm looking at this from, you know, in terms of a company that wants to develop artificial intelligence, whether it's for internal applications or for customer applications, you know, there's really kind of the data itself, obviously, the algorithms, and then, you know, the combination of data and algorithms becomes models. And so obviously each of those could be a, could be a pinching point. And there's, you know, storage issues for the data itself. There's compute and processing issues for training the models themselves. And so it sounds like you're saying that the shift from being model first to being data first has shifted where some of those bottlenecks line up. Obviously, in some cases, right, and not that it's completely easy to just wave a credit card around and have more GPUs or more CPUs or more, but a lot of those are definitely

Starting point is 00:05:57 solvable problems. So I wonder if this is angling towards how we use data and does that actually present a bigger challenge than just looking at model first and stacking more, you know, GPUs in a data center? Yeah, I think you hit it on the head right there, Chris, which is the two things that have taken place in the last decade, specifically in the last five years, right? I had spent some time or some part of my career at NVIDIA, partly I was a product manager and engineering lead for GPU computing, which is now fueling a lot of GPU-based compute. And I was just looking at the most recent releases, NVIDIA being the leader in the space,

Starting point is 00:06:42 that computer is definitely getting faster and faster. If you look at the Ampere architecture, which is the latest one from NVIDIA, that offers five petaflops of performance in one box. And this was unheard of, you know, even just in 2013 when I was there, we had a deployment for my product at Oak Ridge Supercomputing Lab, and reaching a petaflop of performance was like a massive feat. Seven, eight years later, we're putting this in a box. So with every architecture, and it's not just NVIDIA, it's even more so with specialized hardware from Google or Salma Nova Systems or GraphCore, we are getting two to three times faster than the previous generation at the very least, and even more so. So compute is becoming much more accessible and available,

Starting point is 00:07:32 and especially with cloud, the accessibility challenges is much lower. And then on the other end, the models themselves are getting more and more bigger, more complex, but also more available and more open source. So 2018, we were all amazed when the first BERT came out from Google, transformer-based model. It had 340 million parameters. And that was like mind-blowing. And a year later, OpenAI publishes GPT-2, which is, I believe, one and a half billion parameters.

Starting point is 00:08:05 And we're like, wow, we're in the billion parameter range now. And the very next year, GPT-3 comes out, 170 billion parameters, right? So these models are becoming increasingly complex, and the compute is becoming increasingly faster. That means that the models are getting more data hungry, or the application development process is getting more data hungry or the application development process is getting more data hungry. So Google in 2020, early 2020, they published a chatbot called Mina.

Starting point is 00:08:31 This was based on the Lambda architecture that they talked about. And it had 2.6 billion parameters, but it was trained using 40 billion words. It was an NLP oriented chatbot. 40 billion, a corpus using 40 billion words. It was an NLP oriented chatbot. A corpus of 40 billion words that's labeled is just unthinkable, especially when you're working on practical applications. And while Mina was a great proof of concept, it's probably fueling some of Google's technology as well. Most organizations don't have that luxury, right? Most organizations are looking to do very specific, like you said, business or customer

Starting point is 00:09:09 facing or internal facing enterprise applications. And in order to build those applications, the data that they need is often private data, meaning they can't ship it to someone third party to label it and get it back. Or the data is complex. So it's, you know, electronic medical records or it's financial records, maybe it's network analysis, things that require subject matter experts to look and really help the machine understand what is what. And then lastly, data constantly changes. So what model you train with is a snapshot of that data. When it goes in the wild,

Starting point is 00:09:45 the data is constantly changing. And if the rate of the pace of change of your data is really rapid, you need to think through your training data collection or creation process for it to be a lot more scalable. So particularly those two things are yielding this bottleneck of data. And then at the end of the day, we're also seeing that bigger model is not necessarily a better model. Bigger models definitely are harder to train. There's value in sort of a research accomplishment in creating a larger and larger model. But in a practical setting, would you rather have one very large complex model that's 90% accurate, or would you rather have 10 smaller models that are 99% accurate and take fraction amount of cost to develop and train and deploy? I think a private agency or an enterprise would go for the latter option, latter option just being a harder one to implement as we move sort of into production level settings. I think one of the challenges here is that there's sort of a disincentive to optimization simply because so many companies are working at, as you mentioned, developing faster and faster, you know, ever more capable GPUs and ML processors, and additionally, the components

Starting point is 00:11:10 that surround those. So we've talked on the show here about the rapid development of network and IO connectivity and flexibility in terms of performance. And we've talked about development in terms of better and better storage capacity. We've certainly talked about the development of processors and chips. And this serves as a disincentive because frankly, if you can build a model bigger and bigger and the hardware can handle it, why not? I think that that's maybe the mindset that some people go into these things with. Is that right? Most definitely. I think having been on the other side where building the next

Starting point is 00:11:54 architecture was the grand challenge and being able to beat benchmarks is the grand challenge. I think that is still very important and required for the industry to move forward. I mean, we're looking at the horizon. We're looking at quantum computing. We're looking at next generation of even hardware technologies. That's certainly going to aid in moving technology forward. But at the same time, when it comes to AI practice, it's a lot more than just putting components together, right? It's a matter of building a discipline around it. And we've done this very successfully over several iterations

Starting point is 00:12:32 with just pure software development. If you remember, there used to be the waterfall ways of doing software development, and that's transitioned to agile, and now we have scrum teams. And if you think about a typical software scrum team, it usually has a project manager or program manager or product manager who interfaces with the business, collects your requirements, translates that into some sort of an architecture with the help from a technical lead or an architect, hands it off to maybe a few developers. And you might have some QA folks who will QA that software that they've built.

Starting point is 00:13:09 And then you hand it off to your development DevOps team or your IT team. And then they get to put all the goodness that they've built in the infrastructure to deploy this application. But this team looks very different when it comes to AI development. You know, I remember reading somewhere that Jeff Bezos likes to talk about that, you know, our development teams are only as big as they can eat from two boxes of pizza.

Starting point is 00:13:38 I think with AI development teams, you might need a food truck for you to be able to feed the team because there are a few other players involved too, right? There's a data scientist. There's often data engineers. So like you said, you are developing a faster and faster compute. You are developing bigger, faster storage. So you need folks like data engineers to create craft pipelines on top of that infrastructure to be able to move data from data lakes out into application development site platforms

Starting point is 00:14:06 or from application development platforms or developed apps into warehouse and even put that warehouse data to use too. You would probably have folks like machine learning engineers who are involved in this pipeline. You could have folks from machine learning operations team

Starting point is 00:14:23 in addition to your DevOps team who are looking at monitoring your models once they've been put into production. You're looking at interpretability or auditability. So this whole domain of how we do AI development is still very nascent. And in some sense, it's quite exciting because we get to learn from our mistakes or what has worked for us in software development and contrast it and say what's going to work for us in this new paradigm, the AI development or not. In a way, it reminds me of what happened with software development generally when what had been once seen as something of an art became much more of a practice, a business practice. And, you know, it's a maturity in a way that you go from people experimenting with machine learning and experimenting with models and trying to make something bigger, better, more interesting. And then you go to basically, how do we do something productive and practical? And

Starting point is 00:15:27 that's kind of the theme of the podcast, after all. So is that how you see it? You're totally right that software has almost become everybody's business. It's no longer a segment in the market. It's just everybody does software. Every company is a software company. And there are particular practices that we put in place in order for us to be able to get here, right? One, I think we've done a really good job as just practitioners to be able to decompose applications, right? And so we were able to look at this larger business problem and say, what are the underlying

Starting point is 00:16:00 apps that we need to develop this larger or solve for this larger business problem? And then underneath that particular application, we're very good at saying, what are the functions that we need to now develop as microservices or what are microservices available out there in the market that we can implement and put to use so that we have this full application system available to us? And if you were to think about breaking down a larger software application into microservices, job becomes relatively easy because there's a dedicated team, there's a dedicated focus, there's a dedicated QA effort, and then you're testing for local correctness or local quality, and then you're applying the same techniques to a global quality,

Starting point is 00:16:45 and that's how you get your applications published. When it comes to AI development, just to contrast that a little bit, decomposition of AI apps is still very nascent. So one of the classical examples of an AI task would be classification. So if you think about you are trying to classify job codes, you're an expense software company and you're applying, how does different job codes segment into different expenses coming from different individuals? You are looking through Bureau of Labor Statistics published job codes. There are about 800 job codes that are published on this list. And your AI applications task would be to classify individual receipts or individuals themselves into one of these 800 classes. That's a practical example, right? Similar example could be classifying companies. So there's a published list of standard industry codes, about 1,200 items long. classic problem. And when it comes to real production apps, you're looking at these very light shades of gray bifurcation between some of these classes versus research, where you're

Starting point is 00:18:12 looking at very broad classes and very few classes. So now the question becomes that, is there a way to decompose this classification challenge from an AI point of view? Yes, there is. There's a lot of research that's been published on decomposing applications as well. One way to do it is to build multiple classifiers. But here's the rub, right? When you're building multiple classifiers, you have to make sure that any mistakes that each of these classifiers make

Starting point is 00:18:37 are going to be independent of each other. If they feed into each other, then as a whole, you're going to have a lower quality application. And the worst part is that then you're playing the game of whack-a-mole, trying to discover which of my classifier is going wrong where, rather than being able to very precisely pinpoint that this is the area where I need to spend more time fixing my application. So it sounds like the approaches to the actual classifying of the data, obviously,

Starting point is 00:19:04 but just more generally, maybe the approaches to how to get data into a model are our main obstacle, right? And enable the right data getting into the model, but also in an unbiased way and at the right fidelity, I would guess as well, is just as important. Yes. So when I, again, thinking through just contrasting software development versus AI development and Chris, you would sit down and say, what is it logically that I'm trying to accomplish? And then you'll pick a language based on your personal preference or organizational preference

Starting point is 00:19:52 or some specific function that you want to write like a front-end development or back-end development. But at the end of the day, your major input as a human being is logic and your output is code. With AI development, things are changed, right? Things have changed because your input is actually the data, the training data that you are crafting, curating, creating, and then you are giving this data to a model and you're letting that model,

Starting point is 00:20:17 especially deep learning models, they discover solution spaces on their own. That's why they're so powerful. You're giving these examples to the model and then model is then able to develop code that is then used to make the decisions that it makes. And we hope that they make them as accurately as we want them to. So training data on its own has now become the interface to write code.

Starting point is 00:20:41 Question is, how do you generate this training data? Today, you generate this training data almost like you used to generate code through punch cards, right? You would punch every single card with every single instruction, you would put it in this deck, and then the machine would read it. And, you know, imagine doing something like that today, you know, we wouldn't be flying on planes, or we wouldn't be talking over Zoom if that's what we were doing. So to me, the way we're generating training data today is in the punch card ages, right? We're looking at every single data point. We're saying, should I label this data point A or should I label this B or should I label this C? Or one of the other 800 or 797 classes that I do need to generate data to show my model what each class looks like, right?

Starting point is 00:21:30 And not only do we need to show one example, we need to show several examples. And the more examples we show, the better the model gets, right? So there's the manual way of labeling, which is what's the current blocker and also the current technique. Again, a lot of organizations that are just getting started rely on this approach, whether they do it in-house or through third party. And with Snorkel, particularly with the research that we've done, we focused on developing what's called a programmatic way of creating training data. So what does programmatic way do? So number one, instead of taking each

Starting point is 00:22:10 and every data point by hand, you are looking at what are the heuristics that a subject matter understands? What are the different rules or intuition that we have about the data? And then how do we use a simple tool, whether it's a UI based tool or a code based tool to translate and encode that heuristics, that information,

Starting point is 00:22:32 those rules, that intuition into a simple, what we call labeling functions to then generate a training dataset. Now this training dataset might have some noisy data, as we call it, but not be very precise. But when you take it through this loop of iteration that you train your model to begin with using this particular training data set,

Starting point is 00:22:55 and then you inspect that model, and then you come back and iterate, not just on your model, which is what typically people do, but also on your training data. So at this point, you are not only showing your model more examples, you're also showing it better examples because you have a full iteration loop,

Starting point is 00:23:13 just like we have with software development. You're able to yield a much more high accurate or high quality model. Because of this iterative cycle, you're no longer doing things in operational silos. You're doing this as a team, a collaborative team that works on one platform. So you're able to publish models more rapidly. Because you're not looking at every single individual data point, you're able to do labeling at scale, but do it in a private way. So keeping your data private or even having proxy data so that you're not looking

Starting point is 00:23:45 at actual data, which might be for compliance reasons. We know for government use cases, that's definitely a case and more so with new consumer data protection compliance requirements as well. And then at the end of the day, because you have your labeling done through software techniques, when your data changes or when your business objectives change, even when the model is in production, you can come back and adapt your application relatively quickly, rather than having to relabel all your data sets from hand to begin with. So the trained data bottleneck is real, but at the same time, rather than focusing just on the model and iterating and trying to tweak the model, focusing on the data is important. And there are techniques like programmatic labeling, weak supervision that can help you adopt a data-centric approach rather than being stuck behind manual labeling as well. Yeah, a lot of that's really amazing and really powerful. I do want to roll back to something you said in the middle there and just kind of underline

Starting point is 00:24:53 it because I think it's a fairly big statement that could be hidden and just slide away, which is that programmers used to apply logic to create code, and now programmers in a machine learning, artificial intelligence world, are providing data or specifically labeled data and training data to create code. And that's a big paradigm shift to me as far as how we are creating code and what programming means and how all this even works at a very fundamental level. And so to me, that ties in with a lot of the things you said earlier about how big the team has become

Starting point is 00:25:29 and there's all these new roles. And I think that one of the biggest obstacles it sounds like to AI application development is not just that we need to buy more pizza, but that we really need to rethink the way we're developing applications to move into a machine learning world. Is that fair and accurate? That's 100% accurate. And that paradigm shift is going to

Starting point is 00:25:49 come easy to some organizations that have good data practices that are orienting themselves to a data scientist-led development rather than a pure programmer or programmer-based development? Because the criteria that you're choosing to develop an AI application are quite different, right? You're thinking about what is going to be the eventual quality of this application with software development or traditional ways. It's a little binary, right? When you say a performance of a software application, is it either, does it do the job or does it not do the job? But with AI applications, it's not that binary. It's 99% accurate or it can be 50% accurate. And both are okay,

Starting point is 00:26:36 depending upon what are you trying to do, right? If you're building a recommendation engine, it's okay to have some inaccuracy, but if you are building some life critical or mission critical application, you better make sure that it is more accurate than not. And if it is inaccurate, you know, when it is inaccurate and you have mechanisms to either detect that inaccuracy and then, and then present a different and alternative path. You have things like interpretability, which is why did it take this action? You have challenges with just data cleansing in general, because it's not as if all the data that you have is ready to be put for machine learning use. So for some organizations, this shift is going to be easy. For others,

Starting point is 00:27:20 it needs to be intentional. But the sooner that they orient themselves with this mindset, the quicker they'll be able to achieve success. Well, thank you very much for that. I'm wondering as we kind of near the end here, is there one takeaway message that you'd like to deliver to the audience on how they can improve AI application development? Yeah, you know, machine learning is a

Starting point is 00:27:46 fantastic tool. It's one of the many in your development tool belt, but it definitely requires a rethink or a reframe of how you're approaching software development. It does have a different cast or an additional cast of characters, and more so, fundamentally, the approach is to be more data-centric and training data-oriented rather than just logic or model-oriented, which have worked great for legacy, but at the same time, as we're moving forward in this new paradigm, will help you accelerate your efforts. And Chris, what do you think?

Starting point is 00:28:25 Is that practical for companies? Do you think they're going to be able to do that? I think they have to. And I definitely like the approach that Devane got into a little bit there that Snorkel's taking with this programmatic approach to labeling data, which I think actually makes it more accessible to more folks to be able to get this right versus trying to do everything manually themselves. Well, thank you so much. Now, before we go, before we sign off, let me quickly jump into the fun lightning round here. We've got three questions for Devang, and none of these are things that he's been warned about, though if he listens to the podcast,

Starting point is 00:29:02 he might have heard them before. I picked three of them based on the topics of our conversation here and also based on what I would love to hear from him. And I've added a new question. New question, new question. So let's jump right into it. First of all, one of the things that comes up on utilizing AI quite a lot is bias in data sets and models. And I'm wondering, do you think that it's possible to create a truly unbiased AI? There's an academic answer to this, and then there's a practical answer to this. I believe AI should continue and will continue to be human-driven. And humans inherently are biased creatures as much as we don't want to be. But rather than thinking about building AI that is completely unbiased,

Starting point is 00:29:58 I think we should think about what are the ways in which we can detect bias and act on it. So rather than saying we are going to prevent bias from the beginning, it should be more about bias management more than prevention. All right. I like that answer. Next up, can you think of one application of machine learning that has not yet been rolled out, but will have a major impact in the future? And maybe this is a bit of a challenge because I'm putting you on the spot here, but is there something you said, you know what, machine learning would be really good at that? Oh, so many things to come to my mind. Childcare, spousal satisfaction, self-spousal care. No, I mean, I think more practically, if I can just get a good meal built using some smart technology that can read my mind and say, you know, today's Tuesday in June, and you must be craving a fresh pasta salad,

Starting point is 00:31:09 and that fresh pasta salad is produced for me, I think that would be my ideal ML aspiration. You know, that's such a good idea, isn't it? Especially if it knew what ingredients you had on hand, and it could say, you know what, you haven't had vegetable soup in a while. How about we make that today? Yeah, I'll definitely lose some pounds if AI were to feed me. All right, finally, you've inspired a new question I'm going to ask people in the future, so I'll kick it off with you. And that is, how big can ML models get? Will today's 100 billion parameter models look small tomorrow or have we reached a limit? We have in no way reached a limit.

Starting point is 00:31:52 I think they're going to get much, much bigger before they get any smaller. I'll give you a funny anecdote. So I was sitting with one of NVIDIA's customers back in the day. I think it was 2006, 2007. I was a young engineer. I was a proud engineer. I helped build this GPU that had 1 billion transistors.

Starting point is 00:32:15 And I was very excited. And I was telling the customer all about how we were able to fit a billion transistors in this little tiny one inch by one inch semiconductor chip. And if you look at, I don't even know how many billions of transistors are in a GPU anymore, but I'm sure that they are over several hundred billion, maybe somewhere in that magnitude. But in a span of 15 years, we've gone from a billion to not just more transistors in a single GPU or a single device, but also, you know, all the devices that are connected together and how many transistors do they represent. And we've done this at a physical level, right? Like this is actual real things that we have developed. So when it comes to models, it's still software.

Starting point is 00:33:03 It's all virtual. So for me, if history tells me anything, we have barely scratched the surface and get ready to be in the trillions and the gazillions of parameters and numerous such sized models, and then also billions of smaller models. I want to be the first to develop a gazillion parameter model. So Devang, thank you so much for joining us today. Where can people connect with you and follow your thoughts on enterprise AI and other topics? Yeah, really easy. Devang Sachdev, you can find me on LinkedIn, Twitter, or you can also find me on Gmail, which is devangsachdev.gmail.com.

Starting point is 00:33:48 Great. Thanks. And how about you, Chris? What are you into these days? Yeah, having great conversations on LinkedIn. Also, you can follow me on Twitter at Chris Gunderman or check out the website for kind of everything else, chrisgunderman.com. And as for me, I'm pretty excited that we just pulled off our second AI Field Day event. If you go to youtube.com slash tech field day, you'll find the video recordings of all the presentations from AI Field Day number one and number two. And of course, AI Field Day number three will come next year. So thank you so much for joining us for the Utilizing AI podcast. If you've enjoyed this discussion, remember to subscribe, rate and review the show. That really does help. And please do share it with your friends and colleagues. This podcast is brought to you by

Starting point is 00:34:29 gestaltit.com, your home for IT coverage from across the enterprise. For show notes and more episodes, go to utilizing-ai.com or find us on Twitter at utilizing underscore AI. Thanks for joining us and we'll see you next time.

Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 2x23: Overcoming the Obstacles of AI Application Development with Snorkel AI

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.