Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 3x28: Revisiting Utilizing AI Season 3

Starting point is 00:00:00 I'm Stephen Foskett. I'm Frederik van Heren. And this is Utilizing AI. Welcome to another episode of Utilizing AI, the podcast about enterprise applications for machine learning, deep learning, data science, and other artificial intelligence topics. In the past, we've gone through a lot of different things here on Utilizing AI. And now to wrap up our third season before our AI Field Day event, I wanted to take a moment and just look back at season three.

Starting point is 00:00:34 Frederick, it's been a long season. We've had a lot of episodes recorded and shared. What do you make of it? I really like it. Like you said, we had many sessions, but the reality is I do like the variety. We talked about hardware. We talked about methodologies, DevOps, DataOps. We talked about bias, all kinds of bias, religious, marketing.

Starting point is 00:01:00 And we even had the state of AI and the enterprise by Manoj, who works for Deloitte, which gave us a great view on where the enterprise is. So, like I said, I really do like the variety and I find it really difficult to find one or two sessions that really stuck out because they were all quite good. Yeah, I agree. I love the curveball sessions that we do, like the religious and ethical aspects of AI with Leon Adato. That one was absolutely amazing. I even, you know, kind of thinking back as well, you know, we had some really interesting discussions about the future of work and impact of AI on the third world. Things like that that you might not come to think about here in an enterprise setting. But we all have to think about those things, don't we? Right, we do.

Starting point is 00:01:58 Yeah, and it's important. I think that's the whole point of AI. It's about learning and learning new things. And that's also one of the reasons, you know, on the side where I like to travel, because that's where you learn new things, right? And by having those great sessions, we kind of better understand where AI is going and what AI can do for us, and also what AI will do for us that we don't want it to do. Exactly. And so we talked, for example, about the invisible workers behind the algorithms and the impact of the algorithms. But of course, also a lot of nuts and bolts about how it works. So we

Starting point is 00:02:38 talked about data infrastructure, a lot about data. How does that strike you? I mean, I know that you have a background in data science and databases. It seems like data is the key to AI. Right. I mean, once upon a time, you could say that the source code was the IP and that data was kind of just used for testing. Now it's completely flipped around, meaning that the source code is actually all open source. Most of the frameworks are coming from the open source world, but it's the data that's now your IP. So whoever owns the right data and the right amount of data has a foot forward on the competition. And I think that's also one of the reasons that a lot of enterprises are kind of forced into AI because of the competitiveness and that they have data. But if you don't use

Starting point is 00:03:33 or consume that data, then it's as if you don't have the data at all. Yeah, true. And I think that that's actually the case with a lot of enterprises where they've been consuming or collecting a lot of data on the anticipation that it would be valuable, you know, filling up data warehouses, data lakes, you know, whatever, you know, with all this information, and then hoping that they could make use of it. And I think that they see machine learning as a way to leverage that data. But how practical is that? I mean, I think that it's challenging to take existing data sets and make something, you know, spin them into gold, right? Right. I mean, data management is still the number one limiting factor to success in AI. And one of the reasons for that is because in the early days of AI, enterprises were told, collect a lot of data.

Starting point is 00:04:31 And what people forgot to tell is, once you collected it, you have to do something with it. And so a lot of people started piling up a lot of data. And in the end, they realized, how do we apply those ML and DL methodologies on that data? And then it's the wake-up call, right? So what do we do? Not all data is equal. They might have some duplicate data, some data that still needs some cleaning. But I think we're getting there. I think with the focus more on data and data management,

Starting point is 00:05:07 I do feel that people understand that data is key and that they have to have proper methodologies to handle data. Yeah, absolutely. And that does seem to be one of the recurring themes that we've had here. Another theme of course, is sort of the just nuts and bolts aspect with regard to data of sort of where do we store this? How do we store it? How do we make it perform? How do we make it accessible to, whether it's for training or inferencing?

Starting point is 00:05:41 And a lot of discussion this season focused on those sort of nuts and bolts storage aspects. Right. I think one thing I really learned in season three is that, yes, source code is being shared through open source, but also now there are hubs or data hubs for data. And certainly in the last session we had with David Cantor, where he was talking about benchmarking and methodologies to apply to benchmarking, he also said that the open source availability

Starting point is 00:06:21 of data and the ability to have data ready to benchmark is really key. Yeah. And that was actually one of the highlight episodes for me was talking to him about how we're going to be doing, how do we know whether things are performing or not? And, you know, I think that it was interesting that it was a very pragmatic discussion. It wasn't, oh, well, you know, we got that covered. It was much more, yeah, this is a challenge. And this is one of those things that there's a lot of solutions to. And we're not sure how we know how things are performing. Right. One of the big challenges is that there's a lot of open source tools that are changing really rapidly.

Starting point is 00:07:13 And then there is the new hardware technology. So the amount of possibilities of combinations, if you want, between all of them is very difficult to figure out on your own. And then once somebody figured it out, how do you replicate it? I think one thing I see in the AI world a lot is that a lot of the advancements are the result of repeating what somebody else did. Maybe repeating is not the right word. Maybe the ability to reproduce somebody else's process.

Starting point is 00:07:43 And in order to be able to reproduce it, you need to have some kind of a benchmark to know, you know, how do we, if I can reproduce it, what does it do to my system? And what are the components I need to do to improve? So I do think, you know, benchmarking is important

Starting point is 00:07:59 from an absolute standpoint, but also from a relative standpoint towards, you know, newer technologies and other people in the same market. Yep, absolutely. And not just for, you know, sort of shoot out reasons, but just to make sure that the solutions that you're getting are going to be performing appropriately. Yeah, so another discussion focused around the changing nature of training versus inferencing, whether it was BrainChip talking about training at the edge, whether it was, you know, Ben Phan talking about the challenges of finding data sets and integrating those, or whether it was talking about just basically how are ML tools evolving? How do you see this? I mean, is it still the old paradigm of sort of you train in the back room and then you roll it out into production,

Starting point is 00:08:59 or are we going to see a more dynamic push and pull? Well, I think the personalization of AI forces you to kind of combine the training and the inference components in a fast loop. In a fast loop, I mean, where you have the ability to react in real time. And so the split concept of training and inference really worked out well when you had zero understanding or very little understanding of the AI problem you were trying to solve. So for example, for us at speech recognition in the early days, we were really trying to boil the ocean. And the way we did that is to separate the training piece from the inference piece and to collect as much data as we could on the training site and then hopefully come out with a model that then would be able to use for everybody else. While in the world today, it's completely different in the sense that there are basic models for speech recognition. So you don't have to reinvent the

Starting point is 00:10:05 wheel. What you do have to do is to personalize the AI. And so you do that by taking the existing model and to adapt that model to your own personal experience. So examples of that are automated attendance or assistance, Waze, Amazon, where they kind of start to learn your individual behavior. And I do think that forces the concept of training and inference into a single cycle. Now, that being said, it's not that easy, right? The way you train and the hardware and the technologies and methodologies you use to train data can be significantly different from your inference approach. But, you know, I think AI is really evolving. And I think that should be one of the goals is to kind of close the loop on training and inference and to benefit people more from a personalized version.

Starting point is 00:11:09 So another topic that came up quite a lot in season three was hardware. We talked with NVIDIA, we talked with Cerebrus, Hibana, Brainchip, as I mentioned a moment ago. What's your take on hardware and the changing nature of machine learning? Well, I think one of the challenges is that, well, first of all, it's all about math. Well, if I'm simplifying, it's all about math, right? And if you go from billion and trillion parameters, it basically means you have to do a lot of calculations. And so the technology that will survive today is if you can find a shortcut. So even if you can cut off or shave off a few microseconds per calculation, if you multiply that with billions and trillions, that gives you a lot of time saved.

Starting point is 00:12:11 And the challenge is that those shortcuts are really different for different types of technologies, you know, for analyzing binary videos or binary audio or text. The methodologies are different. So I do feel there is a need to have different types of hardware that are optimized for those shortcuts. Now, in the reality, as time goes by, we better understand how to solve those problems.

Starting point is 00:12:39 So we might not have to use a mathematical shortcut, but we might have to use a logical shortcut and then there is a the the topic of scale right so a lot of the hardware vendors do provide a some somehow a processing unit being a CPU or GPU or an FPGA but what the market is demanding now is scale. And so the best way to explain it is if you have a cycle and it takes you six months to generate a model, you will ask yourself the question, how can I reduce that from six months to six weeks or six days?

Starting point is 00:13:17 What do I need to do? And so you need to be able to cluster that hardware. And that's also a position that hardware vendors are trying to solve is how do I make it scalable? Like Cerebras goes for a much larger silicon while NVIDIA sticks with its own GPU format but provides a lot of hardware around it like with Mellanox networking

Starting point is 00:13:44 and NVLink as a way to communicate between the GPUs. And so they're trying to push scalability from that perspective. So I do think that we haven't seen the end of it. And actually another thing that strikes me, and this relates to what you just said, is that it's not all about hardware either. We heard at NVIDIA's GTC this year, we've certainly heard loud and clear from Intel and from Habana Labs that the APIs and the integration of this hardware into the software ecosystem are actually much more important to getting this stuff in the hands of data scientists and machine learning engineers and application developers. I know that Intel has been doing a great job with their OneAPI, and we've seen a lot of sort of pre-built models, pre-cooked zoo-style offerings

Starting point is 00:14:46 where it's all about making this stuff available and easy to use and sort of hiding the complexities of the infrastructure. Like I said, at GTC, NVIDIA pointed out that they have more people working on software than hardware now, which is, I think, a good reflection of the importance of integrating this stuff and not just assuming that it's going to work. Right. It's the ecosystem, right? You don't have to have the fastest hardware.

Starting point is 00:15:22 You don't have to have the best data. You don't have to have the fastest hardware. You don't have to have the best data. You don't have the best algorithms. But if you have a, let's call it what people call it, an efficient data pipeline, that can get you pretty far. And again, it's also about optimizing, right? So you learn as you go and you make some changes. And I think that's one thing that many enterprises still don't understand is that they think they need to buy the best and the fastest hardware or storage for their solution. And then the fastest and the best network. And then the same thing with the compute side. And then hoping that all these pieces together will actually work really well.

Starting point is 00:16:05 The reality is that those components will not be fine-tuned. And so you don't get the efficiency of a well-defined ecosystem. And really, if you're in the business of building models as fast as you can and innovate, it's really challenging. I mean, the market changes really quickly all the time, so you have to make decisions when you cut over to a new algorithm or new piece of hardware. But here's a good thing. You have plenty of options, right? You have no excuses not to do AI, I guess. You just need to find the right options and come up with a strategy

Starting point is 00:16:46 that works out for you and your enterprise. Absolutely. And I think that that's really the core of it is that even though we spend so much time talking about all this new cool, new hardware, new software, new models, new approaches to, none of it matters unless it's actually serving the needs of the business. And that came through loud and clear, as you said,

Starting point is 00:17:11 when we talked with some of the folks who were actually there on the front lines implementing machine learning. So to wrap up the season, I thought it would be fun, since we don't have a guest on here, we've been doing these three questions uh at the end of each episode i thought it would be fun to uh ask ourselves uh some of the questions and have some of the guests join us here to ask us some of the questions so to kick it off i'm going to ask you one and then you can ask me one and we'll see where that goes so first off uh frederick i'm going to ask you one of the questions that I've been asking guests all season long. And that is, quite simply, can you think of an application for machine learning that has not yet been rolled out, but is going to have a big impact on the future? Yes, I think there are many applications out there that haven't seen the light yet.

Starting point is 00:18:01 And I do think that's because AI is penetrating a variety of markets. For example, there was a market I never thought AI was going to be used, which was reviewing resumes, and it's happening. So is there one particular one right now? I don't know. I think in the end, everybody will fall for AI, I guess. So what market will benefit the most from AI in the next 12 months? Oh boy, what market will benefit the most from AI in the next 12 months? Okay, I am going to be controversial and a bit obnoxious. I'm going to say black hat security. In other words, basically, I see a huge trend in attackers using machine

Starting point is 00:18:53 learning and AI to find holes to fuzz attacks and to get into systems. And I feel like that's been building and building, especially since we've had it so much machine learning on the white hat side of things as you know, on the defensive side of things. I think that the big story is going to be basically the bad guys using AI to attack networks and systems. And that's going to be really horrible, but they're going to benefit from the technology. So now I thought it'd be fun to invite some of our previous podcast guests to ask us some of their three questions. So first off, we've got a question from Rich Harang, a senior technical lead at Duo Security. In an alternate timeline where we didn't develop auto differentiation and put it on top of GPUs, and so this entire deep learning hardware family that we depend on right now never got invented or was invented many years later,

Starting point is 00:19:56 what would the dominant AI slash ML technology be and what would have been different? So I think that without the invention of GPUs and hardware accelerators, we would still be in a world that was really CPU centric in the sense that it wasn't all about data but more about CPU so I do think we would have more more of the the the craze in the and the Apollo's of the world being a lot more successful as opposed to being run out of business and and have data centers filled with large compute environments as opposed to data environments. And we also have a question from Sunil Samel from Acrydata, who is the VP of Business Development. I'm Sunil Samel from Acrydata. And the question I am wrestling with is how will new technologies like AI or

Starting point is 00:21:07 what's coming up, Metaverse, how will these help marginalized members of our community? These are folks like senior citizens, minority groups, people with disabilities, veterans trying to reenter civilian life. So I'm really excited about the prospects of artificial intelligence technology helping marginalized, you know, disabled, differently abled people. This is something we actually talked about recently on the Brain Chip podcast. We talked about the many ways in which machine learning could, for example, help someone to see or help someone to hear. Or, you know, even things like, you know, maybe you have long COVID and you can't smell anymore. And you could have a system that could detect spoiled food.

Starting point is 00:21:58 You could have a system that is constantly watching if someone falls down so they don't have to push a button, that it's actually kind of monitoring them and assisting them in that way. I think that it could be really tremendously, tremendously beneficial. And we also talked about this on the podcast as well, that AI can help, for example, with pain management by helping people to understand their own sensory inputs, you know, by having a computer offload.

Starting point is 00:22:28 So, frankly, I think that this is going to be a tremendous, tremendous market for personal AI. The next question we have comes from Adi Gelvin, CEO and co-founder of Speedybee. Hello, I'm Adi. I'm the CEO and co-founder of Speedybee. Hello, I'm Adi, I'm the CEO and co-founder of Speedybee. My question for you is, what do you think the risks of AI are and what recommended solution do you have for it? Well, I think the risk is obviously bias, right? I think that that would be my number one issue.

Starting point is 00:23:10 It's not technology. It's maybe malicious people, but I definitely would say a bias is the biggest issue. And a solution, I don't know. I mean, it's easy to say control your data. I mean, we're talking now about data lineage, big meaning where's data coming from and being able to show where the data is coming from in a model. So I would say let's maybe do what we can do to monitor where the data is coming from. Next, we have a question from Bin Fan, who is a founding member of Aluxio. Hi, I'm Bin Fan, founding member from Aluxio.

Starting point is 00:23:53 I'm wondering if there's any way AI can help for a humanitarian crisis happening in the future. This is a really interesting aspect. And there are so many different things that happen to people, war, pestilence, famine. How about we take one of them? How about we take famine? I think that there is a great potential for AI to help improving crop yields, especially in marginal places. We've already seen a lot of this technology getting out there where, for example, sensors are, even remote sensors, even satellite-based sensors are monitoring rainfall and making sure that farmers know where to water more or less. And I think that that's really going to be something that AI can help.

Starting point is 00:24:50 Basically improving crop yields, helping us grow crops in marginalized places, and helping us to adapt to climate change. I think all of these things are things that AI can really help with and help to avoid humanitarian crises. The next question comes from a memorable episode, Katina Michael, a professor of the School of Future Innovation and Society at Arizona State University. Katina Michael from Arizona State University. And my question is, if AI was to self-replicate, what would be the first thing it would do? That's a good question. What is the first thing to do? I think the first thing it would do is try to find ways to learn, because I think AI realizes that learning is the only way to make progress. So

Starting point is 00:25:39 I think the first thing it would do is learn as much as it can. Now, the next question we have is from David Cantor, who is the executive director at ML Commons. Hi, this is David Cantor. I'm the executive director of ML Commons. And my one question for you is, what is a problem in the AI world where you are held back by the lack of good publicly available data? I'm looking forward to hearing the answer. I think it's hard to find an ML problem that's not held back by good publicly available data. Everything from autonomous driving to medical applications, in most cases, either the data sets are incomplete, or the data sets are biased through the wrong, you know, limited collection, limited availability,

Starting point is 00:26:36 or they're proprietary and hidden. And that's one reason that I so so love what companies are doing like Apple Commons, to try to broaden the availability of these data sets. I think that this is one of those things where you can't have enough good data, and yet we really don't have enough good data. So is it a cop-out to say everywhere? I guess I'm going to say that, everywhere. Well, thanks so much for joining us. This is the wrap-up episode for season three of Utilizing AI. As I mentioned, we've got our AI Field Day event, May 16th through 18th.

Starting point is 00:27:15 And if you'd like to be part of that, please reach out. Or if you'd like to be part of Utilizing AI or AI Field Day in the future, just reach out to host at utilizingai.com. We'd love to hear from you. Fredrik, before we hop off, where can people follow you and connect with you on Enterprise AI? Well, they can find me on Twitter and LinkedIn as Fredrik V. Heron.

Starting point is 00:27:41 And as for me, you can find me at techfieldday.com or gestaltit.com. You can find me as well on the Utilizing AI podcast and the Gestalt IT Rundown, which is a weekly news program. We're available on most podcast applications as well as on YouTube. This podcast is brought to you by gestaltit.com, your home for IT coverage from across the enterprise. For show notes and more episodes, go to utilizing-ai.com or find us on Twitter at utilizing underscore AI. Thanks for joining and we'll see you next time.

Your Ad Here

Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 3x28: Revisiting Utilizing AI Season 3

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.