Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 3x27: Benchmarking AI with MLPerf

Episode Date: April 12, 2022

How fast is your machine learning infrastructure, and how do you measure it? That's the topic of this episode, featuring David Kanter of MLCommons, Frederic Van Haren, and Stephen Foskett. MLCommons i...s focused on making machine learning better for everyone through metrics, datasets, and enablement. The goal for MLPerf is to come up with a fair and representative benchmark to allow the makers of ML systems to demonstrate the performance of their solutions. They focus on real data from a reference ML model that defines correctness, review the performance of a solution, and post the results. MLPerf started with training then added inferencing, which is the focus for users of ML. We must also consider factors like cost and power use when evaluating a system, and a reliable bench Links: MLCommons.org Connect-Converge.com Three Questions: Frederic: Is it possible to create a truly unbiased AI? Stephen: How big can ML models get? Will today's hundred-billion parameter model look small tomorrow or have we reached the limit? Andy Hock, Cerebras: What AI application would you build or what AI research would you conduct if you were not constrained by compute? Gests and Hosts David Kanter is the Executive Director of MLCommons. You can connect with David on Twitter at @TheKanter and on LinkedIn. You can also send David an email at david@mlcommons.org.  Frederic Van Haren, Founder at HighFens Inc., Consultancy & Services. Connect with Frederic on Highfens.com or on Twitter at @FredericVHaren. Stephen Foskett, Publisher of Gestalt IT and Organizer of Tech Field Day. Find Stephen’s writing at GestaltIT.com and on Twitter at @SFoskett. Date: 4/12/2022 Tags: @SFoskett, @FredericVHaren, 

Transcript
Discussion (0)
Starting point is 00:00:00 I'm Stephen Foskett. I'm Frederick Van Haren. And this is the Utilizing AI podcast. Welcome to another episode of Utilizing AI, the podcast about enterprise applications for machine learning, deep learning, data science, and other artificial intelligence topics. In many of our podcasts, we've talked about different frameworks used to build AI applications. We've talked about the impact of AI on the world.
Starting point is 00:00:28 But one of the topics that often comes up is performance, because after all, this is IT. And one of the things that you start thinking about when you're thinking about performance is, how do I even know how fast my system is? Right. I totally agree. I think the challenge in AI is you're always trying to improve your model. So you need some kind of a baseline. And the best baseline you can come up with is to define some metrics that make sense to you. And the metric by itself then relies on a bunch of tools. And I think standardizing those tools will help not only yourself, but also the AI
Starting point is 00:01:06 community. Yeah, absolutely. And that's why I'm really excited to introduce our guest today, David Cantor. David, welcome to the show. Thank you. It's a pleasure to be here. So David, I know that you have been deeply involved in the machine learning and AI space for a long time, but I know that you are keenly aware, especially of AI metrics and MLPerf. Maybe you can give us a little bit of background there on how this all came about. Yeah. Well, so by way of introduction, I'm the executive director of ML Commons, which organizes and runs the MLPerf benchmarks. I got involved about two and a half years ago.
Starting point is 00:01:51 And before that, I had been paying a lot of attention to computer architecture, semiconductors, microprocessor design, and all with an eye towards performance. And so it was very natural. I got involved in some AI-related projects and became really curious. So that's how I ended up getting involved. And initially I was sort of the group secretary. They liked me enough that they promoted me to lead a project. And then they liked me so much that they kicked me upstairs and said, why don't you just run the whole org? So that's sort of how I got involved. And for those who don't know, ML Commons is really focused around how we can make machine learning better for everyone. And we sort of look at that through three lenses or pillars. One is metrics, which, you know, we just talked about before, right? How do you, you know, if you start measuring things appropriately, then you can start moving the whole industry in the right direction.
Starting point is 00:02:48 And that's sort of the concept there. And MLPerf is probably the set of metrics that we're known for best. We also like to produce large open data sets. So we released two big data sets in the speech area, one a speech recognition data set and the other a keyword spotting data set last year. And then also of interest to sort of the IT community and sort of maybe the more ML ops side, we work on projects that take friction out of the ML landscape. And in particular, we have one that makes reproducing AI training experiments a little bit easier. So that's a little bit of
Starting point is 00:03:25 background about me and what we do. But yeah, performance is absolutely of a keen and historical interest to me. So how do organizations start with MLPerf? It's not as simple as downloading it, running it in your environment and then hoping you get some kind of a number back, right? So what's kind of the process to get started? Yeah. So I think that gets into some of the challenges that are somewhat unique to AI and ML. And if I think about folks from an IT audience, both performance and capacity planning are
Starting point is 00:04:02 sort of interchangeable. And the MLPerf benchmarks, our goal is to produce something that is sort of a fair and representative playing field for primarily vendors and solution providers to demonstrate their performance. Now, the way it works is ML has some unique characteristics that make this challenging. For starters, you need to use real data, right? And I'm sure as you can, you can run with fake data, but oftentimes that won't actually tell you something useful.
Starting point is 00:04:47 So we have to use real data, which is one particular challenge. And essentially the way it works is you come up with a reference ML model, which is what we do, that sort of defines correctness. So to pick an example that listeners might be familiar with, BERT is a very common model for natural language processing. You know, anytime someone talks about transformers or attention or GPT-3, you know, those are all related to BERT. And so we define a model that is sort of, you know, you can think of it as the platonic ideal of BERT. And then we tell everyone, hey, go take something that is mathematically equivalent to that,
Starting point is 00:05:34 train it, send us the logs and results, and then we'll verify to make sure that the performance is correct. Now, and then we post all of this on our website once it's done. So the nice thing is that you actually get, not only do you have this reference model, which is not really optimized for speed, but the people who submit the benchmarks have their own optimized models that they're posting that we publish on our website. And so you, as an ML expert, can look at those for inspiration. You can download them and just reuse them, or you could look at them and say, you know, this technique is really applicable to what I want to do, but not this one over here, right? And so, you know, from an IT perspective, I think you can look at the results and learn a lot, but they also provide, in some sense, a set of best practices that you might be able to adopt to help your company run better. And I think that's very exciting because that's,
Starting point is 00:06:30 you know, something that, that a lot of benchmarks are closed source and don't necessarily have this easy reproducibility. Right. And I do think that, that customers can look at, at MLPerf as two types of baseline. One is the internal baseline, is how does it improve internally against their own setup? And then the second one is, like you mentioned, where they start to compare with what other customers have done. I think the challenge with the external comparison is that the data might be different
Starting point is 00:07:03 and the hardware might be different. So to use a comparison, it's not like an apples to apples comparison, right? So is that why you're also working on data sets and with ML Cube and the best practices to kind of reduce the differences or are the two not related? So, yeah, I think you've identified two really good points there, right? Which is, and I'll give you a, actually a very concrete example of that, which is I think the best way to use MLPerf is different enterprises are gonna have different needs.
Starting point is 00:07:37 So I ran into an organization that does healthcare and they said that actually their favorite thing was BERT because one of the big challenges that they face is they're trying to produce AI systems that can look at medical records and sort of help parse them, right? So it's natural language processing. It's close to BERT, but the data is different, right? They want to be looking at, you know, their entire collection of medical records, not the Wikipedia data set that we use. So there are likely to be some differences there, but it's, you know run BERT? Can you run MLPerf BERT where we've agreed as an industry collectively what that means, right? And then you could also say, you know, if you're a big enough customer, you might say, hey, go use my data or send me something for a proof of concept.
Starting point is 00:08:38 But, you know, our goal was to get a nice broad set of benchmarks that represent many of the commercial areas out there. So we have, you know, recommendation, right, which is, you know, commonly used to figure out, you know, what should we show you on Netflix? What ad should we show you next? You know, we've got natural language processing, a lot of vision, et cetera, that will, so, you know, object detection, right, which, you know, you might, so very interesting for all sorts of things from, you know, retail anti-theft solutions all the way to, you know, detecting pedestrians and autonomous vehicles. And so, you know, we built up this nice set of benchmarks, but I think the really important thing is getting back to sort of what you said about metrics. Before MLPerf existed, people would describe performance and capacity in all different ways.
Starting point is 00:09:33 They might use flops or time to train or whatever. And I think a big part of this is that by aligning everyone on what it means to be faster or better, we get the whole industry moving in the right direction. You know, imagine if we were to talk about cars and, you know, someone comes up to you and says, well, my car can turn a corner in a quarter of a second. Someone else says, well, I can drive it 400 miles an hour. You know, the answer is if I'm going to Whole Foods in San Francisco, neither one of those are actually tremendously useful because I can't go 400 miles an hour in San Francisco. And I shouldn't be going through a stop sign that quick around a corner. Yeah, I don't recommend that, by the way. Note to listeners. And that's exactly what I'm getting at. And I love to hear you say this because so often in
Starting point is 00:10:20 the IT industry, we focus on these ridiculous synthetic numbers, and especially in machine learning. I mean, I was watching GTC this week, and there was a whole lot of talk about numbers, sort of artificial, random numbers about how we can process this many of this and this many of that. And it's like, well, that's awesome. But what is that actually going to do to my model? You know, what is that going to do in practice? And I think that we hardware nerds kind of understand that there's a lot of factors that go into the overall performance of a system. And the performance of a GPU is important, but not the whole picture. Yeah. Yeah, I think that's right. And, you know, MLPerf is very much all the different flavors of MLPerf. And I should, you know, I should be clear that we actually have benchmarks for everything from like tiny IoT systems that are, you know,
Starting point is 00:11:18 burning a couple of microwatts all the way up to, you know, things that run on the world's, like literally the world's largest supercomputer, Fugaku at Ryken. And these are full system benchmarks, right? If you think about training, you've got host processors, you've got potentially accelerators, you've got networking to tie them all together, and there's a lot of software.
Starting point is 00:11:42 And so one of the things, MLPerf is both a very powerful and flexible tool. Like we've seen submissions where it is a cloud vendor submitting something that, you know, they want to show off, this is what my solution can do if you go and rent it in the cloud. Now it's a different system than, you know, what you might buy on-prem, whether from an OEM or, you know, a hardware vendor directly like NVIDIA. But, you know, I don't think ultimately at the end of the day, right, it's about, they were making a statement about you can do this with our system. And that's just, you know, sort of a different equation, right? You know, when you buy a system for on-prem,
Starting point is 00:12:28 right, you're buying the management. And that can oftentimes be tremendously valuable for you, but sometimes you just want to plug and play and rent things by the hour, right? And those are all viable options. Yeah. And modern ML systems as well, they're getting increasingly complex. Again, what we're seeing now at these industry shows, what we're seeing the vendors introducing are not very typical computers. It's not the fact of, oh, I bought a fast CPU, I bought a fast GPU, I bought fast memory, I put it in the thing, and this is how it does. I mean, these are systems that are very, very complicated. They can include multiple nodes, They can include a rack full of nodes with disaggregated components connecting with various different interconnects. Really, the only thing
Starting point is 00:13:12 that matters is how fast is it going to do the work? And again, that's why I love the idea of using real-world data running through a real-world predictable model and making sure that the results are correct and then showing, look, it did the thing in this amount of time. That's just, it reflects all of these things, whether it's, as you mentioned, in the cloud or a disaggregated system, or even as you say, like a single monolithic system, even a small one. Yeah. No, I mean, there's a lot of challenges there. And I mean, I think, you know, coming from a hardware background myself, you know, one of the things that I've actually taken tremendous joy in is actually getting to learn a lot more about the software side. Right. And, you know, I think with hardware, there's always speeds and feeds and,
Starting point is 00:14:01 and it's exciting, right. You know, we've got this latest generation and it gives us these, you know, concrete features more. But, you know, software is a huge part of the story. And, you know, just as a good example, like when you're talking about inference, there's the speed at which you can do things, but it's also, well, how are you actually using it? What's the deployment scenario, right? So there's, you know, a classic IT thing is you have online systems and offline batch systems, right? So, you know, I think about, you know, the bills I get at the end of the month,
Starting point is 00:14:36 and there's a bunch of, you know, from a bank, let's say, right, they're going to go in and analyze and make some recommendations and maybe do some analysis that is offline. And the way that works is actually quite a bit different than what an online recommendation system might do, where you've got to get back to the user in tens of milliseconds in order to not lose their attention. And so that's one of those places where, is the online versus batch directly ML related? Not so much. It definitely has an impact, but you have to start thinking about all of these things in order to really accurately reflect the wide variety of problems that we all face. Performance is always a problem because it's hiding the inefficiencies, right? If you look at AI and you look purely at the hardware, a lot of people have issues with the inefficiencies of a GPU. It's very difficult to keep a GPU busy all the time, right? And that's where the combination of the hardware and the software, because it's the software that can eliminate the inefficiencies and improve your cycle. So I agree that the metric that makes most sense is where it's more like the solution where you have the hardware and the software all combined as opposed to just hardware, right? The raw numbers don't make a lot of sense.
Starting point is 00:15:58 You did talk a little bit about inference. So does MLPerf also handle inference? Yeah. So just for a brief dive into history, we got started working on training. That is obviously the first step, right? Before you have anything to do inference with, you've got training. And those are really big compute jobs. I think one of the things that was mentioned is we have to measure performance for systems that are as large as several thousand nodes. And I think for some of our HPC benchmarks,
Starting point is 00:16:30 which is sort of training for scientific computation problems, that might actually scale up to 16 or 20,000 nodes or possibly even larger. You know, if you look at Fugaku, right, I believe that's about 150,000 nodes. And I think they ultimately ran some benchmarks on half of it. So this is very large scale systems that you use to, right, to define the model that you will ultimately place into production. And once we had sort of worked through training, then actually the natural next step was to start focusing on inference. And, you know, inference is not as computationally demanding, right? Sort of by definition, right? When you're training, you're going through all of these iterative examples, you need huge amounts of data, and you're doing both the forward pass and the backward pass and the weight updates.
Starting point is 00:17:20 But for inference, it's just the forward pass. And usually the performance needs for inference come from the scale of the usage more often, right? You know, I that model, you know, regardless of what it is and operating on new data. And so that was actually the project that I helped to lead with four of my fantastic colleagues was looking at inference. And so we tried to use many of the same networks and data sets that we could for training, but then there's a lot of different deployment scenarios, right? There's the online server scenario. You have to figure out what are the right latencies for responsiveness. You need the right accuracy metrics to keep everyone honest. You know, if you have something that can predict spam at, you know, 99% accuracy, you know, that's good. But if it's only doing 60%, that's actually not a usable
Starting point is 00:18:26 system. So accuracy becomes a very intrinsic part of all ML discussions. But then there's also, you've got scenarios like maybe you're taking in multiple camera streams to do security monitoring, right? Or you've got offline processing. So how do you think about that? And coming up with the agreed upon rules and metrics and just helping to really align folks there. Yeah, I think ML Perf or inference is becoming more important for the simple reason that CPUs traditionally were on the inference side and this is changing rapidly.
Starting point is 00:19:03 And on top of that, in my conversations with nvidia i hear that a lot of customers are buying the same hardware same type of hardware for inference and and and training meaning that they just use the software to delegate the workloads to the same device but one could be production the one could be could. So I do think that ML Perf for inference is definitely going to ramp up. Yeah. And I think in some ways the target population is potentially a little bit different as well. Well, the audience really, right? The people who are going to be doing training are generally going to be ML engineers, right?
Starting point is 00:19:44 But then it's taking those know, taking those trained models and putting them into production. And that's actually a fairly complicated and extensive pipeline, right? There's all sorts of things you might want on provenance and model tagging and, you know, how do you detect drift over time? I mean, there's a ton of practical issues around deployment that you need to work through before you get your inference model in production. And actually, I mean, I think that's one area where, you know, the industry has a lot of work to do to make that as sort of push button as possible to sort of accelerate from training to inference to deployment and sort of the full life cycle there. But yeah, you have a lot more options on inference in part because,
Starting point is 00:20:33 as you said, most people do it on CPUs today. There are accelerators out there, but you also can do inference on a variety of devices. We've got inference benchmarks for smartphones, for IoT devices, for data centers, and increasingly we're seeing inference being done in more and more platforms. So it's a very exciting time and I hope it'll be a very useful tool. One of the things that comes to mind as well with benchmarks generally is that a benchmark really only matters when you take it in context of the system that you're running it on. So, you know, you can build a benchmark queen that isn't usable in the real
Starting point is 00:21:11 world, or you can't afford, or it sucks down too much power. Do you see people doing that? Do you see people talking about performance per power or performance per cost when they're evaluating these systems, or are they really just looking for the best, quickest outcome? Yeah. So that's actually a really good question. And I think it's a little bit different for training and for inference. So I actually, one of the things that I had the pleasure of leading was once we did inference, we decided that inference was very appropriate to couple to power measurement, right? In a sense, to provide some normalization, right? And, you know, part of it is that if you look at the inference solutions that we got in our first round, I think the difference in power consumption was something like a factor of a thousand or more, maybe a
Starting point is 00:22:07 factor of 10,000, right? You know, so everything from under one watt to, you know, some of these big servers, as you say, can be multiple kilowatts. And, you know, you don't want a multi-kilowatt system running under your desk. The noise alone would probably drive you mad. So we do have, it's optional, the ability to do power measurement when you run inference. I think from a buyer standpoint, cost absolutely matters, but this is one of the areas where I actually think we, both a blessing and a curse. So we made a very conscious decision to avoid talking about cost in part because there's a lot of factors that go into it that are subtle or hard to control. One is how big a customer are you? You can imagine that Walmart and your corner grocery store get slightly different rates when they're quoted by Dell, for starters. But then the other thing is, even if you look in the cloud,
Starting point is 00:23:06 as I'm sure many people have noticed, the cost of power actually varies quite considerably between a place like Oregon that has a lot of relatively affordable hydroelectric power and maybe an island. And so we ultimately decided that there were enough sort of flexible factors in there that it was going to be hard to consistently measure. But things like power are more informed by physics. And the laws of physics cannot be bent by marketing departments conveniently, right? That is the responsibility of the engineering department. And so we felt like that was a good approximation.
Starting point is 00:23:43 And ultimately, a serious buyer is going to be able to talk to their vendors and get price quotes anyways. So I think one of the challenges I've seen with other benchmarks is if you start including pricing information, you start to see people benchmarking systems that are constructed in very strange fashions. And so we kind of wanted to avoid that. Does that make sense? Yeah, it does. that are constructed in very strange fashions. And so we kind of wanted to avoid that. Does that make sense?
Starting point is 00:24:09 Yeah, it does. There's one question I get all the time and I don't have a good answer for it. And the question is always, how long does it take to run MLPerf to have reasonable results, right? And it's really an open-ended question and I usually try to avoid to answer the question. But since you're here, why not ask the question?
Starting point is 00:24:29 Yeah, no, absolutely. So for MLPerf training, our general target is that we want the benchmark run time in an unoptimized sort of, like the reference models, should take about a week to run. And now that sounds really long, but again, that's not very well optimized. And this is something that we want to scale up to large systems and be reasonably representative, right? If you have a training job that really only takes you like one minute on a home PC, it might be important. You might be running it frequently, but it's already pretty well optimized. That's well under the wait time where I'm going to go and get a cup of coffee. So you kind of have a solved problem there.
Starting point is 00:25:20 Now, I think the effort on MLPerf is really from the optimization side, and that can take a considerable period of time. But actually running it shouldn't be too difficult. When I see some of our, I think some of the systems that have submitted are now training these models that normally would take a week in like under a minute because they've managed to scale it up to 4,000 nodes. And, you know, we've seen the progress in software and hardware over time. I mean, one of my favorite, I did some analysis of the data and one of the things that I thought was really cool
Starting point is 00:25:57 is in our first round of MLPerf, we didn't really see any systems that were for training were larger than about 200 processors. Now it's routinely up to two to 4,000. And then if you look at sort of like the same hardware with just software tuning, we've seen improvements on the order of 20, 30, 40% up to 2X over time. And so I think that actually really illustrates one of the points we had talked about before. It's not just about hardware. It's about software and the full system.
Starting point is 00:26:31 How do you compare MLPerf to some of the other benchmarks out there? I mean, AIBench and DeepBench and all the rest of these. Some of those are more micro benchmark focused. Some of them are more application focused. MLPerf seems to be really focused on sort of the performance of the components. How do you compare that? Right. So I think MLPerf intrinsically is a full system benchmark. And there are some benchmarks out there that are, you know, focused at a specific component level. But we very much wanted to incorporate the full system aspects. And, you know, there's even some work that we're
Starting point is 00:27:05 looking at to potentially for inference, for data center inference in particular, to start incorporating sort of the network, right? Because if you think about how these things actually work, right, you're going to get some data in across the network and then that drives your inference query, which you then need to process and get right back out to the network. So, you know, I'd say it's a full system benchmark. I think, you know, the things that we have going for us is, you know, everything's open source. It's got very broad industry support. There are other benchmarks out there that are doing some great work. And the other thing I should say is MLPerf, you know, didn't start from a blank slate. Like we built on top of some work at Baidu and Stanford and really the way it got started. And this is goes somewhat to our philosophy is we got everyone in the room circa,
Starting point is 00:27:59 you know, I think 2018, 2019, who had worked on ML benchmarks. And we said, look, none of us individually have come up with the right solution, but if we all get together in the room, we can build on what we did before and get something that will really hit the nail on the head. And that's really what happened with ML Perf. And so it's been really exciting to see sort of that community grow and expand over time. But yeah, I mean, there are a lot of other folks who are doing benchmarking in other ways, and they might be measuring different things that are just as valuable. And, you know, there's also breadth of workloads. I think one of the things that we're keenly aware of is, you know, the reality is the best benchmark is whatever
Starting point is 00:28:45 you're actually doing. But, you know, in order to make something that's generalizable, we have to sacrifice a little bit of specificity. And so, you know, I totally understand some folks who are out there and saying, hey, okay, you know, maybe those existing workloads are maybe not exactly what I need. Here's what I want my benchmark to be. And I totally understand that. The other thing that's important to keep in mind is ML changes really rapidly over time. You know, and I know, Frederick, this is probably closer to your alley. But one of our, I remember one of our first benchmarks was we had both a recursive neural network, recurrent neural network, and a transformer based for translation. And at the time, we sort of said,
Starting point is 00:29:38 look, we would like a recurrent neural network in there, number one, because that's an important workload. And it's more mature. But this was right at the dawn of transformers, of BERT and attention. And we said, but we think that having a transformer-based model is more future-proof. And lo and behold, a couple, basically a year later, we said, well, look, that recurrent translator is no longer state-of-the-art and we really need to retire it. So I think that's another thing that you see, which is there may be benchmarks that might be using older networks for posterity's sake, because it is potentially valuable or useful information as well.
Starting point is 00:30:21 Yeah, software goes really fast. I mean, I agree with software. People can expect that it goes really fast. I have a customer who just deployed a large cluster with A100s and they listened to the announcement at GT the things that's in the Hopper series is transformer specific instruction set, which is going to be really interesting to see more of this hardware software co-design where, you know, you really look at a big problem and, and start to optimize it from the Silicon up. And I think, you know, that is, uh, uh, frankly, a very promising approach. I think we're seeing a lot more of that, you know, from many different folks and, you know, you even see it from, uh, organizations that don't make their own silicate. And I think Facebook and other folks have talked about how they will tailor their systems towards their workloads, even if they're using components from third parties. So I think there's a lot of opportunity there. Yeah, I think that's what AI is all about, right?
Starting point is 00:31:45 It's continuous change and continuous learning. I think sometimes people forget about the learning piece, right? That change is guaranteed. If you don't change, you automatically will be behind. But I think tools like MLPerf and others, and even MLCube where you have best practices, are really great tools to create a baseline. I think that's one thing some people don't understand is you don't buy an AI cluster,
Starting point is 00:32:13 you don't tune it and then walk away from it, right? It's a living thing. The moment you deploy it, that's when it all starts and workloads will change. You will have more data. The whole process will change. You need to continuously adapt. And I think having tools that help you make you aware that those things change is very key. Yeah, absolutely. And that's actually, I think, an underappreciated aspect. One of the reasons why MLPerf is open is, yes, it is a tool for vendors to articulate their performance to customers and help that communication pathway. But it's also a tool where if you are building software, you might integrate our benchmarks into your regression tests. Like, do you want to figure out if you're, you know, how is this compiler feature going to do? Or does my compiler have
Starting point is 00:33:03 good support? Or as I put together this network system, how is it working? And so it's about aligning the entire industry sort of from the sales, the customer, the research and academics, everyone all together. And I think ultimately being open source, being an industry standard, being open to everyone gives you a lot more leverage in all of those different dimensions. Right. Yeah. And I think, you know, getting ML perf numbers from vendors is kind of twisted in the sense that every vendor will turn ML perf
Starting point is 00:33:40 around so that they can say that they're the number one, right? So again, it's an interesting baseline, but it's not really, you know, it's not something that will work for you as well. That's right. And I mean, you have to pick the right benchmark within an MLPerf that's going to be hopefully representative. But I think one of the other things is we're also very committed to making sure that the benchmarks stay representative and up-to-date. A classic example is AlexNet is the neural network that finally beat humans on ImageNet. And today, it's not a very interesting network, but always interested in hearing from is from customers that know about AI that are deploying it, like what is important to them? Like I always want to be getting that feedback so that what we do is relevant and valuable. Thanks for this whole conversation, David. It's been really interesting. And I could talk all day long about these things. But we've reached the point in our podcast where we shift gears. And it's time to hit you with three unexpected questions. So this tradition started last season, and it's been a lot of fun.
Starting point is 00:34:57 You're going to get a question from Frederick, a question from me, and then a question from a previous guest on the podcast. So, Frederick, you go ahead. So, my question is, is it possible to create a truly unbiased AI? Ooh, that's a really good one. And so, when you say unbiased, you mean with respect to, say, you know, having bias about maybe whether the speaker is male or female? Right. Ethical, religious, you know, female, male. Oh, that's a really tough one.
Starting point is 00:35:49 So I. So I will say that I am not a core expert in this area. I think a lot of our problems intrinsically stem from data and how the AI is often deployed, right? You could take, even if you had a perfectly unbiased AI system, humans are very good at finding ways to use the tools in ways that they maybe shouldn't be. And so I look at it, I can absolutely see a problem with data. And this is part of why we do our data sets, right? If you look at, we have lots of data that is generated from rich Western countries, because that's economically significant. We have much less for other parts of the world. So I think that's, maybe we can, I think improving the data that is available and open is a very good way to do it, but we also need to think about how we would use AI in a less biased way. How's that for an answer?
Starting point is 00:36:35 Good. Thanks. All right. My question is, how big can ML models get? Will today's 100 billion parameter model look small tomorrow, or have we reached some kind of limit? So I am a firm believer in the bigger is better camp in terms of parameter count. I think, you know, we're seeing people looking at trillion or multi-trillion parameter models. And I think that one of the areas that we will get driven in is bigger and bigger models that are sparser and sparser. So that even if you have, you know, a hundred trillion or larger models that you don't necessarily need to do all of that computation, right? There's a lot of attention being paid to a mixture of experts and other sort of sparse techniques that lets you get the benefit of model size without necessarily
Starting point is 00:37:31 the full computational cost. But I think if you look at the research, it's very clear that as models get bigger, you tend to get better accuracy and lower losses. And sort of the magic of AI has been that there are certain inflection points that when we pass, suddenly things that were not possible become possible, whether it's beating a human at image recognition, or maybe one day detecting pedestrians away from cars. And the other thing to keep in mind is when it comes to these sort of inflection points, we need to advance over the state of the art. We don't need to be perfect, right? I think about autonomous driving.
Starting point is 00:38:13 And if 100 trillion parameter model is what's needed in order to safely detect pedestrians and prevent collisions, then I absolutely think we should go there because there's so many lives that we can potentially save. Excellent. Well, thanks for that. And now, as promised, we're going to be using a question from a previous guest. This question comes from Andy Hawk of Cerebrus Systems. Andy, take it away. Hello, my name is Andy Hawk. I'm the vice president and head of product at Cerebrus Systems. My question is, what AI application would you build, or what AI research would you conduct if you were not constrained by compute? First of all, it's great to get a question from Andy, who I've had the pleasure of working with through MLPerf and has been a great
Starting point is 00:39:13 supporter. I would say that there's two areas that I really look at as potentially being, you know, fairly transformative. And one is I look at breadth and also impact, right? I feel like there's a lot of technologies where you can have tremendous breadth, but the impact is not very deep. And then there's technologies where the impact is really deep, you know, say like a surgeon and improving surgery, right? If you can walk again, that's a huge impact, but it's hard to scale. And so I think two of the things that I look at today as being, you know, very, very, very interesting are sort of the medical domain, I think, has the opportunity to get both that breadth and scale, although there's a lot of regulatory barriers beyond just, you know, what can we do with our current computation? And then I'd say I'm,
Starting point is 00:40:05 I'm certainly very excited about autonomous vehicles as well and things. And then a third would be the ability to translate or work with any language. Like to me, that's like one of those classic things that I think I'm not a Star Trek fan, but I believe the universal translator was one of those things from Star Trek. And the idea of making things that 20 or 30 years ago were regarded as magical and beyond belief and making that ordinary, that's really cool. Well, thanks so much for those answers.
Starting point is 00:40:40 It's great to catch you off guard and get a little insight from you as well. We look forward to hearing what your question for a future guest might be. And if our listeners would like to play this game, you can just send an email to host at utilizing-ai.com, and we'll record a question from you. So, David, thank you so much for joining us today. Where can people connect with you and follow your thoughts? Or is there something that you've done recently you want to call attention to? Yeah, so you can find me online.
Starting point is 00:41:11 I am The Cantor, K-A-N-T-E-R, at Twitter. I'm also on LinkedIn. And you can reach me via email, david at mlcommons.org. And as an additional call to action, I'd say come visit our website. We're a very open and welcoming community. You can join our public mailing list. And as an added incentive, on April 6th, we posted the latest results from MLPerf Inference, measuring performance for data center and edge systems, mobile for smartphone, tablet, and notebooks, and Tiny, which measures performance for IoT systems.
Starting point is 00:41:48 So there's a lot to check out that you might find interesting. Please give me a shout. And especially if you're interested in using the MLPerf benchmarks or understanding how to use them or have ideas for them, please reach out. We're very open and want to help the community. Well, thanks very much for that. Frederick, what's new in your world? Yeah, so I recently published an article on AI in the enterprise. So hopefully people will like it. Still working on a startup around data management and providing services around data management, designing large scale AI clusters, hopefully with MLPerf as well.
Starting point is 00:42:26 You can find me on LinkedIn and Twitter as Frederick V. Heron. And as for me, I've been working hard on planning AI Field Day. So we are coming up here with an AI event where we'll have lots of presentations and discussions, hopefully from some of the folks you've heard from here this season on utilizing AI. So if you'd like to learn more about that or get in touch and get connected, you can go to techfieldday.com.
Starting point is 00:42:53 Thank you very much for listening to the utilizing AI podcast. If you enjoyed this discussion, you can find it on most podcast platforms. And while you're there, give us a rating and a review since it would help our visibility. This podcast is brought to you by gestaltit.com, your home for IT coverage from across the enterprise. For show notes and more episodes, go to utilizing-ai.com or find us on Twitter at utilizing underscore AI. Thanks for joining us and we'll see you next time.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.