Orchestrate all the Things - AI chips in the real world: interoperability, constraints, cost, energy efficiency, models

Episode Date: February 3, 2021

As it turns out, the answer to the question of how to make the best of AI hardware may not be solely, or even primarily, related to hardware Today's episode features Determined AI CEO and Founde...r, Evan Sparks. Sparks is a PhD veteran of Berkeley's AmpLab with a long track record of accurate predictions in the chip market. We talk about an interoperability layer for disparate hardware stacks, ONNX and TVM -- two ways to solve similar problems, AI constraints and energy efficiency, and Infusing knowledge in models Article published on ZDNet

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Orchestrate All the Things podcast. I'm George Amatiotis and we'll be connecting the dots together. Today's episode features Determined AI CEO and founder, Ivan Sparks. Sparks is a PhD veteran of Berkeley's AMP Lab with a long track record of accurate predictions in the chip market. And as it turns out, the answer to the question of how to make the best of AI hardware may not be solely or even primarily related to hardware. I hope you will enjoy the podcast.
Starting point is 00:00:33 If you like my work, you can follow Link Data Orchestration on Twitter, LinkedIn and Facebook. So yeah, thanks again for taking the time, Ivan. And a reasonable first step to take when having this conversation is actually to ask people to say a few words about themselves and their background and what brings you into this topic, kind of. So, how come you got interested into the AI chips landscape? Yeah, absolutely. So happy to introduce myself. So I'm Evan Sparks. I'm co-founder and CEO of a company called Determined AI.
Starting point is 00:01:11 We're an early-stage software startup, and we build software really to help data scientists and machine learning engineers accelerate their workloads and workflows. So help them build AI applications faster. So you can think of us as kind of a software infrastructure layer that sits kind of underneath the TensorFlows and the PyTorches of the world and above the various chips and accelerators. So NVIDIA GPUs and many of the other interesting ones that are kind of coming on the market as well, Google TPUs and so on. And prior to starting the term of AI, I was a graduate student, a PhD student in the AMP lab at UC Berkeley, where I focused on kind of distributed systems for large-scale machine learning.
Starting point is 00:01:53 And at Berkeley, I had the opportunity to work with some pretty interesting folks on the architecture kind of side of the world. Dave Patterson, in particular, was a co-author of mine of the world. Dave Patterson in particular was a, you know, a co-author of mine and a PI and, you know, really from working with Dave, it was really interesting because he was the one who was sort of banging the drum about Moore's Law being dead and custom silicon really being the only hope we had for continued growth in the space really early on. And I kind of drank the Kool-Aid a little bit. And I think we fast forward almost 10 years now and we're seeing a lot of that kind of come to bear
Starting point is 00:02:39 in the sort of AI chip market. So that's a bit of an extended introduction. And a very good one one if I may add at the show, because it sounds like in a way you're coming from maybe the opposite direction from the one I'm coming to in this one. So you have more of a hardware background than I do. I got interested, and you got there definitely earlier than I did. So I got interested in this market kind of as an offset, let's say, of my interest in machine learning and training models and inference and all of that. So you were definitely earlier than I were. And it also sounds like what you do with your company is very, very much closely related to the hardware layer. And I also read with interest an article you
Starting point is 00:03:34 have written for the IEEE Spectrum, in which you basically argued that this whole renaissance of custom chips is, well, maybe good for hardware and good for AI in general, but it also poses a danger, let's say, for people who are responsible for training these models because, well, it kind of forces them to over-specialize or to pick sides, if you want to call it that. So, you also mentioned a couple of the obvious, I guess, names in your introduction like NVIDIA or Intel or Cloud vendors as well. So, I was wondering how would you like to start to go through this conversation? One way I was thinking we could do
Starting point is 00:04:20 it was maybe start, let's say, with NVIDIA, then go to Intel and Qualcomm, which is also in the news lately for having just completed or started actually an interesting acquisition. Or the other way you could look at it, well, taking the lens of someone who's developing models. Do you have a preference? Yeah, I mean, I think I would rather take the perspective of somebody who's developing models, if that's cool with you. Yeah, yeah, sure. Okay. Yeah, go ahead. Sorry, after you. Okay, I was just going to ask that. So if that's the way you were going to dissect it, I guess one way to start would be to start with what kind of brings all of these different custom chips and architectures together.
Starting point is 00:05:19 And to the best of my knowledge, at least, the only thing I can think of would be ONNX. So you can use different architectures or software to train your models, but you can always rely on ONNX to deploy that for inference. And do you have an opinion on ONNX? Yeah, so ONyx is a very interesting initiative. I think if you go back to the genesis of Onyx, it came out of Facebook originally. And the reason Onyx was developed in the first place was really because Facebook had a very disparate training and inference stack. They had developed PyTorch internally
Starting point is 00:06:11 and model developers loved PyTorch, especially in comparison with TensorFlow. The eager execution mode, we can go on and on about the software properties that they liked, but suffice to say, they really felt it was a much more productive experience when they were in that sort of model development side. And yet at the same time, at that time anyway, the bulk of at least deep learning models that Facebook was running in production were computer vision models that were running backed by CAFE. So CAFE is not a name we talk about a ton these days, but was really one of the groundbreaking deep learning libraries, particularly for computer vision in sort of the early days of that taking over.
Starting point is 00:07:01 And when I say that taking over, I mean like five years ago, right? Not centuries ago. And so, you know, Facebook had this mandate that said something like, research can be done in whatever language you want, PyTorch, but production deployment of these models has to be in Cafe 2. And so that led to the need for this intermediate layer that would translate between the model architectures that were output in PyTorch land and input into CAFE.
Starting point is 00:07:27 And they started to see, hey, this is a really good general idea. It's not too different from things that we've seen in compilers in the past where we've got something like an intermediate representation between multiple high-level languages and started to plug things in with multiple languages at the source and multiple frameworks at the destination. I would say ONIX is very interesting. It's certainly matured over the last two years. But I think just generally speaking in this space, the software landscape is moving so
Starting point is 00:08:03 fast that it's hard for anything yet to become really, really a standard. And so I think Onyx is a great step in that direction, but I don't think it is totally there yet. You know, the other project that really comes to mind in this space is TVM. And, you know, slightly different approach to, I would say, similar problems, but also something that is gaining traction, particularly among the chip vendors, is my understanding. That's not, you know, my business. I think the folks at OctoML can talk to you a bit more about that. But there's certainly some interesting standards starting to emerge. Okay, yeah, it's definitely interesting for me to hear about TVM because, well, I didn't know about it up until like a minute ago when you mentioned that, and I guess it may also
Starting point is 00:08:52 be the case for many other people. So if you would be so kind to expand a little bit on that. So what does it do exactly? Where did it come from? And why do you think it's interesting? Yeah, so TVM is really interesting. Where did it come from? And why do you think it's interesting? Yeah, so TVM is really interesting. I mean, it is a research project out of the University of Washington that now has a commercial kind of effort behind it in OctoML. And really the goals of TVM
Starting point is 00:09:29 are about sort of similar to the goals of ONIX in making it possible to compile deep learning models into what they call minimum deployable modules. And kind of similarly from a compiler's kind of perspective to be able to automatically optimize these models for different pieces of target hardware. And so, you know, like I said, it's a relatively new project, but that said, it's got a pretty strong open source community behind it as well. And I'd say, you know, there are a whole bunch of folks
Starting point is 00:10:05 that would like to see it become a standard. You know, it being open source is a huge piece of that. But I think that in general, hardware vendors, and I mentioned this in my sort of IEEE article, hardware vendors not named NVIDIA are likely to want more openness and a way to enter into the market. And they're looking for kind of a narrow interface to implement. Now, that might be down at the sort of bytecode level, the way I think of TVM. And it's not exactly that. TVM is more at the level of sort of graph operations and so on. But it might even be at a higher level than that.
Starting point is 00:10:47 And I think, again, we're still early in kind of figuring out exactly where that interface is. And there might be multiple of these kind of touch points where software developers building higher level frameworks can integrate. So, you know, TVM is certainly a project I would watch if watching closely in the space. You did mention graph operations, and I was going to ask you precisely about that, because to the best of my knowledge, ONNX is also leveraging the graph of operations to work. So do you, can you pinpoint, well, first of all, is my understanding correct? And second, if that is indeed the case, would you be able to pinpoint the
Starting point is 00:11:31 differences in how these two frameworks leverage the graph structure? Yeah, so I think Onyx takes an approach where it's a little bit more or a little bit less opinionated, where there are certain kernels that a framework developer must implement, things like convolution and matrix multiply and so on. And as long as you implement this subset of operations,
Starting point is 00:12:12 then ONIX will kind of walk your network structure recursively and translate that into its intermediate representation. And then finally, on the other side, framework developers figure out how to translate those operations into the target environment. TVM works a little bit more like a compiler in that it separates concerns and the language or the intermediate structure you have to generate is a little bit lower level, so differentiable kind of functions. And this intermediate representation then gets optimized through a passive kind of automatic compilation that they call auto-TVM, and then gets translated, again, to a target backend.
Starting point is 00:12:58 And really what I would say between these two frameworks is TVM is a bit lower level and ONIX is a bit higher level. And there are just some sort of tradeoffs associated with that. I think TVM has the potential to be perhaps a little bit more general. But the downside of that is that the scope of the optimizations it can make are potentially more narrow than what would be available in ONIX because ONIX would know a bit more about the sort of program structure. But, you know, again, these early projects, I would say, and I think they will both learn from each other over time, and I don't see them as kind of immediate competitors at the moment, just two ways to solve kind of similar problems. Okay.
Starting point is 00:13:45 But for someone who trains models, you know, using any kind of framework that has compatibility with either of those, either TVM or ONIX, at the end of the day, does it really make a difference? I mean, it's not something that they will actually have to get their hands dirty with, is it? I would hope not.
Starting point is 00:14:08 And so I think that, to me, that's a theme that I really hope to see emerge over the next couple of years, which is really strong separation of concerns between the various stages of sort of model development. So I think that, you know, for example, there are a whole bunch of systems out there for preparing your data for training and making it high performance and compact data structures and so on. That is a different kind of stage in the process, different workflow than actually thinking
Starting point is 00:14:43 the experimentation that goes into sort of model training and model development. And as long as you get your data in kind of those right formats, while you're in model development, it should not matter what, you know, upstream data system you're doing. Similarly, pertinent to our conversation, I think it shouldn't matter what training hardware you're running on, whether it's GPUs or CPUs or exotic accelerators
Starting point is 00:15:07 from Intel, Havana, or some company that we haven't heard of yet, right? I think that as long as the developer is developing in these kind of high-level languages and the accelerator is under the hood doing their job and giving them exactly that acceleration, that shouldn't matter. And then similarly, the degree to which model developers are worried about the hardware or the frameworks that they're using to deploy to that hardware, like TVM or ONIX, it should be mostly, in my mind, about how is that hardware going to satisfy my application constraints. So I'll give you an example. We work with a medical devices company that has hardware out in the field
Starting point is 00:15:52 that's old version CPUs and GPUs. They're not going to issue a hardware upgrade just so that they can run slightly more accurate models. Instead, their problem is almost the inverse. It's how do I get the most accurate model that can run on this particular hardware? And so they might start with a huge model and employ techniques like quantization and distillation to fit that hardware. Now, right now for them, it's a, well, we help them do that semi-automatically, but it's historically been a fairly manual process. And what we'd like to see over time is a way for users to specify,
Starting point is 00:16:28 these are my design constraints, these are my deployment constraints, have to fit in this memory footprint, have to be at least this accurate. One, perhaps help me automatically select the hardware that is going to best fit that task. Or if the hardware is fixed, help me automatically select the model that is going to most accurately kind of fit into that performance profile. And so, you know, that is some of the capabilities of what we do at Determined are anchored on that idea. So we have some facilities for neural architecture search that help enable some of this. This stuff is pretty early days, and even the research landscape is evolving rapidly.
Starting point is 00:17:13 And so what's interesting is if you can centralize the sort of techniques that you might be using to solve these problems, you can make the life of that end developer who's doing that model development much easier while serving them the latest and greatest from research over time. And that's a little bit different than traditional software just because the landscape is changing so quickly. Yeah, you touched on a number of interesting points, some of which I intended to ask you about anyway. So one of those was real-world constraints, basically, and that was
Starting point is 00:17:52 also mentioned in one of the other contributions I saw from you in that Wired article about, well, something which is, and rightfully so, if I may say so, in the headlines these days. So, the environmental impact of training models at scale and, well, I guess more training than deploying, because inference traditionally doesn't take as much energy, but definitely training. So, unless you're Google or any of the hyperscalers, that's a very real cost that you have to take into account. And it seems to me that it's less highlighted than it should be. So you proposed a kind of reverse engineering approach in a way. So these are my constraints.
Starting point is 00:18:43 I wouldn't want, for example, to spend more than XYZ in training. And so what can I do with this? How do you approach that? You mentioned that you have something in the works there or some kind of, maybe a little bit ad hoc at this point, but some kind of approach there? Yeah, so there's kind of, you raise an interesting point. There is this, so what I was talking about earlier was real-world constraints to model deployment for inference, which is deployed to mobile phones and edge devices and so on.
Starting point is 00:19:15 Or even not in those situations might be things like I'm running an online ad network and I need to minimize latency for my fraud detection models because I only have a fraction of a second to respond to some bid requests, something like that. That is one family of constraints. The other family of constraints that you just asked about, I think, is really around on the training side. And as I've pointed out before, as you point out, the cost of training has grown exponentially. In fact, the folks at OpenAI released a pretty interesting article now two years ago saying that the cost of training was up 300,000 times. That is, the cost of getting a state-of-the-art AI model was up 300,000 times over the course of kind of the previous six or seven years. And that trend has continued.
Starting point is 00:20:12 Just last year, the GPT-3 model that OpenAI released, you know, and this is a, they're an independent lab. They're not a hyperscaler. They are well-funded, but this GPT-3 language model, most of the estimates I've read, and we've done the calculations ourselves as well, put the cost of training that model somewhere in the sort of $7 to $12 million range to train a single model. Just an insane amount of computation, of energy, of money required to do that. And to your point, most of us mortals don't have that. Most of our organizations aren't willing to invest that much in a sort of speculative R&D project. And so what we need are tools that help our users reason about this cost, assign quotas, and so on. From the perspective of our particular software and technology, we do offer tools to allow people to scale their training.
Starting point is 00:21:04 At the heart of it, that's a lot of what we do because of that kind of computational cost. We can't be waiting two years for a model to converge. Our users want to be waiting a day, two days. That means they have to leverage parallelism, but the downside of that is the costs go up. So what do we do? Well, we allow them to set limits in terms of how much they would like to spend on a particular thing. And then on top of that, we have AutoML capabilities that are specifically designed for this regime, where the cost of training a single model might be extremely expensive. And things like hyperparameter tuning and neural architecture search feel like a pipe dream to most of the people that we talk to because, hey, training a single model, I'm measuring in maybe not millions of dollars, but certainly hundreds or thousands of dollars.
Starting point is 00:21:54 And training hundreds or thousands of those models to find one that might work well, that gets prohibitively expensive very quickly. So our technology, which is based on active learning, provides a way of saying, in the same budget I might use to train 10 models to convergence, I want to explore a space of potentially 1,000 models. And we don't train all of those models all the way up to convergence. We stop early via some principled mechanisms. And that allows people to get the benefit of exploration without sort of breaking the bank. That's something that ships in our product today as part of our hyperparameter optimization and neural architecture search features.
Starting point is 00:22:35 But we think that this concept in general of sort of mapping computation to cost and utilization of the resources and controls around that is going to be essential for most organizations that want to adopt these technologies, particularly if the scale and the growth rate of that compute increases over the next several years, as we think it is likely to. There's a number of, well, that's an interesting approach. And I would also add and ask your opinion about a couple of others as well. One is, well, the general tendency. People have suggested that not everyone, as you also pointed out,
Starting point is 00:23:19 not everyone has the means that the Googles of the world have to train these kinds of models. But what they could do instead is, well, kind of build on a specialization of these pre-trained models, something like Distilled Bird, for example. So what do you think of this approach? I think it's an amazing way of tackling this problem, right? I think, you know, if you think about this from a computer science perspective, in computer science, when we're talking about algorithms and analysis of these algorithms, we often talk about amortization of costs and letting, you know, one piece of the algorithm do a whole bunch of work and then a bunch of sort of leaf nodes of that algorithm doing much less work to take advantage of that. The same concept applies here when it comes to things like distillation and even just sort of fine-tuning or transfer learning. These are all similar approaches where you let the big guys,
Starting point is 00:24:25 the Facebooks and the Googles of the world, do the big training on huge quantities of data with billions of parameters and spending their hundreds of GPU years on a problem. And then instead of starting from scratch, you take those models and maybe use them to form embeddings that you're going to use for downstream tasks. And the simplest example is lop off the last layer of ImageNet and train
Starting point is 00:24:53 just one last layer of logistic regression on the features that are emitted from the standard weight or of like a ResNet-50, and you can build an alarmingly good classifier out of that. We've seen that this is true, as you mentioned, in NLP with Distilbert and many, many other kinds of applications. And so we see this deployed at our customers very frequently. Anybody who's dealing with visual data often will start with one of these really strong backbones, say a ResNet-50, and train with those pre-trained weights as a starting point. They might fine-tune those weights, they might fine-tune the last layer, but generally
Starting point is 00:25:38 speaking, they're not going to go and sort of reinvent the wheel when it comes to coming up with good image features. Same thing in NLP. Where it gets tricky is when the modality of the data that people are training on is totally different than what's available in sort of these, you know, the types of data that Google and Facebook and so on are training. And so, for example, we work with a pharmaceutical company that is working on four-dimensional, high-resolution microscopy images where the channels aren't even RGB anymore, they're something different. And there, unfortunately, an ImageNet backbone isn't going to do you much. And so you have to find other ways of coming up with representations
Starting point is 00:26:23 of the underlying data. There are ways you can do this by bootstrapping on some supervised learning task. We also see folks start to take advantage of semi-supervised learning for these tasks. But again, I think a lot of folks are recognizing that this could be a major cost if they don't get ahead of it. And so I'm excited as a computer scientist to see the sort of the algorithmic enhancements that are coming down the pipe to tackle this problem.
Starting point is 00:26:52 Okay, and the other approach that I wanted to ask you about, I think you potentially kind of hinted at some point of your elaboration, is, well, there's a fancy name for it, which is neuro-symbolic AI, which is, I guess, basically comes down in your terminology to inductive biases. So, kind of distilling more a priori knowledge into your models, or kind of a hybrid approach to AI that kind of tries to marry the good older fashioned symbolic AI with a learning. And by doing that, you may be able to, among other things, cut down on your training costs. Are you aware of that? And do you have an opinion on that? So I'm not super familiar with the details of neural symbolic AI in particular. But what I will say is some of the successes I have seen in AI outside of sort of the mainstream
Starting point is 00:27:54 kind of commodity, call it commodity use cases like NLP and vision where, you know, it's okay, there's a benchmark that everybody agrees on. There are standard data sets. We all know what the problem is, image classification or object detection or language translation, where you start to specialize a little bit more. Some of the greatest lifts we have seen is where you get a domain expert to infuse their knowledge of the underlying, say, physical phenomenon that is going on to talk about the, to more effectively kind of model the world. The example I like to draw here is kind of go back to Newtonian physics. If you set a neural net that was, you know, speed forward, a neural net with 100 parameters,
Starting point is 00:28:52 an XY coordinates of the trajectory of an ax flying through the air, and you asked it to predict where it's going to be in a second, and that was how you approach the problem. The neural net will probably, with enough examples, learn pretty well an approximation of the function of interest and be able to predict with a high degree of accuracy. But if you just tell the model a little bit more about the problem. Say, I expect the position of this ax in the air
Starting point is 00:29:30 to behave kind of quadratically as a function of time. Something like infusing the application with a little bit more knowledge of the physical world. The amount of data you need, the accuracy, et cetera, the amount of data is going to go way down. The accuracy is going to go way up and we're going to see some gravitational constants start to emerge as maybe one feature of the network or some combination of features. And to abandon sort of that thinking, I think it would be irresponsible. And so I'm excited anytime I see or hear about new approaches that infuse that sort of physical modeling of the world with the sort of approaches that we're doing with
Starting point is 00:30:13 neural networks. Neural networks are great. They're super powerful function approximators. But if I tell the computer a little bit more about what that function is, Hopefully I can save everybody, say, a few million bucks in compute and get models that more accurately represent the world. Great. Great. Thanks. I guess we're out of time. It went by in a snap. So thanks. Lots of interesting insights. Not all of them necessarily related to AI achieves, but well, who cares? It was interesting anyway. Hopefully, it was good. I had a lot of fun with this, George. I really appreciate you taking the time.
Starting point is 00:30:57 I hope you enjoyed the podcast. If you like my work, you can follow Link Data Orchestration on Twitter, LinkedIn, and Facebook.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.