Orchestrate all the Things - AI chips in the real world: interoperability, constraints, cost, energy efficiency, models
Episode Date: February 3, 2021As it turns out, the answer to the question of how to make the best of AI hardware may not be solely, or even primarily, related to hardware Today's episode features Determined AI CEO and Founde...r, Evan Sparks. Sparks is a PhD veteran of Berkeley's AmpLab with a long track record of accurate predictions in the chip market. We talk about an interoperability layer for disparate hardware stacks, ONNX and TVM -- two ways to solve similar problems, AI constraints and energy efficiency, and Infusing knowledge in models Article published on ZDNet
Transcript
Discussion (0)
Welcome to the Orchestrate All the Things podcast.
I'm George Amatiotis and we'll be connecting the dots together.
Today's episode features Determined AI CEO and founder, Ivan Sparks.
Sparks is a PhD veteran of Berkeley's AMP Lab
with a long track record of accurate predictions in the chip market.
And as it turns out, the answer to the question of how to make the best of AI hardware may
not be solely or even primarily related to hardware.
I hope you will enjoy the podcast.
If you like my work, you can follow Link Data Orchestration on Twitter, LinkedIn and Facebook.
So yeah, thanks again for taking the time, Ivan.
And a reasonable first step to take when having this
conversation is actually to ask people to say a few words about themselves and their background
and what brings you into this topic, kind of. So, how come you got interested into the AI chips
landscape? Yeah, absolutely. So happy to introduce myself.
So I'm Evan Sparks.
I'm co-founder and CEO of a company called Determined AI.
We're an early-stage software startup, and we build software really to help data scientists
and machine learning engineers accelerate their workloads and workflows.
So help them build AI applications faster. So you can think of us as
kind of a software infrastructure layer that sits kind of underneath the TensorFlows and the
PyTorches of the world and above the various chips and accelerators. So NVIDIA GPUs and many of the
other interesting ones that are kind of coming on the market as well, Google TPUs and so on.
And prior to starting the term of AI, I was a graduate student, a PhD student in the AMP lab at UC Berkeley,
where I focused on kind of distributed systems for large-scale machine learning.
And at Berkeley, I had the opportunity to work with some pretty interesting folks
on the architecture kind of side of the world.
Dave Patterson, in particular, was a co-author of mine of the world. Dave Patterson in particular was a, you know, a co-author of mine
and a PI and, you know, really from working with Dave, it was really interesting because he was
the one who was sort of banging the drum about Moore's Law being dead and custom silicon really being the only hope we had
for continued growth in the space really early on. And I kind of drank the Kool-Aid
a little bit. And I think we fast forward almost 10 years now and
we're seeing a lot of that kind of come to bear
in the sort of AI chip market. So that's a bit of an extended introduction.
And a very good one one if I may add at the show, because it sounds like in a way you're
coming from maybe the opposite direction from the one I'm coming to in this one. So you have more of
a hardware background than I do. I got interested, and you got there definitely earlier than I did. So I got interested in this
market kind of as an offset, let's say, of my interest in machine learning and training models
and inference and all of that. So you were definitely earlier than I were. And it also
sounds like what you do with your company is very, very much
closely related to the hardware layer. And I also read with interest an article you
have written for the IEEE Spectrum, in which you basically argued that this whole renaissance of
custom chips is, well, maybe good for hardware and good for
AI in general, but it also poses a danger, let's say, for people who are responsible
for training these models because, well, it kind of forces them to over-specialize or
to pick sides, if you want to call it that.
So, you also mentioned a couple of the obvious, I guess,
names in your introduction like NVIDIA or Intel or Cloud vendors as well. So, I was wondering
how would you like to start to go through this conversation? One way I was thinking we could do
it was maybe start, let's say, with NVIDIA, then go to Intel and Qualcomm, which is also in
the news lately for having just completed or started actually an interesting acquisition.
Or the other way you could look at it, well, taking the lens of someone who's developing
models. Do you have a preference? Yeah, I mean, I think I would rather take the perspective of
somebody who's developing models, if that's cool with you. Yeah, yeah, sure. Okay. Yeah, go ahead.
Sorry, after you. Okay, I was just going to ask that. So if that's the way you were going to dissect it,
I guess one way to start would be to start with what kind of brings
all of these different custom chips and architectures together.
And to the best of my knowledge, at least,
the only thing I can think of would be ONNX. So you can
use different architectures or software to train your models, but you can always rely on ONNX to
deploy that for inference. And do you have an opinion on ONNX? Yeah, so ONyx is a very interesting initiative. I think if you go back to the genesis of Onyx,
it came out of Facebook originally.
And the reason Onyx was developed in the first place
was really because Facebook had a very disparate training and inference stack.
They had developed PyTorch internally
and model developers loved PyTorch,
especially in comparison with TensorFlow.
The eager execution mode, we can go on and on about the software properties
that they liked, but suffice to say, they really felt it was a much more productive experience when they were in that sort of model development side.
And yet at the same time, at that time anyway, the bulk of at least deep learning models that Facebook was running in production were computer vision models that were running backed by CAFE.
So CAFE is not a name we talk about a ton these days,
but was really one of the groundbreaking deep learning libraries,
particularly for computer vision in sort of the early days of that taking over.
And when I say that taking over, I mean like five years ago, right? Not centuries ago.
And so, you know, Facebook had this mandate that said something like,
research can be done in whatever language you want, PyTorch,
but production deployment of these models has to be in Cafe 2.
And so that led to the need for this intermediate layer
that would translate between the model architectures
that were output in PyTorch land and input
into CAFE.
And they started to see, hey, this is a really good general idea.
It's not too different from things that we've seen in compilers in the past where we've
got something like an intermediate representation between multiple high-level languages and
started to plug things in with multiple languages at the source and
multiple frameworks at the destination.
I would say ONIX is very interesting.
It's certainly matured over the last two years.
But I think just generally speaking in this space, the software landscape is moving so
fast that it's hard for anything yet to become really, really a standard.
And so I think Onyx is a great step in that direction, but I don't think it is totally there yet.
You know, the other project that really comes to mind in this space is TVM.
And, you know, slightly different approach to, I would say, similar problems, but also something that is gaining traction, particularly among the chip vendors, is my understanding.
That's not, you know, my business.
I think the folks at OctoML can talk to you a bit more about that.
But there's certainly some interesting standards starting to emerge. Okay, yeah, it's definitely interesting for me to hear about TVM because, well, I didn't
know about it up until like a minute ago when you mentioned that, and I guess it may also
be the case for many other people.
So if you would be so kind to expand a little bit on that.
So what does it do exactly?
Where did it come from?
And why do you think it's interesting?
Yeah, so TVM is really interesting. Where did it come from? And why do you think it's interesting?
Yeah, so TVM is really interesting. I mean, it is a research project out of the University of Washington
that now has a commercial kind of effort behind it in OctoML. And really the goals of TVM
are about sort of similar to the goals of ONIX
in making it possible to compile deep learning models
into what they call minimum deployable modules.
And kind of similarly from a compiler's kind of perspective to be able to automatically
optimize these models for different pieces of target hardware.
And so, you know, like I said, it's a relatively new project, but that said, it's got a pretty
strong open source community behind it as well.
And I'd say, you know, there are a whole bunch of folks
that would like to see it become a standard.
You know, it being open source is a huge piece of that.
But I think that in general, hardware vendors,
and I mentioned this in my sort of IEEE article,
hardware vendors not named NVIDIA are likely to want more openness and a way to
enter into the market. And they're looking for kind of a narrow interface to implement.
Now, that might be down at the sort of bytecode level, the way I think of TVM. And it's not
exactly that. TVM is more at the level of sort of graph operations and so on. But it might even be at a higher level than that.
And I think, again, we're still early in kind of figuring out exactly where that interface is.
And there might be multiple of these kind of touch points where software developers building higher level frameworks can integrate.
So, you know, TVM is certainly a project I would watch if watching closely
in the space.
You did mention graph operations, and I was going to ask you precisely about that, because
to the best of my knowledge, ONNX is also leveraging the graph of operations to work.
So do you, can you pinpoint, well, first of all, is my understanding correct? And second,
if that is indeed the case, would you be able to pinpoint the
differences in how these two frameworks leverage the graph
structure? Yeah, so I think
Onyx
takes an approach where it's a little bit more
or a little bit less opinionated,
where there are certain kernels that a framework developer must implement,
things like convolution and matrix multiply and so on.
And as long as you implement this subset of operations,
then ONIX will kind of walk your network structure recursively and translate that into its intermediate representation.
And then finally, on the other side,
framework developers figure out how to translate those operations
into the target environment.
TVM works a little bit more like a compiler in that it separates concerns
and the language or the intermediate structure you have to generate is a little bit lower level,
so differentiable kind of functions.
And this intermediate representation then gets optimized through a passive kind of automatic compilation that they call auto-TVM, and then gets translated, again, to a target backend.
And really what I would say between these two frameworks is TVM is a bit lower level
and ONIX is a bit higher level.
And there are just some sort of tradeoffs associated with that.
I think TVM has the potential to be perhaps a little bit more general.
But the downside of that is that the scope of the optimizations it can make are potentially
more narrow than what would be available in ONIX because ONIX would know a bit more about the sort of program structure. But, you know, again,
these early projects, I would say, and I think they will both learn from each other over time,
and I don't see them as kind of immediate competitors at the moment, just two ways to solve kind of similar problems. Okay.
But for someone who trains models,
you know, using any kind of framework
that has compatibility with either of those,
either TVM or ONIX,
at the end of the day, does it really make a difference?
I mean, it's not something that they will actually
have to get their hands dirty with, is it?
I would hope not.
And so I think that, to me, that's a theme that I really hope to see emerge over the next couple of years,
which is really strong separation of concerns between the various stages of sort of model development. So I think that, you know, for example,
there are a whole bunch of systems out there
for preparing your data for training
and making it high performance
and compact data structures and so on.
That is a different kind of stage in the process,
different workflow than actually thinking
the experimentation that goes into sort of model training
and model development.
And as long as you get your data in kind of those right formats,
while you're in model development, it should not matter what, you know,
upstream data system you're doing.
Similarly, pertinent to our conversation,
I think it shouldn't matter what training hardware you're running on,
whether it's GPUs or CPUs or exotic accelerators
from Intel, Havana, or some company that we haven't heard of yet, right?
I think that as long as the developer is developing in these kind of high-level languages and
the accelerator is under the hood doing their job and giving them exactly that acceleration,
that shouldn't matter. And then similarly, the degree to which model developers are worried about the hardware
or the frameworks that they're using to deploy to that hardware, like TVM or ONIX, it should
be mostly, in my mind, about how is that hardware going to satisfy my application constraints.
So I'll give you an example. We work with a medical devices company
that has hardware out in the field
that's old version CPUs and GPUs.
They're not going to issue a hardware upgrade
just so that they can run slightly more accurate models.
Instead, their problem is almost the inverse.
It's how do I get the most accurate model that can run on this particular hardware? And so they might
start with a huge model and employ techniques like quantization and distillation to fit that
hardware. Now, right now for them, it's a, well, we help them do that semi-automatically, but it's
historically been a fairly manual process. And what we'd like to see over time is a way for users to specify,
these are my design constraints, these are my deployment constraints,
have to fit in this memory footprint, have to be at least this accurate.
One, perhaps help me automatically select the hardware that is going to best fit that task.
Or if the hardware is fixed, help me automatically select the model that is going to most accurately kind of fit into that performance profile.
And so, you know, that is some of the capabilities of what we do at Determined are anchored on
that idea.
So we have some facilities for neural architecture search that help enable
some of this. This stuff is pretty early days, and even the research landscape is evolving rapidly.
And so what's interesting is if you can centralize the sort of techniques that you might be using to
solve these problems, you can make the life of that end developer who's doing that
model development much easier while serving them the latest and greatest from research
over time.
And that's a little bit different than traditional software just because the landscape is changing
so quickly.
Yeah, you touched on a number of interesting points, some of which I intended
to ask you about anyway. So one of those was real-world constraints, basically, and that was
also mentioned in one of the other contributions I saw from you in that Wired article about, well,
something which is, and rightfully so, if I may say so, in the headlines these days. So, the environmental impact of
training models at scale and, well, I guess more training than deploying, because inference
traditionally doesn't take as much energy, but definitely training. So, unless you're Google
or any of the hyperscalers, that's a very real cost that you have to take into account.
And it seems to me that it's less highlighted than it should be.
So you proposed a kind of reverse engineering approach in a way.
So these are my constraints.
I wouldn't want, for example, to spend more than XYZ in training. And so what can I do with this?
How do you approach that?
You mentioned that you have something in the works there
or some kind of, maybe a little bit ad hoc at this point,
but some kind of approach there?
Yeah, so there's kind of, you raise an interesting point.
There is this, so what I was talking about earlier was real-world constraints to model deployment for inference,
which is deployed to mobile phones and edge devices and so on.
Or even not in those situations might be things like I'm running an online ad network
and I need to minimize latency for my fraud detection models
because I only have a fraction of a second to respond to some bid requests, something like that.
That is one family of constraints.
The other family of constraints that you just asked about, I think, is really around on the training side.
And as I've pointed out before, as you point out, the cost of training has grown exponentially.
In fact, the folks at OpenAI released a pretty interesting article now two years ago saying that the cost of training was up 300,000 times. That is, the cost of getting a state-of-the-art AI model was up 300,000 times over the course of kind of the previous six or seven years.
And that trend has continued.
Just last year, the GPT-3 model that OpenAI released, you know, and this is a, they're an independent lab. They're not a hyperscaler. They are well-funded, but this GPT-3 language
model, most of the estimates I've read, and we've done the calculations ourselves as well,
put the cost of training that model somewhere in the sort of $7 to $12 million range to train a single model.
Just an insane amount of computation, of energy, of money required to do that.
And to your point, most of us mortals don't have that.
Most of our organizations aren't willing to invest that much in a sort of speculative R&D project.
And so what we need are tools that help our users reason about this cost, assign quotas, and so on.
From the perspective of our particular software and technology, we do offer tools to allow people to scale their training.
At the heart of it, that's a lot of
what we do because of that kind of computational cost. We can't be waiting two years for a model
to converge. Our users want to be waiting a day, two days. That means they have to leverage
parallelism, but the downside of that is the costs go up. So what do we do? Well, we allow them to
set limits in terms of how much they would like to spend on a particular thing.
And then on top of that, we have AutoML capabilities that are specifically designed for this regime,
where the cost of training a single model might be extremely expensive.
And things like hyperparameter tuning and neural architecture search feel like a pipe dream to most of the people that we talk to because, hey, training a single model, I'm measuring in maybe not millions of dollars, but certainly hundreds or thousands of dollars.
And training hundreds or thousands of those models to find one that might work well, that gets prohibitively expensive very quickly. So our technology, which is based on active learning, provides a way of saying, in the same budget I might use to train 10 models to convergence, I want to explore a space of potentially 1,000 models.
And we don't train all of those models all the way up to convergence.
We stop early via some principled mechanisms.
And that allows people to get the benefit of exploration
without sort of breaking the bank.
That's something that ships in our product today
as part of our hyperparameter optimization
and neural architecture search features.
But we think that this concept in general
of sort of mapping computation to cost
and utilization of the resources and controls around that is going to be essential for most organizations that want to adopt these technologies,
particularly if the scale and the growth rate of that compute increases over the next several years, as we think it is likely to.
There's a number of, well, that's an interesting approach.
And I would also add and ask your opinion about a couple of others as well.
One is, well, the general tendency.
People have suggested that not everyone, as you also pointed out,
not everyone has the means that the Googles of the world have to train these kinds of models. But what they could do instead is, well, kind of build on a specialization of these pre-trained models,
something like Distilled Bird, for example.
So what do you think of this approach?
I think it's an amazing way of tackling this problem, right? I think, you know, if you think about this from a
computer science perspective, in computer science, when we're talking about algorithms and analysis
of these algorithms, we often talk about amortization of costs and letting, you know,
one piece of the algorithm do a whole bunch of work and then a bunch of sort of leaf nodes of that algorithm doing much less work to take advantage of that.
The same concept applies here when it comes to things like distillation and even just sort of fine-tuning or transfer learning. These are all similar approaches where you let the big guys,
the Facebooks and the Googles of the world,
do the big training on huge quantities of data
with billions of parameters
and spending their hundreds of GPU years on a problem.
And then instead of starting from scratch,
you take those models and maybe use them
to form embeddings that you're going to use
for downstream tasks. And the simplest example is lop off the last layer of ImageNet and train
just one last layer of logistic regression on the features that are emitted from the standard weight
or of like a ResNet-50, and you can build an alarmingly good classifier out of that.
We've seen that this is true, as you mentioned, in NLP with Distilbert
and many, many other kinds of applications.
And so we see this deployed at our customers very frequently.
Anybody who's dealing with visual data often will start with one of these
really strong backbones, say a ResNet-50, and train with those pre-trained weights as a starting
point. They might fine-tune those weights, they might fine-tune the last layer, but generally
speaking, they're not going to go and sort of reinvent the wheel when it comes to coming up
with good image features. Same thing in NLP.
Where it gets tricky is when the modality of the data that people are training on is totally different than what's available in sort of these, you know, the types of data
that Google and Facebook and so on are training.
And so, for example, we work with a pharmaceutical company that is working on
four-dimensional, high-resolution microscopy images where the channels aren't even RGB anymore,
they're something different. And there, unfortunately, an ImageNet backbone isn't
going to do you much. And so you have to find other ways of coming up with representations
of the underlying data.
There are ways you can do this by bootstrapping on some supervised learning task.
We also see folks start to take advantage of semi-supervised learning for these tasks.
But again, I think a lot of folks are recognizing that this could be a major cost if they don't
get ahead of it.
And so I'm excited as a computer scientist
to see the sort of the algorithmic enhancements
that are coming down the pipe to tackle this problem.
Okay, and the other approach that I wanted to ask you about,
I think you potentially kind of hinted at some point of your elaboration,
is, well, there's a fancy name for it, which is neuro-symbolic AI, which is, I guess, basically
comes down in your terminology to inductive biases. So, kind of distilling more a priori
knowledge into your models, or kind of a hybrid approach to AI that kind of tries to marry
the good older fashioned symbolic AI with a learning. And by doing that, you may be able to,
among other things, cut down on your training costs. Are you aware of that? And do you have
an opinion on that? So I'm not super familiar with the details of neural symbolic AI in particular. But what I will say is some of the successes I have seen in AI outside of sort of the mainstream
kind of commodity, call it commodity use cases like NLP and vision where, you know, it's
okay, there's a benchmark that everybody agrees on. There are standard data sets.
We all know what the problem is, image classification or object detection or language translation,
where you start to specialize a little bit more.
Some of the greatest lifts we have seen is where you get a domain expert to infuse their knowledge of the underlying, say, physical phenomenon that is going on
to talk about the, to more effectively kind of model the world.
The example I like to draw here is kind of go back to Newtonian physics. If you set a neural net that was, you know, speed forward,
a neural net with 100 parameters,
an XY coordinates of the trajectory
of an ax flying through the air, and
you asked it to predict where it's going to be in a second, and that was
how you approach the problem.
The neural net will probably, with enough examples,
learn pretty well an approximation of the function of interest
and be able to predict with a high degree of accuracy.
But if you just tell the model a little bit more about the problem. Say, I expect the position of this ax in the air
to behave kind of quadratically as a function of time.
Something like infusing the application
with a little bit more knowledge of the physical world.
The amount of data you need, the accuracy, et cetera,
the amount of data is going to go way down. The accuracy is going to go way up and we're going to see some gravitational constants start to
emerge as maybe one feature of the network or some combination of features. And to abandon sort
of that thinking, I think it would be irresponsible. And so I'm excited anytime I see or hear about new approaches that infuse
that sort of physical modeling of the world with the sort of approaches that we're doing with
neural networks. Neural networks are great. They're super powerful function approximators.
But if I tell the computer a little bit more about what that function is, Hopefully I can save everybody, say, a few million bucks in compute and get
models that more accurately represent the world.
Great. Great. Thanks. I guess we're out of time. It went by in a snap. So thanks. Lots
of interesting insights. Not all of them necessarily related to AI achieves,
but well, who cares? It was interesting anyway.
Hopefully, it was good. I had a lot of fun with this, George. I really appreciate you
taking the time.
I hope you enjoyed the podcast. If you like my work, you can follow Link Data Orchestration
on Twitter, LinkedIn, and Facebook.