Orchestrate all the Things - Bringing Deep Learning to your hardware of choice, one DeciNet at a time. Featuring Deci CEO / Co-founder Yonatan Geifman
Episode Date: February 16, 2022Training deep learning models is costly and hard, but not as much as deploying and running them in production. Deci wants to help address that.. Article published on ZDNet ...
Transcript
Discussion (0)
Welcome to the Orchestrate All the Things podcast.
I'm George Amadiotis and we'll be connecting the dots together.
Training deep learning models is costly and hard,
but not as much as deploying and running them in production.
Desi wants to help address that.
I hope you will enjoy the podcast.
If you like my work, you can follow Link Data Orchestration
on Twitter, LinkedIn and Facebook. And yeah, the typical way to start is by me asking you to say a few words about yourself,
Jonathan, and your background and the founder story, let's say, for Deci.
Actually, no, I'll take it back. My first question is going to be on pronunciation, actually.
So how do you pronounce it? Is it Deci or Deci?
Deci.
Okay, okay, thank you.
Yeah, the founder story for Deci.
Yeah, so after completing my PhD in computer science from the Technion, together with my PhD advisor, Professor Ranir Yaniv, and another
co-founder who was a longtime friend, Joe. We started DESI as we are so doing my PhD studies,
and also we both work, me and Ran, at Google. We saw how deep learning is hardly getting into
production because various reasons, but one of the things
we saw that many many companies are focused on making those algorithms more scalable and that
can can run better in production environments and we saw it at a google scale on one hand and also
we had uh some peers on industry that was struggling in taking models into production.
And also at that time, we saw a lot of hardware companies trying to build better chips to run inference at scale.
So we thought maybe we can do something better on the model design area in order to make those algorithms more scalable,
more efficient and run better in production environment.
And this is basically how we started.
We looked for an automated approach to design models that are more efficient,
more efficient in their structure and in how they interact with the underlying hardware in production time.
And we got to a technology that is called neural architecture search.
That is technology that is automatically designing the structures of a neural network with respect to several constraints or objectives.
For example, how can we get a model that is both accurate and running fast on the hardware. And that was the early days of AutoNAC,
building an automated algorithm to design structures of neural network
to reach multiple objectives in the optimization.
And from there, we ended up with a deep learning development
development platform, which is helping in the model design phase
and training of the models in order to get better performance in the production phase.
Okay. Can I ask you, would you be able to share a few key milestones, let's say,
and facts about the company? So I think you started that in 2019 and you have already gotten one funding round, if I'm correct.
And can you share, for example, also like how many people work for the company at this point?
Yes, sure. So at the moment, we are 40 people in the company.
Most of them are based in Israel, but some of them are in the US mainly for go-to-market. We started in October 2019.
Immediately a few months after that,
we closed our seed round of $9.1 million.
Around February 2020, we launched our community tier
SaaS platform and opened it to the public
to use our technology on a community tier
at september we closed the 21 million dollars a round led by insight partners and recently about
months ago um we just released an open source called super gradients for people to train deep learning models more easily based
on a training library that provides all the well-known open source and academic papers
and models that can be easily reproduced and trained on top of our open source.
So those are kind of the main milestones that I can
tell. Before that, about six months ago, we announced the first version of DesiNets.
DesiNets is our models that are optimized for GPU, CPU, for various computer vision and NLP tasks. And we are now announcing a new version of DesiNet
that are optimized for classification on CPUs.
So that is roughly the story of the company
from inception to where the company is now,
including traction with Fortune 500 companies
among our customers
that obviously led us to a traction that we can fundraise our A round. And we are expanding more our go to market now
since the last round, which is what we are mostly focused at the moment
as the product and technology are now mature to serve Fortune 500 companies
and also a self-serve offering for
the long tail of the mid-market and smaller organization that can take our software
and use it for for their needs okay thank you thanks for uh for sharing so um yeah i think
i have to admit that i wasn't familiar with the company previously.
So I did a little bit of basic looking around, let's say, to try and figure out the key premise behind the concept.
And I think based on what you described so far, I think that I've got it right.
That the key premise seems to be sort of, let's say, optimizing, algorithmic optimization of machine learning models to boost their performance.
By the way, you mentioned specifically deep learning.
Are you focusing specifically on deep learning or are you also serving other types of models as well?
So we are focused at the moment on deep deep learning mainly on computer vision and nlp
but maybe first let's understand the the problem um so deep learning models are very computational
intensive in in two two ways one of them in the training process of building those models. And the second aspect is how to take those models to production at scale or to work on low power edge devices.
And when you look on how to do those,
you understand that you need a different approach
to develop those models because every data scientist knows
that in order to get better accuracy in deep learning, you can take larger models and train them for a little bit more time
with a little bit more data and you will get better results.
But this created kind of divergence between the need for accuracy and the need for speed
of deep learning models where you go larger, it's easier to get better results, but then you are more and more struggling
in taking those models into production.
And what this is promising is solving
that dual optimization problem by providing you
with the platform and tools to build models
that are both accurate and fast in production.
So efficient in production.
And this is the problem and the solution that DESI offers for that problem.
Okay. All right.
Another question I wanted to ask you about your technology.
Obviously, it sounds like it's proprietary.
I was wondering if you have any patents or pending patents around that.
Yeah, so we have several patents of different components
of our core technology of the company.
Otonac is the core algorithm that drives our technology,
which is a neural architecture search algorithm,
an algorithm that searches for structures of neural network to satisfy
several optimization objectives or constraints.
For example, we want to build a model that will get some level of accuracy, but we also
want it to be in some level of latency.
And solving that optimization problem requires more than manual
tweaking of existing neural architecture but an algorithm that can design a
specialized model for specialized use cases and that can solve that cause this
problem requires be aware of the data and the machine learning task that we
want to solve and also be aware on the
production hardware that we want to deploy that model in order to optimize the specific performance
latency throughput on that type of hardware. Yeah, initially I had the impression that this
optimization process, let's say, was sort of custom. So something like clients, users providing, let's say,
their existing models that they have pre-trained
and then you somehow optimizing those.
Apparently, it seems like this is not the case.
It seems like you have something like a library
of pre-trained testINets that people can use.
Would you be able to share what kind of areas do those pre-trained DESINets cover?
Yeah, so at the moment DESINets are covering computer vision and NLP applications.
Let's start with computer vision.
There are three main tasks in computer vision,
namely classification, object detection, and semantic segmentation. But as I mentioned,
the problem of optimization is also hardware aware. So we have multiple types of desinets for
each task that are tackling different level of performance,
which is the trade-off between accuracy and latency, for example,
on multiple types of hardware. So we have dozens of models pre-optimized for customers
to use in a completely self-serve offering,
ranging from various computer vision tasks to NLP tasks on any type of hardware to be deployed in production.
Okay. Since the models are pre-trained, how do people, how do users get to customize them to work specifically for their use cases and for their datasets?
So this is relied on the library that we released for open source that is called SuperGradients.
They take destinates together with SuperGradients, that is a training library that enable them to fine-tune or to customize the models like
Destinets, any Destinets that they want and train it to their needs, adapt it to their
data sets and run a training cycle on their data in order to get a performance on the
data or the tasks that they are trying to solve.
Okay.
Okay.
I see. So they have to rely on using this library as well.
Yes, so the library connected with DESINETS is our offering on the training phase, on building those algorithms based on the repository of the pre-trained models like DESINETS.
Mm-hmm. All right, I see. Another thing that caught my eye, again trying to
figure out, let's say, your value proposition, basically, was
something, some statement I read on your website,
which read something like delivering models that outperform the
advantages of any other hardware
or software optimization technology.
And to me, that's really a little bit strange because, well, obviously, there's nothing
wrong.
On the contrary, it's a very good call to optimize models the way that you do. But that doesn't necessarily mean that, you know,
this is the optimal optimization strategy,
I may say that it sounds a bit strange using optimal twice,
but I think you get the point.
And by that, I mean that, you know,
there may well be a case where you may get good
or even better results if you just switch your hardware to something that works more
effectively and you know as a practitioner I guess you would probably also agree with that
at the same time you said earlier yourself that you target all sorts of different hardware
infrastructure so I'm just wondering what's the takeaway from all that?
Yeah, so maybe I will explain a little bit how the development lifecycle
looks on deep learning and how this is offering, is connected to that development lifecycle. Usually, especially on edge applications,
the hardware is set or selected in advance. For example, if we develop a medical device or
an autonomous vehicle, we know what type of hardware we have on that edge device. Or if
we develop a mobile application, for example example so we know that we need to support
a wide range of iphones and and some mobile devices so the in the edge applications the
hardware is set and given that hardware that you need to run your models on you want to get a
maximum performance per um the maximum performance with the level of accuracy that you need.
So this is one aspect of having the hardware set,
and you want for that hardware to optimize the models to run as best as possible
on that hardware.
On the other hand, in the cloud, you have a wide variety of hardware
that you can use, hardware types that you can use in order to run your models.
On that aspect, what DESI is offering is a recommendation
and benchmarking tool on our SaaS offering
that you can compare latency, throughput, cloud cost
on the various cloud instances and hardware types.
And based on that comparison, you can say,
okay, I want to optimize the model to run on CPU
or GPU looks better to me.
Let's optimize the model to run on GPU.
So even if you switch for a better hardware,
you can get even better performance
by using DESI for the target hardware
that you are now using. So I think that the question of hardware selection and model selection
is orthogonal in the sense that when you choose your hardware, you can also optimize the model
to run better on that hardware with DESI. So this is how we see it at DESI.
Okay, I see. So it makes
more sense now because apparently you're
referring specifically to the
inference part, not the training part
and this is why you refer
to the hardware being
set, which is true obviously.
Normally you can't go around and change
the type of processors that you have
on the engine and things like that.
I was more thinking about training, but in your case that you have...
Actually, that's also kind of a follow-up question.
So you mentioned previously to how people can customize your pre-trained models, your pre-trained desinets.
And where do you see them doing that typically?
So if we think about a machine learning task, usually what we see that each model has a family
and a family could be a family of models that each one of them could be larger or smaller
and span a range of accuracy levels and latency levels.
So if for example, we'll take the well-known ResNet family,
we have ResNet 18, ResNet 34, ResNet 50, ResNet 101,
each one of them have more layers.
It will be usually more accurate than the previous one but it will also also work smaller
it also works lower at inference time so one of the first thing to do is to select the right
point on that accuracy latency trade-off uh for for the specific application that you you you're
working on after selecting out of the family of models that we provide,
and by the way, we provide the same with destinates.
We have destinates one to five or one to 12.
Each one of them is gradually improving in accuracy,
larger, and also working a little bit slower on inference.
So based on that family, you can choose their sweet spot.
So the first step is to choose what Vecina to use,
like which one will be best for the application.
Next thing to do is to optimize for the structure of the input and the output.
For example, people are using different image resolution.
Some of them are using RGB images. Some of them are using RGB images.
Some of them are using grayscale images in computer vision.
And some of them are using also depth maps.
So next thing to do is to adapt the input and output of the model
to suit the machine learning problem.
And the last thing to do in the customization is to train it.
And if you have some clever functions or data augmentation techniques, you can enter them into destinates
and inject those additions to the training script on top of supergradients and leverage those also
in order to get better accuracy on your model, because we are not replacing the data scientists.
So any insights that the data scientist has on the given problem and how to train
the models on should be utilized when training also the DESiNets model.
OK,
excuse me. OK, I see. So let's come then to what you are about to announce.
You're releasing a new model as far as I got it and also I think some benchmarks to go with that,
which have to do, the benchmarks from what I saw saw they're specifically targeted to CPUs and
the key the point there seems to be that well by applying this type of optimization CPUs are now
able to run models that they were previously unable to do so at least that's my summary of it.
And I'll let you explain it in your own words.
So I do have some questions.
Sure, so
a few months ago we announced the first family of Destinets optimized for NVIDIA
GPUs and NVIDIA Jetson, and now what we are announcing is Destinets
family for CPU and specifically
for a Cascade Lake CPU that is widely used in the cloud,
enabling a family of 12 models that all span from,
for image classification, span from low latency of around 70%
of ImageNet until models that reach almost 85% of accuracy
on ImageNet, models that are starting from sub millisecond on CPU, which is very, very fast,
until models that takes something like eight milliseconds to run on a CPU. And those models are creating a new efficient frontier of the accuracy latency trade-offs.
You can imagine a graph where on the y-axis, we have the accuracy, and on the x-axis, we
have the latency.
And we can position each and every model on that curve.
And when we do it with all the open source models like
ResNet, EfficientNet, and other models like RegNet, MobileNet, and those, we can see kind
of a trade-off, an optimal trade-off between latency and accuracy. As long as you go for
faster models with lower latency, we see that the models are less accurate. And when putting those decimates on that curve,
what we see is a significant new efficient frontier
that outperform for the composition of latency and accuracy,
each model that exists in open source.
So if we'll take an example, when referring to efficient at B1,
we see a model like Decimate tree that is getting the same accuracy, but instead of running in something like 4.8 milliseconds per instance, can run below two, like something like one and a half millisecond per prediction of one instance. So this is a very significant boost in performance and all of this
is happening while preserving the prediction accuracy in this case on ImageNet classification
problem but this could be adapted to any use case of a customer. Okay I see. Another follow-up
question I had on that was obviously obviously you run some benchmarks and whenever
there's benchmarks involved, one of the first questions people ask is whether those benchmarks
are available and reputable by third parties and so on. Yes. So all these models will be out and publicly available on our SaaS platform
that you can sign up on our website and will be demonstrated over there in
our model hub, which is a model repository that is given on our platform.
In terms of benchmark, I must say that there's a lot of way to benchmark
machine learning models,
what you include in the baseline, what you include in the performance measurement of the optimized model.
And usually what we do is we do it as much as apples to apples as we can.
So in these benchmarks, for example, we used the graph compiler that is called OpenVINO that is provided by Intel as open source in order to compile and quantize all the models, both the baseline and the desinets.
So the comparison will be as apples to apples as possible in order to not inject any performance boost that is given by any open source or standard tools into the comparison.
So everything is compared with a very strict benchmarking technique, including open Vino
and quantization.
And those can be reproduced and demonstrated on top of our platform, SaaS platform that
you can sign up from our website.
Okay, thank you. You also mentioned Intel and it's also part of the announcement and I think you have a partnership
going on with Intel. I wanted to ask you specifically about that. So how did this partnership came along and what's the
motivation basically for both parties and how does it weave in
basically into this announcement because you're specifically targeting CPUs here
and I guess how does it fit in the strategy for both parties?
Yes.
So this announcement is not any part of the collaboration with Intel,
but the tools that we're using by Intel, the hardware is by Intel, and the collaboration with Intel is tackling some other aspects of working together,
and I will be happy to discuss that.
So we have a very long partnership with Intel,
starting from collaborating to MLPerf
and submitting our models together
with a collaboration with Intel to boost
the state-of-the-art performance in MLPerf.
MLPerf is a benchmark for performance of machine learning models that is
widely used and happening twice a year. So that was the first step in the collaboration with
Intel. The second step was a partnership agreement together with the sales organization of Intel that Intel is selling. This is solution to the customers in order to optimize the performance of their machine
learning models running in production.
And we have another thing that is baking and will be announced soon, but I cannot refer
and maybe we will talk about it in another chapter of your podcast in a few months,
but we are working on some new announcement with Intel
that we will share very soon, I believe.
So this is the partnership with Intel.
How both companies are enjoying from that partnership
is a good question.
I think that what Intel is trying to give to their customers
is better performance on top of their CPU.
And the hardware is already fixed.
So what Intel can do in the software
is to build their own software stack solutions like OpenVINO,
but they can also partner with companies
that enable additional improvement on top of algorithmic layers to a stack solutions like OpenVINO, but they can also partner with companies
that enable additional improvement
on top of algorithmic layers that is currently
in some aspects beyond the scope
that Intel products are looking into.
So this is a partnership that is a win-win for both sides,
Intel getting a solution for their customers
that is in the algorithmic level,
and DESI getting some go-to-market assistance for Intel to get to their customers.
So this is the nature of that partnership.
I must say that DESI is not collaborating only with Intel, but also with companies like HPE and AWS.
So we have a wide range of existing partnerships and also some partnerships
that are in progress of establishment with various types of hardware
manufacturers, cloud providers, and OEMs that sell data centers and servers
for machine learning.
Okay.
Another question I had regarding actually the results
that you're about to announce has to do, I guess,
with that trade-off that you also mentioned yourself.
So performance versus size and I would actually include another element
in this equation let's say. So total cost of ownership let's say and total cost of operation
from end to end. It seems like the process, let's say, of getting someone to use
DesiNets would be, well, first they would have to custom train the model that fits their needs,
and then they would deploy that and, well, run inference for as long as they need.
Do you have any indication, any feeling, let's say, of what
the performance, the trade-offs involved there would be and what would the end result be in
terms of what's the most economical, let's say, solution depending on different parameters?
And also taking into account the different deployment options in terms of hardware.
That's a very good question.
I think that first we should understand the difference
between training and inference in the amount of workload.
So while training is more expensive to be done,
like more expensive task, it is being done once in a while.
But inference is happening all the time
as is coming with a linear ratio
with the production workload,
which is the amount of data,
number of customers, or you name it.
While training is with linear relation
with the amount of models or number of data scientists.
So if we need to think about what we would like to optimize,
the training or the inference,
the answer in 99% of the cases is definitely the inference
as it having much more significant workload
or heavy workload.
In terms of the total cost of ownership,
you enjoy very much by reducing the amount of cloud
or data center usage by using more efficient networks.
And usually what we do is one option is to build from scratch on desi nets.
So you do your development on cycle on desi nets, you get the results faster.
It shorten you all the trial and error iteration the data scientists need to do
in order to find the right architecture, find the right hyper parameters.
So this is one option.
The second option is to transform to destinates
after running with a model in production.
And at the retraining point,
when you want to retrain the model,
you simply switch to destinates
and then you run from then on with destinates
replacing the model in production.
So you benefit a lot from running
a model that is significantly more efficient in production and your cost of switch is only one
time training of building that desi net instead of your existing model which is usually one to two training cycles until you finally fully customize the
desi-net to your need, to reach your accuracy level, et cetera.
So I think that this cost diminishes compared to the huge amount of saving that you can
have in production.
And by the way, when considering edge deployment, it seems that when running
on the edge, you usually can reduce the amount of hardware you need. In the edge, you reduce
the amount of hardware you need, or you can work with a hardware or with low-end hardware and eliminate the need to upgrade the hardware for future releases of the product and stuff like this.
Yeah, I think that's probably kind of common knowledge, let's say, for practitioners that indeed, as you said, in most of the cases,
the total cost of ownership and operation is much more influenced from inference than it is from
training. Well, unless you're training something like GPT-3, for example, but that's probably
a very special case. But also, if we will analyze gpt3 that estimates
say that the cloud cost to train a gpt3 model is 14 million we assume that if you have done so you
have such a huge workload uh that can pass that 14 million dollar training uh cost at some time
at inference so also optimizing g-3 is something that makes sense
if your inference workload is large enough,
because GPT-3 is a model that is very expensive also for inference.
Let's remember.
So as long as the models are getting larger and the training is longer,
also the inference is affected by that.
Yeah, yeah, true.
Another thing that, through the conversation, is dawning on me,
is that it looks to me that what you're doing, the approach you're taking,
well, first, obviously, the value proposition is starting to become more clear, let's say, and especially if you take into account the fact that you mentioned earlier that, well, you can retrain a decinet using an existing model.
That's something that was not clear to me, I have to say, from the beginning.
And I think it's a very crucial parameter because it means that you don't have to train from scratch and to reinvent from scratch based on decimals.
The other thing that I started to do is that the approach you're taking seems to me a lot
like TinyML basically, like you're trying to cut down the size of models you're deploying
to make them more efficient?
And obviously you're aware of TinyML and I wonder if you have any ties
with the organization and the initiative.
Yes, so basically TinyML is mostly on running on microcontrollers
and we are looking in a wider scope on that problem,
how it can also impact cloud cost.
We just talked about GPT-3,
which does not have any relations to TinyML.
So we look on a wider scope than TinyML,
which only looks on how to enable machine learning
to run on microcontrollers.
So we also look on running on modern CPU, modern GPU, and
we enlarge that problem to be more general of how AI can be more efficient. And I think that when you
go to the broader scope, we have to consider preserving the accuracy or getting the accuracy
level that is sufficient for the application. Because on tiny ML, people understand that they
have to compromise an accuracy and get something that can run on the device and the major challenge
is to being able to run on that microcontroller. But when thinking about running in the cloud,
people don't want to compromise an accuracy in order to get better better cost so one of this is on challenges and one of the key characteristics
of the technology is being able to not compromise on accuracy in order to get that performance
boost in production okay i see uh the other thing that uh i felt like um what felt like what you do may fall under is the whole AutoML landscape.
Again, I wonder if that's accurate or not, but actually, seeing your offering, the closest thing that came to my mind was what another company called Noton does as well.
And I know them from TinyML and this is why I asked you about that as well.
They seem to be applying a similar kind of logic.
So optimizing the architecture of a neural network, achieving a lower footprint and all of that. And there's other companies in that space that people have tried to put all under the same landscape, let's say.
How do you see, where do you see yourselves, let's say, fitting? And what do you think of this space in general?
So, generally speaking, I see that getting better performance have multiple layer problem.
The bottom layer is choosing the right hardware or getting performance boost by the hardware level.
The next level is the graph compiler level, where you see solutions that are provided with the hardware manufacturers like Tensor RT by NVIDIA, OpenVINO by
Intel. You see the ONNX open source
supported by Microsoft. And we can see also
commercial solutions like OctoML and TVM
that commercializing TVM. On top of that, we have the model
compression techniques like
pooling and quantization those are widely covered in open source repositories and also some companies
are trying to to commercialize on those solutions but this is working in a different level in the
level of neural architecture search that is redesigning or helping data
scientists to design the models to get better latency on the same accuracy.
And that's kind of the differentiation from other companies and solutions that are out
there in the model optimization space.
And because we are doing that on the model level, we are actually providing
an end-to-end platform for building, optimizing, and deploying deep learning models as models
are offering. Because it's not enough to build a model that the architecture is efficient.
You also need to know how to take it into production effectively in order to benefit
from that performance gains that you saw in the lab.
And that's not an easy part at all.
So we help companies in the entire process from building the algorithms
until taking them to production in an end-to-end platform.
And this is our offering and how we differentiate from other solutions
in the optimization stack.
Okay, by the way, speaking of platform, you mentioned earlier that you also have a community
tier, so I was wondering if you could say a few words about the business model basically and the
different tiers that are available. Yes, absolutely. So the community tier is built on
top of two components of our products. One of them is the Super Gradients open source,
and the second one is the Desilab. Desilab is a SaaS platform that enables you to benchmark
and do runtime optimization for your models based on various types of graph
compilers and also take them to production with our inference runtime engine SDK.
So this is an end-to-end form building the model on super-gradient to optimizing it on
the lab and taking it to production with the runtime engine that is called Inferi
that is provided in the lab.
And you can benefit from an end-to-end development lifecycle of deep learning models for free.
On top of that, we have the commercial offering, which is basically a port here for the platform
that lets you use Destinets, lets you you use URTONAC for custom optimization
of specific models that are not supported on DESINETS
and taking some benefits of runtime engines
that having some sophisticated optimization techniques
for production usage.
So those are all in the commercial layer that the business
model for that is a subscription business model.
Okay, good. Thank you. Yeah, I think we covered quite a few topics actually from the deeply
technical to the quite abstract and the business around that. So I think we're probably good. The one
thing I would like to ask you before we wrap up is, you kind of hinted at
something already, but any ideas about future plans and roadmap ahead? So I can
say that at the beginning I talked talked about Desi being supporting computer vision and NLP,
but most of the talk I've gave benchmark for computer vision only.
So in the oven,
we have our NLP offering baking and we'll be able to announce it very soon.
And for early access,
you can reach out and hear more.
I hope you enjoyed the podcast.
If you like my work,
you can follow Link Data Orchestration on
Twitter, LinkedIn, and Facebook.