Orchestrate all the Things - Bringing Deep Learning to your hardware of choice, one DeciNet at a time. Featuring Deci CEO / Co-founder Yonatan Geifman

Episode Date: February 16, 2022

Training deep learning models is costly and hard, but not as much as deploying and running them in production. Deci wants to help address that.. Article published on ZDNet ...

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Orchestrate All the Things podcast. I'm George Amadiotis and we'll be connecting the dots together. Training deep learning models is costly and hard, but not as much as deploying and running them in production. Desi wants to help address that. I hope you will enjoy the podcast. If you like my work, you can follow Link Data Orchestration on Twitter, LinkedIn and Facebook. And yeah, the typical way to start is by me asking you to say a few words about yourself,
Starting point is 00:00:33 Jonathan, and your background and the founder story, let's say, for Deci. Actually, no, I'll take it back. My first question is going to be on pronunciation, actually. So how do you pronounce it? Is it Deci or Deci? Deci. Okay, okay, thank you. Yeah, the founder story for Deci. Yeah, so after completing my PhD in computer science from the Technion, together with my PhD advisor, Professor Ranir Yaniv, and another co-founder who was a longtime friend, Joe. We started DESI as we are so doing my PhD studies,
Starting point is 00:01:15 and also we both work, me and Ran, at Google. We saw how deep learning is hardly getting into production because various reasons, but one of the things we saw that many many companies are focused on making those algorithms more scalable and that can can run better in production environments and we saw it at a google scale on one hand and also we had uh some peers on industry that was struggling in taking models into production. And also at that time, we saw a lot of hardware companies trying to build better chips to run inference at scale. So we thought maybe we can do something better on the model design area in order to make those algorithms more scalable, more efficient and run better in production environment.
Starting point is 00:02:06 And this is basically how we started. We looked for an automated approach to design models that are more efficient, more efficient in their structure and in how they interact with the underlying hardware in production time. And we got to a technology that is called neural architecture search. That is technology that is automatically designing the structures of a neural network with respect to several constraints or objectives. For example, how can we get a model that is both accurate and running fast on the hardware. And that was the early days of AutoNAC, building an automated algorithm to design structures of neural network to reach multiple objectives in the optimization.
Starting point is 00:02:57 And from there, we ended up with a deep learning development development platform, which is helping in the model design phase and training of the models in order to get better performance in the production phase. Okay. Can I ask you, would you be able to share a few key milestones, let's say, and facts about the company? So I think you started that in 2019 and you have already gotten one funding round, if I'm correct. And can you share, for example, also like how many people work for the company at this point? Yes, sure. So at the moment, we are 40 people in the company. Most of them are based in Israel, but some of them are in the US mainly for go-to-market. We started in October 2019.
Starting point is 00:03:47 Immediately a few months after that, we closed our seed round of $9.1 million. Around February 2020, we launched our community tier SaaS platform and opened it to the public to use our technology on a community tier at september we closed the 21 million dollars a round led by insight partners and recently about months ago um we just released an open source called super gradients for people to train deep learning models more easily based on a training library that provides all the well-known open source and academic papers
Starting point is 00:04:36 and models that can be easily reproduced and trained on top of our open source. So those are kind of the main milestones that I can tell. Before that, about six months ago, we announced the first version of DesiNets. DesiNets is our models that are optimized for GPU, CPU, for various computer vision and NLP tasks. And we are now announcing a new version of DesiNet that are optimized for classification on CPUs. So that is roughly the story of the company from inception to where the company is now, including traction with Fortune 500 companies
Starting point is 00:05:23 among our customers that obviously led us to a traction that we can fundraise our A round. And we are expanding more our go to market now since the last round, which is what we are mostly focused at the moment as the product and technology are now mature to serve Fortune 500 companies and also a self-serve offering for the long tail of the mid-market and smaller organization that can take our software and use it for for their needs okay thank you thanks for uh for sharing so um yeah i think i have to admit that i wasn't familiar with the company previously.
Starting point is 00:06:05 So I did a little bit of basic looking around, let's say, to try and figure out the key premise behind the concept. And I think based on what you described so far, I think that I've got it right. That the key premise seems to be sort of, let's say, optimizing, algorithmic optimization of machine learning models to boost their performance. By the way, you mentioned specifically deep learning. Are you focusing specifically on deep learning or are you also serving other types of models as well? So we are focused at the moment on deep deep learning mainly on computer vision and nlp but maybe first let's understand the the problem um so deep learning models are very computational intensive in in two two ways one of them in the training process of building those models. And the second aspect is how to take those models to production at scale or to work on low power edge devices.
Starting point is 00:07:11 And when you look on how to do those, you understand that you need a different approach to develop those models because every data scientist knows that in order to get better accuracy in deep learning, you can take larger models and train them for a little bit more time with a little bit more data and you will get better results. But this created kind of divergence between the need for accuracy and the need for speed of deep learning models where you go larger, it's easier to get better results, but then you are more and more struggling in taking those models into production.
Starting point is 00:07:49 And what this is promising is solving that dual optimization problem by providing you with the platform and tools to build models that are both accurate and fast in production. So efficient in production. And this is the problem and the solution that DESI offers for that problem. Okay. All right. Another question I wanted to ask you about your technology.
Starting point is 00:08:19 Obviously, it sounds like it's proprietary. I was wondering if you have any patents or pending patents around that. Yeah, so we have several patents of different components of our core technology of the company. Otonac is the core algorithm that drives our technology, which is a neural architecture search algorithm, an algorithm that searches for structures of neural network to satisfy several optimization objectives or constraints.
Starting point is 00:08:52 For example, we want to build a model that will get some level of accuracy, but we also want it to be in some level of latency. And solving that optimization problem requires more than manual tweaking of existing neural architecture but an algorithm that can design a specialized model for specialized use cases and that can solve that cause this problem requires be aware of the data and the machine learning task that we want to solve and also be aware on the production hardware that we want to deploy that model in order to optimize the specific performance
Starting point is 00:09:31 latency throughput on that type of hardware. Yeah, initially I had the impression that this optimization process, let's say, was sort of custom. So something like clients, users providing, let's say, their existing models that they have pre-trained and then you somehow optimizing those. Apparently, it seems like this is not the case. It seems like you have something like a library of pre-trained testINets that people can use. Would you be able to share what kind of areas do those pre-trained DESINets cover?
Starting point is 00:10:14 Yeah, so at the moment DESINets are covering computer vision and NLP applications. Let's start with computer vision. There are three main tasks in computer vision, namely classification, object detection, and semantic segmentation. But as I mentioned, the problem of optimization is also hardware aware. So we have multiple types of desinets for each task that are tackling different level of performance, which is the trade-off between accuracy and latency, for example, on multiple types of hardware. So we have dozens of models pre-optimized for customers
Starting point is 00:10:56 to use in a completely self-serve offering, ranging from various computer vision tasks to NLP tasks on any type of hardware to be deployed in production. Okay. Since the models are pre-trained, how do people, how do users get to customize them to work specifically for their use cases and for their datasets? So this is relied on the library that we released for open source that is called SuperGradients. They take destinates together with SuperGradients, that is a training library that enable them to fine-tune or to customize the models like Destinets, any Destinets that they want and train it to their needs, adapt it to their data sets and run a training cycle on their data in order to get a performance on the data or the tasks that they are trying to solve.
Starting point is 00:12:03 Okay. Okay. I see. So they have to rely on using this library as well. Yes, so the library connected with DESINETS is our offering on the training phase, on building those algorithms based on the repository of the pre-trained models like DESINETS. Mm-hmm. All right, I see. Another thing that caught my eye, again trying to figure out, let's say, your value proposition, basically, was something, some statement I read on your website, which read something like delivering models that outperform the
Starting point is 00:12:44 advantages of any other hardware or software optimization technology. And to me, that's really a little bit strange because, well, obviously, there's nothing wrong. On the contrary, it's a very good call to optimize models the way that you do. But that doesn't necessarily mean that, you know, this is the optimal optimization strategy, I may say that it sounds a bit strange using optimal twice, but I think you get the point.
Starting point is 00:13:14 And by that, I mean that, you know, there may well be a case where you may get good or even better results if you just switch your hardware to something that works more effectively and you know as a practitioner I guess you would probably also agree with that at the same time you said earlier yourself that you target all sorts of different hardware infrastructure so I'm just wondering what's the takeaway from all that? Yeah, so maybe I will explain a little bit how the development lifecycle looks on deep learning and how this is offering, is connected to that development lifecycle. Usually, especially on edge applications,
Starting point is 00:14:06 the hardware is set or selected in advance. For example, if we develop a medical device or an autonomous vehicle, we know what type of hardware we have on that edge device. Or if we develop a mobile application, for example example so we know that we need to support a wide range of iphones and and some mobile devices so the in the edge applications the hardware is set and given that hardware that you need to run your models on you want to get a maximum performance per um the maximum performance with the level of accuracy that you need. So this is one aspect of having the hardware set, and you want for that hardware to optimize the models to run as best as possible
Starting point is 00:14:56 on that hardware. On the other hand, in the cloud, you have a wide variety of hardware that you can use, hardware types that you can use in order to run your models. On that aspect, what DESI is offering is a recommendation and benchmarking tool on our SaaS offering that you can compare latency, throughput, cloud cost on the various cloud instances and hardware types. And based on that comparison, you can say,
Starting point is 00:15:28 okay, I want to optimize the model to run on CPU or GPU looks better to me. Let's optimize the model to run on GPU. So even if you switch for a better hardware, you can get even better performance by using DESI for the target hardware that you are now using. So I think that the question of hardware selection and model selection is orthogonal in the sense that when you choose your hardware, you can also optimize the model
Starting point is 00:15:59 to run better on that hardware with DESI. So this is how we see it at DESI. Okay, I see. So it makes more sense now because apparently you're referring specifically to the inference part, not the training part and this is why you refer to the hardware being set, which is true obviously.
Starting point is 00:16:19 Normally you can't go around and change the type of processors that you have on the engine and things like that. I was more thinking about training, but in your case that you have... Actually, that's also kind of a follow-up question. So you mentioned previously to how people can customize your pre-trained models, your pre-trained desinets. And where do you see them doing that typically? So if we think about a machine learning task, usually what we see that each model has a family
Starting point is 00:16:55 and a family could be a family of models that each one of them could be larger or smaller and span a range of accuracy levels and latency levels. So if for example, we'll take the well-known ResNet family, we have ResNet 18, ResNet 34, ResNet 50, ResNet 101, each one of them have more layers. It will be usually more accurate than the previous one but it will also also work smaller it also works lower at inference time so one of the first thing to do is to select the right point on that accuracy latency trade-off uh for for the specific application that you you you're
Starting point is 00:17:42 working on after selecting out of the family of models that we provide, and by the way, we provide the same with destinates. We have destinates one to five or one to 12. Each one of them is gradually improving in accuracy, larger, and also working a little bit slower on inference. So based on that family, you can choose their sweet spot. So the first step is to choose what Vecina to use, like which one will be best for the application.
Starting point is 00:18:14 Next thing to do is to optimize for the structure of the input and the output. For example, people are using different image resolution. Some of them are using RGB images. Some of them are using RGB images. Some of them are using grayscale images in computer vision. And some of them are using also depth maps. So next thing to do is to adapt the input and output of the model to suit the machine learning problem. And the last thing to do in the customization is to train it.
Starting point is 00:18:44 And if you have some clever functions or data augmentation techniques, you can enter them into destinates and inject those additions to the training script on top of supergradients and leverage those also in order to get better accuracy on your model, because we are not replacing the data scientists. So any insights that the data scientist has on the given problem and how to train the models on should be utilized when training also the DESiNets model. OK, excuse me. OK, I see. So let's come then to what you are about to announce. You're releasing a new model as far as I got it and also I think some benchmarks to go with that,
Starting point is 00:19:40 which have to do, the benchmarks from what I saw saw they're specifically targeted to CPUs and the key the point there seems to be that well by applying this type of optimization CPUs are now able to run models that they were previously unable to do so at least that's my summary of it. And I'll let you explain it in your own words. So I do have some questions. Sure, so a few months ago we announced the first family of Destinets optimized for NVIDIA GPUs and NVIDIA Jetson, and now what we are announcing is Destinets
Starting point is 00:20:23 family for CPU and specifically for a Cascade Lake CPU that is widely used in the cloud, enabling a family of 12 models that all span from, for image classification, span from low latency of around 70% of ImageNet until models that reach almost 85% of accuracy on ImageNet, models that are starting from sub millisecond on CPU, which is very, very fast, until models that takes something like eight milliseconds to run on a CPU. And those models are creating a new efficient frontier of the accuracy latency trade-offs. You can imagine a graph where on the y-axis, we have the accuracy, and on the x-axis, we
Starting point is 00:21:17 have the latency. And we can position each and every model on that curve. And when we do it with all the open source models like ResNet, EfficientNet, and other models like RegNet, MobileNet, and those, we can see kind of a trade-off, an optimal trade-off between latency and accuracy. As long as you go for faster models with lower latency, we see that the models are less accurate. And when putting those decimates on that curve, what we see is a significant new efficient frontier that outperform for the composition of latency and accuracy,
Starting point is 00:21:56 each model that exists in open source. So if we'll take an example, when referring to efficient at B1, we see a model like Decimate tree that is getting the same accuracy, but instead of running in something like 4.8 milliseconds per instance, can run below two, like something like one and a half millisecond per prediction of one instance. So this is a very significant boost in performance and all of this is happening while preserving the prediction accuracy in this case on ImageNet classification problem but this could be adapted to any use case of a customer. Okay I see. Another follow-up question I had on that was obviously obviously you run some benchmarks and whenever there's benchmarks involved, one of the first questions people ask is whether those benchmarks are available and reputable by third parties and so on. Yes. So all these models will be out and publicly available on our SaaS platform
Starting point is 00:23:09 that you can sign up on our website and will be demonstrated over there in our model hub, which is a model repository that is given on our platform. In terms of benchmark, I must say that there's a lot of way to benchmark machine learning models, what you include in the baseline, what you include in the performance measurement of the optimized model. And usually what we do is we do it as much as apples to apples as we can. So in these benchmarks, for example, we used the graph compiler that is called OpenVINO that is provided by Intel as open source in order to compile and quantize all the models, both the baseline and the desinets. So the comparison will be as apples to apples as possible in order to not inject any performance boost that is given by any open source or standard tools into the comparison.
Starting point is 00:24:07 So everything is compared with a very strict benchmarking technique, including open Vino and quantization. And those can be reproduced and demonstrated on top of our platform, SaaS platform that you can sign up from our website. Okay, thank you. You also mentioned Intel and it's also part of the announcement and I think you have a partnership going on with Intel. I wanted to ask you specifically about that. So how did this partnership came along and what's the motivation basically for both parties and how does it weave in basically into this announcement because you're specifically targeting CPUs here
Starting point is 00:24:59 and I guess how does it fit in the strategy for both parties? Yes. So this announcement is not any part of the collaboration with Intel, but the tools that we're using by Intel, the hardware is by Intel, and the collaboration with Intel is tackling some other aspects of working together, and I will be happy to discuss that. So we have a very long partnership with Intel, starting from collaborating to MLPerf and submitting our models together
Starting point is 00:25:35 with a collaboration with Intel to boost the state-of-the-art performance in MLPerf. MLPerf is a benchmark for performance of machine learning models that is widely used and happening twice a year. So that was the first step in the collaboration with Intel. The second step was a partnership agreement together with the sales organization of Intel that Intel is selling. This is solution to the customers in order to optimize the performance of their machine learning models running in production. And we have another thing that is baking and will be announced soon, but I cannot refer and maybe we will talk about it in another chapter of your podcast in a few months,
Starting point is 00:26:25 but we are working on some new announcement with Intel that we will share very soon, I believe. So this is the partnership with Intel. How both companies are enjoying from that partnership is a good question. I think that what Intel is trying to give to their customers is better performance on top of their CPU. And the hardware is already fixed.
Starting point is 00:26:55 So what Intel can do in the software is to build their own software stack solutions like OpenVINO, but they can also partner with companies that enable additional improvement on top of algorithmic layers to a stack solutions like OpenVINO, but they can also partner with companies that enable additional improvement on top of algorithmic layers that is currently in some aspects beyond the scope that Intel products are looking into.
Starting point is 00:27:17 So this is a partnership that is a win-win for both sides, Intel getting a solution for their customers that is in the algorithmic level, and DESI getting some go-to-market assistance for Intel to get to their customers. So this is the nature of that partnership. I must say that DESI is not collaborating only with Intel, but also with companies like HPE and AWS. So we have a wide range of existing partnerships and also some partnerships that are in progress of establishment with various types of hardware
Starting point is 00:27:56 manufacturers, cloud providers, and OEMs that sell data centers and servers for machine learning. Okay. Another question I had regarding actually the results that you're about to announce has to do, I guess, with that trade-off that you also mentioned yourself. So performance versus size and I would actually include another element in this equation let's say. So total cost of ownership let's say and total cost of operation
Starting point is 00:28:39 from end to end. It seems like the process, let's say, of getting someone to use DesiNets would be, well, first they would have to custom train the model that fits their needs, and then they would deploy that and, well, run inference for as long as they need. Do you have any indication, any feeling, let's say, of what the performance, the trade-offs involved there would be and what would the end result be in terms of what's the most economical, let's say, solution depending on different parameters? And also taking into account the different deployment options in terms of hardware. That's a very good question.
Starting point is 00:29:26 I think that first we should understand the difference between training and inference in the amount of workload. So while training is more expensive to be done, like more expensive task, it is being done once in a while. But inference is happening all the time as is coming with a linear ratio with the production workload, which is the amount of data,
Starting point is 00:29:58 number of customers, or you name it. While training is with linear relation with the amount of models or number of data scientists. So if we need to think about what we would like to optimize, the training or the inference, the answer in 99% of the cases is definitely the inference as it having much more significant workload or heavy workload.
Starting point is 00:30:27 In terms of the total cost of ownership, you enjoy very much by reducing the amount of cloud or data center usage by using more efficient networks. And usually what we do is one option is to build from scratch on desi nets. So you do your development on cycle on desi nets, you get the results faster. It shorten you all the trial and error iteration the data scientists need to do in order to find the right architecture, find the right hyper parameters. So this is one option.
Starting point is 00:31:04 The second option is to transform to destinates after running with a model in production. And at the retraining point, when you want to retrain the model, you simply switch to destinates and then you run from then on with destinates replacing the model in production. So you benefit a lot from running
Starting point is 00:31:28 a model that is significantly more efficient in production and your cost of switch is only one time training of building that desi net instead of your existing model which is usually one to two training cycles until you finally fully customize the desi-net to your need, to reach your accuracy level, et cetera. So I think that this cost diminishes compared to the huge amount of saving that you can have in production. And by the way, when considering edge deployment, it seems that when running on the edge, you usually can reduce the amount of hardware you need. In the edge, you reduce the amount of hardware you need, or you can work with a hardware or with low-end hardware and eliminate the need to upgrade the hardware for future releases of the product and stuff like this.
Starting point is 00:32:35 Yeah, I think that's probably kind of common knowledge, let's say, for practitioners that indeed, as you said, in most of the cases, the total cost of ownership and operation is much more influenced from inference than it is from training. Well, unless you're training something like GPT-3, for example, but that's probably a very special case. But also, if we will analyze gpt3 that estimates say that the cloud cost to train a gpt3 model is 14 million we assume that if you have done so you have such a huge workload uh that can pass that 14 million dollar training uh cost at some time at inference so also optimizing g-3 is something that makes sense if your inference workload is large enough,
Starting point is 00:33:30 because GPT-3 is a model that is very expensive also for inference. Let's remember. So as long as the models are getting larger and the training is longer, also the inference is affected by that. Yeah, yeah, true. Another thing that, through the conversation, is dawning on me, is that it looks to me that what you're doing, the approach you're taking, well, first, obviously, the value proposition is starting to become more clear, let's say, and especially if you take into account the fact that you mentioned earlier that, well, you can retrain a decinet using an existing model.
Starting point is 00:34:16 That's something that was not clear to me, I have to say, from the beginning. And I think it's a very crucial parameter because it means that you don't have to train from scratch and to reinvent from scratch based on decimals. The other thing that I started to do is that the approach you're taking seems to me a lot like TinyML basically, like you're trying to cut down the size of models you're deploying to make them more efficient? And obviously you're aware of TinyML and I wonder if you have any ties with the organization and the initiative. Yes, so basically TinyML is mostly on running on microcontrollers
Starting point is 00:35:00 and we are looking in a wider scope on that problem, how it can also impact cloud cost. We just talked about GPT-3, which does not have any relations to TinyML. So we look on a wider scope than TinyML, which only looks on how to enable machine learning to run on microcontrollers. So we also look on running on modern CPU, modern GPU, and
Starting point is 00:35:29 we enlarge that problem to be more general of how AI can be more efficient. And I think that when you go to the broader scope, we have to consider preserving the accuracy or getting the accuracy level that is sufficient for the application. Because on tiny ML, people understand that they have to compromise an accuracy and get something that can run on the device and the major challenge is to being able to run on that microcontroller. But when thinking about running in the cloud, people don't want to compromise an accuracy in order to get better better cost so one of this is on challenges and one of the key characteristics of the technology is being able to not compromise on accuracy in order to get that performance boost in production okay i see uh the other thing that uh i felt like um what felt like what you do may fall under is the whole AutoML landscape.
Starting point is 00:36:31 Again, I wonder if that's accurate or not, but actually, seeing your offering, the closest thing that came to my mind was what another company called Noton does as well. And I know them from TinyML and this is why I asked you about that as well. They seem to be applying a similar kind of logic. So optimizing the architecture of a neural network, achieving a lower footprint and all of that. And there's other companies in that space that people have tried to put all under the same landscape, let's say. How do you see, where do you see yourselves, let's say, fitting? And what do you think of this space in general? So, generally speaking, I see that getting better performance have multiple layer problem. The bottom layer is choosing the right hardware or getting performance boost by the hardware level. The next level is the graph compiler level, where you see solutions that are provided with the hardware manufacturers like Tensor RT by NVIDIA, OpenVINO by
Starting point is 00:37:47 Intel. You see the ONNX open source supported by Microsoft. And we can see also commercial solutions like OctoML and TVM that commercializing TVM. On top of that, we have the model compression techniques like pooling and quantization those are widely covered in open source repositories and also some companies are trying to to commercialize on those solutions but this is working in a different level in the level of neural architecture search that is redesigning or helping data
Starting point is 00:38:27 scientists to design the models to get better latency on the same accuracy. And that's kind of the differentiation from other companies and solutions that are out there in the model optimization space. And because we are doing that on the model level, we are actually providing an end-to-end platform for building, optimizing, and deploying deep learning models as models are offering. Because it's not enough to build a model that the architecture is efficient. You also need to know how to take it into production effectively in order to benefit from that performance gains that you saw in the lab.
Starting point is 00:39:06 And that's not an easy part at all. So we help companies in the entire process from building the algorithms until taking them to production in an end-to-end platform. And this is our offering and how we differentiate from other solutions in the optimization stack. Okay, by the way, speaking of platform, you mentioned earlier that you also have a community tier, so I was wondering if you could say a few words about the business model basically and the different tiers that are available. Yes, absolutely. So the community tier is built on
Starting point is 00:39:48 top of two components of our products. One of them is the Super Gradients open source, and the second one is the Desilab. Desilab is a SaaS platform that enables you to benchmark and do runtime optimization for your models based on various types of graph compilers and also take them to production with our inference runtime engine SDK. So this is an end-to-end form building the model on super-gradient to optimizing it on the lab and taking it to production with the runtime engine that is called Inferi that is provided in the lab. And you can benefit from an end-to-end development lifecycle of deep learning models for free.
Starting point is 00:40:35 On top of that, we have the commercial offering, which is basically a port here for the platform that lets you use Destinets, lets you you use URTONAC for custom optimization of specific models that are not supported on DESINETS and taking some benefits of runtime engines that having some sophisticated optimization techniques for production usage. So those are all in the commercial layer that the business model for that is a subscription business model.
Starting point is 00:41:09 Okay, good. Thank you. Yeah, I think we covered quite a few topics actually from the deeply technical to the quite abstract and the business around that. So I think we're probably good. The one thing I would like to ask you before we wrap up is, you kind of hinted at something already, but any ideas about future plans and roadmap ahead? So I can say that at the beginning I talked talked about Desi being supporting computer vision and NLP, but most of the talk I've gave benchmark for computer vision only. So in the oven, we have our NLP offering baking and we'll be able to announce it very soon.
Starting point is 00:42:01 And for early access, you can reach out and hear more. I hope you enjoyed the podcast. If you like my work, you can follow Link Data Orchestration on Twitter, LinkedIn, and Facebook.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.