Orchestrate all the Things - AI chip startup NeuReality introduces its NR1-P object-oriented hardware architecture. Featuring CEO and co-founder Moshe Tanach

Starting point is 00:00:00 Welcome to the Orchestrate All the Things podcast. I'm George Amadiotis and we'll be connecting the dots together. Neoreality targets deep learning inference workloads on the edge, aiming to reduce capital and operational expenses for infrastructure owners. The AI chip space is booming, with innovation coming from a slew of startups in addition to the usual suspects. You may never have heard of Neoreality before, but it seems likely you'll be hearing more about it after today.

Starting point is 00:00:31 Neoreality is a startup founded in Israel in 2019. Today it has announced NR1P, which it dubs a novel AI-centric inference platform. That's a bold claim for a previously unknown and a very short time to arrive there, even if it's just the first of more implementations to follow. We connected with NeoReality CEO and co-founder Moshe Tanakh to find out more.

Starting point is 00:00:57 I hope you enjoyed the podcast. If you like my work, you can follow Link Data Orchestration on Twitter, LinkedIn, and Facebook. So thanks for having me, George. My name is Moshe Tanakh. I'm the co-founder and CEO of Neurality. More than 20 years of experience in semiconductor and system solutions, from compute and wireless all the way to data center networking and storage. I've been in smaller companies, startups like Design Art Networks that was acquired by Qualcomm.

Starting point is 00:01:31 Actually, our architecture made it to Qualcomm's first 5G solutions for phones. I was also on the larger scale semiconductor like Intel and Marvell. I led the Wi-Fi product line in Intel in my last role in Marvell until end of 2018 was head of product definition and architecture

Starting point is 00:02:00 for the networking associates,, systems on chips. I always believe that systems and semiconductors should be designed from the outside to the inside. You need to understand the system. If you can build a system as Qualcomm is doing, they're building a phone and a base station in order to make the best tips for phone. And this is actually what we're doing in the new reality. This is how we came to a fully functional system in only nine months since we started.

Starting point is 00:02:37 We started back in 2019. We raised the money in the middle of 2020 and we already have the working system. We are a team of three co-founders, Yossi, Tzvika and myself. Yossi and I, Yossi Kassouz, we know each other for many years. We worked on a complex high-performance network processor for Cisco. He was heading the VLSI efforts in EasyChip, an Israeli company that was later acquired by Mellanox. And I was leading the VLSI on the Marvell side, and Marvell had manufactured the chip for Cisco. And IP of the network processor was delivered by EasyCube. So we have long nights, very tough journey. And in tough journeys, great friendships become

Starting point is 00:03:37 and you learn to work together. We know how to do that. Yossi is the VP of, Vice of uh of lsi in the company and um he was the vice president of backhand in melanox this is where we met yossi after uh easy cheap were acquired by melanox um you led their the Bluefield SmartNIC product line development and Zvito was responsible for all the backend methodologies and production. But actually they know much before, 20 years ago when they were in the Technion, our best engineering university in Israel, they played soccer together. So they have a long history. A word about how we came up to start Neurality? Yes? Yeah, that's interesting. And one of the things that you mentioned, which also caught my attention, was the fact that you seem to have accomplished a lot in a relatively short, well, not relatively, really short period, actually.

Starting point is 00:04:55 So I think you said it was nine months since the company was founded that you were able to actually come up with and announce the architecture. And you also sort of connected that to your philosophy and your way of developing the chips. And I was also wondering whether you actually, how do you do that exactly? So whether you do the manufacturing in-house or whether you just design the chip and then you outsource manufacturing as many other companies do? Our first approach was heavily leaned on FPGA solutions. And this is where the partnership with Xilinx came very very important Xilinx is not anymore just a programmable logic and FPGA it's when you look into how their

Starting point is 00:05:57 advanced FPGAs are built today they are a system on a chip they They have ARM processors inside. In their latest Versa A-cup technology, they also integrated an array of VLAW engines that you can program. And together with them, we could build a 16-card server, a 4U chassis that is very powerful. We implemented our hardware inside their FPGA so we didn't have to fabricate anything we just built the chassis. We purchased FPGA cards from Xilinx and together with them we came up with an inference engine that is autonomous and is implemented inside the FPGA. The path to our own server on the chip is still ahead of us. We're focused on building the chip today and we're going to introduce it early next year. But it allowed us to develop a vehicle that help our customers to test it,

Starting point is 00:07:14 integrate it into their software. When you build a system differently, suddenly a world of opportunity opens for you and the software ecosystem in the data center need to test how to integrate it and it allows us to do it before the chip is ready. Okay, okay, I see that makes sense and I guess that also explains what, you know, from the outside seemed like a quite close relationship with Xilinx. So up to the point at least where you're able to develop your own system on Netsheap as you mentioned, I guess you're kind of dependent on Xilinx and their FPGAs to do precisely what you described, to sort of go early to market and enable customers to test how it's working in practice. to sort of go early to market and enable customers

Starting point is 00:08:08 to test how it's working in practice. Exactly, yes, yes. And this is how Xilinx likes to work. They provide the hardware infrastructure that can be modified. You can write your own hardware, burn it into the FPGA, write your software, run it on their embedded arms. It's a wonderful vehicle for development and for real products. You know, base stations are heavily leaned on FPGAs,

Starting point is 00:08:36 so the flexibility is maintained and you can update. And the same goes here. Linux, it also were pioneers in investing a neural net processing on FPGAs. They have nice achievements in the market. It was the natural partner for us and they really supported everything we do. We believe in heterogeneous compute system and chips and it seems obvious to partner together and serve the market that way. Yeah, I see. Yeah and it's also quite obvious from just reading former affiliations of people who are currently involved in a reality that, yeah, as you mentioned in your introduction as well,

Starting point is 00:09:29 you have quite some experience gathered together over there. And so I wanted to ask you about something that caught my eye in your previous announcement. It was February, 2021, if my memory says me right, when you announced your previous round of funding. And with that, you also announced an executive move. So Dr. Naveen Rao, who was formerly general manager of Intel's AI product group, ο Γενικός Μετρικός Μετρικός της Intel AI Product Group που έγινε στον πόρτο σας. Και θα ήθελα να... Υποστηρίζω, πραγματικά, αν υπάρχει κάποια συνδέα με FPGAs,

Starting point is 00:10:13 γιατί οι FPGAs, σε αυτό το σημείο, πιστεύουν πολύ κεντρικοί σε αυτό που κάνετε. Και ξέρουμε ότι η Intel ήταν επίσης πολύ εμπνευσμένη σε αυτά, σε κάποιο σημείο. that Intel was also quite invested in them at some point. So is there any connection there and how did that assignment came to be? The assignment of Naveen announced that he's leaving Intel. And a year ago, the relationship started. And I remember Naveen telling me, you know, I saw so many deep learning activities around the world. And he was intrigued by our fresh view. This is how he called it. I like your fresh view on how inference

Starting point is 00:11:07 solutions should be developed. The world was focusing on training for a long time. The training problem of training models, shortening the time to train, had pulled a lot of innovation and we ended up with very expensive compute pods that have excellent results in training models. But when you want to push AI to be used in real life applications, you need to care about the usage of the model and not the training of them. And when you try to leverage and utilize an expensive pod, the result cost of every AI operation stays very high.

Starting point is 00:11:58 And it's hard to solve the two problems together. So Naveen was very much intrigued about how we suggest to change this game, to change the system architecture that supports inference. He was a big fan of two paths for two purposes. Even inside Intel, he had two product lines, one for training and one for inference. So, it was obvious to him that inference should be sold differently and when he saw that he you know we danced together for a while and when I offered him to join the board he he happily accepted and even invested his own money so for us he you know

Starting point is 00:12:40 Navin is a big asset both for as an AI luminary that in the industry is looking at and has great relationship in the industry. And, you know, he's been the founder of Nirvana. He was acquired by Intel in 2016. So he has a lot of experience in startups. And for us, he's just one of the team and he helps us in anything we need. Okay, thank you. And actually, your answer served quite well because it also answered a question I had as well,

Starting point is 00:13:18 which was about why inference, basically. So why did you choose to only target inference? And I guess you kind of covered that. οπότε γιατί επιλέγεις να μόνος προσέγγισεις την επίθεση και νομίζω ότι έκανες αυτό. Υπάρχει η πράξη που θεωρείς αυτή την πλευρά, δεν θα ήθελα να πω ότι είναι πιο σημαντική, αλλά είναι η πιο ενδιαφέρουσα. Ένα ερώτημα που είχα ήταν ότι, καθώς τώρα χρησιμοποιείτε FPGAs, πιστεύω ότι, δε φακτό, προσπαθείτε την επίγνωση στον κέντρο της δίκαιης. Όταν, στο μέλλον, επεξεργασίσετε στον δικό σας συστήμα, θα προσπαθήσετε επίγνωση στον κέντρο, πιστεύω ότι αυτό θα πρέπει να είναι μέσα στις προκλήσεις σας, δε? the edge? I guess that should probably be within your goals, right?

Starting point is 00:14:12 Yes, it is. At the moment, we do not intend to integrate our technology into the device, the edge device. Edge device need even more optimized solutions, especially designed for the need of the device. You need to do things in micro watts, milliwatts, or less than 50 milliwatts. But there's the pendulum of compute that we're in a trend to push more and more compute to the cloud. But we're starting to see the pendulum coming back. And if you look on the Microsoft AT&T deal to build mini data centers across the U.S. in AT&T facilities to bring more compute power closer to the edge, many IoT devices will not be able to embed AI capabilities because of cost and power. So they will need a closer compute server to serve them. able to embed AI capabilities because of cost and power. So they will need a closer compute server to serve them. Going all the way to the cloud and back, introduce high latency for some applications,

Starting point is 00:15:16 very high bandwidth going to the cloud and back. So we're going to see more and more edge nodes, edge servers installed in your access point at home, residential gateways, 5G base stations. This is where the cost and power pressure is even higher. So we're building an autonomous device. solution is that unlike the other deep learning accelerators are there that do a very good job in offloading the neural net processing from the application, they are a PCI device. They must be installed in a whole server that costs a lot. It's a CPU centric solution where the CPU is the center of the system and it offloads things, it runs

Starting point is 00:16:07 the driver of the device. In our case, it's not the case. Our device is a network device. It connects directly to the network. It's autonomous. We have used all the data path functions and they don't need to run in software so we remove this bottleneck and we eliminate the need for additional devices that connect us to the network and all this is translated to the ideal lowest ai inference operation cost both from capital expense and operational expense. I guess building it in an FPGA is very well crafted for data center. It can also be installed in places where power is less of an issue like 5G base stations and stuff. When we have the SOC we'll have two flavors, one for data center and another one for lower cost and power for edge nodes closer to the near edge solutions.

Starting point is 00:17:16 Yeah, okay, yeah thanks for clarifying. So yeah, it makes sense so you're not going to be targeting embedded, you're not going to be producing embedded chips or targeting devices on the edge per se, but rather, as you mentioned, smaller scale data centers closer to the edge. Yeah, you know, maybe another word on that is the main object here is the AI Compute Engine. We call it object-oriented hardware. We've been using object-oriented software for a long time and it changed the way we code things. We wrap the main object with the functions that it needs. Well, it's time to develop hardware that does the same.

Starting point is 00:18:01 If you want to invest in AI compute engines, make it the main thing. The system is not the main thing. The main thing is the AI compute. You want it to be a managed resource that the data center orchestrator can manage. You want it to be wrapped with the network functions and data path functions

Starting point is 00:18:21 and management and control functions, but they can't be more expensive than the object itself. This is what we do in NeurAdapty. We revamp the AI compute engine, which is the most important thing, with the right functions, AI hypervisor, network engine, data path functions, and make it ideal in terms of efficiency, cost and power. Actually, I was going to ask you precisely about that, just trying to figure out the differentiation basically in your architecture from what I've been able to read and from what you've mentioned so far.

Starting point is 00:19:00 So one point I picked up was how you said that part of what you do is eliminate basically Έχω αναφέρεται σε ένα σημείο που είπα ότι μία από τις πράξεις που κάνεις είναι να αποφύγεις το απαραίτητο χαρτί στο σχέδιο σου. Είπα ότι η περισσότερη αρχιτεκτορία είναι CPU-κεντρική και το σχέδιο σου το αποφύγει. Έτσι, γίνεις ασπίδος αποφύγοντας το απαραίτητο χαρτί. kind of bypasses that. So you get speed up by eliminating unnecessary input out, basically. And I was wondering if that's the only point of differentiation, but what you just mentioned kind of makes me guess that it's not. So I was wondering if you could quickly refer to your differentiation factor. And by the way, if you could also mention

Starting point is 00:19:48 if you have any patents on that pending or not. Yeah, so if I need to recap the main differentiation point between a server on a chip or a PCI device is the fact that we had moved functions like queue management. You have a server and you're serving multiple clients applications that are running in various virtual instances in the cloud. So you need to queue all the requests that are coming from these clients into your server.

Starting point is 00:20:32 You need to schedule the next job in line. You need to load balance between your resources. All this is done in CPU today. All this is running in expensive software, running on x86. It's not even running on ARM yet that will optimize the cost of instruction of software trans on x86 it's very expensive and you want to keep up with it you know we're seeing servers with eight or sixteen deep learning accelerators in one box. You must use the most expensive AMD or Intel solution out there to keep up with the data throughput and run all these functions.

Starting point is 00:21:15 If you need to do some media processing before you use the neural net processor, which runs on x86, many times it limits the utilization and lowers it to less than 20 percent you have deep learning accelerators sitting in the server and doing nothing so when you approach it in a much more scalable way as we did you have an open streamlined path to each one of the engines from the network directly to the engine and you're running all these key management scheduling load balancing driving of the processing element in hardware so you can do it in a parallel compute nature and not serial like in a

Starting point is 00:21:57 CPU a single-threaded CPU another important thing is the communication protocol that you're using. You see a lot of inference solutions like NVIDIA has using REST API, very expensive networking, not only on the server side, but also on the client side. You're paying a lot on getting into the Linux stack, to send and receive the request and the response. We have other schemes of doing it. We didn't publish it publicly yet, but you will hear it, but let's talk in a few months and we'll be able to share more.

Starting point is 00:22:39 So again, once you build the system correctly, once you map the system flows end to end, you can innovate on each one of them. And it's not always very expensive to do it. You just need to build the device correctly, design your software stack correctly, and you get the game. Last, another thing, I mentioned the cloud ecosystem. You want, this is a managed resource.

Starting point is 00:23:10 Elasticity is something very important in the data center. In one hour of the day, you need more compute. In a different hour, you need less. You need to move functions from one instance to the other. Today, existing deep learning accelerators are out of the equation. They don't help in solving that. All the Kubernetes connection, the communication with the orchestrator, all this is done on the CPU that is hosting these deep learning accelerators. We had to integrate these functions into the device, and we're not just running them on software.

Starting point is 00:23:52 The data path-oriented functions are offloaded to our hardware engine, which is also programmable, but it solves things in much lower cost, much faster latency. Another issue, once you have an autonomous device with a single DRAM and an embedded memory, you don't need to copy things between the host and the deep learning accelerator. You have a single memory space, and you're

Starting point is 00:24:22 saving a lot on elastic buffers, about latency of copying stuff from one memory to the other, about the power consumptions of the expensive DRAM that we use today. So you're just solving the problem right, open the world of opportunity and you just need to collect them and take advantage of them. Okay, so it sounds like basically what you are trying to do is sort of optimize the design on the hardware level from

Starting point is 00:24:59 the ground up basically and do all these micro or not so micro optimizations around βάζουμε την πίεση και κάνουμε όλα αυτά τα μικρό, όχι όσο μικρό, οπτιμιζόρυθμα σχετικά με όλα τα συμφωνία και προσθέτουμε την οπτιμιζόρυθμα που μπορείτε να βρείτε. Αλλά, επίσης, αναφέρετε κάποιο σημείο, το software, οπότε, έχουμε κάποια οπτιμιζόρυθμα στον κύριο σύστημα του software, και αυτό είναι κάτι που μας έχει σημαίνει, και πιο συγκεκτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ε machine learning framework, basically. So how does that work? How does deploying models on your, well, FPGAs for the time being, but SOC in the future, how will that work? Well, we're trying to make the user experience very easy.

Starting point is 00:25:59 Our first or our widest customer base today, if you look on where inference is done today, and it's going to change in the next two, three years, but today it's in the cloud and not only in the cloud, it's in the public cloud in first use. there's first use and third use. The first use is what Amazon and Google and Alibaba and Microsoft are doing for their own use, for search acceleration, for Alexa acceleration, et cetera. This is where most of the inference run in data center. We're going to see more and more of these inference surveys, both in platform as a service or software as a service,

Starting point is 00:26:44 be sold out to customers software companies and other data centers around the world that are privately owned I can't come to a developer and give him a golden egg but ask him to change the way he he called his software so we must be compliant to the frameworks that he is using our tool chain will be able to offload to accept models that were trained in PyTorch or TensorFlow or any other. And you know, with the penetration of ONNX in the world,

Starting point is 00:27:31 it's also become easier because you can always go through the intermediate representation of ONNX before you map it into your device. So this is something we have to solve. And we're going to deliver that even today when you can take a model and map it to the FPGA implementation. The next evolution of compute offload is a pipeline offload. So today, if you're doing face recognition, for instance, and your images are compressed, you need to do JPEG decoding, resizing of the image, quantization in the software above you that's running on the host, and then offload only the neural net processing to the device. You can try and offload some of these functions in GPU case to the GPU,

Starting point is 00:28:22 but everything is managed and the pipeline is running from the software above it. The future of AI compute offload is to completely offload the pipeline. For that, our tool train also supports a compute graph representation that represents all the pipelines that the developer had developed. And he can develop it in Python or C++, you just need to add the directives to our toolchain and we'll be able to offload the complete pipeline.

Starting point is 00:28:58 So software is the key. You're touching a very sensitive thing. You know, today we're 20 engineers and half of them are software. I don't know many semiconductor companies that started so evenly between software and I'm seeing us growing much more on the software side than on the hardware side and definitely something we're going to invest in more and more. User experience is everything. And for us, the users are the developers that want to map their trained model to our target. Yeah, that makes sense, actually.

Starting point is 00:29:36 And I was kind of assuming that you'd probably rely either on ONIX or perhaps TVM, so as a kind of intermediate layer. So as we're coming closer to wrapping up, let's do that by mentioning, well, basically what the roadmap is. So what is the current status? So I take it it's not exactly proof of concept, but potentially deployed at a select number of

Starting point is 00:30:07 customers at the moment and i wonder if you can mention any names or if not names any industries that your solution is currently being used in and what's the roadmap going forward the road going forward is just as we've done in the last year is getting closer to our best engagement on the customer side and on the partner side you know it's public and Xilinx and us made it public that we are partnering together. But we have two more partnerships that unfortunately I cannot publicly talk about yet. In terms of customers, there are three swimlands we invested in, the hyperscalers and next wave cloud service providers, the

Starting point is 00:31:06 solution providers that build data centers for military, for governments, for the financial industry. And last but not least, the OEMs and ODFs that are very knowledgeable about how to build the hardware, what's needed. Different data centers have different compliancy levels and power envelope, etc. And we're very close to them. Almost 70% of our engagements had already seen the prototype. We'll soon have it available for remote access

Starting point is 00:31:52 in one of the companies in the US that are going to integrate our servers into their data center. And a lot of the focus had just moved to develop the system on the chip that we're going to introduce next year. This is where we're going. We're already looking on round A to fund the tape out and production. And we're going to engage closer with customers, not only to enable design wins, but also to get the most feedback before we tape out.

Starting point is 00:32:42 I hope you enjoyed the podcast. If you like my work, you can follow Link Data Orchestration on Twitter, LinkedIn, and Facebook.

Your Ad Here

Orchestrate all the Things - AI chip startup NeuReality introduces its NR1-P object-oriented hardware architecture. Featuring CEO and co-founder Moshe Tanach

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.