Orchestrate all the Things - AI chip startup NeuReality introduces its NR1-P object-oriented hardware architecture. Featuring CEO and co-founder Moshe Tanach
Episode Date: May 5, 2021NeuReality targets deep learning inference workloads on the edge, aiming to reduce CAPEX and OPEX for infrastructure owners The AI chip space is booming, with innovation coming from a slew of st...artups in addition to the usual suspects. You may never have heard of NeuReality before, but it seems likely you'll be hearing more about it after today. NeuReality is a startup founded in Israel in 2019. Today it has announced NR1-P, which it dubs a novel AI-centric inference platform. That's a bold claim for a previously unknown, and a very short time to arrive there -- even if it is the first of more implementations to follow. Article published on ZDNet
Transcript
Discussion (0)
Welcome to the Orchestrate All the Things podcast.
I'm George Amadiotis and we'll be connecting the dots together.
Neoreality targets deep learning inference workloads on the edge,
aiming to reduce capital and operational expenses for infrastructure owners.
The AI chip space is booming, with innovation coming from a slew of startups
in addition to the usual suspects.
You may never have heard of Neoreality before, but it seems likely you'll be hearing more
about it after today.
Neoreality is a startup founded in Israel in 2019.
Today it has announced NR1P, which it dubs a novel AI-centric inference platform.
That's a bold claim for a previously unknown and a very short time to arrive there,
even if it's just the first
of more implementations to follow.
We connected with
NeoReality CEO and co-founder
Moshe Tanakh to find out more.
I hope you enjoyed the podcast.
If you like my work, you can follow
Link Data Orchestration on Twitter,
LinkedIn, and Facebook.
So thanks for having me, George. My name is Moshe Tanakh. I'm the co-founder and CEO of
Neurality. More than 20 years of experience in semiconductor and system solutions,
from compute and wireless all the way to data center networking and storage. I've been in smaller companies, startups like
Design Art Networks that was acquired by Qualcomm.
Actually, our architecture
made it to Qualcomm's first
5G solutions for phones.
I was also on the larger scale semiconductor
like Intel and Marvell.
I led the Wi-Fi product line in Intel
in my last role in Marvell until end of 2018
was head of product definition and architecture
for the networking associates,, systems on chips.
I always believe that systems and semiconductors should be designed from the outside to the
inside.
You need to understand the system.
If you can build a system as Qualcomm is doing, they're building a phone and a base station
in order to make the best tips for phone.
And this is actually what we're doing in the new reality.
This is how we came to a fully functional system in only nine months since we started.
We started back in 2019. We raised the money in the middle of 2020 and we already have the working system.
We are a team of three co-founders, Yossi, Tzvika and myself.
Yossi and I, Yossi Kassouz, we know each other for many years. We worked on a complex
high-performance network processor for Cisco.
He was heading the VLSI efforts in EasyChip, an Israeli company that was later acquired by Mellanox.
And I was leading the VLSI on the Marvell side, and Marvell had manufactured the chip for Cisco.
And IP of the network processor was delivered by EasyCube. So we have long nights, very tough journey.
And in tough journeys, great friendships become
and you learn to work together.
We know how to do that.
Yossi is the VP of, Vice of uh of lsi in the company and um
he was the vice president of backhand in melanox this is where
we met yossi after uh easy cheap were acquired by melanox um you led their the Bluefield SmartNIC product line development and Zvito was responsible for all the backend methodologies and production.
But actually they know much before, 20 years ago when they were in the Technion, our best engineering university in Israel, they played soccer together. So they have a long history.
A word about how we came up to start Neurality? Yes?
Yeah, that's interesting. And one of the things that you mentioned, which also caught my attention, was the fact that you seem to have accomplished a lot in a relatively short, well, not relatively, really short period, actually.
So I think you said it was nine months since the company was founded that you were able to actually come up with and announce the architecture.
And you also sort of connected that to your philosophy and your way of developing the
chips.
And I was also wondering whether you actually, how do you do that exactly? So whether you do the manufacturing in-house or whether you just design the chip and then
you outsource manufacturing as many other companies do?
Our first approach was heavily leaned on FPGA solutions.
And this is where the partnership with Xilinx came very very important
Xilinx is not anymore just a programmable logic and FPGA it's when you look into how their
advanced FPGAs are built today they are a system on a chip they They have ARM processors inside. In their latest Versa A-cup technology,
they also integrated an array of VLAW engines that you can program. And together with them,
we could build a 16-card server, a 4U chassis that is very powerful. We implemented our
hardware inside their FPGA so we didn't have to fabricate anything we just built
the chassis. We purchased FPGA cards from Xilinx and together with them we came up with an inference engine that
is autonomous and is implemented inside the FPGA. The path to our own server on the chip is still ahead of us.
We're focused on building the chip today and we're going to introduce it early next year.
But it allowed us to develop a vehicle that help our customers to test it,
integrate it into their software. When you build a system differently,
suddenly a world of opportunity opens for you and the software ecosystem in the data center need
to test how to integrate it and it allows us to do it before the chip is ready.
Okay, okay, I see that makes sense and I guess that also explains what, you know, from the
outside seemed like a quite close relationship with Xilinx. So up to the point at least where you're able to develop your own system on
Netsheap as you mentioned, I guess you're kind of dependent on Xilinx and their FPGAs
to do precisely what you described, to sort of go early to market and enable
customers to test how it's working in practice. to sort of go early to market and enable customers
to test how it's working in practice. Exactly, yes, yes.
And this is how Xilinx likes to work.
They provide the hardware infrastructure
that can be modified.
You can write your own hardware, burn it into the FPGA,
write your software, run it on their embedded arms.
It's a wonderful vehicle for development and for real products.
You know, base stations are heavily leaned on FPGAs,
so the flexibility is maintained and you can update.
And the same goes here.
Linux, it also were pioneers in investing a neural net processing on FPGAs.
They have nice achievements in the market. It was the natural partner for us and they really
supported everything we do. We believe in heterogeneous compute system and chips and it seems obvious to partner together and serve the market that way.
Yeah, I see. Yeah and it's also quite obvious from just reading former affiliations of people
who are currently involved in a reality that,
yeah, as you mentioned in your introduction as well,
you have quite some experience gathered together over there.
And so I wanted to ask you about something
that caught my eye in your previous announcement.
It was February, 2021, if my memory says me right, when you announced your
previous round of funding. And with that, you also announced an executive move. So
Dr. Naveen Rao, who was formerly general manager of Intel's AI product group, ο Γενικός Μετρικός Μετρικός της Intel AI Product Group που έγινε στον πόρτο σας.
Και θα ήθελα να...
Υποστηρίζω, πραγματικά, αν υπάρχει κάποια συνδέα με FPGAs,
γιατί οι FPGAs, σε αυτό το σημείο, πιστεύουν πολύ κεντρικοί σε αυτό που κάνετε.
Και ξέρουμε ότι η Intel ήταν επίσης πολύ εμπνευσμένη σε αυτά, σε κάποιο σημείο. that Intel was also quite invested in them at some point. So is there any connection there and how
did that assignment came to be? The assignment of Naveen announced that he's leaving Intel.
And a year ago, the relationship started.
And I remember Naveen telling me, you know,
I saw so many deep learning activities around the world.
And he was intrigued by our fresh view.
This is how he called it. I like your fresh view on how inference
solutions should be developed. The world was focusing on training for a long time. The
training problem of training models, shortening the time to train, had pulled a lot of innovation and we ended up with very expensive compute pods that have
excellent results in training models. But when you want to push AI to be used in real
life applications, you need to care about the usage of the model and not the training of them.
And when you try to leverage
and utilize an expensive pod,
the result cost of every AI operation
stays very high.
And it's hard to solve the two problems together.
So Naveen was very much intrigued
about how we suggest to
change this game, to change the system architecture that supports inference. He was a big fan
of two paths for two purposes. Even inside Intel, he had two product lines, one for training
and one for inference. So, it was obvious to him that inference should be sold differently and when he saw that he you
know we danced together for a while and when I offered him to join the board he
he happily accepted and even invested his own money so for us he you know
Navin is a big asset both for as an AI luminary that in the industry is looking at and has great relationship in the industry.
And, you know, he's been the founder of Nirvana.
He was acquired by Intel in 2016.
So he has a lot of experience in startups.
And for us, he's just one of the team and he helps us in anything we need.
Okay, thank you.
And actually, your answer served quite well
because it also answered a question I had as well,
which was about why inference, basically.
So why did you choose to only target inference?
And I guess you kind of covered that. οπότε γιατί επιλέγεις να μόνος προσέγγισεις την επίθεση και νομίζω ότι έκανες αυτό.
Υπάρχει η πράξη που θεωρείς αυτή την πλευρά, δεν θα ήθελα να πω ότι είναι πιο σημαντική,
αλλά είναι η πιο ενδιαφέρουσα. Ένα ερώτημα που είχα ήταν ότι, καθώς τώρα χρησιμοποιείτε FPGAs,
πιστεύω ότι, δε φακτό, προσπαθείτε την επίγνωση στον κέντρο της δίκαιης.
Όταν, στο μέλλον, επεξεργασίσετε στον δικό σας συστήμα,
θα προσπαθήσετε επίγνωση στον κέντρο, πιστεύω ότι αυτό θα πρέπει να είναι μέσα στις προκλήσεις σας, δε? the edge? I guess that should probably be within your goals, right?
Yes, it is. At the moment, we do not intend to integrate our technology into the device, the edge device. Edge device need even more optimized solutions, especially designed for
the need of the device. You need to do things in micro watts, milliwatts, or
less than 50 milliwatts. But there's the pendulum of compute that we're in a trend to push more and
more compute to the cloud. But we're starting to see the pendulum coming back. And if you look on the Microsoft AT&T deal to build mini data centers across the U.S. in AT&T facilities to bring more compute power closer to the edge,
many IoT devices will not be able to embed AI capabilities because of cost and power.
So they will need a closer compute server to serve them. able to embed AI capabilities because of cost and power.
So they will need a closer compute server to serve them. Going all the way to the cloud and back,
introduce high latency for some applications,
very high bandwidth going to the cloud and back.
So we're going to see more and more edge nodes,
edge servers installed in your access point at home, residential gateways, 5G base stations.
This is where the cost and power pressure is even higher.
So we're building an autonomous device. solution is that unlike the other deep learning accelerators are there that do a very good job in
offloading the neural net processing from the application, they are a PCI device. They must
be installed in a whole server that costs a lot. It's a CPU centric solution where the CPU is the
center of the system and it offloads things, it runs
the driver of the device. In our case, it's not the case. Our device is a network device.
It connects directly to the network. It's autonomous. We have used all the data path
functions and they don't need to run in software so we remove this bottleneck and we eliminate
the need for additional devices that connect us to the network and all this is translated to the
ideal lowest ai inference operation cost both from capital expense and operational expense. I guess building it in an FPGA is very well
crafted for data center. It can also be installed in places where power is less of an issue
like 5G base stations and stuff. When we have the SOC we'll have two flavors, one for data
center and another one for lower cost and power for edge nodes closer to the near edge solutions.
Yeah, okay, yeah thanks for clarifying. So yeah, it makes sense so you're not going to be
targeting embedded, you're not going to be producing embedded chips or targeting devices on the edge per se, but rather, as you mentioned, smaller scale data centers closer to the edge.
Yeah, you know, maybe another word on that is the main object here is the AI Compute Engine.
We call it object-oriented hardware.
We've been using object-oriented software for a long time and it changed the way we
code things.
We wrap the main object with the functions that it needs.
Well, it's time to develop hardware that does the same.
If you want to invest in AI compute engines,
make it the main thing.
The system is not the main thing.
The main thing is the AI compute.
You want it to be a managed resource
that the data center orchestrator can manage.
You want it to be wrapped with the network functions
and data path functions
and management and control functions,
but they can't be more expensive than
the object itself. This is what we do in NeurAdapty. We revamp the AI compute engine,
which is the most important thing, with the right functions, AI hypervisor, network engine,
data path functions, and make it ideal in terms of efficiency, cost and power.
Actually, I was going to ask you precisely about that, just trying to figure out the
differentiation basically in your architecture from what I've been able to read and from
what you've mentioned so far.
So one point I picked up was how you said that part of what you do is eliminate basically Έχω αναφέρεται σε ένα σημείο που είπα ότι μία από τις πράξεις που κάνεις είναι να αποφύγεις
το απαραίτητο χαρτί στο σχέδιο σου.
Είπα ότι η περισσότερη αρχιτεκτορία είναι CPU-κεντρική και το σχέδιο σου το αποφύγει.
Έτσι, γίνεις ασπίδος αποφύγοντας το απαραίτητο χαρτί. kind of bypasses that. So you get speed up by eliminating unnecessary input out, basically.
And I was wondering if that's the only point of differentiation,
but what you just mentioned kind of makes me guess that it's not.
So I was wondering if you could quickly refer to your differentiation factor.
And by the way, if you could also mention
if you have any patents on that pending or not.
Yeah, so if I need to recap
the main differentiation point
between a server on a chip or a PCI device is the fact that we had moved
functions like queue management. You have a server and you're serving multiple
clients applications that are running in various virtual instances in the cloud.
So you need to queue all the requests that are coming
from these clients into your server.
You need to schedule the next job in line.
You need to load balance between your resources.
All this is done in CPU today.
All this is running in expensive software, running on x86.
It's not even running on ARM yet that will optimize the
cost of instruction of software trans on x86 it's very expensive and you want to keep up with it
you know we're seeing servers with eight or sixteen deep learning accelerators in one box. You must use the most expensive AMD or Intel solution out there
to keep up with the data throughput and run all these functions.
If you need to do some media processing before you use the neural net processor,
which runs on x86, many times it limits the utilization
and lowers it to less than 20 percent you have deep
learning accelerators sitting in the server and doing nothing so when you approach it in a much
more scalable way as we did you have an open streamlined path to each one of the engines
from the network directly to the engine and you're running all these key
management scheduling load balancing driving of the processing element in
hardware so you can do it in a parallel compute nature and not serial like in a
CPU a single-threaded CPU another important thing is the communication protocol that you're using.
You see a lot of inference solutions like NVIDIA has using REST API, very expensive
networking, not only on the server side, but also on the client side.
You're paying a lot on getting into the Linux stack, to send and receive the request and the response.
We have other schemes of doing it.
We didn't publish it publicly yet,
but you will hear it, but let's talk in a few months
and we'll be able to share more.
So again, once you build the system correctly,
once you map the system flows end to end,
you can innovate on each one of them.
And it's not always very expensive to do it.
You just need to build the device correctly, design your software stack correctly, and
you get the game.
Last, another thing, I mentioned the cloud ecosystem.
You want, this is a managed resource.
Elasticity is something very important in the data center.
In one hour of the day, you need more compute.
In a different hour, you need less.
You need to move functions from one instance to the other.
Today, existing deep learning accelerators are out of the equation. They don't help in solving that. All the Kubernetes connection, the communication
with the orchestrator, all this is done on the CPU that is hosting these deep learning accelerators.
We had to integrate these functions into the device,
and we're not just running them on software.
The data path-oriented functions are offloaded to our hardware engine,
which is also programmable, but it solves things in much lower cost,
much faster latency.
Another issue, once you have an autonomous device
with a single DRAM and an embedded memory,
you don't need to copy things between the host
and the deep learning accelerator.
You have a single memory space, and you're
saving a lot on elastic buffers, about latency of copying stuff from one memory to the
other, about the power consumptions of the
expensive DRAM that we use today. So you're just solving the problem
right, open the world of opportunity and you just
need to collect them and take advantage of them.
Okay, so it sounds like basically what you
are trying to do is sort of optimize
the design on the hardware level from
the ground up basically and do all these micro or not so micro
optimizations around βάζουμε την πίεση και κάνουμε όλα αυτά τα μικρό, όχι όσο μικρό, οπτιμιζόρυθμα σχετικά με όλα τα συμφωνία και
προσθέτουμε την οπτιμιζόρυθμα που μπορείτε να βρείτε.
Αλλά, επίσης, αναφέρετε κάποιο σημείο, το software,
οπότε, έχουμε κάποια οπτιμιζόρυθμα στον κύριο σύστημα του software,
και αυτό είναι κάτι που μας έχει σημαίνει, και πιο συγκεκτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ειναι οτι ε machine learning framework, basically. So how does that work? How does deploying models on your, well, FPGAs for the time being,
but SOC in the future, how will that work?
Well, we're trying to make the user experience very easy.
Our first or our widest customer base today, if you look on where inference is done today,
and it's going to change in the next two, three years, but today it's in the cloud and not only
in the cloud, it's in the public cloud in first use. there's first use and third use. The first use is what Amazon and Google and Alibaba
and Microsoft are doing for their own use,
for search acceleration, for Alexa acceleration, et cetera.
This is where most of the inference run in data center.
We're going to see more and more of these inference surveys,
both in platform as a service or software as a service,
be sold
out to customers software companies
and other data centers around the world that are privately owned I can't come to
a developer and give him a golden egg but ask him to change the way he he
called his software so we must be compliant to the frameworks that he is
using our tool chain will be able to offload to accept models that were
trained in PyTorch or TensorFlow or any other.
And you know, with the penetration of ONNX in the world,
it's also become easier because you can always go through the intermediate representation of ONNX
before you map it into your device. So this is something we have to solve.
And we're going to deliver that even today when you can take a model and map
it to the FPGA implementation. The next evolution of compute offload is a pipeline offload.
So today, if you're doing face recognition, for instance, and your images are compressed,
you need to do JPEG decoding, resizing of the image, quantization
in the software above you that's running on the host, and then offload only the neural net
processing to the device. You can try and offload some of these functions in GPU case to the GPU,
but everything is managed and the pipeline is running from the software above it.
The future of AI compute offload
is to completely offload the pipeline.
For that, our tool train also supports
a compute graph representation
that represents all the pipelines that the developer had developed.
And he can develop it in Python or C++, you just need to add the directives to our toolchain
and we'll be able to offload the complete pipeline.
So software is the key. You're touching a very sensitive thing.
You know, today we're 20 engineers and half of them are software.
I don't know many semiconductor companies that started so evenly between software and I'm seeing
us growing much more on the software side than on the hardware side and definitely something
we're going to invest in more and more.
User experience is everything.
And for us, the users are the developers that want to map their trained model to our target.
Yeah, that makes sense, actually.
And I was kind of assuming that you'd probably rely
either on ONIX or perhaps TVM,
so as a kind of intermediate layer.
So as we're coming closer to wrapping up, let's do that by mentioning,
well, basically what the roadmap is.
So what is the current status?
So I take it it's not exactly proof of concept,
but potentially deployed at a select number of
customers at the moment and i wonder if you can mention any names or if not names any industries
that your solution is currently being used in and what's the roadmap going forward
the road going forward is just as we've done in the last year is getting closer
to our best engagement on the customer side and on the partner side you know
it's public and Xilinx and us made it public that we are partnering
together.
But we have two more partnerships that unfortunately I cannot publicly talk about yet.
In terms of customers, there are three swimlands we invested in, the hyperscalers and next wave cloud service providers, the
solution providers that build data centers for military, for governments, for the financial
industry.
And last but not least, the OEMs and ODFs that are very knowledgeable about how to build
the hardware, what's needed.
Different data centers have different compliancy levels and power envelope, etc.
And we're very close to them.
Almost 70% of our engagements had already seen the prototype.
We'll soon have it available for remote access
in one of the companies in the US
that are going to integrate our servers
into their data center.
And a lot of the focus had just moved to develop the system on the chip that we're going to
introduce next year. This is where we're going. We're already looking on round A to fund the tape out and production.
And we're going to engage closer with customers,
not only to enable design wins,
but also to get the most feedback before we tape out.
I hope you enjoyed the podcast.
If you like my work, you can follow Link Data Orchestration on Twitter, LinkedIn, and Facebook.