In The Arena by TechArena - Unpacking AI Inference at the Edge with Untether AI
Episode Date: January 29, 2025Untether AI's Bob Beachler explores the future of AI inference, from energy-efficient silicon to edge computing challenges, MLPerf benchmarks, and the evolving enterprise AI landscape....
Transcript
Discussion (0)
Welcome to the Tech Arena, featuring authentic discussions between tech's leading innovators
and our host, Alison Klein.
Now let's step into the arena.
Welcome in the arena.
My name is Alison Klein, and today I am really excited
because Bob Beachler from Untether AI is in the studio with me. Welcome to the program,
Bob. How are you doing? I'm doing great, Alison. Thanks for the opportunity. Looking forward to
catching up with you. Bob, I know you and I have talked a bunch before, but why don't we just start
with an introduction of Untether and what you're delivering in the AI arena?
Sure.
So at Untether, we were really founded to solve inference compute in AI.
So unlike the training, which gets a lot of the press and heat and light right now, we
know that inference is going to be a much larger marketplace because it's going to run
24-7, 365.
Depending upon the market research you talk to, it's going to be anywhere between three or four times larger in terms of AI acceleration silicon for inference than for training.
So first, the company focused on that, and it really focused on how do you run AI inference as energy efficiently as possible. So it came up with a novel, what we
call an at-memory compute architecture, so that we're minimizing the data movement and maximizing
the compute performance for AI workloads. And that was really the basis of the company. We started
shipping our first generation of silicon in 2021. And this year we're moving to production with our second
generation of silicon. So we've got now almost seven years of experience in terms of what does
it take to deliver AI inference solutions in the marketplace. Now you started with a conversation
about focus on inference, which I think makes a lot of sense, but can you give us a sense of the
landscape you're seeing?
You know, there's been so much attention in data center training over the last year.
Give us a sense of how the infrastructure landscape looks like and where in the compute
landscape inference is happening. Sure. And you made a good point. You know,
training, and particularly when you start talking about very large language models,
training is a data center application, something you offload to the cloud.
It's non-deterministic, meaning you don't know how long it's going to take you to finish training your model.
And it requires a lot of compute it quickly and you want to run it
multiplicity of times, depending upon how many users you have using it. And so in inference,
yes, it's used in the data center, but it's also on-prem. It's also at the edge. It's also at the
endpoint. These models are being deployed from everything from your cell phone to large gigawatt data centers.
So unlike training, which is really reserved as a data center type of application due to the characteristics of it,
once you move to inference, it's being deployed everywhere.
So you have different levels of compute, different power consumption requirements, different latency throughput requirements.
And so inference is much broader and larger than you would see with just a training approach.
You know, what's interesting about that is that you've got so many different environments where
inference is happening. And I know that, Bob, you guys really specialize in delivery of energy
efficient solutions for the edge. Can you tell us a little bit about the specific requirements
in edge computing that you're seeing?
Sure.
And the difference between training and inference is that
you want your inference to be deterministic.
And generally for the edge, by definition, it needs to be low latency.
If you can afford sending the data to the cloud,
having the cloud crunch on it,
get the data back, and you don't know how long it's going to take, sometimes your connectivity
is off, those types of things. By definition, that type of application is not sensitive to
latency and therefore you don't need to put it on the edge. But the applications that are being
done either on-prem or on the edge are the ones that
care about latency and determinism. So I'm on a factory floor and I've got things whizzing by
on a conveyor belt that I need to be able to identify them. And I can't wait. These things
are going to keep moving. So I can't afford to go to the cloud. I'm in an autonomous vehicle,
right? I can't wait to send this wirelessly to I'm in an autonomous vehicle, right? I can't wait
to, you know, send this wirelessly to the cloud and get an answer. I need to know, is that a
pedestrian? Is that a baby buggy? Or is that a bag blowing across, right? So by definition, you know,
these applications and where we focus is to provide that low latency. And then you get into,
based upon the application, what's my budget? And my budget being
what's my power budget? How much thermal heat can I dissipate? And so depending on who you talk to
and the application, it'll change. So, you know, let's take the autonomous vehicle as an example.
While they can have more energy, they do need to cool these compute resources. And every energy that you use on the
compute is less range for an electric vehicle. So they care about energy efficiency as well as
latency. On a factory floor, as an example, it may be a harsher environment and therefore you
want to keep things cool. So you can't afford to put a data center class chip on your factory floor because it's
harder. Same thing with vision-guided robotics. Anything that moves, you have these power,
thermal, latency. And so 100% of the time when we talk to our customers, that's what they care about.
How quickly can you get me the result? How many results can you do in a minute? And how much energy are you
taking to get this done? And then that translates to the data center. Because if you're running,
for example, a 40 megawatt data center, but if I can get three times or six times the compute
in that 40 megawatts than what I'm using today, that tens of millions of dollars just in energy costs alone that I'm
saving. That makes a lot of sense. So everybody's talking about build out and they're talking about
how much energy these data centers are consuming. They're doing it predicated upon the technology
like GPUs and CPUs that they've evolved, but they haven't had a radical transformation in decades.
Now, you know, you've talked about the fact that you've got a couple of generations of silicon.
You know, you are in a unique position in the AI silicon startup space and the fact that you
actually have silicon in the market. Tell us about your portfolio and how the market has responded to
it. Sure. So it's one thing to say I have a piece of silicon and it can do a network, but
it's your software stack and your SDK that says, okay, I can take a multiplicity of different
networks. You can change the network. You can make it faster. You can make it more accurate
and the software supports it and gets it into the silicon and running at performance
in an optimized manner. Six years of
engagements with customers, we've learned so much and we continue to evolve our software in order to
adapt to that. Our silicon is baked. It's great. It's good. But I can get a 2x improvement in
throughput or a 50% reduction in latency just on software optimization. So we're constantly updating the software.
In fact, we release it every three months.
We put a new release out to our customers.
Now, how does software fit into this?
Obviously, software optimization has got to be a big part about dialing this in.
Yeah, it's a huge part.
So let's talk about that a little bit.
I'll say 100% of neural networks today are trained on NVIDIA GPUs.
That's not quite right.
You know, there's some TPUs at Google, there's Trainium at AWS, but in
general, everything runs on NVIDIA.
Our job is to take that network that was trained and optimized for NVIDIA and have
it run efficiently and optimally in our silicon.
So we start with that in the machine learning framework, whether it's TensorFlow or PyTorch.
And what we do is avoid CUDA entirely. That's the low-level optimizations that NVIDIA talks about.
You've heard CUDA moat. Well, our strategy, don't mess with the CUDA. We're just going to take the
trained network and then optimize it for our silicon. That includes quantizing it to what I call inference-friendly data types, you know,
a lower precision floating point or integer, then mapping it to the low-level kernels that run on
the thousands of risk processors that we have on our silicon, physically allocating the network,
making sure that it fits on the
silicon in an optimal manner, minimizing the transfer of data, and then ultimately having
a programming file that programs up the chip and the runtime then sends the data to the chip. The
chip does the calculation of the neural network and gives the result back. That's a simplified
version. More complex answer is that we have more software engineers
than we have silicon designers.
That's how important the software is.
Because without the software working, it doesn't matter
how good your silicon is.
Now, Bob, I know that we talk about performance and performance efficiency,
but in the AI space, ML Commons has produced a number of different
benchmarks, both for the data center and the edge.
Can you talk a little bit about the performance results that you've gotten on their benchmarks and
which benchmarks in particular customers should be looking at for these inference platforms?
Yeah, you bet. So I'm a big fan of MLPerf and MLCommons, and I've been with Untether now coming
up on five years. But prior to that, I was at a couple other AI startups when ML Commons
and ML Perp was just an idea. It's great because it is an unbiased, peer-reviewed benchmark. So
when we submitted our benchmark, we got inquiries from all of our peers, AMD, NVIDIA, the other
people that were submitting, and we got to see their results. They got to see our results and
we poked at it. And that's what makes it a good quality benchmark. It's the peer review. It's
also the fact that they measure not just throughput, but also accuracy, meaning there's an accuracy
threshold that you have to meet because quite honestly, a number of AI startups are using
funky data types or weird analog computing technologies, that's sacrifice
accuracy. And that's why you don't see them submitting to MLPerf because there is an
accuracy bar that if you don't meet it, you don't get to submit. So throughput, accuracy,
and then in the edge category, they measure latency. I talked about that, right? How latency
is really important on the edge. So that's a
measurement that they do. And then finally, there's a separate section, both in the data center and
in the edge benchmarks for power consumption, where rather than looking at just the published
TDPs of the different accelerators, you're actually measuring at the wall socket, the power coming out
of the wall to the server that is housing the accelerators.
So when we submitted, we submitted in a power category, both for the data center,
where we were up against NVIDIA, you know, H200 SXMs, that accelerator type system. And then
we also submitted power in the edge category. And unfortunately, there weren't other people
submitting in the edge power category. But when we talk to customers, that's what they look at.
It's like MLPerf is kind of the gold standard and looking at the different types of benchmarks,
whether it's vision applications, NLP, generative AI, at the data center, at the edge,
throughput, latency, power consumption. Now, obviously 2024 was a lot about
large language models and training, but we're headed into 2025. A lot of folks are saying that
enterprise adoption is going to start hockey sticking upwards. How do you see the market in
2025? And are we still looking at the conversations around the token wars or what are we going to be
talking about? Let's take the token wars aside just for a second, because I do have a comment on that. But
on the enterprise, there was a lot of, you know, heat and light generated by generative AI. And
we saw a lot of enterprises, their IT teams and their research teams got, I wouldn't say
distracted, but they wanted to see what that was all about. What can I do with Lama 270 billion? What can I do with chat GPT? And so they went off and investigated
that and kind of lost a little bit of focus about what they were doing prior to that, which was,
how do I use AI in my enterprise? Now I'm seeing in 2025, people going back to that. It's like,
okay, I've taken a look at these generative AI,
large language models, but for my business application, what can I use? What's going to
make my company more efficient? And that's going to be natural language processing for knowledge
discovery. Perhaps it's some chatbots for customer experience and customer support.
But I'm also seeing like in the industrial case, you know,
vision systems on the factory floor, vision guided robotics. So I think 2025, it's going to be back
to the deployment of AI. And it's not just 70 billion, 405 billion parameter models. It's a
vision model. I'm doing object detection or I'm doing semantic segmentation. It's an NLP
that's maybe doing knowledge discovery. So we're going to see that coming back, I think,
in the enterprises as they look to really utilize AI to improve their efficiencies.
Now, getting to the token wars, I find it interesting because certainly we're seeing
that some of my compatriot startups in particular are chasing the artificial analysis benchmark.
I have the highest throughput for LAMA 3, 405.
But what they're missing is that it's outrageously expensive to stand up those systems. So when you look at a Grok system or a Cerebrus system or a Samba Nova system,
it's measured in the number of racks deployed. So we're talking about the entry point alone
is five to $10 million just to spin up a large language model on one of those systems.
And a lot of it was driven by the fact that that's how their architecture was created,
is that it's not very efficient at doing these very large models. They have to shard it across
many, many pieces of silicon. And so it becomes incredibly expensive. And that's why they pivoted
to tokens as a service. But that, in my mind, it's like you're competing with your own customers
because I talk to data center customers. I want to sell to them. I don't want to compete with them.
You know, we're not going to stand up our own cloud and say, I'm going to use my silicon,
but be a token as a service provider. It's like we're enabling technology for those data centers.
Now, when you look at the enterprise space, and thank you for sharing that on the Token Wars,
I think this is going to be something that's going to be really interesting to follow in 25,
and we keep writing about it on the tech arena. But when you look in the enterprise space,
are there any particular verticals that you see moving faster than others and things that we
should be looking for? Where do I begin? We're a general purpose AI accelerator. So in any given day, I'll talk to
someone who's looking to stand up a large language model. I'll talk to someone who's looking at doing
semantic segmentation for a vision network. I'll talk to someone who's doing object detection for
garbage sorting. There isn't any one because AI is just becoming so pervasive and it's being used in a lot of different aspects
of industry. So certainly a lot of activity in autonomous vehicles, vision-guided robotics,
because that's the only way, if it wasn't for AI, you couldn't have those systems. The fact that
this AI summer that we're living in started in 2014 with the ImageNet Challenge and
AlexNet. It solved a problem that people couldn't solve. We can now make perception systems that are
better than humans by using AI. That's what opens up autonomous vehicles and vision-guided robotics
and the autonomous factory floor and smart cities and smart retail. So we see a lot of that. And like
I said, in the enterprise, I think we're still seeing knowledge discovery. Can I take my corpus
of information about my company, fine tune a language model so that a new employee can go in
and quickly query and get responses to find what information they need in a large company. So certainly seeing that.
The other thing that we're seeing is kind of the regionalization of data centers. So as an example,
this is public knowledge. We've been working with the Ola group in India. They have an AI group
called Krootram. And at Ola Krootram, they're standing up specific Indian language models and AI for the India market.
And we're working with them along with ARM and applying and creating their next data center and AI accelerator.
So it's a three-way between ourselves and ARM and Olakutram.
And we're going to supply the AI acceleration solution, a data center class, not like an edge class.
So this is our next generation silicon.
And then they're working with ARM on the actual CPU silicon.
And they're going to homegrown the whole thing for a regional India-specific data center.
We're also seeing that in some of the Middle Eastern countries where they have the resources and the dollars to stand up very large megawatt, gigawatt data centers.
And they're looking for silicon solutions for that.
That's awesome.
Now, obviously, this is an exciting year for the AI space, but it's also an exciting year
for Untethered.
Tell us what's coming up in 2025 from you and what are you excited about?
Yeah, so, you know, 2025 is a super exciting year for us.
As I mentioned with the Speed AI family, that's moving to full production, generating revenue, winning customers.
And so what we're doing there, like I said, the silicon's done.
We're going through the production qualification process, but there's still software opportunities.
So we've got a new technology that we're going to add into our compiler for our
Q1 release. And we have a roadmap. Like I said, every three months, we're improving the software
stack. So that's on the current generation, which is getting to production and continuing to
optimize the software stack for Speed AI. And then I mentioned the data center application,
that silicon's in design. This is
always the funny thing with a silicon company. You're deploying something that you designed and
taped out 12 months ago, and you're already in design for your next generation. You have to.
So we've got some of our company working on the next generation design, and we're giving them
input about what we've learned from our previous two generations.
But meanwhile, we're also having productization of the current silicon and the software.
So you're always on that treadmill.
That's pretty awesome.
I can't wait to see more. I know that the listeners that are listening in about what we're talking about are going to want to learn more.
So where can they go to find out more information and engage your team?
Yeah, as usual, the best place to start is at the website.
So just look up Untether AI.
That's one place.
Certainly, I'm available on LinkedIn.
I think last time I checked 5,000 followers or something like that, but I'm always open
to more.
And if people want to reach out to me, that's a great conduit because I'm publicly available
there.
Awesome.
Well, thank you so much, Bob.
I know you're a busy guy.
It was great spending some time with you.
Hey, it's great catching up with you, Allison.
I really appreciate it.
Thanks for joining the Tech Arena.
Subscribe and engage at our website, thetecharena.net.
All content is copyright by The Tech Arena.