In The Arena by TechArena - Unpacking AI Inference at the Edge with Untether AI

Starting point is 00:00:00 Welcome to the Tech Arena, featuring authentic discussions between tech's leading innovators and our host, Alison Klein. Now let's step into the arena. Welcome in the arena. My name is Alison Klein, and today I am really excited because Bob Beachler from Untether AI is in the studio with me. Welcome to the program, Bob. How are you doing? I'm doing great, Alison. Thanks for the opportunity. Looking forward to catching up with you. Bob, I know you and I have talked a bunch before, but why don't we just start

Starting point is 00:00:42 with an introduction of Untether and what you're delivering in the AI arena? Sure. So at Untether, we were really founded to solve inference compute in AI. So unlike the training, which gets a lot of the press and heat and light right now, we know that inference is going to be a much larger marketplace because it's going to run 24-7, 365. Depending upon the market research you talk to, it's going to be anywhere between three or four times larger in terms of AI acceleration silicon for inference than for training. So first, the company focused on that, and it really focused on how do you run AI inference as energy efficiently as possible. So it came up with a novel, what we

Starting point is 00:01:26 call an at-memory compute architecture, so that we're minimizing the data movement and maximizing the compute performance for AI workloads. And that was really the basis of the company. We started shipping our first generation of silicon in 2021. And this year we're moving to production with our second generation of silicon. So we've got now almost seven years of experience in terms of what does it take to deliver AI inference solutions in the marketplace. Now you started with a conversation about focus on inference, which I think makes a lot of sense, but can you give us a sense of the landscape you're seeing? You know, there's been so much attention in data center training over the last year.

Starting point is 00:02:09 Give us a sense of how the infrastructure landscape looks like and where in the compute landscape inference is happening. Sure. And you made a good point. You know, training, and particularly when you start talking about very large language models, training is a data center application, something you offload to the cloud. It's non-deterministic, meaning you don't know how long it's going to take you to finish training your model. And it requires a lot of compute it quickly and you want to run it multiplicity of times, depending upon how many users you have using it. And so in inference, yes, it's used in the data center, but it's also on-prem. It's also at the edge. It's also at the

Starting point is 00:02:58 endpoint. These models are being deployed from everything from your cell phone to large gigawatt data centers. So unlike training, which is really reserved as a data center type of application due to the characteristics of it, once you move to inference, it's being deployed everywhere. So you have different levels of compute, different power consumption requirements, different latency throughput requirements. And so inference is much broader and larger than you would see with just a training approach. You know, what's interesting about that is that you've got so many different environments where inference is happening. And I know that, Bob, you guys really specialize in delivery of energy efficient solutions for the edge. Can you tell us a little bit about the specific requirements

Starting point is 00:03:46 in edge computing that you're seeing? Sure. And the difference between training and inference is that you want your inference to be deterministic. And generally for the edge, by definition, it needs to be low latency. If you can afford sending the data to the cloud, having the cloud crunch on it, get the data back, and you don't know how long it's going to take, sometimes your connectivity

Starting point is 00:04:10 is off, those types of things. By definition, that type of application is not sensitive to latency and therefore you don't need to put it on the edge. But the applications that are being done either on-prem or on the edge are the ones that care about latency and determinism. So I'm on a factory floor and I've got things whizzing by on a conveyor belt that I need to be able to identify them. And I can't wait. These things are going to keep moving. So I can't afford to go to the cloud. I'm in an autonomous vehicle, right? I can't wait to send this wirelessly to I'm in an autonomous vehicle, right? I can't wait to, you know, send this wirelessly to the cloud and get an answer. I need to know, is that a

Starting point is 00:04:50 pedestrian? Is that a baby buggy? Or is that a bag blowing across, right? So by definition, you know, these applications and where we focus is to provide that low latency. And then you get into, based upon the application, what's my budget? And my budget being what's my power budget? How much thermal heat can I dissipate? And so depending on who you talk to and the application, it'll change. So, you know, let's take the autonomous vehicle as an example. While they can have more energy, they do need to cool these compute resources. And every energy that you use on the compute is less range for an electric vehicle. So they care about energy efficiency as well as latency. On a factory floor, as an example, it may be a harsher environment and therefore you

Starting point is 00:05:39 want to keep things cool. So you can't afford to put a data center class chip on your factory floor because it's harder. Same thing with vision-guided robotics. Anything that moves, you have these power, thermal, latency. And so 100% of the time when we talk to our customers, that's what they care about. How quickly can you get me the result? How many results can you do in a minute? And how much energy are you taking to get this done? And then that translates to the data center. Because if you're running, for example, a 40 megawatt data center, but if I can get three times or six times the compute in that 40 megawatts than what I'm using today, that tens of millions of dollars just in energy costs alone that I'm saving. That makes a lot of sense. So everybody's talking about build out and they're talking about

Starting point is 00:06:31 how much energy these data centers are consuming. They're doing it predicated upon the technology like GPUs and CPUs that they've evolved, but they haven't had a radical transformation in decades. Now, you know, you've talked about the fact that you've got a couple of generations of silicon. You know, you are in a unique position in the AI silicon startup space and the fact that you actually have silicon in the market. Tell us about your portfolio and how the market has responded to it. Sure. So it's one thing to say I have a piece of silicon and it can do a network, but it's your software stack and your SDK that says, okay, I can take a multiplicity of different networks. You can change the network. You can make it faster. You can make it more accurate

Starting point is 00:07:18 and the software supports it and gets it into the silicon and running at performance in an optimized manner. Six years of engagements with customers, we've learned so much and we continue to evolve our software in order to adapt to that. Our silicon is baked. It's great. It's good. But I can get a 2x improvement in throughput or a 50% reduction in latency just on software optimization. So we're constantly updating the software. In fact, we release it every three months. We put a new release out to our customers. Now, how does software fit into this?

Starting point is 00:07:54 Obviously, software optimization has got to be a big part about dialing this in. Yeah, it's a huge part. So let's talk about that a little bit. I'll say 100% of neural networks today are trained on NVIDIA GPUs. That's not quite right. You know, there's some TPUs at Google, there's Trainium at AWS, but in general, everything runs on NVIDIA. Our job is to take that network that was trained and optimized for NVIDIA and have

Starting point is 00:08:23 it run efficiently and optimally in our silicon. So we start with that in the machine learning framework, whether it's TensorFlow or PyTorch. And what we do is avoid CUDA entirely. That's the low-level optimizations that NVIDIA talks about. You've heard CUDA moat. Well, our strategy, don't mess with the CUDA. We're just going to take the trained network and then optimize it for our silicon. That includes quantizing it to what I call inference-friendly data types, you know, a lower precision floating point or integer, then mapping it to the low-level kernels that run on the thousands of risk processors that we have on our silicon, physically allocating the network, making sure that it fits on the

Starting point is 00:09:05 silicon in an optimal manner, minimizing the transfer of data, and then ultimately having a programming file that programs up the chip and the runtime then sends the data to the chip. The chip does the calculation of the neural network and gives the result back. That's a simplified version. More complex answer is that we have more software engineers than we have silicon designers. That's how important the software is. Because without the software working, it doesn't matter how good your silicon is.

Starting point is 00:09:33 Now, Bob, I know that we talk about performance and performance efficiency, but in the AI space, ML Commons has produced a number of different benchmarks, both for the data center and the edge. Can you talk a little bit about the performance results that you've gotten on their benchmarks and which benchmarks in particular customers should be looking at for these inference platforms? Yeah, you bet. So I'm a big fan of MLPerf and MLCommons, and I've been with Untether now coming up on five years. But prior to that, I was at a couple other AI startups when ML Commons and ML Perp was just an idea. It's great because it is an unbiased, peer-reviewed benchmark. So

Starting point is 00:10:13 when we submitted our benchmark, we got inquiries from all of our peers, AMD, NVIDIA, the other people that were submitting, and we got to see their results. They got to see our results and we poked at it. And that's what makes it a good quality benchmark. It's the peer review. It's also the fact that they measure not just throughput, but also accuracy, meaning there's an accuracy threshold that you have to meet because quite honestly, a number of AI startups are using funky data types or weird analog computing technologies, that's sacrifice accuracy. And that's why you don't see them submitting to MLPerf because there is an accuracy bar that if you don't meet it, you don't get to submit. So throughput, accuracy,

Starting point is 00:10:57 and then in the edge category, they measure latency. I talked about that, right? How latency is really important on the edge. So that's a measurement that they do. And then finally, there's a separate section, both in the data center and in the edge benchmarks for power consumption, where rather than looking at just the published TDPs of the different accelerators, you're actually measuring at the wall socket, the power coming out of the wall to the server that is housing the accelerators. So when we submitted, we submitted in a power category, both for the data center, where we were up against NVIDIA, you know, H200 SXMs, that accelerator type system. And then

Starting point is 00:11:39 we also submitted power in the edge category. And unfortunately, there weren't other people submitting in the edge power category. But when we talk to customers, that's what they look at. It's like MLPerf is kind of the gold standard and looking at the different types of benchmarks, whether it's vision applications, NLP, generative AI, at the data center, at the edge, throughput, latency, power consumption. Now, obviously 2024 was a lot about large language models and training, but we're headed into 2025. A lot of folks are saying that enterprise adoption is going to start hockey sticking upwards. How do you see the market in 2025? And are we still looking at the conversations around the token wars or what are we going to be

Starting point is 00:12:23 talking about? Let's take the token wars aside just for a second, because I do have a comment on that. But on the enterprise, there was a lot of, you know, heat and light generated by generative AI. And we saw a lot of enterprises, their IT teams and their research teams got, I wouldn't say distracted, but they wanted to see what that was all about. What can I do with Lama 270 billion? What can I do with chat GPT? And so they went off and investigated that and kind of lost a little bit of focus about what they were doing prior to that, which was, how do I use AI in my enterprise? Now I'm seeing in 2025, people going back to that. It's like, okay, I've taken a look at these generative AI, large language models, but for my business application, what can I use? What's going to

Starting point is 00:13:11 make my company more efficient? And that's going to be natural language processing for knowledge discovery. Perhaps it's some chatbots for customer experience and customer support. But I'm also seeing like in the industrial case, you know, vision systems on the factory floor, vision guided robotics. So I think 2025, it's going to be back to the deployment of AI. And it's not just 70 billion, 405 billion parameter models. It's a vision model. I'm doing object detection or I'm doing semantic segmentation. It's an NLP that's maybe doing knowledge discovery. So we're going to see that coming back, I think, in the enterprises as they look to really utilize AI to improve their efficiencies.

Starting point is 00:13:57 Now, getting to the token wars, I find it interesting because certainly we're seeing that some of my compatriot startups in particular are chasing the artificial analysis benchmark. I have the highest throughput for LAMA 3, 405. But what they're missing is that it's outrageously expensive to stand up those systems. So when you look at a Grok system or a Cerebrus system or a Samba Nova system, it's measured in the number of racks deployed. So we're talking about the entry point alone is five to $10 million just to spin up a large language model on one of those systems. And a lot of it was driven by the fact that that's how their architecture was created, is that it's not very efficient at doing these very large models. They have to shard it across

Starting point is 00:14:52 many, many pieces of silicon. And so it becomes incredibly expensive. And that's why they pivoted to tokens as a service. But that, in my mind, it's like you're competing with your own customers because I talk to data center customers. I want to sell to them. I don't want to compete with them. You know, we're not going to stand up our own cloud and say, I'm going to use my silicon, but be a token as a service provider. It's like we're enabling technology for those data centers. Now, when you look at the enterprise space, and thank you for sharing that on the Token Wars, I think this is going to be something that's going to be really interesting to follow in 25, and we keep writing about it on the tech arena. But when you look in the enterprise space,

Starting point is 00:15:33 are there any particular verticals that you see moving faster than others and things that we should be looking for? Where do I begin? We're a general purpose AI accelerator. So in any given day, I'll talk to someone who's looking to stand up a large language model. I'll talk to someone who's looking at doing semantic segmentation for a vision network. I'll talk to someone who's doing object detection for garbage sorting. There isn't any one because AI is just becoming so pervasive and it's being used in a lot of different aspects of industry. So certainly a lot of activity in autonomous vehicles, vision-guided robotics, because that's the only way, if it wasn't for AI, you couldn't have those systems. The fact that this AI summer that we're living in started in 2014 with the ImageNet Challenge and

Starting point is 00:16:25 AlexNet. It solved a problem that people couldn't solve. We can now make perception systems that are better than humans by using AI. That's what opens up autonomous vehicles and vision-guided robotics and the autonomous factory floor and smart cities and smart retail. So we see a lot of that. And like I said, in the enterprise, I think we're still seeing knowledge discovery. Can I take my corpus of information about my company, fine tune a language model so that a new employee can go in and quickly query and get responses to find what information they need in a large company. So certainly seeing that. The other thing that we're seeing is kind of the regionalization of data centers. So as an example, this is public knowledge. We've been working with the Ola group in India. They have an AI group

Starting point is 00:17:18 called Krootram. And at Ola Krootram, they're standing up specific Indian language models and AI for the India market. And we're working with them along with ARM and applying and creating their next data center and AI accelerator. So it's a three-way between ourselves and ARM and Olakutram. And we're going to supply the AI acceleration solution, a data center class, not like an edge class. So this is our next generation silicon. And then they're working with ARM on the actual CPU silicon. And they're going to homegrown the whole thing for a regional India-specific data center. We're also seeing that in some of the Middle Eastern countries where they have the resources and the dollars to stand up very large megawatt, gigawatt data centers.

Starting point is 00:18:05 And they're looking for silicon solutions for that. That's awesome. Now, obviously, this is an exciting year for the AI space, but it's also an exciting year for Untethered. Tell us what's coming up in 2025 from you and what are you excited about? Yeah, so, you know, 2025 is a super exciting year for us. As I mentioned with the Speed AI family, that's moving to full production, generating revenue, winning customers. And so what we're doing there, like I said, the silicon's done.

Starting point is 00:18:36 We're going through the production qualification process, but there's still software opportunities. So we've got a new technology that we're going to add into our compiler for our Q1 release. And we have a roadmap. Like I said, every three months, we're improving the software stack. So that's on the current generation, which is getting to production and continuing to optimize the software stack for Speed AI. And then I mentioned the data center application, that silicon's in design. This is always the funny thing with a silicon company. You're deploying something that you designed and taped out 12 months ago, and you're already in design for your next generation. You have to.

Starting point is 00:19:15 So we've got some of our company working on the next generation design, and we're giving them input about what we've learned from our previous two generations. But meanwhile, we're also having productization of the current silicon and the software. So you're always on that treadmill. That's pretty awesome. I can't wait to see more. I know that the listeners that are listening in about what we're talking about are going to want to learn more. So where can they go to find out more information and engage your team? Yeah, as usual, the best place to start is at the website.

Starting point is 00:19:47 So just look up Untether AI. That's one place. Certainly, I'm available on LinkedIn. I think last time I checked 5,000 followers or something like that, but I'm always open to more. And if people want to reach out to me, that's a great conduit because I'm publicly available there. Awesome.

Starting point is 00:20:04 Well, thank you so much, Bob. I know you're a busy guy. It was great spending some time with you. Hey, it's great catching up with you, Allison. I really appreciate it. Thanks for joining the Tech Arena. Subscribe and engage at our website, thetecharena.net. All content is copyright by The Tech Arena.

In The Arena by TechArena - Unpacking AI Inference at the Edge with Untether AI

Untether AI's Bob Beachler explores the future of AI inference, from energy-efficient silicon to edge computing challenges, MLPerf benchmarks, and the evolving enterprise AI landscape....

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

In The Arena by TechArena - Unpacking AI Inference at the Edge with Untether AI

Untether AI's Bob Beachler explores the future of AI inference, from energy-efficient silicon to edge computing challenges, MLPerf benchmarks, and the evolving enterprise AI landscape....

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.