SemiWiki.com - Podcast EP336: How Quadric is Enabling Dramatic Improvements in Edge AI with Veer Kheterpal

Starting point is 00:00:07 Hello, my name is Daniel Nenny, founder of semi-wiki, the open forum for semiconductor professionals. Welcome to the Semiconductor Insiders podcast series. My guest today is Veer Ketterpaul. Veer has founded three technology companies and has a full-stack expertise spanning software to silicon across edge and data center applications. Currently, he is a CEO and co-founder of Quadrick, a semiconductor IP licensing company that delivers the blueprints for efficient, flexible, AI processors to a wide range of customers designing chips for varied applications. Welcome to the podcast here. Thanks, Sandy. Thanks for having me on board.

Starting point is 00:00:41 Very excited to be here. Yeah, it's a pleasure to meet you. I know you, of course, but let's get a little bit of background. Do you tell us how Quadrick came about? Interesting and long story. I'll keep it short, though. I come from Carnegie Mellon founded a company in chip design software a long time ago, almost two decades now.

Starting point is 00:01:01 It was acquired fairly quickly by this company called PDF Solutions here in the Valley. And PDF is where I met my co-founders, Daniel and Nigel, almost 15 years ago. We built two companies together. We were co-founders of this Bitcoin mining company here in the Valley in 2013. Bitcoin was like $30 at the time. And we were building these ASICs, deploying them in data centers in large quantities. And then that company eventually was acquired by Coinbase in 2020. 2016 on the back of some blockchain products that we did there.

Starting point is 00:01:42 After that, we looked at the ecosystem. Our backgrounds really are computing infrastructure, compiler, software, hardware. And 2017, we looked at autonomous vehicles and robotics. Really, I think we were way too early in the effort, but we envisioned the kind of AI roadmap and VDIA was laying out. They were talking about what they're doing today back then nine years ago. So they laid the foundation for what they were going to do in the data center. No one was really talking about how AI gets deployed or leveraged on the edge,

Starting point is 00:02:21 which is outside the data center on your laptop in a robot, in a car. And we looked at that ecosystem, extremely bespoke efforts, broken software, constrained architectures. And we said, well, where is the CUDA for the edge? No one's really talking about it. And that laid the foundation of what we have today. We've gone on, created a new architecture here, an inference native AI processor.

Starting point is 00:02:55 The foundations are it's extremely programmable software control, which means as AI evolves, this processor keeps going, all developers have to do is reprogram it, compile new code, and run on it. As we've built Quadrick, we've raised a couple of rounds of funding series A, B, C along the way, prove the architecture in Silicon, and then chose the IP licensing model, exactly like ARM. It's based on the observation that our tech is extremely widely applicable. And we focus on the software layer, let our customers really build chips

Starting point is 00:03:40 and deploy them in their verticals. And right around Series C, this is November, December of last year, we've noticed something that LLM-driven AI capabilities, reasoning capabilities took a very big jump, right around November, December timeframe. And here I am three months later, pretty much running my personal workflows all through an AI agent across the board in the company, we're leveraging AI across many, many things, way beyond coding.

Starting point is 00:04:16 It's shocking what you can achieve in terms of what's possible today. And this all came about very, very quickly. Interesting. You said you run 80% of your CEO workflows through AI agents. What does that actually look like day to day? And how has it changed about how you run the company? It takes a lot of mechanical point and click out of your workflow. So you pull up this agent and you say, catch me up, right?

Starting point is 00:04:44 And it's wired to understand who you are. It has a lot of context on the data you have access to. As the CEO, I have access to all data, financials, projections, investor communications, all of that. And with that context, the catch me up. We'll read new email messages. It'll read my Slack messages. It'll look at even my texts because there's a lot of conversation happening one-on-one with a lot of strategic contacts over text messages.

Starting point is 00:05:17 And it can synthesize all of that, look at my view of the company, and give suggestions on priority. So all the mechanical part of me assembling information. across all of this and taking a step back and saying what what is the big things I need to go after today It's doing all of that for me leaving the decision To me which is which is what I want to do anyway in terms of What is the two two key big key things I need to be driving and then I do that three times a day Which is catch me up again on new emails new messages and it doesn't stop there Once there's something to do, some goal to achieve, which is a proposal, a contract discussion,

Starting point is 00:06:10 it can launch fairly deep work right there where it can pull additional information across the board from company management information systems that we have and then help create PowerPoint presentations, help create proposal decks, which are way better than what you can do in that amount of time, right? I can now crank out six proposals a day, have four investor conversations, which are fairly detailed and specific to our goals as a company, all within a day which would take weeks before. And I'm not talking about firing people and replacing them with this thing. It is a massive, massive personal leverage tool right now for me. Wow, that's amazing.

Starting point is 00:06:59 So, Agentic AI is the hot topic in the cloud software right now. In your terms, what does Agenic mean when it moves to the edge, you know, to devices that don't have a data center behind them? So let's contrast this a little bit, right, for our listeners to understand. You know, we all have used chat GPD, right? You open the browser, you give it a prompt, it responds. You could talk to about anything, really, right? An agent at the very fundamental level, you prompt and drive. It can be autonomous.

Starting point is 00:07:31 That's a choice. But the very fundamental difference is it has a goal. It has a personality that it works towards. And second thing between a chat bot and an agent, I'm drawing this contrast because that's the easiest to sort of understand. It's you get an answer. You can go away.

Starting point is 00:07:51 You can come back. An agent, on the other hand, will look at its goal, look at all the resources it has access to. And I'm talking access to emails and tools and software and information systems you've connected it to. It can make those tool calls. It will apply reasoning and find ways to achieve that goal. And towards that, it could be sitting and writing emails.

Starting point is 00:08:22 it can decide, hey, I should find out this information from this person. And we'll send off an email asking those questions until it can achieve that goal. So functionally, that's what is different. And when I say the agent has access to resources, it lives outside the cloud. It can be on your laptop, has access to your local network, local computers, any resources you have it connected to, cameras, databases. and really can leverage all of that, which is really hard to do in the cloud.

Starting point is 00:08:58 There was one more thing I'd like to point out here. There is more semiconductor processor savvy. The compute profile, when you think of these two cases, right, is very different. An agent is continuously consuming tokens, continuously taking actions. It's sustained inference over hours, versus a burst of inference when you send the message to Cloud Chat,

Starting point is 00:09:25 GPD, and then you get an answer and you go away, right? So very foundational structural difference between these two types of inference workloads. So cloud inference is built around many users sharing a GPU, you know, on a device, Agentic AI is the opposite, one agent running 24-7 on dedicated silicon. How does that difference in usage pattern change,

Starting point is 00:09:50 what the hardware needs look like. Great question. So, you know, when you think about multi-tenancy, right, everything changes. We've seen that between GPUs, like what Nvidia has, or architectures like GROC, the one that they acquired recently, is precisely that difference is, are you designing your silicon, your software stack, your memory access patterns, all of it with a single, batch in mind or or not, right? That's very foundational. How GPUs differ versus how edge architectures, like ours, are optimized for single batch continuous inference. Like our customers on using architecture

Starting point is 00:10:37 in a vehicle for an ADAS function, a driver assist feature, right, is continuously pushing camera frames. There is no breaks. There is no bursty nature that workload, right? And that foundation gives rise to very different architectural choices when designing everything from the ground up. And so cloud AI is like those architectures like restaurant kitchen, right, cook many things at the same time, evict the workload and introduce another one. Edge focus ones or agents, personal agents like these are a personal chef. right you're cooking all the time just for you never stop you know catering to that interesting so where are you seeing real customer pull for on-device agentic AI today you know

Starting point is 00:11:33 what industries are going to be moving first if i were to put my bed on we have customers already that we have a license for this last year as our IP we have this yet not announced these customers yet they are going to be designing this capability right into the laptop using chips design with our IP capable of running dense 7 billion to 30 billion parameter models and if you think about the capabilities of those models that have evolved significantly you can do a lot with a local 30 billion parameter model these days and only leverage the cloud hyper intelligent model for planning or or architectural decisions And so the use case there is your agents running inference.

Starting point is 00:12:25 Most of the tokens are happening locally on your laptop. And then use your, you know, clod or open AI subscription to leverage the cloud one wherever necessary. Right. And so personal laptops is one use case. The other one I strongly believe will be industrial manufacturing. similar chipsets that go out there, but imagine a machine in a machine room, big manufacturing plant and an agent right there running on a small laptop or a PC that can write emails to support, observe the machine, write code and analyze sensor data, right? The whole thing inverts instead of you doing the streaming data in the cloud and processing it later. it's like a little tiny brain diagnosing problems looking at production data, emailing support, emailing the management of the factory and really interacting at that level, right?

Starting point is 00:13:34 That's what I mean by the foundational difference between, again, highlighting agent versus a chatbot. So those two, I think, are big. They are happening now. Behind that, there is robotics requires a lot of similar capabilities. It's not there yet in terms of volumes and how fast it's going to deploy. Then security defense will be whole another area. I mean, we just had this flare up last week. It's ongoing between, you know, DOD and anthropic.

Starting point is 00:14:10 And so, you know, it's exactly about this use case of an A. Now, the one we're talking about DOD is more running in the cloud, but that use case is already here and it's happening. Yeah, we're definitely at the tip of the iceberg here. So the software tool chain has been a big challenge for AI chip startups. And so how does Quadricks solve the programmability problem for developers building agentic applications? Great question. So this one, right, if you look at the history of computing, not the recent AI inference, accelerators, but even go back DSP, go back during x86, you know, early days of x86,

Starting point is 00:14:53 the place where many, many, many, many architectures died, there was data flow attempted, even on CPUs long time ago, was compilers, right? If you cannot write code on it easily and express developer intent, compile and run it efficiently, it's over, right? It's extremely hard to scale that. What we did here right from the start, I'm talking 2018, was create a programming model along with the hardware architecture. So making decisions in hardware which enable a certain construct at a commonly used C++ level and created the C++ for the architecture, which is a power user's tool, right, to program hardware. And design. design the hardware alongside that across three versions of the architecture.

Starting point is 00:15:49 We have now reached version three where the maturity of the C++ has hit certain levels where a lot of external developers write code on our architecture. And then we have recently added Python layer as well where you come in and, you know, familiar Python looking code can compile and go down. Short answer to your question is, you know, hardware software code design or hardware compiler code design is what we did here from the beginning and focused 80% of our investment in engineering into that compiler as we developed the hardware. And when you think about agentic workloads, right, what's happening here? Just recall I mentioned, there's this continuous inference tokens,

Starting point is 00:16:41 which are running reasoning, and that reasoning is determining what tools, what resources it has access to and executing on those tasks. And that's extremely dynamic environment. It could be invoking another model to analyze audio. It could be invoking another model to analyze a camera stream or continue running the tokens to write some software to do the analysis. And I have all of this actually already running. And so I see these use cases. And in that dynamic environment, right, ask yourself, like, what kind of inference processor do I need? Does it have to be only transformer or only vision?

Starting point is 00:17:25 Or should it be capable of different types of models where a new model next week can be loaded into the toolbox of this agent, right? And that really implies programmability. and flexibility of that inference hardware is paramount if you're going to be running agentic workload on it. That's what makes me really excited. We made the programmability bet a long time ago, have matured software and hardware around that concept.

Starting point is 00:17:59 And the big, big, big use cases just arrived one quarter ago. Great. So final question, Vier, and this is kind of a vision question, what does the semiconductor industry need to get right in the next two to three years to make on-device agentic AI a reality? I think one of the things that's near and dear to me personally is ease of use. I think that generally applies. Let's not mess that up, right? You put out silicon, you put out an architecture. If it's hard to get anything running on it, these days people expect, you know, at a voice command, people are getting a lot of things done.

Starting point is 00:18:43 And convenience is a huge factor. So making absolutely sure the silicon, the hardware, is easy to use. Without that, you will die. Now, you know, that's the baseline, I think. the big thing if you look at these types of models is your memory wall, right? It's a well-known problem. LLMs are memory-bound, but if you look at last 18 months of what went down, right? Hardware improvements, process improvements, which is TSMC delivering on their roadmap and NVIDIA, you know, scaling the GPU on top of it, That has not been the big foundation of performance improvements.

Starting point is 00:19:30 You know, 10x have come from those improvements in the last about 24 months, across process, across, you know, better packaging or what have you, HBM improvements. But the big, which is about 23 to 25X improvements, 25 times, have come from software, which is, you know, things like flash attention, FP8 and now, NVIDIA and Frontier Labs looking at FB4, speculative decode quantization techniques have all contributed way more than what hardware has, right? And so that contrast, right, is very stark if you think about it, much more improvements coming from software as opposed to hardware per year.

Starting point is 00:20:20 And what that implies going forward is model architecture improvements, model foundational new types of models, will be a massive driver of this performance. Bringing down the memory wall is, yes, there's efforts out there on analog purely in-memory techniques. I think that can change the game if that comes through. but the more feasible approaches and that are happening now are happening in software, new model architectures, new types of quantization techniques. And that implies the hardware that needs to intercept and deliver on that software that developers are going after has to be relatively programmable, allow for that innovation to occur, right? You tie down hardware to specific attention, specific types of number formats and that locks everything up,

Starting point is 00:21:23 which is why if you look at Nvidia Roadmap from the last eight years, they have been leading on one key factor, which is all the different number formats that they offer during training. They've been, they started with FP32 and have been absolutely offering all kinds of things, because that's what gives the developers the freedom to build an experiment and deliver on the types of models and new architectures, which has come through in terms of performance, right? Like I said, 25, 30X improvements from software and 15X from hardware in the last 24 months. Well, that's amazing. It's a pleasure to meet you.

Starting point is 00:22:09 I'm a big fan, and hopefully we can catch up again and get another update. later on this year. Awesome. Thank you guys. That concludes our podcast. Thank you all for listening and have a great day.

SemiWiki.com - Podcast EP336: How Quadric is Enabling Dramatic Improvements in Edge AI with Veer Kheterpal

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.