Software Misadventures - Pete Warden - On launching "AI in a Box" and building a hardware edge AI company - #24

Starting point is 00:00:00 Welcome to the Software Misadventures podcast. We are your hosts, Ronak and Gwan. As engineers, we are interested in not just the technologies, but the people and the stories behind them. So on this show, we try to scratch our own edge by sitting down with engineers, founders, and investors to chat about their path, lessons they have learned, and of course,

Starting point is 00:00:25 the misadventures along the way. So like many of you, Rannik and I have been loosely following the news about large language models, or LLMs. Recently, we came across this pretty cool demo video that shows an LLM running on a wallet-sized board that's doing real-time language translation, as well as question answering. Kind of like ChatGPT, but running locally, so he doesn't need Wi-Fi. We had so many questions, like how do you fit LLMs that have billions of parameters onto such a small chip? What's the accuracy like? What are the use cases?

Starting point is 00:00:58 And where do I get one? If these questions sound interesting, then stick around, because in this episode, we're talking to the guy behind it all, Pete Walden. So you might know Pete from his role leading the TensorFlow mobile project at Google. But after seven years, recently he started his own company called Useful Sensors. And the demo video I mentioned is from a project they recently launched called AI in a Box. Without further ado, let's get into the conversation. So on your blog, right under the name, you have a subtitle that reads,

Starting point is 00:01:31 Ever tried, ever failed, no matter, try again, fail again, fail better. Is this a motto of yours? Yeah, and it's a quote from Beckett, the playwright, and it really stuck with me because what I found is I don't know how to do things in a particularly smart or intelligent way, sort of better than anybody else. That's very humble of you. No, I mean, it's an honest assessment but like what i can do is see if i can get the iteration actually going and that whole i mean anybody who's done any software engineering or data science knows you try something it doesn't quite work but if you can make it so that that

Starting point is 00:02:23 cycle is fast you can actually learn really fast you know in ml terms if you can make it so that that cycle is fast, you can actually learn really fast, you know, in ML terms, if you can bring that epoch time down, then you actually have a chance to, you know, figure stuff out. So it's, it's really trying to capture that idea of getting those iterations in, not being afraid, you know, trying to set up things where you aren't betting everything on one big, uh, you know, the first time you try something really trying to keep, keep iterating on it and getting that iteration time down and just being comfortable with like figuring out, I think it was uh thomas edison had a quote about he he tried a thousand different types of light bulbs and you know he he was excited because he found 999 ways not to make

Starting point is 00:03:14 light bulb and that was really useful what's your take on like experiential learning i remember first hearing about that and then where someone was kind of jokingly being like oh yeah i'm an experiential learner in that i'm like too dumb to take like good advices so i need to make the mistakes myself i heard that i was like oh my gosh that's me yeah i mean i there's definitely part of that you know i need to touch the stove myself really figure out if i'm going to get burned. And I usually do. And a big part of it is actually trying to listen to all the advice I can get. But a lot of advice is contradictory. And a lot of advice is very dependent on the particular circumstances and the particular context. So you've got, look before you leap, but he who hesitates is lost

Starting point is 00:04:06 you know you you often find yourself in that sort of situation so a lot of the challenge is either digging in with the person and understanding the details of how they came to their conclusions or just trying to set things up so that you can course correct very quickly if you start to see signals that things are headed in the wrong direction. I like that. I like that. Okay, so coming back, so a few weeks ago,

Starting point is 00:04:39 you launched this AI in the box. Can you tell us more? Yeah, so I'll actually give you a little, Oh, wow. Let's go. That's awesome. Oh, wow. As we are speaking, it's actually transcribing it and showing it on the

Starting point is 00:04:55 screen. That's so cool. Wow. And it's actually understanding my speech, which I am very impressed because Monik can do that a lot of times. Very true. And I know, for example, I have European friends who have to put on an American accent to get their GPS systems to understand them. And one of them was actually saying, yeah, I just put my finger on my nose. Which my wife was, we don't sound like that.

Starting point is 00:05:29 Wait, that's actually a bit rude because you're not American. I know. Wait a minute. See, I'm just trying to offend everybody here. Hey, show them the cool demo. They'll be impressed. They'll forget what you said. And this is all based on,

Starting point is 00:05:46 we've taken OpenAI's Whisper model and we've turned it into a real-time speech-to-text, you know, because it takes, it's a batch model that takes 30 seconds of audio. So by default, you'd be waiting sort of 30 seconds and doing a chunk at a time. We've managed to accelerate it significantly on this. It's all running locally, I should mention, on this little AI in a box,

Starting point is 00:06:13 sort of credit card sized board that's very similar to like an orange pie. And I'm so excited about being able to do speech to text as like a-text as a local utility that you can have lying around. I'll do one more demo as well. Oh, yes, please. We also have, as well as the speech-to-text, if I say, go into chatty mode, it's now in a large language model that's also running locally. And then if I ask it, are you going to destroy humanity? This is always one of my favorite prompts.

Starting point is 00:07:04 Oh, wow. See, that is slightly terrifying. I mean, the demo is incredible. The responses at the end, yes, it's very interesting. Yeah, it's like, oh, and by the way, I may intervene. But just to show you a practical use case that actually came up yesterday when I was talking to a customer. If I ask it, what's the most common problem with a fridge?

Starting point is 00:07:39 Oh, wow. Interesting. The reason that was interesting was I was talking to a company that was involved in warranties for home appliances. And the idea of having a user manual that's actually embedded in the device, which could save money on their call centers and also potentially save them having to send people out to repair devices. Only that large language model goes from being kind of this, you know, abstract thing that you use a webpage to interact with to something that's actually able to understand natural language, able to ask you questions to help you diagnose a problem. I was just really excited by, you know, being able to do speech to text, to get the text and then being able to use large language models to understand natural

Starting point is 00:08:32 language, I think that that's going to be a game changer. Okay. This is very exciting. So there's three topics I want to touch on. So one is, so use cases. The other one is sort of the specs, right? Like how does this comparing to, you know, the latency and all that.

Starting point is 00:08:47 And then the last part is the point, like how do we update? Okay, so maybe starting on the use cases. So can you give me one use case that's found very useful? And then one use case that's like hilariously silly or, you know, just bad. Yeah, so one of the things I really want to be able to do

Starting point is 00:09:04 and kind of the idea behind useful sensors is I just want to be able to look at a lamp, say on and have the lamp come on. You know, that should just work. We have all of these like Alexa and Siri and all this other very elaborate voice control systems, but they don't work like we work. You know, they don't work like interactions with people. I just want to be able to build a really simple interaction with these everyday objects. So I think that's the most useful. Yeah, there's been people actually building, for the fun use cases, people building animatronics for theme parks. So, you know, those things on rides that sort of, you know, or things that leap out of you in haunted houses. Yeah.

Starting point is 00:09:52 And being able to do speech to text. Oh, interesting. And, you know, actually interact with these things. I love the idea of actually being able to have something that one of these creepy monsters or even like, was it Chuck E. Cheese has those bands? You can actually talk and interact with them. I think that's a whole new level of creepiness that I'm here for. So the demo that you showed was really cool. And I think one thing which was obviously very different from how you interact with assistants, like let's say Google or Alexa,

Starting point is 00:10:31 there was no trigger word. Like you weren't waking up the device. You just said what you wanted to say and the device responded. And I was reading about AI in the box and this is all happening locally, right? Yeah, it's all running with no network connection. So that's one of the kind of amazing things.

Starting point is 00:10:52 And a lot of what we're doing with like, for example, the LAMP use case is we have this little person sensor, which is this little board here, which you can buy for sort of $10 on SparkFun. And this tells you whether somebody's looking at you. It has a camera and a microcontroller, which is running like an ML model. And so you've got that social mimicry of actually being able to tell when somebody's connecting with your gaze. Because we really wanted it to be similar to the way that you

Starting point is 00:11:25 interact with people and that is such a key dynamic you know when somebody's talking to you because they're looking at you yeah that is incredible i think one of the aspects that is super important to call out is the privacy piece because i know many of my family members who are not who don't work in tech they use these assistants all the time. And they would come and say, hey, you work in tech. Do you know if this thing listens to us? Because I spoke about a random thing with a friend of mine and I saw an ad on one of the platforms. So it seems like it's listening to me.

Starting point is 00:11:59 It doesn't matter how much you tell them. Yes, it's secure and they're not listening like you think they are. They don't tend to believe that yeah because people don't trust big tech companies like you know we we've burned through a lot of goodwill over the last like decade or two and they're rightly skeptical and what it comes down to like when I was at Google and had the same experience, I would be able to say, yeah, no, I know that code. We're not listening to you when the assistant is off. But I can't prove it because it's all just in a massive code base inside Google. trying to do with the on-device stuff and with these little sensors is build systems that the privacy can actually be checked by a third party like Underwriters Laboratory,

Starting point is 00:12:53 Consumer Reports, somebody else who can actually look at the equivalent of an ingredients list for the data that's being shared and confirm that these little subsystems aren't capable of sharing the camera feed, for example. You can just get information about whether somebody's looking at the device, like the metadata. So that's a really important driver for what we're doing too. Coming back to the use case piece, one of the things I read about AI in the box was you can connect like a device to it, which can take keyboard input. So as you're speaking, it can become more of a transcriber or note taker, where you can have a document with all your notes. That seems super cool. And one of the things I was chatting with my wife last night was we didn't know about, or she didn't know about AI in the box at the time. And she's like, one of the

Starting point is 00:13:42 amazing use cases for AI could be where you have this device at home and you can talk to all their appliances. And you could say, turn on the stove or turn on the microwave for 30 seconds without having to go and press the button. So do you see this use case going in that direction where it becomes the control box to connect to various things yeah and actually one of the things we're trying to do is make it cheap enough and small enough that each individual appliance can have its own voice you know so instead of having a centralized box like we do with air on a box you know my dream is to get it down to 50 cents for speech to text. And then we can actually have, you know, the microwave can know when you're looking at it and it can just listen out for that voice. And the other thing that would be really nice there is it should work as soon as you plug it in.

Starting point is 00:14:37 Like most of these connected appliances never actually get connected to Wi-Fi because it's such a pain. So having something that just works out of the box is a really important value for us too. One thing that I was surprised when I first saw it is just how small it is. You hear about billions of parameters that's associated with LLMs and then to deploy on your server.

Starting point is 00:14:59 So one question is, are you guys able to shrink it down to that size? And then the second, I guess, is this comparing to like Alexa or Siri. So the advantage there, obviously, you know, what's going on in the hood, under the hood, and then you can actually access a lot of the stuff. But is the model is also superior, right? In terms of like how they're doing natural inference. Yeah, I mean, it's, it's hard to compare because they're doing the inference in the cloud.

Starting point is 00:15:27 And I will say... Right, right, right. Sorry, that was a dumb question you wrote. Yeah. And you know, I don't want to particularly, because, you know, I worked very closely with the Google speech team. I wouldn't be surprised if their models

Starting point is 00:15:39 actually beat us in, you know, in some of the word error rate scores and things like that and so i don't want to sort of you know set ourselves up for that sort of competition because there's been so many smart people on all of those teams you know they've had hundreds of engineers working for years and this is something that you know has been put together by a small startup. But what I will say is in practice, it's actually worked really well. And a lot of that is down to OpenAI's new approach with Whisper, where instead of taking a lot of labeled speech data, which is very expensive and time consuming to produce, they've actually just taken sort of semi-structured data off the web, but lots

Starting point is 00:16:25 and lots of it to produce their transcription engine. So just to give you an idea, but we also believe that this is by far the best quality solution that's outside of those, you know, Google, Amazon, and Apple. And how are you guys able to deploy the model on such a small chip? So a lot of it comes down to, you know, there's no single magic bullet. And we are building on top of a lot of other open source teams work. But we have, for example, the useful transformers framework that lets us use the NPU that's present on this Rockchip board, on this Rockchip SoC.

Starting point is 00:17:08 Sorry, the NPU is like a neural processing unit of sorts? Yeah, sorry, I should have spelled that out. And so that lets us run twice as fast as any of the other solutions we've seen on this device and really helps us get that latency down, which is absolutely key. And so it really is just a lot of stuff around quantization, around acceleration, around all these different ways to kind of deploy and squeeze stuff into.

Starting point is 00:17:40 And on the large language model side, we talk about billions and billions of parameters, but if you think about that as eight bits per weight parameter, that's a few gigabytes. So, you know, it's actually in the scale of things, even for, you know, single board computers, it's their large language models by machine learning terms, but they're actually well within the capacity of fairly low-end hardware to run. Can you speak a little bit on quantization? I know that's a topic that you're quite passionate about. Yeah, definitely. I've been on quantization since I was actually doing a retrospective for iccv on quantization going back through sort of the history and i i've been working on it i think since 2013 you know initially

Starting point is 00:18:34 8-bit but um there's been so much work now around 4-bit and 2-bit that you know it's it's really you know the whole field has exploded. And a big part of that is because a lot of these large language models are memory-bound. So this is something new. Convolution networks are usually compute-bound. So you really just have to throw more arithmetic units at the problem to speed them up. But because large language models are just doing these big fully connected layers essentially through transformers and they don't really have large

Starting point is 00:19:14 batches, it ends up, you just basically have to, you can estimate the speed usually of a model by looking at the DRAM transfer speed. And that's the main limiting factor. of a model by looking at the DRAM transfer speed, and that's the main limiting factor. So that means quantization actually helps speed things up because you reduce the memory traffic and you instead do more with the unpacking compute logic. So it's been really interesting seeing how it's gone from being something that's just about meeting these kind of constraints in terms of how much memory and storage you have to actually being a latency and throughput thing. And you can also get a lot more elaborate with the quantization schemes.

Starting point is 00:20:09 So they aren't linear encodings. they can use lookup tables or complex functions so yeah it's it's been i i have unfortunately i've not been as hands-on as i would like over the last year or two but it's been fantastic seeing everything that's happened in the field i remember for convolution so for vision tasks networks, a lot of times when you quantize, you have to retrain the model, right? Because it actually results in different things. Is that still holds for language models as well? Actually, I've seen a lot of work where people have been able to go down to four bits,

Starting point is 00:20:39 for example, without doing retraining and sometimes even two bits. There is some accuracy loss, but it's not unworkable. And what I've seen is the accuracy improves when you are doing quantization aware training. But the fact that you can quantize these models sometimes without having to do that retraining, it kind of makes me think that there's a lot of redundancy. You know, like they're over parameterized or however you want to call it like there's sort

Starting point is 00:21:10 of room to kind of represent the same information with fewer parameters because you know being able to compress it well is usually a sign that there's you know there's a lot of redundancy there so anyway that's that's just a sort of a gut feeling so the last part in terms like deployment when new models like come out how do you see the like what is it throwaway models sorry oh disposable sorry yeah so are you able to just like kind of grab these and then just deploy it on the box, like things advance? Yes, but what I see happening a lot more is with traditional ML frameworks like TensorFlow and PyTorch, you expect to be able to just keep the framework the same

Starting point is 00:21:58 and just bring in new models as data files, essentially representing the architecture and the weights. If you look at things like GGML and what we're doing with the useful transformers, there's a lot more work involved in supporting a new model from the framework side. Like they don't claim to be ready to bring your own model, you know at an arbitrary model and have them just run so uh there's actually some coding work involved to bring in new models and the reason that i think that's worth it is models are actually changing um much more slowly or you Or another way of looking at it is a lot of different tasks are now able to be solved

Starting point is 00:22:50 using one of a handful of different models. So the effort, putting in extra effort to support models in code is a worthwhile investment, especially on the influence side. How do you compensate that with, or it's a trade-off. In this case, if you have a specific framework that works with a set of models or a single model, you can have a new use case really quickly deployed. But then at a company who has a bunch of these, one of the questions that is often asked is

Starting point is 00:23:25 if every use case uses a different framework, then the maintenance cost, the cost to change them, update them, any instrumentation that you need to add to those frameworks becomes increasingly more expensive as you have N plus one. So how do you see that being traded off with these disposable frameworks? Yeah, and that's a really good question. And it's something that haunted us at Google and is the reason that TensorFlow was the sort of rallying point around which we tried to get everybody to use the same framework for those maintenance reasons. I think that we might have to think about reusability at a different level. So if you imagine instead of the machine learning

Starting point is 00:24:13 sort of architecture and weights and things being the bit you share, maybe we can actually have underlying matrix multiply implementations, you know, the old school blast gem function for matrix multiply, maybe that's the point at which we expect frameworks to share common implementations. And then everything else can be done very much in a you know in a much more sort of idiosyncratic way for different

Starting point is 00:24:49 models or maybe it's something like we have the cuda nn library interface and that's what all of these libraries share a common you know a common layer you know so I think we just have to be more imaginative and more flexible on the infrastructure side to support, because it clearly makes a lot of sense for application and product people. If it's less work to write your own framework to run a model than it is to figure out how to use this kind of massive crawling horror of a

Starting point is 00:25:26 you know a framework that includes everything then that's that's really a sign that you know we haven't done our jobs right when going back to the deployment piece for a second so this ai in the box it's a complete solution so like you said you plug it in ready to go have you figured out like once you have clients who who have these uh devices if and when you need to let's say add a new model or use something else have you figured out how that would be deployed to these edge devices yeah we'll send them a new hardware. Now, that's a bit of a glib answer, but especially the way that we think about this is, if you think about a temperature sensor

Starting point is 00:26:13 or a pressure sensor or an accelerometer, you don't expect to be able to do software updates on them. Right. They work the way they work out of the box, and you build a system around the way, you know, around their sort of, you know, limitations and capabilities. And we as software engineers, you know, it makes all of us nervous not being able to update code because it's like, oh, you know, what if we ship a bug? But for things like speech to text, I think that we come up with, you know, the best solution that we can at a particular time. People then build a system around that and it would often actually be pretty disruptive to make changes because you have to

Starting point is 00:27:07 kind of qa you have to do things like check the security you know all over again the security and privacy if somebody has given a certification and usually these devices also aren't necessarily under a subscription model which is what we're used to on the web and for phones and things like that you know if your light switch is able to understand you you don't want it relying on update servers that may get hacked over it's like 20 year or 30 year lifetime you know imagine being on the hooks for maintenance for decades yeah it's terrifying so for a lot of the companies we work with having something that just doesn't have that liability of you know updatability is really important for them like they want something that works and then you know if somethingability is really important for them. Like they want something that works.

Starting point is 00:28:05 And then, you know, if something really new and big, and that's a big change comes along, they want people to buy, you know, next year's model. So another follow up on customizability. So if say someone has like a private corpus of like text that they want to be able to like the model to be able to reference, say like recipes cooking uh that they really like how would that work in this context yeah we actually have one of our interns who's based at ut austin is doing his phd on how to create, we often call them tiny large language models, that are actually able to be trained on very specialized knowledge. So, you know, like I was talking about with appliances and having user manuals built in to the appliance that you can talk to,

Starting point is 00:29:00 you know, there you'd feed the like user manual as or maybe the customer support sort of guide as the source. And you know, in your case, you might feed in the, you know, this all these recipes as the source and then have the large language model able to answer questions about it. I'll, I'll share some links after the podcast to Evan's work, because I think that there's a big opportunity to have something, you know, maybe it's only hundreds of millions of parameters that encode this kind of specialized knowledge. You know, another thing I'm really interested in is, you know, if you think about home improvement stores, I would love to put one of these boxes on every pillar so you can walk up to it and press a button and ask, okay, where are the nails? Oh, yes, please. And in that case, you really want to be able to just feed in a CSV file, a spreadsheet of all of the parts and where they exist. So I see, yeah, I see a massive amount of potential in being able to set up large language models with these specialized domain knowledge.

Starting point is 00:30:14 And is it possible to do it in such a way that it doesn't involve like retraining where you do need the compute in order to do it? I, we do think that there's ways to do this with fine tuning. I think that, you know, how much training is involved is still a research question. But we don't expect you to have to have like open AIs, GPU sort of cluster in order to be able to do this. That's our hope. I see. And have you guys explored so as i read

Starting point is 00:30:46 earlier about where you can do like kind of a multi-shot learning which seems a bit janky but you're literally just like hey like the first step right it's like just uh taking all this corpus and then the step two is like i'm going to ask you questions about it like have you guys uh test that out that out is that usable i you know i't actually, I haven't actually tried that yet. That's kind of similar in, in idea, I think to some of Evan's work. Ah,

Starting point is 00:31:12 cool. Yeah. So I, you know, when I, when I share that, hopefully that will, I'll check out the link on,

Starting point is 00:31:18 honestly, I'm, I'm CEO now, so I'm kind of out of my depth on a lot of the technical, just nodding and smiling in meetings. So, now, so I'm kind of out of my depth on a lot of the technical questions. Just nodding and smiling in meetings. So speaking of that, a nice pivot. So after seven years of

Starting point is 00:31:33 building TensorFlow Mobile at Google, so you founded the company Useful Sensors at the start of last year. What made you decide to do that? It's really hard to put out hardware at Google because Because there's such a reputational risk. You know, if you think about something like Google Glass, like that had there was such a backlash to that, that,

Starting point is 00:31:57 you know, at any big company, I think if you're bringing, you know, it's a lot easier to open source software and models but as soon as you're actually and even like to spin up something that's on a new app or new website but bringing out something that's actually under the google umbrella that's a piece of hardware that requires a lot more approvals because of the potential reputational impact. So, you know, I really wanted to build things like the AI box and these little sensors and things like that as a way of getting out these ML capabilities into the hands of people who would not necessarily want to particularly learn ML, but would want to benefit from some of the things that it can do. So really the only way to do that was to do it as a startup.

Starting point is 00:32:50 So you were a CTO before you joined Google. You were a tech lead at Google, and now you're transitioning to a CEO role. And as you were saying, you're meeting with clients, you're thinking more about the business than being more hands-on. How has that transition been for you? Oh, really hard. Say more, say more. Yeah.

Starting point is 00:33:14 Well, my, you know, when things get tricky, my default is to start trying to code up a solution. And that's, that's not the right thing to do like especially for example and business development i we recently got a first full-time sales employee terry and he's only been with us for two months but he's already been like i've learned so much about working with, working with especially business customers and B2B from being with him. And he's made a massive difference to the whole company. And it's, it's really, you know, looking back now, I see that a lot of my first 12 months

Starting point is 00:33:58 I spent, I was doing meetings with companies and, you know, there was interest, but I was doing meetings with companies and there was interest, but I was not driving things to actually get contracts closed, get stuff over the line, actually really do all of the specialized work that's involved in business development and sales to make this stuff happen. So, yeah. And, you know, hats off to one of our VCs, Mike Dalber at Amplify, who was very, very, he's been super supportive, but he also was extremely persuasive around, got to get somebody on the sales side, Pete. And I was, you know, I honestly was like, you know, I was listening, but you know, it wasn't at the level of urgency that it should have been. And he really managed to get me

Starting point is 00:34:51 to do that. And it's been a revelation. I think fundraising is another part that would have been newest compared to being at a company. But again, you're asking for funding, but with some projects like TensorFlow, it's funded. The conversation you're having is very different.

Starting point is 00:35:08 So how is that like? I mean, one nice thing was that, you know, I spent probably a year in like 2010 going around every single VC in the Valley and getting told no. So I actually had, you know, before we got the funding that we needed for Jetpack, my first startup in the end. And so I actually have spent a bunch of time on the fundraising before. And luckily, I've actually got to know a bunch of these over the years. And most of the VCs that we're working with are people that I've known for more than a decade. So that has been really fantastic. It's been wonderful working with them. And, you know, we are in a weird area because we're AI, which is, you know,

Starting point is 00:36:06 has a lot of momentum behind it, but also doing hardware, which nobody in the Valley pretty much is familiar with. And, you know, it's very rare to have startups that are doing hardware. So we're in this really interesting position from the fundraising side. Well, absolutely. I think, as you mentioned, on the hardware side, like doing hardware is one we're in this really interesting position from the fundraising side.

Starting point is 00:36:25 Absolutely. I think, as you mentioned, the hardware side, doing hardware is one of the hardest things. Software scales much better. And you see this in venture funding, you see this in just the number of startups trying to do this. And all of the things like labor is expensive, A-B testing is hard, so many challenges with hardware, a number of things that can go wrong. So how are you thinking about just being able to monetize the products that you build over time? And any advice for other engineers and some of our listeners who might be thinking about this on how they can think about if they want to do something in hardware space? I mean, I guess the first advice would be don't. I saw that coming. I saw that coming.

Starting point is 00:37:06 Yes, I thought you might. I mean, it is. And hats off to, you know, I'm coming at this as fundamentally a software engineer, and I'm learning about hardware I can barely solder. And, you know, we've had a fantastic team who've made it work but just the time scales involved with hardware are mind-boggling if you're coming from the software world you know sort of six to twelve months just to put the initial version together you know waiting for months on factories trying to get all of the components together it's an entirely new world of problems that you're kind of exposing yourself to so if at all possible try and find some way to at least mock up your stuff

Starting point is 00:37:55 on the web like we've actually been using we've been emulating what we do what our devices do as web demos so that we can at least share them with prospective clients a lot more easily. Advice one, don't. That's an important one. I know we are running up against time, so I want to capture a few questions before we go. On your blog, by the way, you're a prolific writer. And you mentioned on your blog, you think through writing.

Starting point is 00:38:30 I think it's incredible. I want to talk a whole lot about that, but maybe another time. On one of the posts you had captured where someone told you this, training costs scale with the number of researchers, inference costs scale with the number of users. And you made a prediction there. Can you tell us more about this? Yeah, so it's this idea that if machine learning is successful, even though training an individual

Starting point is 00:38:54 model might involve hundreds of GPUs for weeks, that model, once it's in production, if it's reaching hundreds of millions of users for years, even though the individual inference cost is for each core to the model, each inference is a lot smaller than training an epoch of data, the number of users and the amount of time that it's going to be used for means that the amount of compute that's going to be spent on inference is going to be much, much larger than the amount of compute that you spend on training. And a whole bunch of interesting kind of things flow from that. I guess one of your predictions there was NVIDIA might not stay as is for much longer considering how just the ecosystem is evolving as we have these more use cases that come in for inference yeah and that honestly that was me being a bit spicy there was something in my cereal that morning but it it was this idea that as researchers we really need convenience to be able to experiment with different models. And you also need the absolute lowest latency and highest throughput you can get because the limitation almost all the time is how long the model takes to train.

Starting point is 00:40:21 Once, which is where NVIDvidia wipes the floor with everybody you know so for training they are they are going to be the kings for the foreseeable future but for inference because the total cost for inference are going to be so much higher that's where this idea of disposable ml frameworks and doing custom coding to support that use case it becomes a lot more sensible to invest a whole bunch of engineer time to take a model that's you know for example i've i heard something like you know chat GPT takes four cents per call to return a result. So if you imagined like, you know, hundreds of millions of API calls, you know, a month or something like that, suddenly it becomes totally sensible to take a bunch of engineers and just say, hey, take this particular model, even if we only use it for a

Starting point is 00:41:27 year or two years, the amount of money we're going to be spending on inference is so high. It's totally worth a bunch of highly paid engineers, optimizing and profiling and doing special purpose stuff to speed things up. You know, it will pay off very quickly when you're spending that much. And part of that might be moving over to something that's not a GPU to execute this stuff. And I think cheaper inference also results in more use cases like AI in the box, which enables a whole lot of economies of scale. Exactly.

Starting point is 00:42:05 Pete, this has been awesome. Thank you so much for joining us. Before we let you go, is there anything else you would like to share with our listeners? No, thank you so much for having me on. This has been fantastic. Awesome. Thank you so much for joining. And I think you have a crowdfunding campaign for AI in the box.

Starting point is 00:42:22 We'll make sure we link that in the show notes and we highly encourage our listeners to go check it out. And thanks so much again, Pete. Okay, thanks so much for the chat. Thanks, Guang. Hey, thank you so much for listening to the show. You can subscribe wherever you get your podcasts and learn more about us at softwaremisadventures.com.

Starting point is 00:42:44 You can also write to us at hello at softwaremisadventures.com. You can also write to us at hello at softwaremisadventures.com. We would love to hear from you. Until next time, take care.

Your Ad Here

Software Misadventures - Pete Warden - On launching "AI in a Box" and building a hardware edge AI company - #24

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.