Software Misadventures - Pete Warden - On launching "AI in a Box" and building a hardware edge AI company - #24
Episode Date: October 23, 2023What's "AI in a Box"? Pete Warden joins the show to share a new project he recently launched that encapulates Language Transcription/Translation and Question Answering capabilities into a wallet-sized... board running locally without internet, as well as stories and learnings from building his new company, Useful Sensors, after 7 years of leading the tensorflow mobile project at Google. Pete is the CEO of Useful Sensors. After founding his own company Jetpac and selling it to Google in 2014, he became a staff research engineer at Google, where he led the TensorFlow Mobile team. Pete is also the author of two well-received books: "Public Data Sources" and "Big Data Glossary" and builder of OpenHeatMap. Show Notes: AI in a Box crowdfunding: https://www.crowdsupply.com/useful-sensors/ai-in-a-box Pete's blog: https://petewarden.com/ Useful Sensors: https://usefulsensors.com/ Stay in touch: ✉️ Subscribe to our newsletter: https://softwaremisadventures.com 👋 Send feedback or say hi: hello@softwaremisadventures.com Segments: [0:00:00] Failing and trying again, experiential learning [0:03:13] AI-in-a-box demo [0:07:28] Animatronics? [0:10:12] Privacy and trust [0:12:04] Talk to your appliances [0:15:22] How to fit the LLM into such a small chip? [0:16:50] Quantization [0:20:07] Disposable ML frameworks [0:24:33] Updating model on shipped hardware [0:28:34] LLMs with specialized domain knowledge [0:30:08] Founding Useful Sensors [0:37:07] scaling training vs inference
Transcript
Discussion (0)
Welcome to the Software Misadventures podcast.
We are your hosts, Ronak and Gwan.
As engineers, we are interested in not just the technologies,
but the people and the stories behind them.
So on this show, we try to scratch our own edge
by sitting down with engineers, founders, and investors
to chat about their path, lessons they have learned,
and of course,
the misadventures along the way. So like many of you, Rannik and I have been loosely following
the news about large language models, or LLMs. Recently, we came across this pretty cool demo
video that shows an LLM running on a wallet-sized board that's doing real-time language translation,
as well as question answering. Kind of like ChatGPT, but running locally, so he doesn't need Wi-Fi.
We had so many questions, like how do you fit LLMs that have billions of parameters
onto such a small chip?
What's the accuracy like?
What are the use cases?
And where do I get one?
If these questions sound interesting, then stick around,
because in this episode, we're talking to the guy behind it all, Pete Walden.
So you might know Pete from his role leading the TensorFlow mobile project at Google.
But after seven years, recently he started his own company called Useful Sensors.
And the demo video I mentioned is from a project they recently launched called AI in a Box.
Without further ado, let's get into the conversation.
So on your blog, right under the name, you have a subtitle that reads,
Ever tried, ever failed, no matter, try again, fail again, fail better.
Is this a motto of yours?
Yeah, and it's a quote from Beckett, the playwright, and it really stuck with me because what I found is
I don't know how to do things in a particularly smart or intelligent way,
sort of better than anybody else. That's very humble of you.
No, I mean, it's an honest assessment but like what i can do is see if i can
get the iteration actually going and that whole i mean anybody who's done any software engineering
or data science knows you try something it doesn't quite work but if you can make it so that that
cycle is fast you can actually learn really fast you know in ml terms if you can make it so that that cycle is fast, you can actually learn really fast,
you know, in ML terms, if you can bring that epoch time down,
then you actually have a chance to, you know, figure stuff out. So it's, it's really trying
to capture that idea of getting those iterations in, not being afraid, you know, trying to set up
things where you aren't betting everything on one big, uh, you know, the first time you try
something really trying to keep, keep iterating on it and getting that iteration time down
and just being comfortable with like figuring out, I think it was uh thomas edison had a quote about he he tried a thousand
different types of light bulbs and you know he he was excited because he found 999 ways not to make
light bulb and that was really useful what's your take on like experiential learning i remember
first hearing about that and then where someone was kind of jokingly being like oh yeah i'm an
experiential learner in that i'm like too dumb to take like good advices so i need to make the
mistakes myself i heard that i was like oh my gosh that's me yeah i mean i there's definitely
part of that you know i need to touch the stove myself really figure out if i'm going to get burned. And I usually do. And a big part of it is actually trying to listen
to all the advice I can get. But a lot of advice is contradictory. And a lot of advice is very
dependent on the particular circumstances and the particular context. So you've got,
look before you leap, but he who hesitates is lost
you know you you often find yourself in that sort of situation so a lot of the challenge is
either digging in with the person and understanding the details of how they came to
their conclusions or just trying to set things up so that you can course correct very quickly
if you start to see signals
that things are headed in the wrong direction.
I like that. I like that.
Okay, so coming back,
so a few weeks ago,
you launched this AI in the box.
Can you tell us more?
Yeah, so I'll actually give you a little,
Oh, wow.
Let's go.
That's awesome.
Oh, wow.
As we are speaking, it's actually transcribing it and showing it on the
screen. That's so cool.
Wow.
And it's actually understanding my speech, which I am very impressed
because Monik can do that a lot of times.
Very true.
And I know, for example, I have European friends who have to put on an American accent to get their GPS systems to understand them.
And one of them was actually saying, yeah, I just put my finger on my nose.
Which my wife was, we don't sound like that.
Wait, that's actually a bit rude because you're not American.
I know.
Wait a minute.
See, I'm just trying to offend everybody here.
Hey, show them the cool demo.
They'll be impressed.
They'll forget what you said.
And this is all based on,
we've taken OpenAI's Whisper model
and we've turned it into a real-time speech-to-text,
you know, because it takes,
it's a batch model that takes 30 seconds of audio.
So by default, you'd be waiting sort of 30 seconds
and doing a chunk at a time.
We've managed to accelerate it significantly on this.
It's all running locally, I should mention, on this little AI in a box,
sort of credit card sized board that's very similar to like an orange pie.
And I'm so excited about being able to do speech to text as like a-text as a local utility that you can have lying around.
I'll do one more demo as well.
Oh, yes, please.
We also have, as well as the speech-to-text, if I say, go into chatty mode,
it's now in a large language model that's also running locally.
And then if I ask it, are you going to destroy humanity?
This is always one of my favorite prompts.
Oh, wow.
See, that is slightly terrifying.
I mean, the demo is incredible.
The responses at the end, yes, it's very interesting.
Yeah, it's like, oh, and by the way, I may intervene.
But just to show you a practical use case
that actually came up yesterday when I was talking to a customer.
If I ask it, what's the most common problem with a fridge?
Oh, wow.
Interesting.
The reason that was interesting was I was talking to a company that was involved in warranties for home appliances.
And the idea of having a user manual that's actually embedded in the device, which could save money on their call centers and also potentially save them having to send people
out to repair devices. Only that large language model goes from being kind of this, you know,
abstract thing that you use a webpage to interact with to something that's actually able to
understand natural language, able to ask you questions to help you diagnose a problem. I was just really excited by, you know, being able to do speech to text, to get
the text and then being able to use large language models to understand natural
language, I think that that's going to be a game changer.
Okay.
This is very exciting.
So there's three topics I want to touch on.
So one is, so use cases.
The other one is sort of the specs, right?
Like how does this comparing to, you know,
the latency and all that.
And then the last part is the point,
like how do we update?
Okay, so maybe starting on the use cases.
So can you give me one use case
that's found very useful?
And then one use case that's like hilariously silly
or, you know, just bad.
Yeah, so one of the things I really want to be able to do
and kind of the idea behind useful sensors is I just want to be able to look at a lamp, say on and have the lamp come on.
You know, that should just work.
We have all of these like Alexa and Siri and all this other very elaborate voice control systems, but they don't work like we work.
You know, they don't work like interactions with people.
I just want to be able to build a really simple interaction with these everyday objects.
So I think that's the most useful.
Yeah, there's been people actually building, for the fun use cases, people building animatronics for theme parks. So, you know, those things on rides that sort of, you know, or things that leap out of you in haunted houses.
Yeah.
And being able to do speech to text.
Oh, interesting.
And, you know, actually interact with these things. I love the idea of actually being able to have something that one of these creepy monsters or even like, was it Chuck E. Cheese has those bands?
You can actually talk and interact with them.
I think that's a whole new level of creepiness that I'm here for.
So the demo that you showed was really cool. And I think one thing which was obviously very different
from how you interact with assistants,
like let's say Google or Alexa,
there was no trigger word.
Like you weren't waking up the device.
You just said what you wanted to say
and the device responded.
And I was reading about AI in the box
and this is all happening locally, right?
Yeah, it's all running with no network connection.
So that's one of the kind of amazing things.
And a lot of what we're doing with like, for example, the LAMP use case is we have this
little person sensor, which is this little board here, which you can buy for sort of
$10 on SparkFun.
And this tells you whether somebody's looking at you.
It has a camera and a microcontroller, which is running like an ML model.
And so you've got that social mimicry of actually being able to tell when somebody's
connecting with your gaze.
Because we really wanted it to be similar to the way that you
interact with people and that is such a key dynamic you know when somebody's talking to you because
they're looking at you yeah that is incredible i think one of the aspects that is super important
to call out is the privacy piece because i know many of my family members who are not who don't
work in tech they use these assistants all the time.
And they would come and say, hey, you work in tech.
Do you know if this thing listens to us?
Because I spoke about a random thing with a friend of mine and I saw an ad on one of the platforms.
So it seems like it's listening to me.
It doesn't matter how much you tell them.
Yes, it's secure and they're not listening like you think they are.
They don't tend to believe that yeah because people don't trust big tech companies like you
know we we've burned through a lot of goodwill over the last like decade or two and they're
rightly skeptical and what it comes down to like when I was at Google and had the same experience, I would be able to say, yeah, no, I know that code.
We're not listening to you when the assistant is off.
But I can't prove it because it's all just in a massive code base inside Google. trying to do with the on-device stuff and with these little sensors is build systems
that the privacy can actually be checked by a third party like Underwriters Laboratory,
Consumer Reports, somebody else who can actually look at the equivalent of an ingredients list
for the data that's being shared and confirm that these little subsystems
aren't capable of sharing the camera feed, for example. You can just get information about
whether somebody's looking at the device, like the metadata. So that's a really important driver
for what we're doing too. Coming back to the use case piece, one of the things I read about AI in the box was you can connect like a device to it, which can take keyboard input. So as you're speaking,
it can become more of a transcriber or note taker, where you can have a document with all your notes.
That seems super cool. And one of the things I was chatting with my wife last night was
we didn't know about, or she didn't know about AI in the box at the time. And she's like, one of the
amazing use cases for AI could be where you have this device at home and you can talk to all their appliances.
And you could say, turn on the stove or turn on the microwave for 30 seconds without having to go and press the button.
So do you see this use case going in that direction where it becomes the control box to connect to various things yeah and actually one of the things we're trying to do is
make it cheap enough and small enough that each individual appliance can have its own voice
you know so instead of having a centralized box like we do with air on a box you know my dream
is to get it down to 50 cents for speech to text.
And then we can actually have, you know, the microwave can know when you're looking at it and it can just listen out for that voice.
And the other thing that would be really nice there is it should work as soon as you plug it in.
Like most of these connected appliances never actually get connected to Wi-Fi because it's such a pain.
So having something that just works out of the box
is a really important value for us too.
One thing that I was surprised when I first saw it
is just how small it is.
You hear about billions of parameters
that's associated with LLMs
and then to deploy on your server.
So one question is,
are you guys able to shrink it down to that size?
And then the second, I guess, is this comparing to like Alexa or Siri.
So the advantage there, obviously, you know, what's going on in the hood, under the hood,
and then you can actually access a lot of the stuff.
But is the model is also superior, right?
In terms of like how they're doing natural inference.
Yeah, I mean, it's, it's hard to compare because they're doing the inference in the cloud.
And I will say...
Right, right, right.
Sorry, that was a dumb question you wrote.
Yeah.
And you know, I don't want to particularly,
because, you know, I worked very closely
with the Google speech team.
I wouldn't be surprised if their models
actually beat us in, you know,
in some of the word error rate scores and things like that
and so i don't want to sort of you know set ourselves up for that sort of competition
because there's been so many smart people on all of those teams you know they've had hundreds of
engineers working for years and this is something that you know has been put together by a small startup. But what I will say is in practice,
it's actually worked really well. And a lot of that is down to OpenAI's new approach with Whisper,
where instead of taking a lot of labeled speech data, which is very expensive and time consuming
to produce, they've actually just taken sort of semi-structured data off the web, but lots
and lots of it to produce their transcription engine.
So just to give you an idea, but we also believe that this is by far the best quality solution
that's outside of those, you know, Google, Amazon, and Apple.
And how are you guys able to deploy the model on such a small chip?
So a lot of it comes down to, you know, there's no single magic bullet.
And we are building on top of a lot of other open source teams work.
But we have, for example, the useful transformers framework that lets us use the NPU that's present on this Rockchip board,
on this Rockchip SoC.
Sorry, the NPU is like a neural processing unit of sorts?
Yeah, sorry, I should have spelled that out.
And so that lets us run twice as fast as any of the other solutions
we've seen on this device and really helps us get that latency down,
which is absolutely key.
And so it really is just a lot of stuff around quantization,
around acceleration, around all these different ways
to kind of deploy and squeeze stuff into.
And on the large language model side,
we talk about billions and billions of parameters,
but if you think about that as eight bits per weight parameter, that's a few gigabytes.
So, you know, it's actually in the scale of things, even for, you know, single board computers,
it's their large language models by machine learning terms, but they're actually well within the capacity of fairly low-end hardware to run.
Can you speak a little bit on quantization? I know that's a topic that you're quite passionate about.
Yeah, definitely. I've been on quantization since I was actually doing a retrospective for iccv on quantization going
back through sort of the history and i i've been working on it i think since 2013 you know initially
8-bit but um there's been so much work now around 4-bit and 2-bit that you know it's it's really
you know the whole field has exploded.
And a big part of that is because a lot of these large language models are memory-bound.
So this is something new.
Convolution networks are usually compute-bound.
So you really just have to throw more arithmetic units
at the problem to speed them up.
But because large language models are just doing these big fully connected layers essentially through transformers and they don't really have large
batches, it ends up, you just basically have to,
you can estimate the speed usually of a model by looking at the DRAM transfer
speed. And that's the main limiting factor. of a model by looking at the DRAM transfer speed,
and that's the main limiting factor.
So that means quantization actually helps speed things up because you reduce the memory traffic
and you instead do more with the unpacking compute logic.
So it's been really interesting seeing how it's gone from being something that's just about meeting these kind of constraints in terms of how much memory and storage you have to actually being a latency and throughput thing.
And you can also get a lot more elaborate with the quantization schemes.
So they aren't linear encodings. they can use lookup tables or complex functions so yeah it's it's been i i have unfortunately i've not been as hands-on as i would like
over the last year or two but it's been fantastic seeing everything that's happened in the field
i remember for convolution so for vision tasks networks, a lot of times when you quantize,
you have to retrain the model, right?
Because it actually results in different things.
Is that still holds for language models as well?
Actually, I've seen a lot of work
where people have been able to go down to four bits,
for example, without doing retraining
and sometimes even two bits.
There is some accuracy loss,
but it's not unworkable.
And what I've seen is the accuracy improves when you are doing quantization aware training.
But the fact that you can quantize these models sometimes without having to do that retraining,
it kind of makes me think that there's a lot of redundancy.
You know, like they're over parameterized or however you want to call it like there's sort
of room to kind of represent the same information with fewer parameters because you know being able
to compress it well is usually a sign that there's you know there's a lot of redundancy there so anyway that's
that's just a sort of a gut feeling so the last part in terms like deployment when new models
like come out how do you see the like what is it throwaway models sorry oh disposable sorry yeah
so are you able to just like kind of grab these and then just deploy it on the box, like things advance?
Yes, but what I see happening a lot more is with traditional ML frameworks
like TensorFlow and PyTorch,
you expect to be able to just keep the framework the same
and just bring in new models as data files,
essentially representing the architecture and the weights.
If you look at things like GGML and what we're doing with the useful transformers,
there's a lot more work involved in supporting a new model from the framework side. Like they
don't claim to be ready to bring your own model, you know at an arbitrary model and have them just run so uh
there's actually some coding work involved to bring in new models and the reason that i think
that's worth it is models are actually changing um much more slowly or you Or another way of looking at it is
a lot of different tasks are now able to be solved
using one of a handful of different models.
So the effort, putting in extra effort
to support models in code
is a worthwhile investment,
especially on the influence side.
How do you compensate that with, or it's a trade-off. In this case, if you have a specific
framework that works with a set of models or a single model, you can have a new use case really
quickly deployed. But then at a company who has a bunch of these, one of the questions that is often asked is
if every use case uses a different framework, then the maintenance cost, the cost to change
them, update them, any instrumentation that you need to add to those frameworks becomes
increasingly more expensive as you have N plus one. So how do you see that being traded off with
these disposable frameworks?
Yeah, and that's a really good question. And it's something that haunted us at Google and is the reason that TensorFlow was the sort of rallying point around which we tried to get everybody to use the same framework for those maintenance reasons.
I think that we might have to think about reusability
at a different level.
So if you imagine instead of the machine learning
sort of architecture and weights and things
being the bit you share,
maybe we can actually have underlying
matrix multiply implementations,
you know, the old school blast gem function for matrix multiply,
maybe that's the point at which we expect frameworks
to share common implementations.
And then everything else can be done very much in a you know in a much more sort of idiosyncratic way for different
models or maybe it's something like we have the cuda nn library interface and that's what all of
these libraries share a common you know a common layer you know so I think we just have to be more imaginative
and more flexible on the infrastructure side to support,
because it clearly makes a lot of sense
for application and product people.
If it's less work to write your own framework
to run a model than it is to figure out how to use
this kind of massive crawling horror of a
you know a framework that includes everything then that's that's really a sign that you know
we haven't done our jobs right when going back to the deployment piece for a second so this ai in
the box it's a complete solution so like you said you plug it in ready to go have you figured out
like once you have clients who who have these uh devices if and when you need to let's say add a
new model or use something else have you figured out how that would be deployed to these edge devices
yeah we'll send them a new hardware. Now, that's a bit of a glib answer,
but especially the way that we think about this is,
if you think about a temperature sensor
or a pressure sensor or an accelerometer,
you don't expect to be able to do software updates on them.
Right.
They work the way they work out of the box,
and you build a system around the way, you know, around their sort of, you know, limitations and capabilities.
And we as software engineers, you know, it makes all of us nervous not being able to update code because it's like, oh, you know, what if we ship a bug?
But for things like speech to text, I think that we come up with, you know, the best solution that we can at a particular time.
People then build a system around that and it would often actually be pretty disruptive to make changes because you have to
kind of qa you have to do things like check the security you know all over again the security and
privacy if somebody has given a certification and usually these devices also aren't necessarily under a subscription model which is what we're
used to on the web and for phones and things like that you know if your light switch is able to
understand you you don't want it relying on update servers that may get hacked over it's like 20 year
or 30 year lifetime you know imagine being on the hooks for maintenance for decades
yeah it's terrifying so for a lot of the companies we work with having something that just
doesn't have that liability of you know updatability is really important for them
like they want something that works and then you know if somethingability is really important for them. Like they want something that works.
And then, you know, if something really new and big, and that's a big change comes along,
they want people to buy, you know, next year's model. So another follow up on customizability.
So if say someone has like a private corpus of like text that they want to be able to like the
model to be able to reference, say like recipes cooking uh that they really like how would that work in this context
yeah we actually have one of our interns who's based at ut austin is doing his phd on how to create, we often call them tiny large language models,
that are actually able to be trained on very specialized knowledge.
So, you know, like I was talking about with appliances
and having user manuals built in to the appliance that you can talk to,
you know, there you'd feed the like user manual as or maybe the customer support sort of guide as the source. And you know, in your case, you might feed in the, you know, this all these recipes as the source and then have the large language model able to answer questions about it. I'll, I'll share some links after the podcast to Evan's work, because I think that
there's a big opportunity to have something, you know, maybe it's only hundreds of millions of
parameters that encode this kind of specialized knowledge. You know, another thing I'm really
interested in is, you know, if you think about home improvement stores, I would love to put one
of these boxes on every pillar so you can walk up to it and press a button and ask, okay, where are the nails?
Oh, yes, please.
And in that case, you really want to be able to just feed in a CSV file, a spreadsheet of all of the parts and where they exist.
So I see, yeah, I see a massive amount of potential in being able to set up large language models with these specialized domain knowledge.
And is it possible to do it in such a way that it doesn't involve like
retraining where you do need the compute in order to do it?
I, we do think that there's ways to do this with fine tuning.
I think that, you know, how much training is involved is still a research question.
But we don't expect you to have to have like open AIs, GPU sort of cluster in order to be able to do this.
That's our hope.
I see.
And have you guys explored so as i read
earlier about where you can do like kind of a multi-shot learning which seems a bit janky but
you're literally just like hey like the first step right it's like just uh taking all this corpus and
then the step two is like i'm going to ask you questions about it like have you guys uh test that
out that out is that usable i you know i't actually, I haven't actually tried that yet.
That's kind of similar in,
in idea,
I think to some of Evan's work.
Ah,
cool.
Yeah.
So I,
you know,
when I,
when I share that,
hopefully that will,
I'll check out the link on,
honestly,
I'm,
I'm CEO now,
so I'm kind of out of my depth on a lot of the technical,
just nodding and smiling in meetings. So, now, so I'm kind of out of my depth on a lot of the technical questions.
Just nodding and smiling in meetings.
So speaking of that, a nice pivot.
So after seven years of
building TensorFlow Mobile at Google, so you founded
the company Useful Sensors at the
start of last year. What made you
decide to do that?
It's really
hard to put out hardware at Google because Because there's such a reputational
risk. You know, if you think about something like Google
Glass, like that had there was such a backlash to that, that,
you know, at any big company, I think if you're bringing, you
know, it's a lot easier to open source software and models but as soon as you're actually
and even like to spin up something that's on a new app or new website but bringing out something
that's actually under the google umbrella that's a piece of hardware that requires a lot more
approvals because of the potential reputational impact. So, you know, I really wanted to build
things like the AI box and these little sensors and things like that as a way of getting out
these ML capabilities into the hands of people who would not necessarily want to particularly
learn ML, but would want to benefit from some of the things that it can do. So really the only way to do that was to do it as a startup.
So you were a CTO before you joined Google.
You were a tech lead at Google, and now you're transitioning to a CEO role.
And as you were saying, you're meeting with clients,
you're thinking more about the business than being more hands-on.
How has that transition been for you?
Oh, really hard.
Say more, say more.
Yeah.
Well, my, you know, when things get tricky, my default is to start trying to code up a
solution.
And that's, that's not the right thing to do like especially for example
and business development i we recently got a first full-time sales employee terry and he's only been
with us for two months but he's already been like i've learned so much about working with, working with especially business customers
and B2B from being with him.
And he's made a massive difference to the whole company.
And it's, it's really, you know, looking back now, I see that a lot of my first 12 months
I spent, I was doing meetings with companies and, you know, there was interest, but I was doing meetings with companies and there was interest, but I was not driving things to actually get contracts closed, get stuff over the line, actually really do all of the specialized work that's involved in business development and sales to make this stuff happen. So, yeah.
And, you know, hats off to one of our VCs, Mike Dalber at Amplify, who was very, very,
he's been super supportive, but he also was extremely persuasive around,
got to get somebody on the sales side, Pete.
And I was, you know, I honestly was like, you know, I was listening, but
you know, it wasn't at the level
of urgency that it should have been.
And he really managed to get me
to do that. And it's
been a revelation.
I think fundraising is another
part that would have been newest
compared to being at a company. But again, you're
asking for funding, but with some projects like
TensorFlow, it's funded.
The conversation you're having is very different.
So how is that like?
I mean, one nice thing was that, you know,
I spent probably a year in like 2010
going around every single VC in the Valley
and getting told no.
So I actually had, you know, before we got the funding that we needed for Jetpack, my first startup in the end. And so I actually have spent
a bunch of time on the fundraising before. And luckily, I've actually got to know a bunch of
these over the years. And most of the VCs that we're working with are people that I've known for more than a decade. So that has been really fantastic. It's been wonderful working with them. And, you know, we are in a weird area because we're AI, which is, you know,
has a lot of momentum behind it,
but also doing hardware,
which nobody in the Valley
pretty much is familiar with.
And, you know, it's very rare
to have startups that are doing hardware.
So we're in this really interesting position
from the fundraising side. Well, absolutely. I think, as you mentioned, on the hardware side, like doing hardware is one we're in this really interesting position from the fundraising side.
Absolutely. I think, as you mentioned, the hardware side, doing hardware is one of the
hardest things. Software scales much better. And you see this in venture funding, you see this in
just the number of startups trying to do this. And all of the things like labor is expensive,
A-B testing is hard, so many challenges with hardware, a number of things that can go wrong. So how are you thinking about just being able to monetize the products that you build over
time? And any advice for other engineers and some of our listeners who might be thinking about this
on how they can think about if they want to do something in hardware space?
I mean, I guess the first advice would be don't.
I saw that coming. I saw that coming.
Yes, I thought you might.
I mean, it is. And hats off to, you know, I'm coming at this as fundamentally a software
engineer, and I'm learning about hardware I can barely solder. And, you know, we've had a
fantastic team who've made it work but just the time scales
involved with hardware are mind-boggling if you're coming from the software world you know sort of
six to twelve months just to put the initial version together you know waiting for months
on factories trying to get all of the components together it's an entirely new world of problems that you're kind
of exposing yourself to so if at all possible try and find some way to at least mock up your stuff
on the web like we've actually been using we've been emulating what we do what our devices do
as web demos so that we can at least share them with prospective clients
a lot more easily.
Advice one, don't.
That's an important one.
I know we are running up against time, so I want to capture a few questions before we go.
On your blog, by the way, you're a prolific writer.
And you mentioned on your blog, you think through writing.
I think it's incredible.
I want to talk a whole lot about that, but maybe another time.
On one of the posts you had captured where someone told you this,
training costs scale with the number of researchers,
inference costs scale with the number of users.
And you made a prediction there.
Can you tell us more about this?
Yeah, so it's this idea that if machine learning is successful, even though training an individual
model might involve hundreds of GPUs for weeks, that model, once it's in production, if it's reaching hundreds of millions of users for years, even though the individual inference cost is for each core to the model, each inference is a lot smaller than training an epoch of data, the number of users and the amount of time that it's going to be used for means that
the amount of compute that's going to be spent on inference is going to be much, much larger than
the amount of compute that you spend on training. And a whole bunch of interesting kind of things
flow from that. I guess one of your predictions there was NVIDIA might not stay as is for much longer considering
how just the ecosystem is evolving as we have these more use cases that come in for inference
yeah and that honestly that was me being a bit spicy there was something in my cereal that
morning but it it was this idea that as researchers we really need convenience to be able to experiment with different models.
And you also need the absolute lowest latency and highest throughput you can get because the limitation almost all the time is how long the model takes to train.
Once, which is where NVIDvidia wipes the floor with everybody you know so for training
they are they are going to be the kings for the foreseeable future but for inference because
the total cost for inference are going to be so much higher that's where this idea of disposable ml frameworks and doing custom coding to support that use case
it becomes a lot more sensible to invest a whole bunch of engineer time to take a model that's
you know for example i've i heard something like you know chat GPT takes four cents per call to return a result.
So if you imagined like, you know, hundreds of millions of API calls, you know, a month
or something like that, suddenly it becomes totally sensible to take a bunch of engineers
and just say, hey, take this particular model, even if we only use it for a
year or two years, the amount of money we're going to be spending on inference is so high.
It's totally worth a bunch of highly paid engineers, optimizing and profiling and doing
special purpose stuff to speed things up. You know, it will pay off very quickly when you're
spending that much. And part of that might be moving over
to something that's not a GPU to execute
this stuff. And I think cheaper inference also results in more
use cases like AI in the box, which enables a whole lot of economies of scale.
Exactly.
Pete, this has been awesome.
Thank you so much for joining us.
Before we let you go, is there anything else you would like to share with our listeners?
No, thank you so much for having me on.
This has been fantastic.
Awesome.
Thank you so much for joining.
And I think you have a crowdfunding campaign for AI in the box.
We'll make sure we link that in the show notes
and we highly encourage our listeners to go check it out.
And thanks so much again, Pete.
Okay, thanks so much for the chat.
Thanks, Guang.
Hey, thank you so much for listening to the show.
You can subscribe wherever you get your podcasts
and learn more about us at softwaremisadventures.com.
You can also write to us at hello at softwaremisadventures.com. You can also write
to us at hello at softwaremisadventures.com. We would love to hear from you. Until next time,
take care.