Embedded - 390: Irresponsible At the Time
Episode Date: October 21, 2021Tyler Hoffman joined us to discuss the issues associated with embedded devices at consumer scale. We talked about firmware update, device management, and remote diagnostics for millions of devices. Ty...ler is a co-founder at Memfault (memfault.com), a company that works on IoT dashboards and embedded tools. (We will invite Tyler back to talk about embedded tools but someone was preparing a lecture on firmware update and device management.) Tyler writes for Memfault’s Interrupt blog which has excellent advice including the mentioned article about Defensive Programming. You can also find him and Memfault on Twitter: @ty_hoff, @Memfault. Elecia is teaching Making Embedded Systems at ClasspertX, a high-quality MOOC with video lectures, quizzes, exercises, synchronous discussions classes, and a portfolio-worthy final project. The alpha cohort starts in early November and the course will run again in Q1 2022.
Transcript
Discussion (0)
Welcome to Embedded. I'm Alicia White, alongside Christopher White. Today we'll be discussing
why I hate the term Internet of Things. Wait, no, we'll be discussing the management of
distributed systems with Memfault's Tyler Hoffman.
Hey, Tyler. Welcome.
Hello.
Could you tell us about yourself?
Yeah, for sure.
Yeah, I'm Tyler Hoffman.
I am generally an embedded firmware engineer. apparently like Chris, mostly Python and building Memfault's backend services,
data infrastructure to manage our device management platform and diagnostics tools.
Before then, I was a firmware engineer at Pebble and Fitbit, where I constantly found myself
doing more developer tools and infrastructure than writing firmware.
All right. We will have questions about, well, writing firmware and managing it and all of that.
But first we're going to do a topic.
Narrow topic.
Lightning round, where we ask you short questions and we want short answers. Are you ready?
I am.
Okay, easy one. Favorite fictional robot?
Wall-E.
IoT, edge devices, or distributed systems?
Ooh, IoT, edge devices.
Who had a better smartwatch, Fitbit, Pebble, or Apple?
Pebble.
It's an easy one.
Preferred code editing tool?
Now, PyCharm.
It's great.
CMake, make, or something else?
CMake, but I don't know it super well.
Open source software, yes or no?
Yes.
Complete one project or start a dozen?
Finish two, 80 80 of the way if you were teaching a course about embedded systems what three topics should you definitely cover um unit testing debugging and build systems
okay i have a late-breaking question for you.
Have you ever ridden the Boilermaker Express? I never did, actually. I hopped aboard it when it was stationary, but never while it was moving. Follow-up, where did the name Boilermaker come
from? I mean, I'm going to guess here. I mean, I do know it was from the men who worked on trains and railroads.
I'm asking real-time questions that are coming into me from a fellow Purdue alum. So, that's if anybody's wondering what the heck is going on, that's what those questions are for.
I went to Purdue for undergrad for the listeners.
Okay, we're going to go back to the course thing because that was kind of important.
I'm sorry
i forgot that i forgot that you're trying to get everybody to do your homework for you
that wasn't it okay so listeners sorry tyler's just gonna take a second listeners i am teaching
a course for a company called classpert that's like class and expert had a little word together
and called it classpert i like it uh and
it's about embedded systems it's goes through my book uh it has a whole bunch of extra stuff
i'm doing videos i'm doing all kinds of lectures there'll be mentors and and real-time discussions
projects projects um and i'll put a link in the show notes but i hope you check it out uh the
first class is going to be kind of small because, let's face it,
I haven't done this before, but the ClassPert folks seem to really have their act together
and, let's face it, my logo for them is awesome.
Okay, sorry, Tyler, back to you.
Debugging, unit testing, and what was the other one?
Build systems.
Well, all right. I think that's where we're going to head for the whole show.
Recently on Twitter, I asked about IoT management for non-cell phone devices like BLE or ZigBee with a backhaul cell phone or coordinator, non-Linux Ethernet
devices.
And I wanted to know what platforms people use and what they like and what you'd suggest
for a new small company entering the IoT space.
Do you have an answer to that?
I think we all have somewhat strong opinions to that.
I didn't get any response.
I mean, on Twitter, I was so surprised.
But yes, I have strong opinions, mostly in the, oh, God, get me out of here opinion.
But you actually are in that space.
We're in that space.
And my guess as to why people did not respond to you would be because no one has a very strong or confident or
probably even right answer to that question, because I feel like a lot of them are,
you know, mediocre at best, a lot of these systems. In terms of what platforms we've seen
people use. So yeah, so we work with a lot of customers at Memfault. We talked to a lot of
engineers. I have never talked to more embedded systems engineers in my entire life than I have over the last two or three years.
Zephyr, Minute, FreeRTOS, and the Espressif IDF are the ones that come up most commonly in the customers that we talk to.
But those are the devices.
I was looking for what happens after you get to 10 sitting in your
offices or in your closet somewhere and not necessarily a very small embedded device.
And I know that's what you're looking for. Yes. Non-Linux Ethernet devices.
Exactly. Right. And so AWS has one, right?? You can use FreeRTOS with AWS IoT and Microsoft bought ThreadX, the RTOS there.
And Espressif has their own cloud backend that they want to use as well, or they want
people to use.
I wouldn't say all of them are good.
And they weren't written to be usable, especially by engineers or people who
don't know exactly how to use these systems to begin with. Why is this such a hard problem? Is
it because you're taking a step beyond just firmware to now having an understanding of
networking and software as a service kinds of things? Do you have to make that kind of a jump in expertise or is it
that nobody has made a real kind of turnkey okay this is easy we will do everything for you kind
of solution i mean the ones writing the firmware are very much not the people writing the backhaul
services and i don't know if they talk to each other often enough. And that's, you know,
I'm sure both of you working at previous companies doing embedded systems, that's probably true,
is the firmware engineers very rarely talk to the cloud engineers, I believe.
I know that was true for the last two companies that I worked for.
I think it's worse at some of the big companies. Azure and Amazon both have IoT offerings that really do seem to be written for software engineers working on computers, not written for firmware engineers trying to squeak out one last byte of RAM.
Exactly right. exactly right and especially trying to like do ssl connections and https over with like you know
64k or some of our customers like 32k of ram like it's just not happening yeah i think i think you're
right that that's a huge piece of it is some of the things you must conform to don't really fit
yeah and they were never intended to fit and And the other thing that's also tough
with a lot of those platforms that exist today
is they assume that these devices have infinite power,
pretty much infinite resources,
and they have a constant and stable internet connection
to these systems.
And that's very rarely the case
unless you are literally a computer in a closet, you know, running Linux.
I've worked on two big distributed systems such that I've had to get involved with both the software and the hardware.
And one was ShotSpotter where we had dozens and dozens of sensors in each covered, and we had dozens and dozens of covered cities.
And every day we wanted to know, well, was there a sensor that didn't have its heartbeat, that didn't check in, which meant its radio or power was down?
Was there a sensor that had a fault or didn't hear anything and therefore probably had something wrong.
And I mean, once you get up to like a thousand sensors, it becomes hard.
And we did it with Visual Basic, querying SQL tables in Excel and color coding.
In Excel too?
In Excel.
In your defense, AWS didn't exist.
None of this stuff existed back then.
That's true.
I mean, it was 2007, 8, 9-ish.
That's kind of still what they want you to do though.
They're going to put all your data,
they're just going to export your data to a CSV file in S3
and they're going to tell you to do it yourself.
That's all they're going to provide.
But one of the other problems with that, I mean, the communication was part of it, but for some of the devices, we were on a self modem.
And so every byte you sent back actually cost money.
And so we didn't want to do a heartbeat every minute because that actually adds up to a lot of money every day.
But no back, no, I don't, what is this called?
Is device management, IoT management takes into account the need for small data updates updates and so yeah so so so can i can i pitch memfault really quickly or just
like say what we're attacking right it's like we are so yes what what we what we did at pebble and
fitbit you know pebble we built our own it was very simple our devices connected through a phone
and every so often reported back through that phone to a to a
very scalable python application written on heroku honestly that's how we got most of our data back
at fitbit massive systems um i'm sure both of you have some history on that and how those are built
but but yeah very complex systems but completely homegrown and why we we wanted to build memfault was because we we kept seeing this problem over and
over again we're like no matter what company we went to we were going to have to build this system
or you know shoehorn one of these larger systems into a into a hardware product embedded system
again and me chris and francois were just, we can't do that again. Like we,
we don't want to solve this problem for the third or the fourth time. And so that's Memfault. And
that is, it's like, we're getting in, I would say more so device management. I think everyone
defines it differently, which I guess is also part of this conversation. Um, I see it as kind
of three or three or more things. It's like provisioning it's giving the device, you know,
some sort of certificates or device serial that you, you know, put it on and the factory assembly line.
It is knowing whether that device is alive and how well it's doing. And then it's also
pushing new updates to those devices. I think those are the three things for,
for device management. Memfault does very well, in my opinion, the OTA delivery and the
monitoring and diagnostics. We do not have yet, maybe, any sort of provisioning services,
security keys. We're not doing those things yet, which I think is the one thing that AWS IoT
maybe does well, but also very confusing. How do you do over the updates if you don't have security keys programmed in manufacturing?
For our customers, we assume they are going to do that themselves.
So we are basically saying, bring your own system.
We don't, you know, I think other companies are attacking it in the way that like,
you need to use our platform.
You need to use our chips that we
provide you you know they're like ten dollars a piece and please use our chips please use our
back end and you can build your product on top of it you know um we're just saying not very scalable
i mean it's just not very scalable um but like i think a lot of these companies are building it for very large and expensive devices, right?
Like if you're building a tractor or if you're building a big machine on an assembly line, like you don't care about the cost at that time.
But if you're building a wearable device that costs a hundred bucks, you need something that works for that company and, you know,
for that business model. And there's not much there.
One of the problems with supporting provisioning and manufacturing
that I've seen some vendors try to help with ends up with them having the keys.
And that's always been a non-starter for me.
Vendor lock-in.
Yes.
Exactly.
I mean, in the end, if I'm protecting the customer data or protecting my device through secure over-the-air downloads,
I don't really want anyone else to have that information.
Correct. Yes.
And so, yeah, there's no good solution, but I will,
yeah, the comment, like, have you heard of providers that don't give you the private keys if you give them on a device, like the provisioning? I think so, because sometimes
there are, I don't want to call out anybody, but there are some companies that provide a whole solution from network to dashboard.
And you write a little bit of code for their widget, and you don't really get to know anything else about it.
Got it.
And so you're basically writing software for this thing that exists in in the environment
that you're placing it in yeah and sometimes i mean like over-the-air updates happen kind of
magically which is terrifying um because you don't really want over-the-air updates to happen
like you want continuous integration with software correct and especially when it comes to hardware
because the inevitable and the worst case is you're going to brick units or have issues in the field that you can't possibly handle or want to deal with, basically. They see, like we said, they have that software perspective. It's like, well, how can we make the device a software thing?
How can we make the device just part of the cloud?
And if you write software for it, we own everything that's involved with it, basically.
So it's a difficult balance.
And when you say provisioning, you mean the security piece, not the provisioning that the customer has to do when they get it home and have to connect it.
Correct. Yes.
I mean, I mean, the certificates and stuff flashing, flashing the flashing the device with like, this is your device serial.
This is your Mac address.
This is your Bluetooth ID.
And this is your security token that like is how you will communicate to anything, but not necessarily like customer onboarding and let's install your first OTA payload and everything.
Is there a different word for what the customers do?
Honestly, I would call it onboarding.
Honestly.
Okay.
I think it's what I've always used.
Yeah. So going back very briefly to the larger systems,
what they try to do and what they're focusing on
is like secure transport.
And in my opinion, a lot of it for OTA updates specifically
is as long as you have secure boot, you're fine.
As long as the payload is signed and you install it,
which I think most of the bootloaders today
and the embedded system platforms that you can use, which I think most of the bootloaders today and the
embedded system platforms that you can use, you're generally going to be fine.
What do I need to know as a firmware engineer about OTA when I'm thinking about these large
distributed systems? Signing and hashing are important, where hashing is the checksum,
but with security and signing says it really did come from the person I said it came from.
Yeah.
It's funny.
It's funny?
It has to be 100%.
Sorry, I'm only saying funny because at Pebble, we actually didn't have secure boot.
We didn't have signed payloads.
It was more of a hacker device. And so we just assumed the device connected to the mobile app
and everything was fine.
But I'm thinking back now it's funny because there was a group of people.
They were called Pebble Bits.
And they would modify the Pebble firmware in whatever way they wanted
where they added new fonts,
they built internationalization for, but they basically just like modified our firmware in
very different ways, adding like really cool features. But then you would just click the link
in the mobile app and it would just like automatically push that firmware to the
Pebble, which I thought was fantastic. But like you could install whatever you wanted
on that Pebble device, as long as the crc matched which which is great when you have a hacker device and it's great when
you have it on your desk as a developer but it is not great when like the president of the united
states is wearing your smart watch at that point you want a little more security what do you say and let's just hope
that every hardware every hardware company is is making sure that you know they are using those
secure practices all we can do is write on interrupt about that you should do it interrupt
is your blog right um yeah memfault's founders, the three of us kind of just were like, we need to write
some content because it doesn't exist. So let's do it ourselves.
And it's a good blog and I have pointed to it and been pointed to it various times. And yet I was
totally unaware of the connection to Memfault or what Memfault did.
Have you considered maybe just a
little more promotion? There is this, I mean, yeah. So, so our, our marketing employee, Colleen
would love that. Um, there is this, there is this fine line that we are trying to balance between aggressive self-promotion and also trying to
build this community on the side of the company that we don't ultimately control. I've seen it
time and time again. And the reason I don't like a lot of the embedded systems communities is the
ones that you find are like almost always owned by a company or enterprise
and like the largest and arguably you know best linkedin group that i found for embedded systems
is like blatantly owned by a consultant like an embedded systems consultancy and it's just
awful and they've actually ruined it now and so we wanted to just not do that um but yes we should
do a little bit more self-promotion and now that we you know have a very good product that we all
believe in and we do think almost any hardware company that's building on embedded systems and
now android and soon embedded lytx like all of them would benefit from it and so now we're not
super opposed to it we're actually just had a meeting. And so now we're not super opposed to it. We're actually
just had a meeting last week about how we're going to get some more people to understand
what Memfault is, who are reading Interrupt. Marketing is really hard. I mean, because there
is that balance between, I did this thing, I think you'll think it's cool that most engineers are hesitant about.
And then there's this, you know what I need?
I need this thing.
And not realizing that somebody else has already built it and done a good job of it.
I don't know how to do that.
I mean, I have that problem with the podcast that I think I should be marketing more.
I think there should be more out there.
Because I do think it's a good thing. And I think people like it, but I don't really want to market. It's no fun and it's, it feels wrong.
Yeah. It always feels like you're advertising to people that don't want to listen. And I mean,
what we've learned a lot is like people people actually want to hear about Memfault
and read more content. And yeah, to tie it back. So what we want Interrupt to become ultimately
is a community of developers that they feel like, you know, it's at least helped by Memfault.
We may provide resources to the community. Eventually it could come into like a more fleshed out website.
That's more of a hub that you kind of hop into
and learn more about embedded systems,
maybe a conference in the future.
But we don't want to be the company that owns it.
If somebody else wants to come in and help us out, great.
And we will provide resources to
it but that's kind of it and yeah like embedded.fm should become a community that that you know if
you two want to stop doing it it should live on right yeah although if somebody offers us enough
money for the slack we will totally sell but it's gonna have to be a lot. Yeah. Yeah, $5, $10, maybe $20.
I mean, at least buy you a couple meals in San Francisco, right?
Well, that's going to be more like $100 then.
Yeah, exactly.
Okay, I want to go back to over-the-air programming.
Sorry.
All the way back to that.
Wait, security and hashes for when you get the firmware, or signatures and hashes.
What else, as a firmware engineer, do I need to be thinking about with over-the-air updates?
You mentioned a secure bootloader.
Is that something the vendors are providing now, or is that still something I have to write?
Unless you have specific needs or requirements, generally you're not writing it. I think most
vendors are providing it. They're not great. And so a lot of companies are just using,
you know, if you're using a standard enough chip, a bootloader is probably built for you whether that's wolf boot or or mcu boot or um nordics dfu
nordics dfu and i'm sure zephyr you know they don't necessarily have a bootloader but they
basically will like tell you how to go about doing this well and ti has oad why are there
different initials for everybody this seems like a term we should agree on now.
At least DFU.
I mean, I think most people who I talk to will now use the phrase DFU.
But that's also like the only way that they know how to install firmware too.
So doing the firmware update over-the-air programming device,
firmware update, over-the-air downloads, whatever it's called, for a few units in your lab is different than deploying to a few million smartwatches.
You don't have to go that far even, but yes.
No, you don't have to go that far.
I mean, that's drastic. Yeah.
But we've done it.
What are the steps?
What are the gotchas?
And what do I need to know as a firmware engineer taking the steps to get from a few devices to consumer production level?
Yes.
So yes, I will answer that in just a moment.
Even before we get there, you asked, what is the requirement? You have to just have an OTA system that works and has a failsafe. or restore a very, very minimal firmware that knows how to contact, you know, or knows how to
phone home or send out a signal that, you know, some phone passing by will eventually install a
firmware on it. At Pebble, we chose the minimal firmware route. If you booted a firmware and it
failed three times in a row within, I think, the span of 15 minutes, we would boot up into what we call the factory
firmware, which we had tested and hardened for a very long time that you could install a firmware
over Bluetooth and you could factory reset the watch to absolute factory conditions.
And so in my opinion, like that is step one. And that you should build that when you have five devices, or at least you're starting
to build sealed units.
Because if you can JTAG anything, like you're probably not going to care about a reliable
OTA delivery system at that point.
Getting to millions of devices, that's a whole different ballgame. I think the buzzwords and actually
true words are you need staged rollouts. This is deploying to 10 devices, then 100 devices,
and then 1,000, 10,000, and you scale linearly, basically. And that entire time you are getting data
back from the devices, how are you doing? How's the new firmware behaving?
And are there any new crashes or anything that I should be aware of? That's a whole different
system. We'll talk about that later. But like stage rollouts and making sure that you get some form of ping or heartbeat or status after you've installed
a firmware update is probably the most critical thing when you're dealing with the millions of
devices. Cause if you update even thousand, um, and you just don't hear anything from a device
anymore, like that's when the sirens go off and you press the big red button on the side of your
desk, right?
That's when you go to Reddit and see what everybody's complaining about.
Exactly.
You start reading Amazon, you check Reddit, you check Twitter.
Um, yeah, I mean, it's so true.
You, you like hit the nail on the head there.
That's exactly how we felt at Pebble.
Um, as soon as a Reddit thread came up, hey, is version 2.4 broken for anyone else?
We're like, stop everything.
It's weird to get that sort of feedback from customers.
I mean, that's the exact sort of feedback
you desperately don't want.
And yet, if there's an error
that only happens on one out of 100 units,
you're not going to find it in the first couple of stages of rollout,
unless you get lucky.
Or, or unless that person is a very vocal Reddit user for sure. Yes,
absolutely. I mean, it's, it's so, it's so relevant to me as well.
Like I just remembered one of my first tasks at Pebble, and it was so irresponsible at the time.
But I came in, you know, I'm just out of college.
And in my first month, I, you know, did a couple tickets, fixed a couple bugs.
And then they were like, all right, Tyler, like, no one wants to be the release lead for, you know, this version 2.4.
And they were like, you are going to be the one to release this firmware.
And it takes about a month, you know, month and a half.
You're basically working on it full time.
You are triaging every bug that comes in.
You are fixing all the bugs that are easy.
And then you're kind of like making sure that all the other ones that are harder or like,
you know, more specific to engineers, you're making sure that those all get fixed.
You're deploying nightly firmware updates.
And ultimately what it means is you're dealing with a one or two people that just break their watches and like every which way.
Um, and yeah, that was, that was my first, like, that was my second month at Pebble and it was
super fun. Thankfully at that point we had logs, we had core dumps and we had some very minimal metrics coming back from devices so we had you know
battery life and a few heartbeats here and there so we generally knew how things were going when
we were releasing even internally um but yeah like you have to watch reddit honestly like as
soon as something comes up there it's like let's pause for a second but this release engineer position uh or or role
does tend to get passed around because it's not very fun especially if you're the person who has
to make all the versions be right verify do the test ver, make sure the security keys are in the right place, do some documentation for manufacturing, all of these little things.
And then you have to compile the image with the security keys, make sure that you can update the firmware, downgrade the firmware, upgrade the firmware, this whole dance to make sure that it's releasable.
It is a pain, but it's also one of the most important things we have to do.
And it's one of the most, it's the least often thing we do for a lot of us.
And so it's full of mistakes.
Yes.
I mean, how many times have you had to write the checklist?
I've had to modify the checklist and update it plenty of times. I think during my, my tenure
pebble, I think it was the release lead like four or five times and like every single release,
something changed or it was out of date. And, you know, I skipped a few steps here and there,
and I only messed up once. I think we deployed, um, a, not a bad firmware but like an incorrectly labeled firmware to to
like a hundred to a thousand people you know it got the git sha it got like you know 2.8 dash abcdefg
instead of 2.8 um but that was my that was my one mistake that's not so bad it's not so bad no it
wasn't bad at all but it was
it was also just something that like somebody will post on reddit and just be like hey what's going
on here like this is not normal um and it just doesn't look great working at leapfrog on consumer
devices uh we didn't have the problem of over the air update but we did have the problem of
releasing the manufacturing with very strict i mean mean, they were going to start... They make masked ROMs, and so you can't change them afterwards. And so you have to make sure you get the firmware right. to change a version number so that it matched some document and did it with a hex editor
oh gosh because if you recompiled you had to go through testing again
but if you just made it match the documentation it was all fine
um yeah well sorry distraction uh
and when you rebuild a firmware and you try to ship it to people, you have to go through like,
depending on how complicated or how sophisticated the company is, like you're either going to say,
all right, well now it needs seven days of soak time or 14 days of soak time.
But if you, what we did at Pebble instead was like, you track it for a day. And if the battery
life is trending in the right direction, we're like, instead of, you
know, letting every single watch run out of batteries for 14 days and then measuring the
duration of that each watch took to die or needed a recharge, we just like said, okay,
cool.
Every single device that is out there today running our firmware dropped 7% today.
Okay, great. We're ready to ship the firmware tomorrow. The battery life trends look good rather than waiting 14 days, which I know
many, many other teams basically have a requirement to do that. But if you have to wait that long,
then if there's a really important bug, you have time and you have more people on Reddit
complaining about you.
You're preaching to the choir.
Yep.
There's this balance of, do I let it go?
And with wearables, with Pebble, with Fitbit, that whole, you did something to make the battery life die.
If you're running on your desk with a unit that has a power supply instead of a battery, you're never going to know that.
Or you're just not connected to the right Android phone from a particular vendor with a particular Bluetooth stack in a particular day of the week. Yes.
And then people complain that their batteries die.
It's, yes.
The number one complaint.
Well, number two complaint.
Number one complaint is probably it doesn't connect or it drops constantly.
Number two is battery life drops or, you know, is terrible.
Okay.
You mentioned monitoring the battery life.
And we've both mentioned heartbeats.
Once I get my firmware out there, what else do I need to know?
And this is where Memful comes in. in like this is our bread and butter it's like once you get the firmware out what are you tracking
and what are you making sure looks good and so number one make sure your devices are alive and
reporting anything number two um uh make sure your devices aren't rebooting. I think the simplest thing
you can track for firmware is count the number of times or at least send an event or find some
way to report whether your devices are crashing or resetting or hitting an assert. And then
ideally reporting some piece of information
about how it's asserting or crashing.
And that's usually the program counter
or the link register.
Or if you have a more complex firmware,
you can usually pull the function
or a backtrace basically.
And so at least get those two things.
So you can kind of tell whether this firmware is more crashy
or not than the other ones beyond that now you're kind of searching for trends like battery life
you know um and and you said heartbeat that's actually the phrase that I use and Memfault uses for events that happen periodically.
We can talk about this as well.
It's like, how often do you send these periodic heartbeats?
At Pebble, we did it every hour.
And so for every single hour, we would track how much did the battery life drop?
How many ticks or seconds was the CPU active? How many seconds was
the Bluetooth chip on? How many disconnects were there on Bluetooth? How much time was I connected
on Bluetooth for this hour? And before we shipped any firmware at Pebble, you had to at least meet or exceed those certain trends.
So your battery life had to drop less than the previous one or be within acceptable limits.
And if your Bluetooth connection time per hour dropped significantly, like that is a regression in firmware and we made a bug or or you know there's contention on the cpu
or you know the connection interval change that you made just like chris mentioned with android
phones like the connection interval changed and it made a lot of android phones upset
um can't tell you how many times we had to change that as well um yeah i, I think I can go on and on about this too. It's something that I've seen
IoT companies not consider. On one hand, all that information is very useful. On the other hand,
if you are a battery-powered device, the more often you send that information, the less often you will manage to live through the whole day or however long your battery is supposed to last.
There's a cost associated with sending those reports.
Do you have a way to balance the trade-off there?
Not one that's not obvious, I guess, right? Like if you,
if you are on a coin cell battery where it's like you have one, it's one and done, you know,
maybe it's even a fixed battery. Like you can't send a heartbeat every minute, like you said
before. Um, sending a heartbeat every, every hour or every few hours, or even once a day
is pretty good. Actually. Um actually you should be able to do
pretty well and and if you have persistent storage what we tell a lot of people as well is like
store up you know a week or two of heartbeats on flash in a compressed format and then send them
up when you're ready um slight slight plug there from M fault as well. Like we are able to store
plenty of heartbeats and, and now like, I think each, each metric that you track is basically
like six to eight bites. And if you have, you know, a few K a flash, you can batch up quite a
number of quite a few heartbeats. Um, and yeah, when you, when you then have a connection or
maybe even a user you
know plugs your plugs your device into a wall socket or charges it then you can send up everything
yeah i've had some devices that it's it's once you are plugged in okay now just send everything
you've ever wanted to send in the past well yeah it's a balance between logging and statistics too
if you can boil stuff down to statistics that's a few numbers yeah yeah it's a balance between logging and statistics too if you can boil stuff
down to statistics that's a few numbers yeah yeah that's easier to send periodically uh then okay i
have you know one megabyte of the day's event logs i gotta ship all that up there both can be useful
but there's there's a big trade-off there yes for sure and and we have found yeah and i think there's a big trade-off there. Yes, for sure. And we have found, yeah,
and I think there's a mixture between the two, right?
You have logging, you have metrics,
and then I think everyone has kind of come out
with their own flavor of it.
It's like the compressed logging.
We call it compact logging.
Other people call it hash logging,
but it's basically like,
take this human readable message, provide that an ID.
You can pass a couple arguments and everything is basically stored as like UN32s.
And then you send up those.
And that's much more compact and compressed than sending ASCII text.
It's like how Apple SoftBasic used to work.
Oh, really?
And they wonder why I still know the ASCII table pretty darn well.
So there's logging and metrics.
There's the statistics that you were mentioning, which you were calling the heartbeat.
For me, a heartbeat is just anything from the unit, which usually is this little packet of statistics. And you set those because you have battery issues,
but you also, the length of time you don't check in is the length of time it takes for a user
who has changed something on the website
to get that on their watch.
And so this is like if a, if a user clicks
install on the app store and then they're trying to send that down to the watch. Yeah. If you're
only checking in once an hour, doesn't that mean it takes an hour for it to, to check in?
Oh, I mean, this is more for, for diagnostic data that the user, I mean, they opt in, of course,
I think that's generally the trend now is you, you opt into all this diagnostic data and you have to um this is like the device is basically in
control of sending that data yeah um for at least a pebble and fitbit it was like you're on the
phone let me install an application you're directly connected to the device and then that just sends
it over immediately sorry i was back in cell phone where Where if you only said hello, actually, this is in the underwater thing I've been working on.
If you only say hello once an hour and you're only awake that one time to listen, then somebody has to wait an hour for you to get around to say hello again.
Yeah, it's like working on a mars rover
yeah it's like mars i mean you you laugh at this this is this is a way for people to implement
their own version of stage rollouts though right if a device only wakes up every hour or once every
24 hours and checks in you know here's my heartbeat do you have anything
for me and that's the kind of in the payload right like you send all your information and
then the the server responds like okay i got it and also like here's some things you should know
about the world um a lot of times that's going to be like here's an ota payload for you to install
and if you just like release the firmware for 30 minutes and then turn it off, that's pretty much the stage rollout, right?
That's one way to do it. family who will report bugs directly to us and then go out to the to the bigger picture to the
larger audience even though that means you may have a bias towards different cell phones true
very true or environments yes or environments i was going to say like um that is always one
suggestion as well we say like do do your stage rollouts but also have your internal developers or users which
you know it's usually the company employees like if you're working for a hardware company like
every employee should be required to test your device or use it or wear it and and another thing
i always suggest is like if your device is experiencing an issue or asserting or has rebooted like when you're doing internal
testing like make that very loud if you're making you know um a smart lamp like you know even the
simplest thing like make the lamp flash on for like 30 seconds like on and off and that's like
telling the user this thing probably crashed like please load up
your phone and submit a bug report you know internally and at least at pebble like if the
device crashed on an internal build we had like a build flag that basically said pop up this window
if it reset if this is an internal build um it would like pop up a screen that you couldn't do
anything else it was like your your pebble just reset please submit a bug and you had to dismiss it um we didn't do that and that was like we didn't know
i mean i pushed for that very hard uh we we did we did do it on i mean now we're getting into
history ionic i built it on ionic but it was only for internal and beta testers not sure i ever saw
that happen huh okay yeah i think it was a build flag um but it was only i think if
you opted in as well and there was i mean yeah okay so that's that's the firmware side that's
some trade-offs on the firmware side and a little bit on the management side but one of the things
at shot spotter and fitbit was okay now that I have thousands or hundreds of thousands of units, these 50 or 100 have had problems.
How much time do I spend each day looking at those problems or trying to find the root cause or even finding out about those problems which
ding ding ding finding about finding out about those problems is the hardest part right um
it's it's it comes back to like millions of devices everyone's gonna have a problem right
everyone is i mean everyone is gonna have a problem well i mean not necessarily everyone
but there will always be at that at that scale there will be thousands of bug reports every
single day right like no doubt about it thousands um and yeah it's generally my battery life was bad
and it was probably the user was out of range or something, right? And the other issues will be, my device didn't connect to Wi-Fi or Bluetooth. And it will probably be they have a weird router
or phone and it just doesn't work. In those weeds, there are actually bugs. And then trying to find
those is the hardest part. And if you're starting out on firmware, and what I see people do time and time again is like,
they build a firmware and they sit, they, they capture logs and they send logs somewhere.
They usually end up on some, in some S3 bucket or on some person's hard drive. And, you know,
when you're doing 20 devices, you can look through those logs generally every single day
and like control F it or command F it depending on which platform you're on. And you can look through those logs generally every single day and like control F it or command F it depending on which platform you're on.
And you can build some like really simple Python scripts
that can basically like parse through some logs.
But yeah, like to your point,
when you're doing even a thousand devices or a million,
like no one is going to find the real issues
and especially the new issues that happen, right?
And when you get a new issue,
like if you've seen this issue a bunch
and you've kind of gotten the idea that it happens
and the unit resets and I'm just,
I can't find it in the code, but that's okay.
But when you get the new issue
and you've never seen it before
and you're like, oh, is this the start
of the tidal wave of problems?
How do you bubble those up?
How do you decide what's an important thing to tell people?
Yep. And this is where Memfault really comes into play, honestly. Because, yeah,
quickly to cover this, what are those issues that are going to be very important right it's probably going to be your device is crashing or it's going to be sounding
some alarms on like asserting or or some sort of like really bad like your device and its heartbeat
is saying like bug or issue or holding up a red flag, right?
Memfault is built in a way that when a device crashes or has a particular log,
it will basically capture a signature of it.
It captures a core dump or it captures a log.
It sends that to our server.
We basically generate a signature of it. And if it's a new signature, we will generate a new ticket. We'll send you an email, we'll send
you a Slack message, and we will show it on the front page to be like, Hey, you know, your firmware,
the firmware version you just updated and pushed out like has a new bug. And if it's one we've
seen before, we will increment a counter. And so it's not this, like, you're not getting a thousand new bug
reports that you have to basically like crawl through. You're just being alerted to the one
or two new ones that you have maybe that day. Um, and to figure out which ones are actually
important, it's probably the ones that are affecting the largest number of devices I would
say, or the CEO's device. Usually those two.
Yep. Yep. The CEO's device is always high importance.
Or the press reviewer.
Or the press reviewer. Exactly. Oh, man. Yeah. We've done that as well, right? Like you put
them into a special cohort of devices or a special cohort And you do not update their firmware during the release event.
Or if you do, you make sure it's a special build that doesn't do anything fancy.
It's kind of a facade.
No matter what you do, whatever button you press, it goes to the next screen and looks perfect.
We've done it.
Oh, yeah.
It's just a sticker. I remember at Fitbit finding a new issue in the company-wide rollout of a problem.
And realizing I didn't know that person, but since this was important and the bug was whacked, I mean, just crazy, couldn't figure out what it was doing, I actually called and said, okay, so, you know, at blah, blah, blah time.
This was an internal person.
This was an internal person.
Never do this to actual customers. Oh, my gosh. Okay. No, this was an internal person. This was an internal person. Never do this to actual customers.
Oh my gosh, okay.
No, this was an internal person who knew they had...
I went into the customer service database,
found this person's registration.
I just called them at home and said,
hey, I noticed your watch isn't working.
And they were very confused, naturally,
and then looked at the time and then said,
oh, that's when I put it in the dryer.
Oh.
I decided I didn't have to chase that bug anymore.
Yeah.
And actually, that's the whole creepiness of that, especially as you go to customers.
How do you handle those data ethics?
I mean, internal customers and Fitbit was small at that time, but I had the keys to their debug database for a little longer than I should have.
How do you balance the, I need this information versus, oh, this shows the customer was in such and such a place at this time.
So they must be, I don't know.
This is like when the watch that people were running with was showing how the military base was set up.
Right, right.
This Strava, yeah.
There are different types of debug information that you can send from a device, right?
There are hardware metrics,
like what is the readouts from these sensors?
Like, are the sensors reporting faulty information?
I know we tracked some metrics at Pebble
where we would record the max and the min
X, Y, and Z axes from the accelerometer.
And basically what we would verify from that
is like if we just got bogus results
for that hourly heartbeat,
we knew that that accelerometer,
either one is completely
faulty and that product should be replaced or two, like something really weird went,
went wrong during that time. And like, maybe something else, maybe there's a firmware bug.
And so like, that's not revealing anything private about the user. And anyway, it's just,
it's just hardware data. Um, GPS locations are very very different that's where the product is located um at least
for us at memfault like we don't tell we tell people explicitly do not send us that type of
information don't send us where people are located how quickly they're moving um and anything that is
personally identifiable like what if they need that information for their own device management?
Does that mean they have to split their stream of information?
Generally. And generally they do.
Not many people use Memfault as their primary data pipe. They have some other auxiliary pipe that they basically pipe
all of their product or PII or things that make their product completely function. Like they're
not, where Memfault is currently ingesting, you know, debug and monitoring information and some
sort of configuration management for some devices. A lot of times they even send all of, all of our data to their own servers. And then
they send over the mem fault specific stuff. They basically pass it over from server to server to
our service. And that's how they keep a lot of that stuff away from us. And yeah, at pebble,
like we, we captured in a Fitbit too, like we captured a lot of data, but I would say not much of it, if any of it at that time was, like, identifiable. It was just, like, how many times was a flash sector read or written to erase? How long did it take? How long was the heart rate task running? Like, these things are critical to debug, but in no way, like, useful information to identify a person or understand what they were doing.
I have some listener questions, if you don't mind.
Philip Johnston of Embedded Artistry, when I said you were on, I think he was ready to write the whole outline for me.
He asked really good questions.
So let's see. In most orgs I've
worked in, they hesitate to outsource device management and prefer to build it in-house.
Is that simply not invented here syndrome or are there factors with existing services that
drive companies toward that decision? Probably both. I think the most obvious reason why they want to build it
in-house is I think what we talked about earlier. There just doesn't seem to be a great solution out
there, at least for the factory line provisioning that they need to do. Generally, companies are
just going to build that in in house because that's what
they had to do five, 10 years ago anyways.
And the same people are going to be working the lines and they know what to do in terms
of, are there any, yeah.
I mean, in the other existing thing is like, if you're trying to use a device management
tool that you don't know if it's going to exist when your product,
you know, is nearing its end of life or like is going to continue.
Like you're trying to support a product for 10 years.
I think in the consumer space, we, you know, I wish it was longer, but we want a product
to maybe last like two, three, four or five years.
But if you're building a product for government or a city or a sensor that's supposed to stay in the same place
for 20 or 30 years like you probably should build that system yourself so that you can at some point
in time like lock it in a closet and never touch it again and hopefully it just continues to work
forever who knows if aws is going to want to continue i mean probably not google but who knows
if these companies are going to want to support their IoT platforms in five or 10 years. Yeah. I don't know if Google has an IoT device
management system and I wouldn't, but I wouldn't consider it. No, they burned me after their
Google reader. I'm never trusting them again. That was it. That was it.
Okay. Philip also asked, what are the real challenges with managing a fleet of devices
versus what people think are the challenges, but turn out to be easy?
All right. Two part question. The real challenges are, are what we talked about before. It's,
it's signal from the noise. I think most device management platforms today are truly built for
20 to 100 devices. They are built for, I think, on these dashboards that you see from these
products that you're basically looking at, you're comparing your device-managed platforms,
the dashboard that they show is like a green or a red box for all of the devices in your entire fleet.
And you're basically trying to look for like the one red box
and you're like, ooh, this device number 72 is offline.
Like, let me go walk over and see what's up with it
or like call the assembly line, you know, manager
and ask them to go reboot it.
When you're doing thousands, hundreds of thousands,
millions of devices, like you're always going to have like a thousand of them red if you're you know using this sort of device management tool
and so it's it becomes is is this number worse on previous release or worse in the new release
you know was there a regression or an improvement? And I don't believe
Memfault is getting much better at this. I think we're the only company that I've seen do it is
like easily comparing release to release. So you just upgraded from 1.0 to 2.0. How do your metrics
compare between them? How are your devices behaving? You know, how did the battery life change?
Historically, like six months ago, how was the battery life between 1.0 and 2.0? Like all of these things, I just don't believe these device management tools do well, if at all. And yeah,
there's always going to be noise, and there's always going to be a signal. It's just like
trying to figure it out. I think, I i mean that i think your statistics there and the noise definitely show
your fitbit and pebble background um i mean that's true on almost everything that you you
have to figure out which of these bugs is important to spend your day on and which of them you have no chance of fixing until
something else happens. But the battery component is one of the wearables that is just makes it that
much harder. What about the other part of Philip's question? What do people think is difficult,
but it turns out to be easy?
People, companies like to think that their product is actually the hard part.
You know, this, this, we're trying, I mean, I'm just, I'm just naming things randomly. It's like, let's go build a TV remote.
You know what the hardest part is, is building that TV remote.
That's what they think.
And, and it turns out just not to be.
The problem is actually like managing the firmware
updates it's managing customer support and how do you get customer support to understand the
low-level firmware enough to know like what's a real bug and what's not a real bug and what's
just go reset the device um and yeah i i i do believe that writing the firmware and building your product is probably the easy part because you probably hired or trained people to do that.
You have not hired a bunch of people who know how to manage and, yeah, manage very low level, very, you know, ancient like devices in a modern world. And, and, and one
of the things that I think people, people struggle with as well is like, you don't know what you
don't know if you've never, and you probably have many stories about this as well as like,
if a firmware engineer from five years ago tried to build a product in the firmware world today,
they'd pull their hair out for sure they're like you mean i
have to like do what i have to communicate to phones routers secure transport um firmware
updates every single month every single week even nightly sometimes and you have to like have a
beautifully crafted like touchscreen display all all of it um it's just hard not many people
there's only been so much time we've demanded these sorts of things from from these little
low-level devices um and so i think those are the hard parts because we've not done them before
um we only did them at pebble because we were really naive we were like well
we think we need these things.
Like we're generally software engineers.
Let's learn how to write some firmware.
And if we can't build or find the tools,
or if we can't find the tools that we needed in the software world,
like building iOS and Android apps,
like we got to build them ourselves because that's what we know is required.
Whereas I think if you build hardware for a living,
you don't know that these software tools are required.
So many of the tools that I've taken part in building app weren't designed like you're saying.
They were the effect of 3 a.mam debug sessions. The realization that, oh, we have to monitor
battery life because if we don't, then we don't know
that it's broken.
How do you get engineers to understand
that going, I mean, it's really not
something you worry about when it's on your desk or
when it's in your lab.
But when it turns into enough devices that people go to Reddit, I don't know why I'm
picking on Reddit now.
Because it's noisy.
It's great.
I mean, it's great.
Very, very fanboys and girls.
I only go to like the origami channel these days
it's not a channel is it what are the reddits subreddits subreddits
i think i know your question is going it's like how do you then train or or get engineers to
understand that like they need to focus on these problems now, not when the customer support tickets come flooding in,
that the battery life is now bad, right?
Because then, as soon as you hear about it that time,
then it takes you months to fix.
And no one wants that two- to three-month debug session.
It's not even the two- to to three month debug session. It's not even the two to three month debug session.
It's the not,
we have to fix this problem and figure it out,
but also,
oops,
we really should be tracking this since now we have to have a crash program
to actually do the kind of logging and stuff that we weren't doing before.
Right.
And the bug only took like two days to fix,
but now you have your release process so that it doesn't have another bug in
it that causes more problems. like two days to fix but now you have your release process so that it doesn't have another bug in it
that causes more problems we're all forgetting the fact that you have to reproduce this issue
first as well you have to understand and that's you know probably the part that oh man i mean
the amount of people that i've that are interns or or sad you know sad individuals that i've that
i've talked to that it's like oh i've been trying to reproduce a bug for like two weeks and it still hasn't cropped up. Um,
that's the thing with a million devices. If they all run for a day, you can get
a one in a million sort of, yeah, bugs get weird. I've talked a lot about this. A plug for an interrupt article.
It is one of my favorites.
It is defensive.
It's such a clickbait article, but I love it.
Defensive programming friend or foe.
But it's what I talk about in it
is more of this concept of offensive programming.
It's, yes, when you have a million devices like every you're going
to get one of every single crash that's in that firmware pretty much or like one of every single
issue per day and the the goal of of that offensive programming is like trying to surface as many bugs as possible, as quickly
and as loudly as possible. And what that allows you to do is fix them early and,
and very quickly and ideally very easily as well. Um, yeah, I mean, that's, that's the,
if you get to that point though, you need a lot of systems in
place before that you need data that the devices are sending you that allow you to track down
exactly what bugs exist and how did my devices crash and how did my battery life drop? Like
what are the different metrics that, that pertain to battery life and kind of contribute to it um oh there's so many more you know ant tunnels to talk about
in this topic as well but yes i mean there's there's so much actually so i've done i've done
the role where i've monitored the the devices it's not one I'm particularly suited towards.
But I've done it enough that,
especially as products come up and go from 100 inside a company
to a couple, maybe 10,000 outside a company.
After that, I'm just not the right person.
I wouldn't say any firmware engineer really is
because it becomes more of a data science problem.
Is there a new role? Is there a new engineering title for the person who monitors these and tries to prioritize what can happen?
It's called the enthusiastic firmware engineer.
Ah, the intern.
Ah, the under 30 set.
I mean, yeah, I just hit 30 this year.
You can turn off your enthusiasm now.
No, I will never.
But seriously, that is, I mean, if we're going to be honest, that is the role that
need that, that generally takes place, right?
Like I, I very rarely hear about companies hiring a like higher level firmware engineer.
I think that's the role that I took at pebble.
I like slowly morphed myself into like higher level firmware engineer slash I think that's the role that I took at Pebble. I like slowly morphed myself
into like higher level firmware engineer slash Python, you know, Python and web app builder.
Like I built a lot of web application tools at Pebble. And at Fitbit, like I kind of carved my
way into this role after like nine months that was developer productivity tools where you know we built a cli to kind of
build and manage the firmware locally and i built some web applications to parse a bunch of the data
the device sent i you know it parsed a bunch of core dumps parsed logs got rid of my really bad
python script which one exactly the one that oh Oh, they tracked the court.
And, but that, that role doesn't exist. It's usually the, the embedded engineer who spends, you know,
some extra nights or, or, or weekends or has done it before or yeah.
Who has, who has done it for a previous company and thankfully now there
is memfault like you you integrate the sdk and you get most of this data but you still need to be
you still need to understand like what metrics to capture and what what does it mean to have
this metric be different on this release and this release? And that just
happens through socializing and talking to your community and asking, you know, the hard questions
and, you know, you asking these questions on the podcast and hopefully people listening.
Well, and you are right because somebody who wasn't intimately familiar with the firmware
couldn't look at these trends and understand
where the root causes might be. They could write a bug that said battery life is down
in some number of units, but it would take a firmware engineer to say,
oh, those are all iPhones, or those are all Android phones, or those are all units we shipped in the first month, or something.
Well, and it's not just that.
It's somebody who has knowledge enough of the product management,
or the project management.
I always get those confused.
But to see where you are in the feature set,
because maybe you turned on a new power uh battery hogging
feature and now everybody's using their gps to track something and they weren't before well then
that's why you're getting you know 30 less battery life every every day so woohoo heart rate works oh
now my battery dies oh we we ship that heart rate feature but you probably shouldn't keep it on all the time.
You also do tools.
I think we're going to have to have him back to do the tools conversation. It's a long conversation.
Well, because I had a lot of questions.
I know, and we're already...
All right.
How much time is it?
We're at an hour and 15 now.
Yeah.
Oh, my gosh.
Sorry.
No, it's great great this is very good but i do want to talk about tools and we would not do it justice if we were to try to do
it now i'm happy to come back part two there there's oh there's so much more to talk about
there's so much yeah and i mean this this whole device management thing is going to become a
bigger problem as we go on forever it's always going to be bigger and
bigger and i'm still going to call them distributed systems darn it it's it's a good term i just
haven't you know heard that before when talking about embedded devices i mean it's not actually
the first one working together it's not like all the fitbits are working together. They're all individual systems.
That was never what distributed systems meant.
It isn't?
It doesn't imply a mesh of any kind.
It doesn't?
Tyler, I heard Memfault is hiring.
Would you like to give us more information?
Yes.
Currently, we are hiring for a firmware solutions engineer, and that is building up our SDK,
talking to customers, and generally being an evangelist for the company, and also a data engineer.
All these devices send us a bunch of data.
We have to analyze it, store it, and produce insights and tell people how their devices
are failing or succeeding in the field.
And yeah, we're looking for a data engineer.
And Tyler, do you have any thoughts
you'd like to leave us with?
It's more of a, yes, it's more of a like,
this is what I've learned over the last,
you know, two years in COVID,
but kimchi is very easy to make.
And I suggest everyone try to make some kimchi at home if they like it.
Unexpected, but excellent.
Our guest has been Tyler Hoffman, co-founder of MemFault.
If you'd like to check out their blog, well, it'll be in the show notes.
But if you can't find that, type interrupt and MemFault together, and you will definitely find it.
Thanks, Tyler.
Yeah, thank you both.
Have a great one.
Thank you to Christopher for producing and co-hosting.
Thank you to our Patreon listener Slack group for questions, in particular, Philip Johnston,
which reminds me, if you've been considering supporting us in Patreon and you want to join
that Slack, now is a really good time as the book club just
started some really cool new stuff. Finally, thank you for listening. You can always contact
us at show at embedded.fm or hit the contact link on embedded.fm. And now a quote to leave you with.
This one's from Jack Kerouac. My fault, my failure, is not in the passions I have, but in my lack of control of them.