Embedded - 390: Irresponsible At the Time

Starting point is 00:00:00 Welcome to Embedded. I'm Alicia White, alongside Christopher White. Today we'll be discussing why I hate the term Internet of Things. Wait, no, we'll be discussing the management of distributed systems with Memfault's Tyler Hoffman. Hey, Tyler. Welcome. Hello. Could you tell us about yourself? Yeah, for sure. Yeah, I'm Tyler Hoffman.

Starting point is 00:00:33 I am generally an embedded firmware engineer. apparently like Chris, mostly Python and building Memfault's backend services, data infrastructure to manage our device management platform and diagnostics tools. Before then, I was a firmware engineer at Pebble and Fitbit, where I constantly found myself doing more developer tools and infrastructure than writing firmware. All right. We will have questions about, well, writing firmware and managing it and all of that. But first we're going to do a topic. Narrow topic. Lightning round, where we ask you short questions and we want short answers. Are you ready?

Starting point is 00:01:21 I am. Okay, easy one. Favorite fictional robot? Wall-E. IoT, edge devices, or distributed systems? Ooh, IoT, edge devices. Who had a better smartwatch, Fitbit, Pebble, or Apple? Pebble. It's an easy one.

Starting point is 00:01:43 Preferred code editing tool? Now, PyCharm. It's great. CMake, make, or something else? CMake, but I don't know it super well. Open source software, yes or no? Yes. Complete one project or start a dozen?

Starting point is 00:02:11 Finish two, 80 80 of the way if you were teaching a course about embedded systems what three topics should you definitely cover um unit testing debugging and build systems okay i have a late-breaking question for you. Have you ever ridden the Boilermaker Express? I never did, actually. I hopped aboard it when it was stationary, but never while it was moving. Follow-up, where did the name Boilermaker come from? I mean, I'm going to guess here. I mean, I do know it was from the men who worked on trains and railroads. I'm asking real-time questions that are coming into me from a fellow Purdue alum. So, that's if anybody's wondering what the heck is going on, that's what those questions are for. I went to Purdue for undergrad for the listeners. Okay, we're going to go back to the course thing because that was kind of important. I'm sorry

Starting point is 00:03:05 i forgot that i forgot that you're trying to get everybody to do your homework for you that wasn't it okay so listeners sorry tyler's just gonna take a second listeners i am teaching a course for a company called classpert that's like class and expert had a little word together and called it classpert i like it uh and it's about embedded systems it's goes through my book uh it has a whole bunch of extra stuff i'm doing videos i'm doing all kinds of lectures there'll be mentors and and real-time discussions projects projects um and i'll put a link in the show notes but i hope you check it out uh the first class is going to be kind of small because, let's face it,

Starting point is 00:03:48 I haven't done this before, but the ClassPert folks seem to really have their act together and, let's face it, my logo for them is awesome. Okay, sorry, Tyler, back to you. Debugging, unit testing, and what was the other one? Build systems. Well, all right. I think that's where we're going to head for the whole show. Recently on Twitter, I asked about IoT management for non-cell phone devices like BLE or ZigBee with a backhaul cell phone or coordinator, non-Linux Ethernet devices.

Starting point is 00:04:28 And I wanted to know what platforms people use and what they like and what you'd suggest for a new small company entering the IoT space. Do you have an answer to that? I think we all have somewhat strong opinions to that. I didn't get any response. I mean, on Twitter, I was so surprised. But yes, I have strong opinions, mostly in the, oh, God, get me out of here opinion. But you actually are in that space.

Starting point is 00:04:57 We're in that space. And my guess as to why people did not respond to you would be because no one has a very strong or confident or probably even right answer to that question, because I feel like a lot of them are, you know, mediocre at best, a lot of these systems. In terms of what platforms we've seen people use. So yeah, so we work with a lot of customers at Memfault. We talked to a lot of engineers. I have never talked to more embedded systems engineers in my entire life than I have over the last two or three years. Zephyr, Minute, FreeRTOS, and the Espressif IDF are the ones that come up most commonly in the customers that we talk to. But those are the devices.

Starting point is 00:05:41 I was looking for what happens after you get to 10 sitting in your offices or in your closet somewhere and not necessarily a very small embedded device. And I know that's what you're looking for. Yes. Non-Linux Ethernet devices. Exactly. Right. And so AWS has one, right?? You can use FreeRTOS with AWS IoT and Microsoft bought ThreadX, the RTOS there. And Espressif has their own cloud backend that they want to use as well, or they want people to use. I wouldn't say all of them are good. And they weren't written to be usable, especially by engineers or people who

Starting point is 00:06:47 don't know exactly how to use these systems to begin with. Why is this such a hard problem? Is it because you're taking a step beyond just firmware to now having an understanding of networking and software as a service kinds of things? Do you have to make that kind of a jump in expertise or is it that nobody has made a real kind of turnkey okay this is easy we will do everything for you kind of solution i mean the ones writing the firmware are very much not the people writing the backhaul services and i don't know if they talk to each other often enough. And that's, you know, I'm sure both of you working at previous companies doing embedded systems, that's probably true, is the firmware engineers very rarely talk to the cloud engineers, I believe.

Starting point is 00:07:36 I know that was true for the last two companies that I worked for. I think it's worse at some of the big companies. Azure and Amazon both have IoT offerings that really do seem to be written for software engineers working on computers, not written for firmware engineers trying to squeak out one last byte of RAM. Exactly right. exactly right and especially trying to like do ssl connections and https over with like you know 64k or some of our customers like 32k of ram like it's just not happening yeah i think i think you're right that that's a huge piece of it is some of the things you must conform to don't really fit yeah and they were never intended to fit and And the other thing that's also tough with a lot of those platforms that exist today is they assume that these devices have infinite power,

Starting point is 00:08:31 pretty much infinite resources, and they have a constant and stable internet connection to these systems. And that's very rarely the case unless you are literally a computer in a closet, you know, running Linux. I've worked on two big distributed systems such that I've had to get involved with both the software and the hardware. And one was ShotSpotter where we had dozens and dozens of sensors in each covered, and we had dozens and dozens of covered cities. And every day we wanted to know, well, was there a sensor that didn't have its heartbeat, that didn't check in, which meant its radio or power was down?

Starting point is 00:09:17 Was there a sensor that had a fault or didn't hear anything and therefore probably had something wrong. And I mean, once you get up to like a thousand sensors, it becomes hard. And we did it with Visual Basic, querying SQL tables in Excel and color coding. In Excel too? In Excel. In your defense, AWS didn't exist. None of this stuff existed back then. That's true.

Starting point is 00:09:49 I mean, it was 2007, 8, 9-ish. That's kind of still what they want you to do though. They're going to put all your data, they're just going to export your data to a CSV file in S3 and they're going to tell you to do it yourself. That's all they're going to provide. But one of the other problems with that, I mean, the communication was part of it, but for some of the devices, we were on a self modem. And so every byte you sent back actually cost money.

Starting point is 00:10:20 And so we didn't want to do a heartbeat every minute because that actually adds up to a lot of money every day. But no back, no, I don't, what is this called? Is device management, IoT management takes into account the need for small data updates updates and so yeah so so so can i can i pitch memfault really quickly or just like say what we're attacking right it's like we are so yes what what we what we did at pebble and fitbit you know pebble we built our own it was very simple our devices connected through a phone and every so often reported back through that phone to a to a very scalable python application written on heroku honestly that's how we got most of our data back at fitbit massive systems um i'm sure both of you have some history on that and how those are built

Starting point is 00:11:16 but but yeah very complex systems but completely homegrown and why we we wanted to build memfault was because we we kept seeing this problem over and over again we're like no matter what company we went to we were going to have to build this system or you know shoehorn one of these larger systems into a into a hardware product embedded system again and me chris and francois were just, we can't do that again. Like we, we don't want to solve this problem for the third or the fourth time. And so that's Memfault. And that is, it's like, we're getting in, I would say more so device management. I think everyone defines it differently, which I guess is also part of this conversation. Um, I see it as kind of three or three or more things. It's like provisioning it's giving the device, you know,

Starting point is 00:12:04 some sort of certificates or device serial that you, you know, put it on and the factory assembly line. It is knowing whether that device is alive and how well it's doing. And then it's also pushing new updates to those devices. I think those are the three things for, for device management. Memfault does very well, in my opinion, the OTA delivery and the monitoring and diagnostics. We do not have yet, maybe, any sort of provisioning services, security keys. We're not doing those things yet, which I think is the one thing that AWS IoT maybe does well, but also very confusing. How do you do over the updates if you don't have security keys programmed in manufacturing? For our customers, we assume they are going to do that themselves.

Starting point is 00:12:55 So we are basically saying, bring your own system. We don't, you know, I think other companies are attacking it in the way that like, you need to use our platform. You need to use our chips that we provide you you know they're like ten dollars a piece and please use our chips please use our back end and you can build your product on top of it you know um we're just saying not very scalable i mean it's just not very scalable um but like i think a lot of these companies are building it for very large and expensive devices, right? Like if you're building a tractor or if you're building a big machine on an assembly line, like you don't care about the cost at that time.

Starting point is 00:13:37 But if you're building a wearable device that costs a hundred bucks, you need something that works for that company and, you know, for that business model. And there's not much there. One of the problems with supporting provisioning and manufacturing that I've seen some vendors try to help with ends up with them having the keys. And that's always been a non-starter for me. Vendor lock-in. Yes. Exactly.

Starting point is 00:14:11 I mean, in the end, if I'm protecting the customer data or protecting my device through secure over-the-air downloads, I don't really want anyone else to have that information. Correct. Yes. And so, yeah, there's no good solution, but I will, yeah, the comment, like, have you heard of providers that don't give you the private keys if you give them on a device, like the provisioning? I think so, because sometimes there are, I don't want to call out anybody, but there are some companies that provide a whole solution from network to dashboard. And you write a little bit of code for their widget, and you don't really get to know anything else about it. Got it.

Starting point is 00:15:00 And so you're basically writing software for this thing that exists in in the environment that you're placing it in yeah and sometimes i mean like over-the-air updates happen kind of magically which is terrifying um because you don't really want over-the-air updates to happen like you want continuous integration with software correct and especially when it comes to hardware because the inevitable and the worst case is you're going to brick units or have issues in the field that you can't possibly handle or want to deal with, basically. They see, like we said, they have that software perspective. It's like, well, how can we make the device a software thing? How can we make the device just part of the cloud? And if you write software for it, we own everything that's involved with it, basically. So it's a difficult balance.

Starting point is 00:15:58 And when you say provisioning, you mean the security piece, not the provisioning that the customer has to do when they get it home and have to connect it. Correct. Yes. I mean, I mean, the certificates and stuff flashing, flashing the flashing the device with like, this is your device serial. This is your Mac address. This is your Bluetooth ID. And this is your security token that like is how you will communicate to anything, but not necessarily like customer onboarding and let's install your first OTA payload and everything. Is there a different word for what the customers do? Honestly, I would call it onboarding.

Starting point is 00:16:37 Honestly. Okay. I think it's what I've always used. Yeah. So going back very briefly to the larger systems, what they try to do and what they're focusing on is like secure transport. And in my opinion, a lot of it for OTA updates specifically is as long as you have secure boot, you're fine.

Starting point is 00:16:59 As long as the payload is signed and you install it, which I think most of the bootloaders today and the embedded system platforms that you can use, which I think most of the bootloaders today and the embedded system platforms that you can use, you're generally going to be fine. What do I need to know as a firmware engineer about OTA when I'm thinking about these large distributed systems? Signing and hashing are important, where hashing is the checksum, but with security and signing says it really did come from the person I said it came from. Yeah.

Starting point is 00:17:29 It's funny. It's funny? It has to be 100%. Sorry, I'm only saying funny because at Pebble, we actually didn't have secure boot. We didn't have signed payloads. It was more of a hacker device. And so we just assumed the device connected to the mobile app and everything was fine. But I'm thinking back now it's funny because there was a group of people.

Starting point is 00:17:57 They were called Pebble Bits. And they would modify the Pebble firmware in whatever way they wanted where they added new fonts, they built internationalization for, but they basically just like modified our firmware in very different ways, adding like really cool features. But then you would just click the link in the mobile app and it would just like automatically push that firmware to the Pebble, which I thought was fantastic. But like you could install whatever you wanted on that Pebble device, as long as the crc matched which which is great when you have a hacker device and it's great when

Starting point is 00:18:31 you have it on your desk as a developer but it is not great when like the president of the united states is wearing your smart watch at that point you want a little more security what do you say and let's just hope that every hardware every hardware company is is making sure that you know they are using those secure practices all we can do is write on interrupt about that you should do it interrupt is your blog right um yeah memfault's founders, the three of us kind of just were like, we need to write some content because it doesn't exist. So let's do it ourselves. And it's a good blog and I have pointed to it and been pointed to it various times. And yet I was totally unaware of the connection to Memfault or what Memfault did.

Starting point is 00:19:23 Have you considered maybe just a little more promotion? There is this, I mean, yeah. So, so our, our marketing employee, Colleen would love that. Um, there is this, there is this fine line that we are trying to balance between aggressive self-promotion and also trying to build this community on the side of the company that we don't ultimately control. I've seen it time and time again. And the reason I don't like a lot of the embedded systems communities is the ones that you find are like almost always owned by a company or enterprise and like the largest and arguably you know best linkedin group that i found for embedded systems is like blatantly owned by a consultant like an embedded systems consultancy and it's just

Starting point is 00:20:19 awful and they've actually ruined it now and so we wanted to just not do that um but yes we should do a little bit more self-promotion and now that we you know have a very good product that we all believe in and we do think almost any hardware company that's building on embedded systems and now android and soon embedded lytx like all of them would benefit from it and so now we're not super opposed to it we're actually just had a meeting. And so now we're not super opposed to it. We're actually just had a meeting last week about how we're going to get some more people to understand what Memfault is, who are reading Interrupt. Marketing is really hard. I mean, because there is that balance between, I did this thing, I think you'll think it's cool that most engineers are hesitant about.

Starting point is 00:21:06 And then there's this, you know what I need? I need this thing. And not realizing that somebody else has already built it and done a good job of it. I don't know how to do that. I mean, I have that problem with the podcast that I think I should be marketing more. I think there should be more out there. Because I do think it's a good thing. And I think people like it, but I don't really want to market. It's no fun and it's, it feels wrong. Yeah. It always feels like you're advertising to people that don't want to listen. And I mean,

Starting point is 00:21:41 what we've learned a lot is like people people actually want to hear about Memfault and read more content. And yeah, to tie it back. So what we want Interrupt to become ultimately is a community of developers that they feel like, you know, it's at least helped by Memfault. We may provide resources to the community. Eventually it could come into like a more fleshed out website. That's more of a hub that you kind of hop into and learn more about embedded systems, maybe a conference in the future. But we don't want to be the company that owns it.

Starting point is 00:22:20 If somebody else wants to come in and help us out, great. And we will provide resources to it but that's kind of it and yeah like embedded.fm should become a community that that you know if you two want to stop doing it it should live on right yeah although if somebody offers us enough money for the slack we will totally sell but it's gonna have to be a lot. Yeah. Yeah, $5, $10, maybe $20. I mean, at least buy you a couple meals in San Francisco, right? Well, that's going to be more like $100 then. Yeah, exactly.

Starting point is 00:22:55 Okay, I want to go back to over-the-air programming. Sorry. All the way back to that. Wait, security and hashes for when you get the firmware, or signatures and hashes. What else, as a firmware engineer, do I need to be thinking about with over-the-air updates? You mentioned a secure bootloader. Is that something the vendors are providing now, or is that still something I have to write? Unless you have specific needs or requirements, generally you're not writing it. I think most

Starting point is 00:23:32 vendors are providing it. They're not great. And so a lot of companies are just using, you know, if you're using a standard enough chip, a bootloader is probably built for you whether that's wolf boot or or mcu boot or um nordics dfu nordics dfu and i'm sure zephyr you know they don't necessarily have a bootloader but they basically will like tell you how to go about doing this well and ti has oad why are there different initials for everybody this seems like a term we should agree on now. At least DFU. I mean, I think most people who I talk to will now use the phrase DFU. But that's also like the only way that they know how to install firmware too.

Starting point is 00:24:17 So doing the firmware update over-the-air programming device, firmware update, over-the-air downloads, whatever it's called, for a few units in your lab is different than deploying to a few million smartwatches. You don't have to go that far even, but yes. No, you don't have to go that far. I mean, that's drastic. Yeah. But we've done it. What are the steps? What are the gotchas?

Starting point is 00:24:50 And what do I need to know as a firmware engineer taking the steps to get from a few devices to consumer production level? Yes. So yes, I will answer that in just a moment. Even before we get there, you asked, what is the requirement? You have to just have an OTA system that works and has a failsafe. or restore a very, very minimal firmware that knows how to contact, you know, or knows how to phone home or send out a signal that, you know, some phone passing by will eventually install a firmware on it. At Pebble, we chose the minimal firmware route. If you booted a firmware and it failed three times in a row within, I think, the span of 15 minutes, we would boot up into what we call the factory firmware, which we had tested and hardened for a very long time that you could install a firmware

Starting point is 00:25:52 over Bluetooth and you could factory reset the watch to absolute factory conditions. And so in my opinion, like that is step one. And that you should build that when you have five devices, or at least you're starting to build sealed units. Because if you can JTAG anything, like you're probably not going to care about a reliable OTA delivery system at that point. Getting to millions of devices, that's a whole different ballgame. I think the buzzwords and actually true words are you need staged rollouts. This is deploying to 10 devices, then 100 devices, and then 1,000, 10,000, and you scale linearly, basically. And that entire time you are getting data

Starting point is 00:26:47 back from the devices, how are you doing? How's the new firmware behaving? And are there any new crashes or anything that I should be aware of? That's a whole different system. We'll talk about that later. But like stage rollouts and making sure that you get some form of ping or heartbeat or status after you've installed a firmware update is probably the most critical thing when you're dealing with the millions of devices. Cause if you update even thousand, um, and you just don't hear anything from a device anymore, like that's when the sirens go off and you press the big red button on the side of your desk, right? That's when you go to Reddit and see what everybody's complaining about.

Starting point is 00:27:31 Exactly. You start reading Amazon, you check Reddit, you check Twitter. Um, yeah, I mean, it's so true. You, you like hit the nail on the head there. That's exactly how we felt at Pebble. Um, as soon as a Reddit thread came up, hey, is version 2.4 broken for anyone else? We're like, stop everything. It's weird to get that sort of feedback from customers.

Starting point is 00:27:53 I mean, that's the exact sort of feedback you desperately don't want. And yet, if there's an error that only happens on one out of 100 units, you're not going to find it in the first couple of stages of rollout, unless you get lucky. Or, or unless that person is a very vocal Reddit user for sure. Yes, absolutely. I mean, it's, it's so, it's so relevant to me as well.

Starting point is 00:28:19 Like I just remembered one of my first tasks at Pebble, and it was so irresponsible at the time. But I came in, you know, I'm just out of college. And in my first month, I, you know, did a couple tickets, fixed a couple bugs. And then they were like, all right, Tyler, like, no one wants to be the release lead for, you know, this version 2.4. And they were like, you are going to be the one to release this firmware. And it takes about a month, you know, month and a half. You're basically working on it full time. You are triaging every bug that comes in.

Starting point is 00:28:50 You are fixing all the bugs that are easy. And then you're kind of like making sure that all the other ones that are harder or like, you know, more specific to engineers, you're making sure that those all get fixed. You're deploying nightly firmware updates. And ultimately what it means is you're dealing with a one or two people that just break their watches and like every which way. Um, and yeah, that was, that was my first, like, that was my second month at Pebble and it was super fun. Thankfully at that point we had logs, we had core dumps and we had some very minimal metrics coming back from devices so we had you know battery life and a few heartbeats here and there so we generally knew how things were going when

Starting point is 00:29:34 we were releasing even internally um but yeah like you have to watch reddit honestly like as soon as something comes up there it's like let's pause for a second but this release engineer position uh or or role does tend to get passed around because it's not very fun especially if you're the person who has to make all the versions be right verify do the test ver, make sure the security keys are in the right place, do some documentation for manufacturing, all of these little things. And then you have to compile the image with the security keys, make sure that you can update the firmware, downgrade the firmware, upgrade the firmware, this whole dance to make sure that it's releasable. It is a pain, but it's also one of the most important things we have to do. And it's one of the most, it's the least often thing we do for a lot of us. And so it's full of mistakes.

Starting point is 00:30:42 Yes. I mean, how many times have you had to write the checklist? I've had to modify the checklist and update it plenty of times. I think during my, my tenure pebble, I think it was the release lead like four or five times and like every single release, something changed or it was out of date. And, you know, I skipped a few steps here and there, and I only messed up once. I think we deployed, um, a, not a bad firmware but like an incorrectly labeled firmware to to like a hundred to a thousand people you know it got the git sha it got like you know 2.8 dash abcdefg instead of 2.8 um but that was my that was my one mistake that's not so bad it's not so bad no it

Starting point is 00:31:23 wasn't bad at all but it was it was also just something that like somebody will post on reddit and just be like hey what's going on here like this is not normal um and it just doesn't look great working at leapfrog on consumer devices uh we didn't have the problem of over the air update but we did have the problem of releasing the manufacturing with very strict i mean mean, they were going to start... They make masked ROMs, and so you can't change them afterwards. And so you have to make sure you get the firmware right. to change a version number so that it matched some document and did it with a hex editor oh gosh because if you recompiled you had to go through testing again but if you just made it match the documentation it was all fine um yeah well sorry distraction uh

Starting point is 00:32:19 and when you rebuild a firmware and you try to ship it to people, you have to go through like, depending on how complicated or how sophisticated the company is, like you're either going to say, all right, well now it needs seven days of soak time or 14 days of soak time. But if you, what we did at Pebble instead was like, you track it for a day. And if the battery life is trending in the right direction, we're like, instead of, you know, letting every single watch run out of batteries for 14 days and then measuring the duration of that each watch took to die or needed a recharge, we just like said, okay, cool.

Starting point is 00:32:58 Every single device that is out there today running our firmware dropped 7% today. Okay, great. We're ready to ship the firmware tomorrow. The battery life trends look good rather than waiting 14 days, which I know many, many other teams basically have a requirement to do that. But if you have to wait that long, then if there's a really important bug, you have time and you have more people on Reddit complaining about you. You're preaching to the choir. Yep. There's this balance of, do I let it go?

Starting point is 00:33:38 And with wearables, with Pebble, with Fitbit, that whole, you did something to make the battery life die. If you're running on your desk with a unit that has a power supply instead of a battery, you're never going to know that. Or you're just not connected to the right Android phone from a particular vendor with a particular Bluetooth stack in a particular day of the week. Yes. And then people complain that their batteries die. It's, yes. The number one complaint. Well, number two complaint. Number one complaint is probably it doesn't connect or it drops constantly.

Starting point is 00:34:09 Number two is battery life drops or, you know, is terrible. Okay. You mentioned monitoring the battery life. And we've both mentioned heartbeats. Once I get my firmware out there, what else do I need to know? And this is where Memful comes in. in like this is our bread and butter it's like once you get the firmware out what are you tracking and what are you making sure looks good and so number one make sure your devices are alive and reporting anything number two um uh make sure your devices aren't rebooting. I think the simplest thing

Starting point is 00:34:49 you can track for firmware is count the number of times or at least send an event or find some way to report whether your devices are crashing or resetting or hitting an assert. And then ideally reporting some piece of information about how it's asserting or crashing. And that's usually the program counter or the link register. Or if you have a more complex firmware, you can usually pull the function

Starting point is 00:35:18 or a backtrace basically. And so at least get those two things. So you can kind of tell whether this firmware is more crashy or not than the other ones beyond that now you're kind of searching for trends like battery life you know um and and you said heartbeat that's actually the phrase that I use and Memfault uses for events that happen periodically. We can talk about this as well. It's like, how often do you send these periodic heartbeats? At Pebble, we did it every hour.

Starting point is 00:35:55 And so for every single hour, we would track how much did the battery life drop? How many ticks or seconds was the CPU active? How many seconds was the Bluetooth chip on? How many disconnects were there on Bluetooth? How much time was I connected on Bluetooth for this hour? And before we shipped any firmware at Pebble, you had to at least meet or exceed those certain trends. So your battery life had to drop less than the previous one or be within acceptable limits. And if your Bluetooth connection time per hour dropped significantly, like that is a regression in firmware and we made a bug or or you know there's contention on the cpu or you know the connection interval change that you made just like chris mentioned with android phones like the connection interval changed and it made a lot of android phones upset

Starting point is 00:36:56 um can't tell you how many times we had to change that as well um yeah i, I think I can go on and on about this too. It's something that I've seen IoT companies not consider. On one hand, all that information is very useful. On the other hand, if you are a battery-powered device, the more often you send that information, the less often you will manage to live through the whole day or however long your battery is supposed to last. There's a cost associated with sending those reports. Do you have a way to balance the trade-off there? Not one that's not obvious, I guess, right? Like if you, if you are on a coin cell battery where it's like you have one, it's one and done, you know, maybe it's even a fixed battery. Like you can't send a heartbeat every minute, like you said

Starting point is 00:37:57 before. Um, sending a heartbeat every, every hour or every few hours, or even once a day is pretty good. Actually. Um actually you should be able to do pretty well and and if you have persistent storage what we tell a lot of people as well is like store up you know a week or two of heartbeats on flash in a compressed format and then send them up when you're ready um slight slight plug there from M fault as well. Like we are able to store plenty of heartbeats and, and now like, I think each, each metric that you track is basically like six to eight bites. And if you have, you know, a few K a flash, you can batch up quite a number of quite a few heartbeats. Um, and yeah, when you, when you then have a connection or

Starting point is 00:38:44 maybe even a user you know plugs your plugs your device into a wall socket or charges it then you can send up everything yeah i've had some devices that it's it's once you are plugged in okay now just send everything you've ever wanted to send in the past well yeah it's a balance between logging and statistics too if you can boil stuff down to statistics that's a few numbers yeah yeah it's a balance between logging and statistics too if you can boil stuff down to statistics that's a few numbers yeah yeah that's easier to send periodically uh then okay i have you know one megabyte of the day's event logs i gotta ship all that up there both can be useful but there's there's a big trade-off there yes for sure and and we have found yeah and i think there's a big trade-off there. Yes, for sure. And we have found, yeah,

Starting point is 00:39:26 and I think there's a mixture between the two, right? You have logging, you have metrics, and then I think everyone has kind of come out with their own flavor of it. It's like the compressed logging. We call it compact logging. Other people call it hash logging, but it's basically like,

Starting point is 00:39:39 take this human readable message, provide that an ID. You can pass a couple arguments and everything is basically stored as like UN32s. And then you send up those. And that's much more compact and compressed than sending ASCII text. It's like how Apple SoftBasic used to work. Oh, really? And they wonder why I still know the ASCII table pretty darn well. So there's logging and metrics.

Starting point is 00:40:15 There's the statistics that you were mentioning, which you were calling the heartbeat. For me, a heartbeat is just anything from the unit, which usually is this little packet of statistics. And you set those because you have battery issues, but you also, the length of time you don't check in is the length of time it takes for a user who has changed something on the website to get that on their watch. And so this is like if a, if a user clicks install on the app store and then they're trying to send that down to the watch. Yeah. If you're only checking in once an hour, doesn't that mean it takes an hour for it to, to check in?

Starting point is 00:40:56 Oh, I mean, this is more for, for diagnostic data that the user, I mean, they opt in, of course, I think that's generally the trend now is you, you opt into all this diagnostic data and you have to um this is like the device is basically in control of sending that data yeah um for at least a pebble and fitbit it was like you're on the phone let me install an application you're directly connected to the device and then that just sends it over immediately sorry i was back in cell phone where Where if you only said hello, actually, this is in the underwater thing I've been working on. If you only say hello once an hour and you're only awake that one time to listen, then somebody has to wait an hour for you to get around to say hello again. Yeah, it's like working on a mars rover yeah it's like mars i mean you you laugh at this this is this is a way for people to implement

Starting point is 00:41:53 their own version of stage rollouts though right if a device only wakes up every hour or once every 24 hours and checks in you know here's my heartbeat do you have anything for me and that's the kind of in the payload right like you send all your information and then the the server responds like okay i got it and also like here's some things you should know about the world um a lot of times that's going to be like here's an ota payload for you to install and if you just like release the firmware for 30 minutes and then turn it off, that's pretty much the stage rollout, right? That's one way to do it. family who will report bugs directly to us and then go out to the to the bigger picture to the larger audience even though that means you may have a bias towards different cell phones true

Starting point is 00:42:52 very true or environments yes or environments i was going to say like um that is always one suggestion as well we say like do do your stage rollouts but also have your internal developers or users which you know it's usually the company employees like if you're working for a hardware company like every employee should be required to test your device or use it or wear it and and another thing i always suggest is like if your device is experiencing an issue or asserting or has rebooted like when you're doing internal testing like make that very loud if you're making you know um a smart lamp like you know even the simplest thing like make the lamp flash on for like 30 seconds like on and off and that's like telling the user this thing probably crashed like please load up

Starting point is 00:43:45 your phone and submit a bug report you know internally and at least at pebble like if the device crashed on an internal build we had like a build flag that basically said pop up this window if it reset if this is an internal build um it would like pop up a screen that you couldn't do anything else it was like your your pebble just reset please submit a bug and you had to dismiss it um we didn't do that and that was like we didn't know i mean i pushed for that very hard uh we we did we did do it on i mean now we're getting into history ionic i built it on ionic but it was only for internal and beta testers not sure i ever saw that happen huh okay yeah i think it was a build flag um but it was only i think if you opted in as well and there was i mean yeah okay so that's that's the firmware side that's

Starting point is 00:44:32 some trade-offs on the firmware side and a little bit on the management side but one of the things at shot spotter and fitbit was okay now that I have thousands or hundreds of thousands of units, these 50 or 100 have had problems. How much time do I spend each day looking at those problems or trying to find the root cause or even finding out about those problems which ding ding ding finding about finding out about those problems is the hardest part right um it's it's it comes back to like millions of devices everyone's gonna have a problem right everyone is i mean everyone is gonna have a problem well i mean not necessarily everyone but there will always be at that at that scale there will be thousands of bug reports every single day right like no doubt about it thousands um and yeah it's generally my battery life was bad

Starting point is 00:45:39 and it was probably the user was out of range or something, right? And the other issues will be, my device didn't connect to Wi-Fi or Bluetooth. And it will probably be they have a weird router or phone and it just doesn't work. In those weeds, there are actually bugs. And then trying to find those is the hardest part. And if you're starting out on firmware, and what I see people do time and time again is like, they build a firmware and they sit, they, they capture logs and they send logs somewhere. They usually end up on some, in some S3 bucket or on some person's hard drive. And, you know, when you're doing 20 devices, you can look through those logs generally every single day and like control F it or command F it depending on which platform you're on. And you can look through those logs generally every single day and like control F it or command F it depending on which platform you're on. And you can build some like really simple Python scripts

Starting point is 00:46:29 that can basically like parse through some logs. But yeah, like to your point, when you're doing even a thousand devices or a million, like no one is going to find the real issues and especially the new issues that happen, right? And when you get a new issue, like if you've seen this issue a bunch and you've kind of gotten the idea that it happens

Starting point is 00:46:49 and the unit resets and I'm just, I can't find it in the code, but that's okay. But when you get the new issue and you've never seen it before and you're like, oh, is this the start of the tidal wave of problems? How do you bubble those up? How do you decide what's an important thing to tell people?

Starting point is 00:47:21 Yep. And this is where Memfault really comes into play, honestly. Because, yeah, quickly to cover this, what are those issues that are going to be very important right it's probably going to be your device is crashing or it's going to be sounding some alarms on like asserting or or some sort of like really bad like your device and its heartbeat is saying like bug or issue or holding up a red flag, right? Memfault is built in a way that when a device crashes or has a particular log, it will basically capture a signature of it. It captures a core dump or it captures a log. It sends that to our server.

Starting point is 00:48:04 We basically generate a signature of it. And if it's a new signature, we will generate a new ticket. We'll send you an email, we'll send you a Slack message, and we will show it on the front page to be like, Hey, you know, your firmware, the firmware version you just updated and pushed out like has a new bug. And if it's one we've seen before, we will increment a counter. And so it's not this, like, you're not getting a thousand new bug reports that you have to basically like crawl through. You're just being alerted to the one or two new ones that you have maybe that day. Um, and to figure out which ones are actually important, it's probably the ones that are affecting the largest number of devices I would say, or the CEO's device. Usually those two.

Starting point is 00:48:46 Yep. Yep. The CEO's device is always high importance. Or the press reviewer. Or the press reviewer. Exactly. Oh, man. Yeah. We've done that as well, right? Like you put them into a special cohort of devices or a special cohort And you do not update their firmware during the release event. Or if you do, you make sure it's a special build that doesn't do anything fancy. It's kind of a facade. No matter what you do, whatever button you press, it goes to the next screen and looks perfect. We've done it.

Starting point is 00:49:21 Oh, yeah. It's just a sticker. I remember at Fitbit finding a new issue in the company-wide rollout of a problem. And realizing I didn't know that person, but since this was important and the bug was whacked, I mean, just crazy, couldn't figure out what it was doing, I actually called and said, okay, so, you know, at blah, blah, blah time. This was an internal person. This was an internal person. Never do this to actual customers. Oh, my gosh. Okay. No, this was an internal person. This was an internal person. Never do this to actual customers. Oh my gosh, okay. No, this was an internal person who knew they had...

Starting point is 00:50:09 I went into the customer service database, found this person's registration. I just called them at home and said, hey, I noticed your watch isn't working. And they were very confused, naturally, and then looked at the time and then said, oh, that's when I put it in the dryer. Oh.

Starting point is 00:50:27 I decided I didn't have to chase that bug anymore. Yeah. And actually, that's the whole creepiness of that, especially as you go to customers. How do you handle those data ethics? I mean, internal customers and Fitbit was small at that time, but I had the keys to their debug database for a little longer than I should have. How do you balance the, I need this information versus, oh, this shows the customer was in such and such a place at this time. So they must be, I don't know. This is like when the watch that people were running with was showing how the military base was set up.

Starting point is 00:51:22 Right, right. This Strava, yeah. There are different types of debug information that you can send from a device, right? There are hardware metrics, like what is the readouts from these sensors? Like, are the sensors reporting faulty information? I know we tracked some metrics at Pebble where we would record the max and the min

Starting point is 00:51:52 X, Y, and Z axes from the accelerometer. And basically what we would verify from that is like if we just got bogus results for that hourly heartbeat, we knew that that accelerometer, either one is completely faulty and that product should be replaced or two, like something really weird went, went wrong during that time. And like, maybe something else, maybe there's a firmware bug.

Starting point is 00:52:13 And so like, that's not revealing anything private about the user. And anyway, it's just, it's just hardware data. Um, GPS locations are very very different that's where the product is located um at least for us at memfault like we don't tell we tell people explicitly do not send us that type of information don't send us where people are located how quickly they're moving um and anything that is personally identifiable like what if they need that information for their own device management? Does that mean they have to split their stream of information? Generally. And generally they do. Not many people use Memfault as their primary data pipe. They have some other auxiliary pipe that they basically pipe

Starting point is 00:53:07 all of their product or PII or things that make their product completely function. Like they're not, where Memfault is currently ingesting, you know, debug and monitoring information and some sort of configuration management for some devices. A lot of times they even send all of, all of our data to their own servers. And then they send over the mem fault specific stuff. They basically pass it over from server to server to our service. And that's how they keep a lot of that stuff away from us. And yeah, at pebble, like we, we captured in a Fitbit too, like we captured a lot of data, but I would say not much of it, if any of it at that time was, like, identifiable. It was just, like, how many times was a flash sector read or written to erase? How long did it take? How long was the heart rate task running? Like, these things are critical to debug, but in no way, like, useful information to identify a person or understand what they were doing. I have some listener questions, if you don't mind. Philip Johnston of Embedded Artistry, when I said you were on, I think he was ready to write the whole outline for me.

Starting point is 00:54:20 He asked really good questions. So let's see. In most orgs I've worked in, they hesitate to outsource device management and prefer to build it in-house. Is that simply not invented here syndrome or are there factors with existing services that drive companies toward that decision? Probably both. I think the most obvious reason why they want to build it in-house is I think what we talked about earlier. There just doesn't seem to be a great solution out there, at least for the factory line provisioning that they need to do. Generally, companies are just going to build that in in house because that's what

Starting point is 00:55:05 they had to do five, 10 years ago anyways. And the same people are going to be working the lines and they know what to do in terms of, are there any, yeah. I mean, in the other existing thing is like, if you're trying to use a device management tool that you don't know if it's going to exist when your product, you know, is nearing its end of life or like is going to continue. Like you're trying to support a product for 10 years. I think in the consumer space, we, you know, I wish it was longer, but we want a product

Starting point is 00:55:36 to maybe last like two, three, four or five years. But if you're building a product for government or a city or a sensor that's supposed to stay in the same place for 20 or 30 years like you probably should build that system yourself so that you can at some point in time like lock it in a closet and never touch it again and hopefully it just continues to work forever who knows if aws is going to want to continue i mean probably not google but who knows if these companies are going to want to support their IoT platforms in five or 10 years. Yeah. I don't know if Google has an IoT device management system and I wouldn't, but I wouldn't consider it. No, they burned me after their Google reader. I'm never trusting them again. That was it. That was it.

Starting point is 00:56:21 Okay. Philip also asked, what are the real challenges with managing a fleet of devices versus what people think are the challenges, but turn out to be easy? All right. Two part question. The real challenges are, are what we talked about before. It's, it's signal from the noise. I think most device management platforms today are truly built for 20 to 100 devices. They are built for, I think, on these dashboards that you see from these products that you're basically looking at, you're comparing your device-managed platforms, the dashboard that they show is like a green or a red box for all of the devices in your entire fleet. And you're basically trying to look for like the one red box

Starting point is 00:57:09 and you're like, ooh, this device number 72 is offline. Like, let me go walk over and see what's up with it or like call the assembly line, you know, manager and ask them to go reboot it. When you're doing thousands, hundreds of thousands, millions of devices, like you're always going to have like a thousand of them red if you're you know using this sort of device management tool and so it's it becomes is is this number worse on previous release or worse in the new release you know was there a regression or an improvement? And I don't believe

Starting point is 00:57:45 Memfault is getting much better at this. I think we're the only company that I've seen do it is like easily comparing release to release. So you just upgraded from 1.0 to 2.0. How do your metrics compare between them? How are your devices behaving? You know, how did the battery life change? Historically, like six months ago, how was the battery life between 1.0 and 2.0? Like all of these things, I just don't believe these device management tools do well, if at all. And yeah, there's always going to be noise, and there's always going to be a signal. It's just like trying to figure it out. I think, I i mean that i think your statistics there and the noise definitely show your fitbit and pebble background um i mean that's true on almost everything that you you have to figure out which of these bugs is important to spend your day on and which of them you have no chance of fixing until

Starting point is 00:58:47 something else happens. But the battery component is one of the wearables that is just makes it that much harder. What about the other part of Philip's question? What do people think is difficult, but it turns out to be easy? People, companies like to think that their product is actually the hard part. You know, this, this, we're trying, I mean, I'm just, I'm just naming things randomly. It's like, let's go build a TV remote. You know what the hardest part is, is building that TV remote. That's what they think. And, and it turns out just not to be.

Starting point is 00:59:23 The problem is actually like managing the firmware updates it's managing customer support and how do you get customer support to understand the low-level firmware enough to know like what's a real bug and what's not a real bug and what's just go reset the device um and yeah i i i do believe that writing the firmware and building your product is probably the easy part because you probably hired or trained people to do that. You have not hired a bunch of people who know how to manage and, yeah, manage very low level, very, you know, ancient like devices in a modern world. And, and, and one of the things that I think people, people struggle with as well is like, you don't know what you don't know if you've never, and you probably have many stories about this as well as like, if a firmware engineer from five years ago tried to build a product in the firmware world today,

Starting point is 01:00:22 they'd pull their hair out for sure they're like you mean i have to like do what i have to communicate to phones routers secure transport um firmware updates every single month every single week even nightly sometimes and you have to like have a beautifully crafted like touchscreen display all all of it um it's just hard not many people there's only been so much time we've demanded these sorts of things from from these little low-level devices um and so i think those are the hard parts because we've not done them before um we only did them at pebble because we were really naive we were like well we think we need these things.

Starting point is 01:01:06 Like we're generally software engineers. Let's learn how to write some firmware. And if we can't build or find the tools, or if we can't find the tools that we needed in the software world, like building iOS and Android apps, like we got to build them ourselves because that's what we know is required. Whereas I think if you build hardware for a living, you don't know that these software tools are required.

Starting point is 01:01:27 So many of the tools that I've taken part in building app weren't designed like you're saying. They were the effect of 3 a.mam debug sessions. The realization that, oh, we have to monitor battery life because if we don't, then we don't know that it's broken. How do you get engineers to understand that going, I mean, it's really not something you worry about when it's on your desk or when it's in your lab.

Starting point is 01:02:07 But when it turns into enough devices that people go to Reddit, I don't know why I'm picking on Reddit now. Because it's noisy. It's great. I mean, it's great. Very, very fanboys and girls. I only go to like the origami channel these days it's not a channel is it what are the reddits subreddits subreddits

Starting point is 01:02:32 i think i know your question is going it's like how do you then train or or get engineers to understand that like they need to focus on these problems now, not when the customer support tickets come flooding in, that the battery life is now bad, right? Because then, as soon as you hear about it that time, then it takes you months to fix. And no one wants that two- to three-month debug session. It's not even the two- to to three month debug session. It's not even the two to three month debug session. It's the not,

Starting point is 01:03:08 we have to fix this problem and figure it out, but also, oops, we really should be tracking this since now we have to have a crash program to actually do the kind of logging and stuff that we weren't doing before. Right. And the bug only took like two days to fix, but now you have your release process so that it doesn't have another bug in

Starting point is 01:03:24 it that causes more problems. like two days to fix but now you have your release process so that it doesn't have another bug in it that causes more problems we're all forgetting the fact that you have to reproduce this issue first as well you have to understand and that's you know probably the part that oh man i mean the amount of people that i've that are interns or or sad you know sad individuals that i've that i've talked to that it's like oh i've been trying to reproduce a bug for like two weeks and it still hasn't cropped up. Um, that's the thing with a million devices. If they all run for a day, you can get a one in a million sort of, yeah, bugs get weird. I've talked a lot about this. A plug for an interrupt article. It is one of my favorites.

Starting point is 01:04:07 It is defensive. It's such a clickbait article, but I love it. Defensive programming friend or foe. But it's what I talk about in it is more of this concept of offensive programming. It's, yes, when you have a million devices like every you're going to get one of every single crash that's in that firmware pretty much or like one of every single issue per day and the the goal of of that offensive programming is like trying to surface as many bugs as possible, as quickly

Starting point is 01:04:48 and as loudly as possible. And what that allows you to do is fix them early and, and very quickly and ideally very easily as well. Um, yeah, I mean, that's, that's the, if you get to that point though, you need a lot of systems in place before that you need data that the devices are sending you that allow you to track down exactly what bugs exist and how did my devices crash and how did my battery life drop? Like what are the different metrics that, that pertain to battery life and kind of contribute to it um oh there's so many more you know ant tunnels to talk about in this topic as well but yes i mean there's there's so much actually so i've done i've done the role where i've monitored the the devices it's not one I'm particularly suited towards.

Starting point is 01:05:47 But I've done it enough that, especially as products come up and go from 100 inside a company to a couple, maybe 10,000 outside a company. After that, I'm just not the right person. I wouldn't say any firmware engineer really is because it becomes more of a data science problem. Is there a new role? Is there a new engineering title for the person who monitors these and tries to prioritize what can happen? It's called the enthusiastic firmware engineer.

Starting point is 01:06:25 Ah, the intern. Ah, the under 30 set. I mean, yeah, I just hit 30 this year. You can turn off your enthusiasm now. No, I will never. But seriously, that is, I mean, if we're going to be honest, that is the role that need that, that generally takes place, right? Like I, I very rarely hear about companies hiring a like higher level firmware engineer.

Starting point is 01:07:01 I think that's the role that I took at pebble. I like slowly morphed myself into like higher level firmware engineer slash I think that's the role that I took at Pebble. I like slowly morphed myself into like higher level firmware engineer slash Python, you know, Python and web app builder. Like I built a lot of web application tools at Pebble. And at Fitbit, like I kind of carved my way into this role after like nine months that was developer productivity tools where you know we built a cli to kind of build and manage the firmware locally and i built some web applications to parse a bunch of the data the device sent i you know it parsed a bunch of core dumps parsed logs got rid of my really bad python script which one exactly the one that oh Oh, they tracked the court.

Starting point is 01:07:52 And, but that, that role doesn't exist. It's usually the, the embedded engineer who spends, you know, some extra nights or, or, or weekends or has done it before or yeah. Who has, who has done it for a previous company and thankfully now there is memfault like you you integrate the sdk and you get most of this data but you still need to be you still need to understand like what metrics to capture and what what does it mean to have this metric be different on this release and this release? And that just happens through socializing and talking to your community and asking, you know, the hard questions and, you know, you asking these questions on the podcast and hopefully people listening.

Starting point is 01:08:36 Well, and you are right because somebody who wasn't intimately familiar with the firmware couldn't look at these trends and understand where the root causes might be. They could write a bug that said battery life is down in some number of units, but it would take a firmware engineer to say, oh, those are all iPhones, or those are all Android phones, or those are all units we shipped in the first month, or something. Well, and it's not just that. It's somebody who has knowledge enough of the product management, or the project management.

Starting point is 01:09:18 I always get those confused. But to see where you are in the feature set, because maybe you turned on a new power uh battery hogging feature and now everybody's using their gps to track something and they weren't before well then that's why you're getting you know 30 less battery life every every day so woohoo heart rate works oh now my battery dies oh we we ship that heart rate feature but you probably shouldn't keep it on all the time. You also do tools. I think we're going to have to have him back to do the tools conversation. It's a long conversation.

Starting point is 01:09:55 Well, because I had a lot of questions. I know, and we're already... All right. How much time is it? We're at an hour and 15 now. Yeah. Oh, my gosh. Sorry.

Starting point is 01:10:06 No, it's great great this is very good but i do want to talk about tools and we would not do it justice if we were to try to do it now i'm happy to come back part two there there's oh there's so much more to talk about there's so much yeah and i mean this this whole device management thing is going to become a bigger problem as we go on forever it's always going to be bigger and bigger and i'm still going to call them distributed systems darn it it's it's a good term i just haven't you know heard that before when talking about embedded devices i mean it's not actually the first one working together it's not like all the fitbits are working together. They're all individual systems. That was never what distributed systems meant.

Starting point is 01:10:48 It isn't? It doesn't imply a mesh of any kind. It doesn't? Tyler, I heard Memfault is hiring. Would you like to give us more information? Yes. Currently, we are hiring for a firmware solutions engineer, and that is building up our SDK, talking to customers, and generally being an evangelist for the company, and also a data engineer.

Starting point is 01:11:14 All these devices send us a bunch of data. We have to analyze it, store it, and produce insights and tell people how their devices are failing or succeeding in the field. And yeah, we're looking for a data engineer. And Tyler, do you have any thoughts you'd like to leave us with? It's more of a, yes, it's more of a like, this is what I've learned over the last,

Starting point is 01:11:38 you know, two years in COVID, but kimchi is very easy to make. And I suggest everyone try to make some kimchi at home if they like it. Unexpected, but excellent. Our guest has been Tyler Hoffman, co-founder of MemFault. If you'd like to check out their blog, well, it'll be in the show notes. But if you can't find that, type interrupt and MemFault together, and you will definitely find it. Thanks, Tyler.

Starting point is 01:12:06 Yeah, thank you both. Have a great one. Thank you to Christopher for producing and co-hosting. Thank you to our Patreon listener Slack group for questions, in particular, Philip Johnston, which reminds me, if you've been considering supporting us in Patreon and you want to join that Slack, now is a really good time as the book club just started some really cool new stuff. Finally, thank you for listening. You can always contact us at show at embedded.fm or hit the contact link on embedded.fm. And now a quote to leave you with.

Starting point is 01:12:40 This one's from Jack Kerouac. My fault, my failure, is not in the passions I have, but in my lack of control of them.

Embedded - 390: Irresponsible At the Time

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.