Embedded - 451: From Concept to Launch

Episode Date: June 8, 2023

Phillip Johnston of Embedded Artistry, Tyler Hoffman of Memfault, and Elecia White discuss the software tasks that tend to fall through the cracks after the device has all its features but before it i...s in customers' hands. Noah Pendleton of Memfault was the moderator.  You can see the video on the Embedded YouTube channel or directly from memfault (also see their other panels and webinars). Memfault’s Slack Channel and Interrupt Blog are both excellent resources for embedded information of all kinds. Transcript

Transcript
Discussion (0)
Starting point is 00:00:00 Hello and welcome to Embedded. I'm Elysia White and this week we have a special episode for you. Tyler Hoffman and Philip Johnston and I were on a panel for Memfault talking about from concept to launch, what it takes to build and ship a device. This is all the software you forgot about when you were developing your product. Take a listen and watch Memfault's newsletter for more panels like this. Welcome everyone to our quarterly embedded panel. This is a really great panel we have. As always, we've got an amazing topic to cover as well. And so we're going to dive right into it. To kick us off, Alicia is going to just describe what the topic is for today's panel. We are going to talk about, from concept to launch, what it takes to build and ship a product.
Starting point is 00:00:57 This is all the software that you maybe forgot to write, the manufacturing software, a little bit of the cloud software that interfaces with your device. I'm excited to have Tyler and Philip join me because they're going to be experts in this area as well. So those things you forgot to do, this is the new list. Amazing. Thank you very much.
Starting point is 00:01:20 So now we're going to go through a quick round of introductions, although for folks who are joining us from previous sessions, they'll probably know these faces. Have we got anyone today? Alicia, do you want to introduce yourself first? Sure. My name is Alicia White. I host the Embedded podcast, where we talk about all sorts of embedded topics, mostly with guests, and we find out maybe more about people than technology. I also have Logical E elegance as a consulting firm that I run with my partner.
Starting point is 00:01:51 And I wrote a book called Making Embedded Systems from O'Reilly. And I have taught a class by the same name for Classpert. So yes, this is where I live. This is my field. Amazing. Yeah. Embedded FM is one of my favorite of all time podcasts. So yeah, it's great to have you here today. Phil, do you want to introduce yourself? Hi, everyone. I'm Philip Johnston. I'm the founder of Embedded Artistry. I'm an embedded systems consultant and educator. And I've been doing this for about 13 years now. And I've shipped a lot of products in that time.
Starting point is 00:02:29 So I've seen and experienced all the things you forget about multiple times. So it'll be fun to get into that with this crew. Awesome. And Tyler, could you introduce yourself? Yeah, I'm Tyler, one of the co-founders of Memfault. I'm not a consultant, full-time employee over here. Yeah, I got my start in firmware at a company called Pebble. We were making smartwatches, had a few million or a couple of million of those in the field. And I just found myself constantly working on tools and then doubled down on that at Fippit
Starting point is 00:03:00 when I was a firmware engineer there. I was building a lot of tools. A lot of the things that we'll be talking about here, did some work on the firmware, of course, as a firmware engineer, but it was not my focus. And so I'm excited to talk to the group here. Noah, do you get to introduce yourself too? Sure, why not? My name is Noah Pendleton. I'm moderating today's discussion, so you won't hear too much from me. Lucky for you guys. Yeah, I'm a firmware engineer by trades. I've been doing it a long time and currently an employee at Mempho, working with Tyler every day,
Starting point is 00:03:27 which is basically a dream come true. So it's been a good career for me. So, all right. Well, thanks everyone for introductions. Let's dive into our first group question. So just a little bit of background on this format for folks who are joining us for the first time. We do a couple of group questions
Starting point is 00:03:44 and a couple of individual questions just to keep the discussion flowing. And yeah, keep those Q&As coming in as well. Super, super fun for us to answer those questions from you. All right, so first group question for us is we're gonna talk about some fundamentals in this sort of background in shipping an IoT product. So the question is what pieces need to be in place
Starting point is 00:04:04 when you want to have your device being ready for release to manufacturing? So one of the things that... I mean, that's, of course, you need a rock-solid bootloader. You need a way to update your devices. If you're even thinking about getting into manufacturing, and you have to have the ability to fix things, you have to have a bootloader, you have to have an OTA. And then my favorite thing as well is we go through manufacturing and leading up to it and have bugs and have bugs in the file system or other various things. I always strive to have a factory reset. I think we kind of didn't do factory reset correctly on maybe like Pebble v1. But on v2 and v3, it was like the most reliable,
Starting point is 00:04:51 most robust, it cleared almost everything. There was no usage of any sort of file system or anything that could be corrupted. It was like the most brutal factory set, but it was the thing that saved us probably many, many times at the end of the day. So those are a couple of my recommendations, I guess a few. I would say you need to have manufacturing software. The difference between making one prototype or 10 engineering prototypes
Starting point is 00:05:21 and making a thousand or a million is huge you can't just fix it on your desk anymore you have to make it so that someone else can program it can test it can make sure the hardware is good and then do all of the things that we're going to talk about with respect to, does this unit connect to the right cloud? So there's a lot more that goes on. I think I'll throw in maybe something even more fundamental. Before you start manufacturing your devices, you really need to have a clear idea of what done looks like for your system. And you really want to be there. You don't want to do what I've seen so many times, which is,
Starting point is 00:06:11 you know, we're really rushing forward on producing the hardware so we can, you know, make our September deadline to start producing units so we can be in stores by Christmas. But then your software team's actually another six or eight months out. So all you're doing is spending money to put things in a warehouse. And you could have used that time refining your hardware or figuring out various problems that you're just kind of glossing over because you wanted to start producing units as early as possible. But that's how we get, as customers, devices that as soon as you open them, you have to update their firmware. Maybe three or four times. Right. Which I don't enjoy.
Starting point is 00:06:52 It's probably my least favorite part about buying a new product. It's not a good customer look. The thing that I hear so many people forget about with manufacturing in general is that there's no internet connection or very little internet connection in the factory itself. And so we've had a number of customers that are like, oh, our device requires that we contact the server. Or even maybe from the beginning, their debug flow requires an LTE connection or a cloud component, and they don't actually build the local CLI debugging experience.
Starting point is 00:07:27 And so they get to the factory assembly line and everything is broken or just completely dysfunctional. Have an offline mode. Yeah, we forget that a lot of these factories are in places too where there might not be great connectivity for you to rely on,
Starting point is 00:07:44 let alone the fact that you're dealing with some other company's network security now and trying to get your stuff out. And how are you going to put a dedicated antenna just for you on top of their factory? It definitely is something that you do run into a lot that can really throw a wrench in the works if you banked on something that you just can't do or it's going to take you a year to get your manufacturer to actually take care of that. And don't forget the problem that if you do have a wireless device in your office, maybe you have 15 working. But now in manufacturing manufacturing you're building a thousand an hour and 200 of them are on at a time and they can't talk to anything because they're trashing each other's network
Starting point is 00:08:32 so you there's just all the manufacturing pieces that i've seen i've seen that happen where people suddenly have enough devices that everyone in their office has one, or they're building them in manufacturing, and suddenly the device no longer works. And you can't see why. It's not telling you that it's broken. And the software that used to work, that sometimes works when everybody goes home, it suddenly doesn't work very often. It isn't consistent. It's crunchy and hard to use. And that's a manufacturing problem that comes up a lot. Electro engineers would have,
Starting point is 00:09:16 they would talk about like design for manufacturing, which is more about like, okay, how do we actually build this thing? But from the farmer's side, I think that sometimes gets forgotten. Like you had mentioned, you know, what your manufacturing software look like. And that's a pretty like important piece of the puzzle.
Starting point is 00:09:32 And it is something firmware engineers end up doing because no one else can. I mean, who else is going to write the, Oh, why don't you blink green if all of your hardware is in place? That's not usually the manufacturing engineer's job. They're working on getting their manufacturing line to be efficient, not just to get it up for the first time. I don't actually know if it's all too common. One of the things I loved at Pebble, at least, is even if it was the form factor board, we still had a fixture that gave us every debugging utility
Starting point is 00:10:09 that we could possibly want as if it was a development board. And so it had all the pins, had JTAG, had serial. It behaved as if it was a normal development board, which I guess I never found the ones at Fitbit for the fixtures to allow us to do that. But cracked open a lot of watches, had to break through the glue. Thankfully, we added screws later on because that was a pain. The ability to open the board or the sealed unit was incredibly important.
Starting point is 00:10:39 I was going to take this to a slightly different direction in that you mentioned cracking open units and figuring out what's going wrong. I think another thing that is easy to forget about is we're selling devices to customers. Some of those devices are going to break in the field or not work or have some performance characteristic that we don't understand that makes the experience bad. And so you do need to have a repair flow
Starting point is 00:11:03 and the ability for your team to actually investigate these units and figure out what's going wrong. You need to be able to do all the things you might do at the factory, at your office, or some other place where you're performing these repairs. So you can send them back out to a customer or you could put it into a refurbished unit box that you're selling at a discount or something like that. That's something that is critically important. And also, how are you going to handle your customer support needs? Somebody needs to be able to contact you to file an issue. You need to be
Starting point is 00:11:35 able to keep track of all this stuff as it's going through the various steps in the process. All of that needs to be designed and handled. Well, then usually you don't want it to be you. I mean, you're saying like at your desk. No usually you don't want it to be you i mean you're saying like at your desk no really what you want is to write your documentation well enough that you don't need to be involved somebody else who isn't an engineer can do all the preliminary so you only get the interesting books that's right if you've done your job right. You mentioned something interesting there, Philip, about the out-of-box or RMA side of things.
Starting point is 00:12:10 Is that something you've experienced with developing a piece of firmware that enables that? I have used the same manufacturing test software to help set up a repair line that can be used. It's sort of like a smaller factory, essentially. So I've been involved in that process or with early field FA when you're
Starting point is 00:12:31 having all the engineers look at the first 100,000 product returns that come back in and actually do that triaging yourselves and trying to figure out what the factory process is. So I have spent quite a lot of time dealing with that. But usually I just use the same manufacturing software, if at all possible, for that. When you're debugging the first 100 devices that come back from the field, are you just constantly updating those devices too
Starting point is 00:12:59 to add more logs? Did you add enough to begin with? I'm imagining, I was never in that position, but I'm imagining at Pebble, we would just constantly update those watches like every couple hours, be like, oh, let's add a log line here. Because you're trying to figure out how the hardware is failing. Like there probably are hardware bugs. Yeah, it depends, I guess, on the actual problem. I have done that. And I've also been in the case where, you know, whatever firmware we had was good enough to to get the information and it was clearly clearly like we had a factory escape or some other issue
Starting point is 00:13:31 that happened you know and that wasn't required but I've definitely done both it's a more difficult though if you've like blown the jtag fuses and you can't actually you know easily connect to the unit to debug but that's not always the case, thankfully. Well, sometimes those first beta units, the ones that don't usually go outside the building, but do go to people who aren't engineers, won't blow those fuses for that reason. But once you do, it does get a lot harder to debug.
Starting point is 00:14:08 And I have, in the first hundred or first thousand units out, if they report back to a cloud server, I have had them aggregate why they reset and then chase down the boot causes. In fact, there was one in the early days of Fitbit where I actually called the person and said, okay, at three o'clock, your Fitbit did some weird reboot and I don't understand. So do you know what happened? And the answer was really surprising. That was when the Fitbit went into the dryer. Covered by warranty, right? Just RMA the unit out next time. I mean, that was still internal release, but yes, that was not a bug that I was going to spend a lot of time chasing down then.
Starting point is 00:15:03 Amazing. What a use case. I guess I didn't count too many steps during that actually it had i mean the washing cycle all right that's great um yeah this has been good stuff we're talking a lot about sort of the manufacturing side of pieces or even the opposite of things um how does that differ from what you need to have ready on the device prior to it landing in a customer's hands rather than just hitting the manufacturing line? For all that we hate updating devices as customers, that has to be rock solid. That is the piece that you cannot do without.
Starting point is 00:15:44 But I know you guys already talked about OTA. What else would you answer? You'd certainly have other backend servers that need to be up and running, whether your device is checking in for remote monitoring or you have some kind of IoT backend that you're dealing with, right? That's a whole secondary software system that you're dealing with, right? That's a whole secondary software system that
Starting point is 00:16:06 you're building that has to be in place and functional and tested to actually make your device work. Same with, you know, your phone applications or desktop applications or your web interface, however you're, you know, engaging with the device. That needs to be completed and ready to go too. And I've certainly seen phone apps delay hardware product ship dates because firmware's ready, the product's been built and sitting in a warehouse, but we didn't finish the iOS app on time
Starting point is 00:16:34 or some other problem is gating that. So all the various pieces that go into building and managing a fleet of devices have to be ready together to make that work. And then trying to find the most stable way to update. Sorry. No shit. Thank you.
Starting point is 00:17:02 Trying to find the most stable way to update the device. A Pebble, the factory firmware, the firmware that the customers receive never changed. And I think the only time, maybe it changed once, because as time went on, Android phones, some of them, some of the random ones, could no longer actually do the initial firmware update on the device, because their Android Bluetooth stack
Starting point is 00:17:24 was actually so bad or broken in weird ways that we actually had to add some workarounds to the factory firmware but the fix for that for those customers i mean we had a wide test plan but like the fix for those customers was like go borrow your friend's iphone download the pebble app um or install the firmware which has you you know, the newest firmware has all the fixes and the workarounds that we had, but like that was truly the fix. But it's like finding the way that is the most stable over time to basically
Starting point is 00:17:54 reset or update the device is something that I never thought about either. Like the firmware is stable. It's great, but it doesn't mean the things that you connect to that are going to be stable. Yeah. It's quite challenging in these, you know, fast moving IoT deployments. It's a really good point. All right, I think we're going to move on to our first individual question. So Tyler, you're in the hot seat first.
Starting point is 00:18:20 Yes. So your question is, I think you're going to like this one too. So for customer support flow, how do we get enough information out of our customers to make your life easier? Especially when you're wildly successful and you have thousands of devices, you know, hitting customers hands. Make sure to get their phone number to, so that you can call them if anything crazy goes on.
Starting point is 00:18:44 I think there's so much to touch on here. One of them, so yeah, at the very, very first thing, like beta testing with some customers that are not engineers or don't really know how to use the device, I think is incredibly important. And at that point, as long as the device is storing information on the device and you can retrieve those devices, that's probably good enough. So that you can start debugging things. These beta customers are usually cheerleaders of the company. I know at Pebble, we had maybe 100 or so trusted people who were like, sure, I'll touch a beta. I'll have a beta unit if it means I can use the product early, but their devices are crashing like constantly. And, and, you know,
Starting point is 00:19:29 they wore two watches instead of one because they, um, they needed to have a watch that worked too. Um, but yeah, we tried to store as many logs on those devices, knowing that we may replace them sooner, even if the flash chip burned out, like that's fine for us in those cases. Getting a way to retrieve the data off that device in some manner is also incredibly important. At Pebble, our flow for that, we had a mobile app, which connected to the device. Our flow for that was if you clicked report a bug in the mobile app, it would then kind of pull off the logs from the device.
Starting point is 00:20:06 It wasn't just an automated fashion, but it was like an on-demand sort of thing. And then over time, we pulled off more information from the device. We added new Bluetooth endpoints to ask the device, like, what firmware version are you running? What are some metrics that are key rather than just pulling off the logs?
Starting point is 00:20:22 And then, of course, we built the whole system to pull it automatically and then automate it and collect it all. But that was probably the most important thing. One, the report of bug flow, making sure a customer has some sort of way to pull data off the device. Two, my favorite feature was...
Starting point is 00:20:42 One of my favorite features of any of the internal releases at Pebble was when the device crashed or experienced a known issue, we could basically pop up a banner that told the customer please go into the mobile app and report a bug. And so we could do that in special situations.
Starting point is 00:21:00 Maybe we're really trying to track a hardware bug for a user. We don't want to just collect their stuff automatically, but we're going to tell them, bug for a user. We don't want to just collect their stuff automatically, but we're going to tell them, please file a bug. And that was good for the various employees at Pebble, legal team, people who didn't really know a lot about the engineering side, but they at least knew how to report a bug. That was because the other company I worked for didn't have that banner.
Starting point is 00:21:30 And so what actually happened was a lot of bugs just went unnoticed until a customer noticed. And so even internally, we had hundreds of employees wearing the devices. All the crashes went unnoticed because the device didn't even have a boot logo usually. Because it was like, oh, if you have a boot logo, then people are going to notice it's rebooting. So having the banner was really good for stabilizing the firmware internally. Some people got really annoyed by it because they'd have to report a bug every hour, but such is life. Part of what you're being paid for. It's true. All I do is report bugs. I use the watch. I use it in various different environments. I click the button mash every so often. I do that for trying to force crashes. It's actually quite fun. Yeah, go ahead. I was just going to say, nice. Yeah, those would definitely be the two top of my list for sure. Make it easy for people to report things. And then the question you have to ask yourself is, do you leave that banner enabled for real customers? confident at that point that the device is not crashing. Like we had enough metrics. We had enough data. We would not really ship a firmware out if it crashed less than once every
Starting point is 00:22:49 seven days on average. And so that was fine. If it crashed every so often, people would notice it's crashing, but we live in the world of IoT and that's fine. No one's doing mission critical stuff with their Pebble smartwatch at the time. I want to believe that Fitbit, I think, was a 14-day average before shipping. I want to say it was a little bit higher bar.
Starting point is 00:23:13 Our clientele at Pebble was hackers and developers, so they were probably fine with it. But we did not leave it enabled, no. But we, of course, left almost everything enabled in terms of the report bug, collected a bunch of data from the device, and then allowed engineers to really quickly kind of figure out the actual issue that was going on.
Starting point is 00:23:35 And that was core dumps, metrics, and logs. So that was great. I think the other thing that was important at the time for Pebble, one thing we did kind of collect automatically was just raw metrics from the device, numbers, battery life, battery drain, CPU usage, LCD backlight usage, because that also played into battery, like a bunch of things around battery and connectivity.
Starting point is 00:24:01 And the support team could pull that dashboard up and see over the last 10 days, what were the metrics on the device looking like, so that they could help. So for customer support, it could help paint a picture of what was going on on the device. And for engineers, we could kind of see if there was a weird regression, or if the accelerometer was stuck on due to a new bug that we had never seen before or if the bluetooth radio was like being used a lot and that was causing the bluetooth train there were there were a lot of metrics that we would keep continue to add i guess
Starting point is 00:24:33 firmware release by firmware release and that was actually kind of fun debug devices remotely in production using only numbers you do a lot of creative things. That's great. Yeah, we could do certainly a whole webinar, a series of webinars probably on that. But yeah, I do love it. Just collecting some form of number. I think Alicia had touched on it kind of during our, even our chat before,
Starting point is 00:24:57 but it's collect some vital metric or heartbeat or ping or just know a device is alive and collect maybe a reboot reason if that's kind of the minimal case. Like know why the device rebooted. If the devices are rebooting due to the user shutting it down, that's one thing. But if it's due to a fault or an assert, it's another thing. Nice. Thank you. That's a super great answer. I love all it. All right. So next up, we have Phil for an individual question.
Starting point is 00:25:25 So your question is, what are the basics that we need to think about for manufacturing tests? Yeah, it's dangerous to get me started on this topic because I could go on for a long time. We touched on manufacturing firmware. Obviously, that's essential. We don't usually want to use our customer software for manufacturing tests for a number of reasons. One is our customer software is often doing things autonomously or in response to events. Say you're building a camera and you want to press a button and that's going to start a recording or stop a recording. That's not really behavior I want to have happen on the manufacturing line.
Starting point is 00:26:05 So I need firmware that's not really doing anything unless it's instructed to, and it's only doing what it's instructed to. So I can get this deterministic environment to actually use for testing. And you also need functionality that you probably don't want your customers to have access to in their firmware, whether that's just for the possibility of something going wrong or for somebody nefarious trying to poke around your system. So you might add extra capabilities just for the purposes of manufacturing tests that you don't need in your customer firmware. And there are other things you need to think about,
Starting point is 00:26:42 like you need usually the ability to set up the device's initial configuration and, and write all that critical information to some kind of non-volatile area in flash. So, you know, your manufacturing firmware probably is going to have the smarts to unlock that region of flash to write to it. But you don't want your customer firmware to have the ability to do that. So you can guarantee that, you guarantee that in whatever process is happening, that region of Flash is going to stay locked and my factory written information will remain valid.
Starting point is 00:27:12 So it's a second application you're writing, essentially. You can reuse a lot of what you're doing for your customer-facing application, but they will diverge pretty heavily at some point. Following that, you sort of need to know how to test your product. And that's unique to every product. It depends on what you're doing, how complex it is, what you need to check at the factory. But I find that every product has a pretty standard manufacturing flow with the same basic requirements, and you can kind of build on that basic flow. So you've manufactured some PCBs. You need to put your manufacturing software on it, right?
Starting point is 00:27:52 You need to be able to do what we call provisioning, which is writing out critical information your device needs. So PCB serial numbers, final device serial numbers, MAC addresses for your radios, security keys for code signing, or authentication with your server, whatever it might be, that information is going to be written at the factory. And so you're going to need processes for doing that. And usually that happens alongside the flashing step. You're going to want to test your PCBs to make sure that there's no short circuits, open connections,
Starting point is 00:28:27 there's no defective components before you go through all the effort of assembling that into a finished device. Once you've assembled a device, you actually need to make sure that assembly went well, right? So you're going to have some tests that will run through all the basic functionality checks to make sure your assembled device works well. You might do some calibration steps. Say you're building a camera again, right? I'm going to do some color calibration on all my cameras so that when I record video, I'm getting as close to the same colors out of all my cameras as possible. And then at the end of the line, you're going to need to flash your customer software on it and potentially, you know, put your device into a shipping state. If that's relevant, you might, for example, open a battery FET to, to make sure you're not, um, losing charge while you're just sitting in a box in a warehouse. So your customer actually receives, um, you know, a unit that they can start using immediately. So from all of that, you kind of, you know, if you can hit all
Starting point is 00:29:27 of those steps, you've got a basic manufacturing process that you can incrementally build on over time. Nice. Thank you, Philip. As usual, very, very thorough answer. Love it. One follow-up sort of question on that that I wanted to ask was ask was this out-of-box ship mode situation. Since that's part of the manufacturing test image, would you say that in general you would recommend leaving your manufacturing test commands or whatever on the device while it's going to the warehouse? Or would you say erase that and put in some alternate image? Usually the last step is to flash the customer firmware, which probably is not going to have those manufacturing interfaces on there.
Starting point is 00:30:10 You could, if you have a bootloader and you're just going to go through an OTA process anyway, maybe you could skip that step. I don't think I've ever done that though. I think usually there's like a known good, we want to ship the units with this customer firmware that we've tested and make sure that we can update from and we know it's reliable, just as a starting point. Nice. Thank you. All right. Our next individual question is for Alicia. What are some strategies that can be employed to store keys,
Starting point is 00:30:40 credentials securely on devices that need to connect to a network or send data to a server? Security. Well, I decided that I couldn't do this without slots. So I'm going to start with an incredibly brief introduction to public-private key encryption. And I took these images from Wikipedia. It's a great introduction, so go there. But hopefully I'm just going to remind you of a few things you already know. So public-private key encryption
Starting point is 00:31:12 is a really good way to start your encryption journey. And you decide you want to have security, so you do this key generation thing where you get a private key and a public key. It doesn't matter what these are. Both are pretty valuable, although the private key is more valuable. And now if somebody wants to talk to Alice, who has the key, they take her public key and they use that to encrypt their data and they send it to Alice and Alice can use her private key to decrypt it. Note that the public key does not decrypt data. It can only encrypt data. So if you're thinking about an embedded device, this is the TX line.
Starting point is 00:32:00 This is the transmit line. And you're going to need another set of keys. You're going to need Bob's keys to go the other way. You can only go one way per set of keys. So here we have Bob talking to Alice, using Alice's public key, and Alice is decrypting using her private key. Go to the next slide, please. Now, so now that image with Bob's transmit is on the left. And on the right, we have the TX and RX, where Alice has her private key and Bob's public key. And Bob has Alice's public key and Bob's private key. The thing with public-private key encryption, like RSA, is that it's private key. The thing with public private key encryption like RSA is that it's a pain. It's computationally intensive. It's slow, blah, blah, blah. So you don't use these when
Starting point is 00:32:53 you're communicating usually. What you do is you combine them in super secret ways, possibly with some other information like the time of day. And then you have a shared secret. And you use that shared secret to do some form of encryption that's much simpler. Slides are going backwards now. Go down one more. So when we talk about the keys on the device, we have the device's private key and the cloud's public key. That means the device can decrypt anything sent to it with its private key, and it can encrypt anything sent to
Starting point is 00:33:32 the cloud. Why are we bothering to encrypt things? And there are two main reasons for that. One, you want to make sure that the information you're getting is from who you think it is. So signatures. If, for example, Pebble took over Fitbit's devices keys, got these keys that live on the device, they could fake, they could spoof, and then get onto the Fitbit servers. And I know we're using these companies because Tyler's been involved with them, but let's stick with it. And then the Pebble users can use the Fitbit apps. And Fitbit probably doesn't want to support that.
Starting point is 00:34:22 So the signature piece is a big piece. And going the other way, you want to make sure that your firmware updates come from your cloud and not somebody else. And then the second thing we want to do after signatures is actual what we think of as encryption, which is protect it from anybody else reading the data. If you're doing a medical device, you definitely want to stay away from HIPAA and all of the things involved with keeping patient data private. And Fitbit did this too, because patient data such as exercise habits should be kept private. So we have signatures and encryption for our public-private key thing that shares the secret so they can talk to each other via simpler mechanisms. Where do you put these? Philip mentioned that you probably want to put them in manufacturing.
Starting point is 00:35:17 And you probably need a serial number to go with that. Okay, so we have a serial number. We have some keys. We compile them into the code. I'll be fine. If we ever need to update them, we'll do the OTA and update them that way. Well, what if, and hear me out here, what if some bad person, attacker, script kitty, some interested party, a reverse engineer, hacker, whatever. They say, this is a really interesting application. I want to know more. Well, no device is perfectly secure once people have physical access,
Starting point is 00:36:01 whether it's sanding down the chip to read out the code or figuring out that there's a secret debug serial port or forgetting to blow the flash fuses so that people can read your code out with the right tools. It doesn't matter. What you need is to make the process of breaking your code more expensive than anybody wants to pay for it. That doesn't get you away from things like script kitties who do it for amusement, but it's continuing effort, it's not just once, should be a reflection of how much money your company will lose if the data goes public. Okay, so you can compile them into the code and somebody can then crack them. Well, now they have keys to everything.
Starting point is 00:37:04 I mean, they can send data as a device. So you say, okay, I don't want it to be that simple. In manufacturing, I'm going to put them on an external spy flash. Okay, well, then you get people like me who can read your spy flash as soon as they have a board. Okay, I'll put it on the internal flash. Okay, that's not bad and some chips have features that make that very difficult to read most of them are imperfect but then you get things like the chip whisperer
Starting point is 00:37:34 that does magical things that will tell you these things just by sitting next to it there are external crypto chips the choice you make here is really about what you need to do as a company for security things. So the next slide. Let's say they get broken. And if you are using the same key for every device, all devices now are broken. But instead, you can do a per-key device, per-device key, sorry, per-device key. And that is far more secure. It means that every device has their own set of keys. If you break the keys for that device, you can only spoof that device. You can't read traffic for other devices or from other devices. Except that is a huge pain. I mean, you talk about manufacturing software,
Starting point is 00:38:33 every unit now has a serial number and its private key. And if you're really being fancy, maybe it has a cloud's public key as well, an individualized cloud's public key. And that will let you make sure that if anybody cracks a device, they only get that device. They can't build a whole army of new pebbles invading Fitbit servers. One more slide. The other side of this equation are the keys that live in the cloud. Whether you do each device has its own key or in each device has its own private and public key from the cloud doesn't matter. These are secret. These are very secret. Once people have these,
Starting point is 00:39:35 they can pretend to be your cloud. They can take your devices. They can take your firmware. They can do anything they want. This isn't the sort of thing you leave in your office drawer. This is the sort of thing that should be in a safe. And yet we need them to actually do our work. We can't talk to the devices without them. And this isn't necessarily part of the discussion. It's just really important to understand that all of the security stuff, the easiest way to break it is people. And so you need to keep them locked and don't check them into GitHub. I know you forgot that one time. Now you have to keep them locked and don't check them into GitHub. I know you forgot that one time. Now you have to change all of them.
Starting point is 00:40:15 So yeah, be aware of how the security is going to affect your manufacturing because it will. And whether it's just you protecting your company, protecting your customers, protecting your trademarks, all of the stuff is part of the manufacturing process that you shouldn't wait until the end to think of. Okay, I'm sorry, that was a little more prepared than you probably wanted. But what do you think, Philip? Well, I have a question. To me, it seems like the most challenging part of what you described would actually be getting the, how do I exchange data with my CM? And how do I prevent my CM from having access to all my keys? Especially given the fact that we talked about, we don't often have network connectivity between our CM site and our office. So how do you typically manage that? The companies who have the most to lose, people who may have military contracts or HIPAA violation issues, end up doing their final manufacturing in wherever domestic is.
Starting point is 00:41:19 So if it's in the U.S., they do it here. And that means the unit gets almost fully manufactured, may even be in its case, or may even be in its packaging. And then that last bit happens in the company. As for not having connection, well, you don't actually need to. You can send a database of keys over and then they get programmed in. The manufacturer says, we're done with these keys. You load them to your cloud and it just goes in that sort of cycle. And you just sent them half the dump. You didn't send them all the information. Right.
Starting point is 00:42:00 I mean, that's the beauty of the public keys is that you don't, everybody can actually know the public keys. You don't really want to spread them around, but it is possible. So, yeah, you don't have to send them everything. You can also have devices that will find their own private keys, the device itself will tell the CM what its public key is. And the CM knows the public key for the cloud. So nobody but the device ever knows its private key. And that's kind of mentally challenging. Your device is going to go out and
Starting point is 00:42:48 make its own security, but it's one of the best ways to do it. Thanks. That was a great answer. I just want to underscore for the topic of this panel, this is clearly not something you can cowboy at the last minute. A lot of thought has to go into how you're securing your devices and managing that. And this is the kind of thing
Starting point is 00:43:09 that if you wait until the last minute to think about it and do it, you're not going to do it well, or you're going to delay the ship date of your product, right? So as we see security regulation coming down upon us, it's certainly something you can't ignore. But do look to your chip vendors. Like OTA, a lot of these things are becoming more standard and becoming less of something you need to do yourself. Yeah, that's a great comment.
Starting point is 00:43:38 Buy off the shelf. Someone solve this problem. We should do a webinar on security. I'm realizing that'll be something. TBD. Thank you so much, Alicia. That was an a webinar on security. I'm realizing that that'll be something TBD. Thank you so much, Alicia. That was an amazing, amazing answer. All right. So next up, we've got a group question for the panel. So that's going to be sort of more of a broad one, but what we really want to see is any common pitfalls or, you know, hurdles that you see in general when you're shipping IoT devices,
Starting point is 00:44:05 when you're shipping these products out? Bricks. Bricks are the most fun. Where you mess up OTA and suddenly, maybe the manufacturer has built a thousand of these and the OTA doesn't work. So they're all just trash unless you unbox them. That's heartbreaking. There is somewhere in a factory in China,
Starting point is 00:44:34 a crate of 5,000 iPhones that it was my fault for bricking. We're just never dealt with. So yeah, it's painful. Let's see, 5,000 times. How much does an iPhone cost? I haven't done this math. I don't think it would be my most expensive mistake though. I will say that I think one of the biggest challenges that we face is the fact that we have to interface between software teams and hardware teams who have totally different jobs. And what you often see with embedded devices is that device itself is going to be a mechanical engineer or an electrical engineer who's going to be focused on the physical side. Or you're going to have somebody who comes from a software background and they have this great idea that requires a product, but they have no idea what it takes to build a product.
Starting point is 00:45:37 And when you're in these situations, and for example, you're really an expert in hardware, but you have no idea of all the external software pieces that are required to make your device function, you're just not going to think about it. It's not going to be in your schedule. It's going to be a surprise. You're going to have continual slips as you learn this and then vice versa. If you're a software person, you don't realize that you need to get your FCC certification and your Bluetooth radio has to be certified because you didn't pick a pre-certified module or, you know, you forget that you actually need to figure out how to provision all this information at a factory. Right. It's the same thing.
Starting point is 00:46:13 You're just going to not know what you have to do and you're going to be surprised. And it's going to be very, very painful when you have months of delays because of these critical pieces that you can't overlook and you can't just kind of shoehorn in at the last minute. I've definitely used the smart device that has the same name for every single device in the Bluetooth pairing screen. And then you have to literally walk down the street to pair the thing and then you walk back. We also had that at Pebble one time that was fun i think one of the one of the things that honestly maybe noah you remember this shitting on our devices we were just displeased with how slow we were provisioning data and
Starting point is 00:46:59 running manufacturing tests and so one guy one firmware engineer his full-time job for like one or two months was building a web application that talked locally to a fleet or like a farm of raspberry pies that would then run the manufacturing tests and it was just like totally out of the blue was not what he was planning on doing um and yeah but it was like because we couldn't literally produce enough of these devices quickly enough if we had just run the normal manufacturing line we had to have this like automated system that was honestly like a huge project yeah those types of things are very sneaky like you don't think about okay it takes me you know 45 seconds to generate a private key on the device and get the flash set up and then put it into ship mode.
Starting point is 00:47:47 Whatever, that's 45 seconds. And I need a million of these by Black Friday. Then the time, you can't even do that from business. Yeah, that's something I really appreciate from working at Apple. I don't remember the UPH we had to hit, but I think it was something like a million. So time was very important because you
Starting point is 00:48:05 can't fill a factory floor with just one test station. And there were hundreds. And hardware stage rollout. I think on the manufacturing front and the speed front, something that I see a lot is people forget about manufacturing
Starting point is 00:48:26 tests between development builds. And so you go to a build event for your, you know, you're producing engineering prototypes, you create your manufacturing tests, you know, your firmware does some things, and then you're back in the office, it's in between events, you're changing your firmware, you're adding new features, you're rewriting commands, things like that. And then you go to build more units and you've changed everything. So your firmware doesn't work or your firmware and your test scripts are incompatible. And now there's a last minute scramble to get those up to date. And you haven't been validating your manufacturing firmware in between builds. So now there's bugs and your retest rate is high, which is causing you to spend twice the
Starting point is 00:49:05 amount of time to get units through the production line. You can't just ignore manufacturing firmware if you're not doing a build event. It needs to be something that you include in your CI pipelines. In fact, on that point, it's a very easy way to get into hardware-in-the-loop testing and automation because you could, for example, in every build flash a little, a couple devices maybe hooked up to a tester or even just running the commands and make sure that your manufacturing test scripts still work. They execute in the expected amount of time, things like that. And you need to do that validation anyway, right?
Starting point is 00:49:45 You can't just release, like you can't release buggy, crashy builds to customers. You can't release buggy, crashy builds to your factory either. Um, so it's something to keep in mind because I see that happen a lot that people just forget about them. How many firmware builds do you typically have throughout the whole process? Do you have a release, a debug, manufacturing, inbox, firmware? It's almost four different ones.
Starting point is 00:50:14 Do you count bootloader as a different one? Depends on if the bootloader is a micro image or not. But the full-fledged image i think i probably have four yeah i would say three to four is pretty common yeah it can get more complicated i guess if you're changing your keying strategy based on whether you're a development unit or a production unit too or you know any other details like if you have a development unit or a production unit too. Or any other details, like if you have a totally separate development backend environment, which you should,
Starting point is 00:50:51 then you might hard code things or depending on how you're handling that, you might now multiply your configs, which points out that that stuff should probably be configurable information on your device and not a variant if you could help it. And that's often something that you learn the hard way as you're realizing. Yeah, let's do development and testing and debugging
Starting point is 00:51:15 on our production backend. That doesn't usually work out very well. Something's going to break, inevitably. Awesome. That was great. Thank you guys very much. I think we're going to jump into a few audience questions now, unless anyone else had some more items they wanted to talk about for the comment hurdles section. Let's do questions. All right, awesome. Yeah, so we've got some great questions coming in and please keep sending them in for any of the attendees. We'll happily answer any we can.
Starting point is 00:51:55 The first one that I'd like to ask the panel is, what is the best way to secure a UART or a programming interface? Certainly one common password is not good enough, and if you disable it, then you're unable to debug a sealed unit. So what are strategies that you might use for that? I'm going to pass on this question. We use a hard-coded thing at Pebble. Yeah. I think there's a balance between going back to what Alicia said. If somebody has your device, there's a limit to what you can secure. So commonly, I do see a hard-coded password. If it's beyond that, it would be you need to authenticate with some kind of key over a special Bluetooth connection endpoint or something like that. I've done that in the past. I've seen strategies where you might have a special debug board you plug in that has to send a password
Starting point is 00:52:45 and toggle some IO lines in a specific sequence with very precise timing. But again, these are all things that can be totally reverse engineered from your firmware and used against you if you're really concerned about that. Blowing fuses and just eliminating your debug interfaces altogether is, you know, but then you can't deal with debugging after the fact. Don't say that, Philip. No. Again, it depends on the degree of concern you have and how much you really need to protect that.
Starting point is 00:53:19 Another one I see a lot is your debug interface is only available for a short time. So you have to type that password in in the first two seconds after booting, which if you know the password is easy, but if you don't, it's hard to start guessing if you have to wait for reboot each time. Yeah, I've definitely seen that strategy used all right another question from the audience we have is um how do you manage revocation of keys for critical devices that's probably a whole paper i was thinking one i have not written i mean it's it is a hard problem um it isn't so much revoking the keys it's replacing them because you can't i mean if you it is a hard problem. It isn't so much revoking the keys, it's replacing them.
Starting point is 00:54:11 Because you can't, I mean, if you want to just trash a bunch of units so they can't get to your cloud because they've fallen off a truck or something, that's one thing, you just delete them from your access list. But if you have devices you think may have been accessed improperly, then you probably just sending them new keys isn't going to help. Because if they have your device, then sending them a new key doesn't do much. You just give them the new key. So it isn't really a good solution that I know of. I would be happy to be wrong if anybody else has a good solution.
Starting point is 00:54:53 So I don't know how it's done, but I know from just watching security news. For example, MSI just had their BIOS updating keys leaked in a ransomware attack, and they don't have a revocation mechanism for making so those keys can't be used anymore. But apparently other motherboard manufacturers have solved this problem, so I'd be really curious how various vendors like that handle revocation of the UEFI keys for their secure boot processes. And maybe that would be a good model. But off the top of my head, I don't know how that's done at all. Got it. So in the case where it would be needed, it's probably like a whole month-long project for someone, multiple months probably. Right. We got a good one here.
Starting point is 00:55:50 Can you talk more about bootloader design and the concept of a micro image? I think, Tyler, maybe you mentioned that one. Yeah, there is a webinar done by Francois, who basically, I think, iterated through a bunch of designs from his time at Pebble and then at Oculus. But I guess the idea is, so one, look at the webinar. It was given by an engineer, four engineers, not pitching them fault, really, but kind of talked about the multi-stage bootloader design. But in short, it's like the tiniest little bootloader that knows where to look for the bootloader image so that you can update the bootloader image. But then the bootloader is only responsible for kind of
Starting point is 00:56:30 verifying the new firmware image, making sure it's signed and then knowing how to actually boot it and knowing if you have two slots or knowing how to update the main image. I have seen both types of bootloaders that one can connect to Bluetooth or like have some sort of connectivity stack and then others that do not. What I typically work with is the bootloader itself does not have connectivity.
Starting point is 00:56:58 There's usually a, what do we call it? A recovery firmware and then a main firmware. The recovery firmware is like a full- full fledged image that has a UI. It has a lot of stuff. And then that's the thing that has the connectivity. And then there's the full image that has everything. But it's like you're booting through stages and at each time you're, you're verifying and you're seeing if you need to update the thing and
Starting point is 00:57:23 you're basically performing like hardware validation across as you boot through. I'm sure Philip or Alicia have extra things to add there, but I would say watch the webinar. Yeah, I thought that was a good answer. Nice. Another great question we have. Actually, we're at the top of the hour,
Starting point is 00:57:43 but we'll do one more question and I think we'll close out the panel're at the top of the hour, but we'll do one more question and I think we'll close out the panel and answer the rest of the interrupt slack. But this last question that we'll do live, do you worry about security with the ability to crack open a release product? Sure. I mean, it's about how concerned you are.
Starting point is 00:58:04 It's not easy. It's not usually, and people don't, people are usually buying your product because they want the product. It's when you have things that your product can cause other problems, medical devices or locating people, that's when you start worrying about how you're doing the security. But again, if you have one unit, you should only be able to break that one unit. If you can break all the units by cracking open one, then you have a much bigger problem. And there are strategies you can take should your device warrant such strategies. You can create
Starting point is 00:58:53 a circuit that is only closed, for example, when your product is fully enclosed and the circuit is broken when it's open. You can use that to control behaviors. Obviously, that can be faked. There's also things like a lot of RTC components will have tamper alarms that can be used to trigger a wake-up event. And so you can leverage things like this if you are really concerned to, for example, self-brick a unit. It's not really something that you want to do. But if your security warrants such a degree or such a concern or consideration that you might really want to take catastrophic extreme measures if somebody's opening your device, there are ways to detect that that's happening and respond to it.
Starting point is 00:59:41 You have to be prepared to do an RMA in case something goes wrong, or have some flow in place that you could repair those units and make them good again if it was an accident or some other thing happened. But you definitely can detect those devices being cracked open. I've actually worked on one that self-bricked. And the hardest part for me was at the end when we really had to test this functionality, but we were still in engineering, so we didn't have that many prototypes. But you have to test the functionality. You have to make sure it bricks. How did you brick it? Oh, we put a bootloader into RAM and then updated the flash to be all ones and then all zeros and then all ones. Yeah.
Starting point is 01:00:34 Yeah, and we didn't have a recovery method. And it was a unit that we had taken out the flash fuses because it was supposed to be very secure. So it's truly bricking the unit. There is no recovery. It's hard to do that to the unit that you've had lovingly sitting on your desk. It's been your friend through all of these debugging adventures, and now you're just going to kill it. Sometimes that's a lot of money, you know,
Starting point is 01:01:06 depending on the cost to build those prototypes. That's just, you know, thousands of dollars that can be just flushed down the drain to test this functionality. Thousands of dollars, hundreds of hours. Yes. It's just hard to watch, harder to do. Sometimes necessary, but heartbreaking nonetheless. Great. Thank you guys so much. I'm just going to share the panelists' information really quick. All right. Yeah. So big thank you to our panelists today. That was amazing, especially our special guest, Alicia White. So thank you guys so much for entertaining our questions. It was a ton of fun. I could go for hours.
Starting point is 01:01:49 You can see on the slide, and we'll share this as part of the webinar follow-up, there's content information for all these nice folks, especially a big shout-out to Embedded FM and the Embedded Artistry content that those two guys put out, which is just amazing stuff. So please go check it out. Thank you, Anne. Thank you. Big thank you as well to all the attendees. Thank you for
Starting point is 01:02:09 joining us while we discuss this topic and looking forward to the next one.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.