Embedded - 38: Blame the Monkey

Episode Date: February 12, 2014

Producer Chris White (@stoneymonster) and Elecia discuss some insurmountable problems and some strategies for approaching them.  Google it (or look on Stack Exchange). Explain the problem to someone... else… even if they aren't there (use a stuffed animal or write a really detailed email, anticipating potential questions). Draw a picture (system/subsystem architecture or code block diagram or a doodle). Make sure you are running what you think you are, start over from a blank slate, making no assumptions about how your hardware is programmed. Identify and verify your assumptions about the all the pieces involved. Get scientific: define the problem, create a hypothesis, run an experiment, record the results. Small steps! Also: get methodological and write everything down. Return to first principals: how is this supposed to work? Revert to last known good and diff to find the cause of a new issue. Logging functions: they take time but can lead to a better trace, better picture. Make it reproducible: there is information in the solution if you can find the steps to repro. Step by step, reduce the steps until you can nab it in the act. Remove the voodoo. Avoidance: accept the bug (it's a feature!) and go on. Sleep, go for a walk, or work on something else.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome back to Making Embedded Systems, the show for people who love gadgets. I'm Anecia White, and Chris White is joining me again today to chat about strategies for dealing with insurmountable problems. Apologies for my lingering stuffiness, I'll try not to blow my nose directly into the mic. But right before I got this stupid cold, I was doing some firefighting at a client, and the algorithms guy, who has been roped into doing the embedded work,
Starting point is 00:00:38 had dug himself a huge hole. I came in and explained what he really needed was DMA to go with those spy drivers, and he'd really thank me for it later. But at some point, as we were talking about what had gone wrong and what he was trying to do to fix it, he turned to me and asked if I ever got stuck, just stuck on a problem. No. Well, yes, of course, but not for very long. And certainly not as well as he had done it. But that's the plan for today, to talk about getting stuck and getting unstuck. As I mentioned
Starting point is 00:01:17 in the intro blurb, Chris White is turning around his producer chair to chat through this with me. Hi, Chris. Thank you for joining me. Hi. Do you know joining me. Hi. Do you know what I mean by stuck? Yes, and I take issue with your not stuck for very long. I've certainly had things I've been stuck on for weeks. Is that not very long? Well, stuck. What do you mean by stuck then?
Starting point is 00:02:06 Well, unable to come up with an explanation for a difficult problem, sometimes because it's difficult to reproduce, sometimes because it's so complex that it's difficult to find a framework to interrogate what's going on. The generalities I understand, and I think we're going to be doing a lot of those today, but do you have any specifics? What things? Well, I mean, I don't want to get deep into this immediately, but things on complex systems where there's multiple threads happening and maybe you have a race condition or corruption that only happens under certain very specific circumstances, maybe with external hardware, speaking to your hardware, you know, those kind of situations where the graph of interactions is complicated. So complex systems having an not unreproducible problem,
Starting point is 00:02:53 but a difficult to reproduce problem. Yeah, or even, again, I mean, it doesn't have to be difficult to reproduce. Sometimes it can be. Having a problem that interacts with lots of its subsystems. Yeah. Okay. And sometimes you dig into those and you think you know what's going on.
Starting point is 00:03:13 And this is kind of the pattern that happens to me a lot is I'll see a problem and go, oh, I know exactly what's happening. And I'll go fix the exactly what's happening and that won't be it at all. Or I'll have made something much, much, much worse. Or I'll fix the problem but caused a new sort of twisted version of the same problem to appear. So it's like you're pushing around the edges. And the more complex the system is, it seems to me, the more often you end up in that situation where, okay, I'm just going to fix this one line of code and everything will be fine. But what you really do is you have a cascade effect that you know leads to another problem and
Starting point is 00:03:47 that that i've been in a situation where it's weeks to stabilize something from you know you started here and you started pulling on the thread and pretty soon you didn't realize yeah there's you've unraveled the entire sweater you should just start over yeah well but that wasn't actually i mean you never stopped in there you never said oh my god i am stuck oh you kept pulling the thread i mean not right okay so i see what you mean by stuck just giving up it's the it's the feeling of oh my god i'm never going to fix this and i don't even have a clue what to do next. I've certainly felt that way, but I keep trying something else. Yeah. Even, even if there's something else doesn't really quite approach it. I mean,
Starting point is 00:04:38 even if there's something else seems kind of pointless, it's still, I never stop. Right. Or if I, you know, if it's really bad I never stop. Or if it's really bad, then I just quit that job. I used to feel stuck often. So I think it's an experience thing. I definitely, early in my career, I would just get hit with these things and I had, you know, well, how do you begin to design this system?
Starting point is 00:05:15 It's just a wall and, and there's no, there's no footholds. You, you just, it's just this huge thing. Um, and I felt stuck that way. And, uh, memory before I really understood memory right definitely having a pointer problems like I don't even know how to go about solving this well and not to sound like to get off my lawn guy but get off my lawn back when we were starting out there was less pervasive information on the internet. You know, you couldn't go to Stack Exchange because it didn't exist.
Starting point is 00:05:52 It's true. And, you know, nowadays, there's so many developers approaching so many problems because this industry has exploded that there's very few things that you might encounter that somebody else hasn't at least encountered a similar incarnation of well you get those complex systems and and nobody has this set of lego blocks
Starting point is 00:06:15 but yeah but that's a rare situation i think and so i mean it sounds almost like cheating but a lot of times when I, the first thing I do when I run into a problem these days is to Google the exact error message or the exact situation. And a lot of the problems I run into these days, and I think a lot of other people do, is not so much, oh, I've written all this code from scratch
Starting point is 00:06:40 and it doesn't work. It's the interactions between things you're piecing together like third-party APIs and things. It's like, wow, how do I make this work this particular way which isn't really standard? And those kind of questions are generally things that other people have asked. Okay, so I was going to make a list
Starting point is 00:06:59 of the different types of problems we'll talk about and then work on strategies to that. All right. Okay, so that was dealing with third-party libraries. And you don't have the right visibility because it's a black box. Hardware problems is an area that, as a software engineer, when I started out,
Starting point is 00:07:16 if it was a hardware problem, or even if I thought it might be a hardware problem, I was kind of stuck as to where to go next. Random crashes. Random crashes, yeah. Stack overflows and hard fault handlers. Definitely the I don't know where to start problem used to hit me a lot.
Starting point is 00:07:37 Race conditions. Timing and race conditions. Things that happen that don't show up when the debugger is attached. That never happens to me. What attached that never happens to me what that never happens to me all of your all of your bugs show up when the when the debugger is attached and only when the debugger is attached oh that's a different set oh it's the it's the rebugger let's see problems that happen occasionally. The stuff that doesn't reproduce easily.
Starting point is 00:08:07 The stuff that is clearly not software's fault. Which is the default state of all problems to start with. Things that
Starting point is 00:08:18 aren't software, they're clearly not software's fault. Those are cosmic rays. Hey, that actually happens. It does.
Starting point is 00:08:27 It really, really does. And if you have to plan your software for it, wow. I worked on a system that it was guaranteed to happen all the time because it had so much RAM. Yeah. Okay. Let's see. From Twitter, I asked a question about what insurmountable problems.
Starting point is 00:08:49 Cache coherence, garbage collection, invalid pointers, and off-by-one errors. I definitely see the invalid pointers. Cache coherence, yeah. Most, I mean, gosh. Unless you're architecting the system you know the hardware system yourself i don't know many of us run into that these days yeah i don't worry too much about cash coherence or garbage collection because garbage collection what do you yeah i'm on java java okay so we have our list of problems and now now it's let's go through and figure out
Starting point is 00:09:22 where would you start oh Oh, and it worked yesterday. That was, that one. That one used to be, I used to be so, so bad at that and just get so tangled up with, oh my God, this worked yesterday. It worked when I did it. Well, reproducible and it works for me is kind of a separate thing. Okay, so that's my list of things that I have felt.
Starting point is 00:09:53 Oh my God, I might as well just quit. This is never going to be fixed. Let's just scram, start over. Burn it down. If I just plug it in backwards, no one will realize it's software's fault. Start over. Burn it down. If I just plug it in backwards, no one will realize it's software's fault. So strategies.
Starting point is 00:10:12 And I wasn't going to do this one for one, but maybe like now that we have our list of problems, try to work on which strategies work really well. And Google it. That is by far the first thing. Well, I mean, Stack Exchange has become a huge resource. Even for embedded. Yeah.
Starting point is 00:10:31 I used to say, well, yeah, Stack Exchange was nice if you were working on a PC. But now even for embedded, there's a lot of ARM stuff and even some TIDSP stuff there. And I often feel like I should go there and start answering some questions at this point. I do. I feel like it's irresponsible not to participate in such a pretty amazing community. And yet, still not signing up for Stack Exchange. And that goes for Googling it, phrase it different ways. And I'm thinking about the guy, the clients, he couldn't have Googled his problem
Starting point is 00:11:08 because he couldn't state his problem. He didn't know anything other than there didn't seem to be enough time in the system. Well, that's a skill. I mean, you have to, I mean, you can flippantly say Google it, but if you don't really understand how to, if you don't understand your problem well enough
Starting point is 00:11:22 to construct a search, then yeah, now you're in real trouble. Yeah. But one way to construct a search or some way I've talked about before, having a second pair of eyes, even if they aren't actually there, talk it through with someone. And if you don't have somebody to talk it through with, that's okay. Use a teddy bear. They're really, really effective because just going through the problem is enough.
Starting point is 00:11:56 Yeah. Or a duck. Wasn't there a duck? Yeah, there have been ducks too. And a frog. My frog was my bug eater when I worked at LeapFrog. But I used to also email, like prepare this in-depth, what is the problem email and what steps I've taken to colleagues.
Starting point is 00:12:21 Yeah. And then never send it. No, that's a good idea. A lot of, I mean, a lot of us, at least myself, you get in the habit of sitting in front of your computer and the only thing you really type or write is code or the occasional short email. And it can be very powerful to get your thoughts organized
Starting point is 00:12:42 in a way that perhaps they aren't normally organized when you're coding. Because you're, you know, it's a different application of your thinking. Yeah, and it gets you out of the monkey press try over and over. Which can be very useful sometimes.
Starting point is 00:13:00 It can be very useful, but when you get to these insurmountable problems, that's when you have to stop doing that. I guess. You don't agree? I think I'm defending my usual mode of debugging probably a little too heavily. No, I mean, this is not, oh, there's a bug, I have to fix it, sort of. Yeah, I know. I really want to go after those things that at one time in your career
Starting point is 00:13:27 you would have said, I'm hosed. I don't know. But when I write these emails, usually it's to somebody I respect a lot. Yeah. And as I'm writing it, I'm trying to answer all of their questions. So just to step back, and I know we want to get into specifics, but just stay general. At least in my past,
Starting point is 00:13:50 a lot of these instrumental problems come in terrible times. Like off the manufacturing floor or from a customer report. And you start looking at it and you realize there's a problem, but it's a deep problem. You know, starting to feel tickle the insurmountable bit in the back of your brain and make you nervous but you know it's one thing if you if you're in the midst of developing a new product or something
Starting point is 00:14:15 and you've come across this this situation and you've got you know two three weeks to bang on it it's another thing if you've got all of management hovering around you saying, why isn't this fixed yet? And you're saying, well, you know, here's my long list of stuff that you don't understand at all that I've gone through and they don't care. They just want to know why software is not working. So that, to circle back, I think that's a good reason to have, okay, this is a, you know, this is a somewhat challenging problem. Here's why. And be able to say that in a way that makes sense. Oh, that's really hard. I mean, even if you were assuming that the people who are asking you for constant status updates were engineers at one point in their life you can't you can't just break down and say oh well you need dma to go with those
Starting point is 00:15:33 spy drivers and blah blah blah and they're like yeah i totally know what spy is and i know what dma is but quit treating me like i'm an idiot because I have no idea what you're talking about. And it's this weird combination of don't condescend to me and at the same time condescending back. It's strange. But I mean, you see that problem, right? Oh, yeah. Because being in their shoes, actually, having been a manager and having been the vulture asking for status updates, I really don't care what the problem is. I just want to know when you're going to fix it. And telling me it's insurmountable doesn't actually help me figure out when
Starting point is 00:16:14 we're going to ship this puppy. So yeah, it's fine to have these detailed analyses of what's wrong, but can you make it sound bad enough that people are going to leave you alone for a little while and good enough that they're not just going to fire you sound too bad then they'll fire you too well how did you let it get this bad or sometimes they'll throw more people on the problem when that isn't that isn't helpful at all now you just have two people staring at the screen going, I don't know.
Starting point is 00:16:45 What do you think? I don't know. I used to work with a guy when these problems would come up. He would say back to whoever found it or whoever was harassing him, well, that's impossible. That just can't happen. Well, that's what these problems feel like. That's totally... That's been my first reaction to a lot of these.
Starting point is 00:16:59 Well, that's just impossible. That can't happen. I'm sorry. You're breaking the laws of physics. It's not going to work that way. I mean, there's an if statement right there that you're past with the conditional that isn't true. Yes, yes.
Starting point is 00:17:09 If A equal equal B, then do this. And then you go in the debugger and B doesn't equal A and you did it anyway. Yeah. Stupid computers. They never do what you tell them to. Actually, they do exactly what you tell them to. Yes, but sometimes somebody else told them something
Starting point is 00:17:24 and you don't know what it is. What were we talking about? Insurmountable problems and a list of various kinds and strategies by which one might solve them. Okay. So what do you do? Once I get past the monkey stage yeah because i mean i always start with with monkeying around yeah tweaks um so it depends on the kind of problem but generally uh i'll pull
Starting point is 00:17:57 out a pad of paper i like that and i'll start drawing maybe a block diagram of what's going on. If it's a messaging or multi-thread kind of interaction, draw a timeline to try to make sense of what I think at least should be happening. And then you can go back and you can verify against that to say, okay, is this what's really happening? There's a whole bunch of things you just said there. The first is checking your assumptions, verifying what you think is supposed to happen. Because occasionally these insurmountable problems have been little tweaks where it was actually doing exactly what I expected. I just didn't quite think it would work that well
Starting point is 00:18:37 or in that order. And so verifying the assumptions with what was supposed to happen, especially with message passing in complex systems. That one we should highlight. And also, you said draw a picture. I think drawing a picture is a solution for 60% of all books.
Starting point is 00:18:59 Hard books, not easy books. You draw a related picture, though, not like of an elephant or something. Oh, really? No, no, no, you doodle. Fantastical doodles. No, no, I have a lab notebook. You have many lab notebooks.
Starting point is 00:19:16 We actually got our lab notebooks personalized. That's how we are dependent on our lab notebooks. For each client, I fill in the first two pages and then draw somewhere else. That's true. I don't always use my lab notebook for this. But drawing a picture of what's supposed to happen, or I mentioned where do you start.
Starting point is 00:19:37 Those insurmountable, this system is too big, there's no way I'm going to be able to do it. Drawing a picture is breaking into parts. When you have memory corruption errors or stack fault errors, drawing a picture and figuring out, okay, I know my stack is overflowing, but I don't know where. When you draw a little picture of a stack overflowing, and now you have, it helps my mental image as well as having a physical image even if it's
Starting point is 00:20:08 crappy sketch physical image it helps my mental image of what's happening um and then paper uh because when i'm in that monkey stage where i'm just oh you know i'll try this i'll tweak this i'll push go i'll tweak this, I'll push go, I'll tweak this, I'll push go. I don't necessarily keep track of what I've done. No. And I will often do the same thing over and over again out of irritation. It's not irritation. It's a different I word.
Starting point is 00:20:39 That's not true. Many things you do over and over again and get a different result. But yeah, no, I didn't. I'll forget that I've done, like you said, I'll write down, you know, keep track of what you've done because I will try the same thing over and over again. Sometimes just because I forgot if I did it or maybe I didn't trust that I really did what I thought I did. Well, and sometimes it does act differently because there was some step that you didn't realize you did yeah that was critical and so it worked that time and it didn't work this time and so the paper for me is a way of developing the steps to reproducibility
Starting point is 00:21:20 and then tweaking them you know get rid of the voodoo and figure out the science. Yeah. Um, what else? So what else is on your paper when you're, you're stuck? Well, um, you know, again, it depends on the kind of problem. Um, but a lot of times some sort of model of, of what the system looks like, the software,
Starting point is 00:21:47 um, the whole system, just maybe the starting with the piece that I'm, that's broken, that's broken. Uh, you know, I'll write things down like values and stuff that I,
Starting point is 00:21:59 that I expect to be there. Sometimes with a debugger, it's hard to keep track of how things change. You know, it doesn't keep a history of the particular variable or a register or something. So sometimes it's valuable to, okay, this was at time zero, it was this. At time one, it was this. At time three, it was this.
Starting point is 00:22:17 It's nice to just kind of have a backtrace of the state of the system. Because most debuggers only give you, this is what the system's like right now. Next statement. Okay, this is what the system is like right now. Well, and you can sometimes go up in the call stack and see what everything was before you called this function. It's still what was there.
Starting point is 00:22:41 It's still... It's still a very instantaneous picture. It's not going back in time, though. It's values of yeah or at that time uh and getting the picture of what's happening now what's happening now that goes back to if you have a last known good code right well yeah um figure out what the different... I mean, last known good code doesn't ever crash. This one crashes. Diff. I'd start with diff.
Starting point is 00:23:11 Diff, for sure. Or start with, you know, git log and blame. But sometimes that hasn't worked well for me for one reason or another. But running the code side by side in two computers, two devices. Oh, I've Never done that. Step, step, step.
Starting point is 00:23:26 Okay. That has been useful. For those, oh my God, this can't be happening. All I did was change a comment. So it can't be the code that's different. Sort of. Usually it's all I did was add a printf. Well, printfs are kind of easy though.
Starting point is 00:23:44 Because printfs change the timing. But you have to realize that's what you did. Yes, yes. Printfs, well, all the timing ones, you can do things to replace them. And I guess that's a toolbox sort of thing. If it works without printf, or it works with printf,
Starting point is 00:24:03 and then when you go without, then you can start putting delays in for printfs and see if that is really all that's happening or if there's another characteristic that is involved with printf. Because printf also takes a bunch of RAM and it does some stacky things. Yeah, I can blow your stack away,
Starting point is 00:24:19 especially if you've got a serious printf with the... Percents. The percents. Is that what we're calling them the variable string arguments I don't know the thing that you know formats formatting that's it formatting the percents
Starting point is 00:24:32 boy we're really staying organized today this is how we debug too we just bounce around isn't this how everybody debugs I mean really where were we you were asking me what I did first okay so Isn't this how everybody team bugs? I mean, really. Where were we? You were asking me what I did first.
Starting point is 00:24:50 Okay, so write things down. And I think your second suggestion was good. If it's something that's changed over time, figure out when the last known good state was and either run it side by side, which is something I've never done. Or diff. Definitely diff. Diff is your friend. Or diff. Definitely diff. Or diff is your friend.
Starting point is 00:25:05 Or diff. Or if you don't have the capability to run it side-by-side and you have some logging, run it the old way and get a full log and then run it the new way and get a full log. And it's always my luck that one of the things we put in with the new one was the logging, and so I ended up backporting the logging to the old one. And then it doesn't work anymore. No, sometimes it works. yes of course and then the logging is your problem ta-da
Starting point is 00:25:30 oh oh we didn't talk about uh we talked a little bit about you know sometimes it runs and it works and sometimes you do the same thing and it doesn't work yeah uh remember one whole 48-hour vultures hovering over me waiting for me to get it done, period. And I wasn't running the code I thought I was the whole time. The whole time. Is it plugged in? Yes, yes.
Starting point is 00:26:00 Don't forget. But I think that's helpful with debugging sometimes is unplug everything and start over today fresh with a known you did all the steps. It's too easy to, you know, we have Flash in our processors now. And if you have a system with three processors, did you actually run the Flash program for everything? You know, it's really easy to get flustered and then start,
Starting point is 00:26:26 and I make fun of the monkey stuff because we all do it, but it's really easy to start messing around when you don't really know what the state of things is. Well, yeah, because... You don't, I mean, you may have your hardware put together wrong. And, you know, it may not be that your problem is the things you set up wrong. It may be that in the course of debugging, when you were acting like a monkey, you did something that you didn't really notice or pay attention to.
Starting point is 00:27:00 And now you're in a different state. And not even really debugging the problem you thought you were. Oh, yeah. Well, that happens a lot. I had a stack problem, and I didn't know. I didn't realize I was overflowing the stack. And so my algorithm was running badly, and I spent forever figuring out what the algorithm's problem was. And the truth was there were a couple of bugs in there.
Starting point is 00:27:25 So I would fix them and something different would happen. You know, this makes us sound like such great engineers. Let's just keep talking about how much we suck. Sorry, go ahead. So, you know, what was I going to say? Well, one thing I found about a lot of these problems is that you start out with a notion. I think I sort of said this.
Starting point is 00:27:50 You start out with a notion of what it is, and you might be proven wrong as you go through, as you find out that your initial impression is even worse. But a lot of times there's more than one thing going on. It's true. You may have two bugs that are kind of interacting with each other that you know and in isolation either one of them may not have happened or in isolation either one of them would have been easy to fix and find right but maybe you have stack corruption and maybe you're
Starting point is 00:28:19 running out of ram and so now everything's just all over the place and nothing makes sense. Yeah. So that actually, the algorithm thing and that together, we had James Grenning on the show and he talked about the test-driven development and running tests for different areas. And I think some of these insurmountable problems fall to that sort of testing, to the rigorous, every line you put in has to be tested.
Starting point is 00:28:53 And if it's not testable on its own, why is it a line of code? Yeah, no, I think that's true on a module-to-module basis. I think if you do good unit tests, you shouldn't really run into these kind of situations on a per-module basis. I think if you do good unit tests, you shouldn't really run into these kind of situations on a per module basis. But that leads us back to stack corruptions. Yeah.
Starting point is 00:29:13 And it leads us back to complexities in big systems. That's almost like the difference between, you know, testing in a test tube where, you know, you could cure cancer with bleach. Controlled circumstances. And testing in a live being where, you know, obviously that would be bad. But unit testing is often, and I, you know, I think this is the way a lot of people do it.
Starting point is 00:29:35 They, you know, they surround the particular modules with the test framework and they interact with it. And okay, it works fine there. But it's not necessarily on the real system or it's obviously not interacting with the not necessarily on the real system or it's it's obviously not interacting with the you know real callers um so your unit test is only as good as you know as your imagination to some extent and you can simulate the whole system and have your unit test be perfect but shouldn't you spend some of that time and energy building a system instead of simulation?
Starting point is 00:30:06 Well, I think unit tests are good, but I do think they have their limits. And I think they have their limits in particularly in complicated message-passing, multi-threaded architectures where there's combinatorial problems that... Well, you say message-passing and threads, and I hear...
Starting point is 00:30:24 Not embedded. Semaphores and sure interrupts yeah because one of the problems clients recently had dealt with having multiple interrupts stack up and they weren't nesting them but then when you were done with this interrupt you'd lost all of them no no they could do it they they went to the next interrupt and they even had them prioritized correctly but if this low priority one came and it finished and it took too long five interrupts later was when you saw the problem because they had run out of time but it wasn't until the next hard time interrupt real-time interrupt occurred that you realized you had totally shot yourself in the foot okay so i mean that's sort of where is
Starting point is 00:31:14 everything um goes back to visibility yeah which you mentioned earlier that sometimes you're working in systems where you don't get to see what you want to see i think you you said that with black boxes and third-party libraries um i don't think i've actually said it on the air oh man so we can get to that no no you did you said i said something about third-party libraries but oh oh I guess, yes, you're right. You're reading ahead. I'm not reading ahead so much as... You imagine me saying things. You could just do my side of this, too.
Starting point is 00:31:55 Yeah, well, okay, as long as we're being completely discombobulated. It's just so good to be an A-plus show. Okay, discombobulated, yes? Well, there's only one show, so there's only one grade. So if we're grading on a curve, we have to be the best. According to your logic, I am stuck as a C student for the rest of my life. What? No.
Starting point is 00:32:21 That's not how it works. We're high-scoring this. High score, low score okay so going back to to problems um yeah there's a lot of the times where all else being equal you know you could have solved something quickly but you can't see what's going on. And that's masking the problem. Yes. So times like this are, you know, you have hardware that you don't understand. Maybe you bought it.
Starting point is 00:32:56 Maybe your double E is gone or... Uncommunicated. Uncommunicated. Whatever. Whatever. Mean. So you have this, you know, this black box piece of hardware that's doing something that it may be doing the right thing, but you don't have visibility into its internal workings enough to kind of do the things you need to do to debug your problem.
Starting point is 00:33:17 You know, keep track of state and diagram things out. So there's that. There's third-party software, which people use a lot even in embedded situations these days where you buy i don't know an artos or graphics library or something and something weird is happening and again it may not even be with their stuff but you're using their stuff and it would be nice if you could see this but you can't or your debugger is not very good which definitely an embedded system incredibly deeply embedded and you don't have a debugger's not very good. Which definitely an embedded system is possible. Or you're incredibly deeply embedded and you don't have a debugger,
Starting point is 00:33:49 or your hardware doesn't have the test points it turns out you need, all kinds of things like that. Yeah, I had that happen recently. I was doing a complicated PWM system to drive a motor, a three-phase motor, and we were sine commutating. I was outputting the PWM on TTL levels, and it was going to a motor controller where it was then going to FETs, and then it was finally...
Starting point is 00:34:16 Boba Fett? No, FETs. So it kept up the voltage and current, and then it was finally going out to the motor. And so I was driving six signals that ended up being, at the end, three signals out to the motor. I couldn't see the six signals. And it was one ball grid array and one fine-fitting pitch part.
Starting point is 00:34:37 And I was sure that my signals were doing what I said they were doing. And the people I was working with said, oh, you just have to do this. As long as the signals look like this, the output will be fine. And the output didn't look fine. They were out of their minds. It's fine.
Starting point is 00:34:54 It's fine. It was so bad. And the motor sounded awful. That's normal. I needed to see my six lines to prove that I was driving what they wanted. Smoke meets, Smoke meets working. For this one, it did.
Starting point is 00:35:07 Blew a lot of fats. But it was one of those, I guess being stuck is about being powerless. Well, you know, it feeds back into feelings of inadequacy after enough time. But this isn't, this isn't imposter syndrome. We talked about that. That was a good conversation, but this is,
Starting point is 00:35:31 this is, but, but, but it can make things worse because you can get, you can get anxious and you know, it's like being a, it's like taking an exam, right?
Starting point is 00:35:39 Where you blank out. Yeah. And you can blank out for, I blanked out for a long time on some of these problems the whole day. It's a big cubicle of education. It takes a nap or, you know, sleeping overnight to, you know,
Starting point is 00:35:54 let your subconscious puzzle over it. But going back to black boxes and things and not having visibility, I mean, I don't, you know, some of the techniques I've used there are kind of scientific exploration. Okay, change the circumstances, change the inputs, and see how it changes the outputs.
Starting point is 00:36:16 See how changing the initial conditions of whatever is happening, and I don't have a good example right now. It could be an algorithm. It could be some message transaction, you know, okay, insert a new message, change the timing, and just, you know, try to get a feel for the parameters of the problem. You know, people say garbage in, garbage out. Okay, change the garbage on one side and see how the garbage changes on the other side. And sometimes you can get a sense for,
Starting point is 00:36:49 oh, this is what's going on because I did this and it reacted this way. That means it's not doing X or doing Y. And this is where the piece of paper helps me, is defining. You create the hypothesis, you put the garbage in, you try it out, and you record the results, and you take tiny steps and see,
Starting point is 00:37:10 and then you notice that, oh, it's only when the high bit is set that this happens. Right, and sometimes you can be really methodical. If you've got an input that's eight bits, all right, it's doing this in this one, fine, I'm going to give it every damn possible input. When you accept it's going to take two hours to get through this, then you just go ahead.
Starting point is 00:37:29 Then you come up with a two hour plan and you can be exhaustive. And sometimes that's great because maybe, maybe you have four problems and you've only seen one. Yes, that's true. And of course that goes back to unit testing. If you had a proper unit test for that, you would have found all of these to begin with.
Starting point is 00:37:44 But you know, we don't all have proper unit tests sometimes. It's a big secret. Don't tell anyone. But going back to scientific, I always think about, you know, the scientific method. And it has to do with defining the problem, which is one of the things I see people who get stuck do is they fail to define the problem, which is one of the things I see people who get stuck do,
Starting point is 00:38:13 is they fail to define the problem. They say, it's broken. I'm like, okay, how is it broken? What is broken? What is it? It was working before, and now it's not. Yes, yes. And then throughout the whole explanation, the pronouns will always be it. And there'll be no definition. And I think defining the problem, this kind of goes back to the Googling it, which I'm pretty sure we did talk about. Yes, yes, that was after the show. That was after the recording. Okay.
Starting point is 00:38:39 But you have to define the problem. You can't just say, my system is not working. You have to say, my system is doing random things. Okay, what is random? What part of your system? And break it down the problem. It's an iteration, right? I mean, you can start out from a user level and say,
Starting point is 00:39:00 the display doesn't look right. Okay, the display doesn't look right. Why? The display has garbage in the lower right-hand corner. Okay, the display doesn't look right. Why? The display has garbage in the lower right-hand corner. Okay, what does the garbage look like? Oh, this is the four-year-old method. Yes, explain it to me like I'm a five-year-old or four-year-old.
Starting point is 00:39:16 And then the four-year-old will say, why? And then at some point, but at some point you're going to get down to it's not completing a DMA transaction correctly. Yeah. At X time when, you know, and you'll get, you'll start from a more general description of the problem and get to your more specific, and the more, you know,
Starting point is 00:39:32 in the limit, as you get to the most specific definition of the problem, oftentimes the solution is apparent. Yes. It's when you don't really know what's going on that it's still a problem. Yeah. Garbage in the bottom of the screen, that's a hard problem.
Starting point is 00:39:48 It's like, where do I start? Oh, my God, it's all broken. But DMA not completing in time. Well, that's something you can try to figure out. Right. So breaking down the problem. Or at least you know where the code is. Yes, now at least you can step through the code.
Starting point is 00:40:06 Before, you were stepping through all of the graphing libraries, and that wasn't possible. Now you have only a little bit. I do tend to get methodical and detail-oriented when I get frustrated, when I get anxious. And I think that helps. That's a good fallback method for me. Do you know it took me almost three years in my career
Starting point is 00:40:30 to realize detail-oriented was not a compliment? If that's showing up on your performance reviews, listeners, just realize that that is not actually a compliment. No comments. No, I'm trying to remember when, I mean, yeah. I guess it's not a compliment. It's not meant as a compliment. It's not meant as a compliment.
Starting point is 00:40:58 I didn't take it as an insult until later. I don't understand why it's an insult, especially for engineers. Well, the last person who said it to me which hopefully will be the last person ever to have said it to me. You can be detail-obsessed which is different than detail-oriented. I don't...
Starting point is 00:41:17 I was pointing out how all of the different architectures that were being proposed were unsuitable given the problem statement because i actually understood the problem and the person who was sure spouting off all of these impossible things was you're so detail-oriented and i'm like yeah that's because i don't want this to totally so to those people, that's a synonym to don't confuse me with facts. Yes.
Starting point is 00:41:47 Or don't confuse me with details. Detail-oriented means this is inconvenient. Please stop saying these truths. Yes. And I don't get too detail-oriented unless I'm doing something, debugging like this, or until I'm like, what you said to me 10 minutes ago
Starting point is 00:42:08 and what you say to me now are not the same. And here, look, all the things that are different. And I think to that guy, I said, you know, I wouldn't be detail-oriented if you had a clue of what you were saying. Yeah. And then I had a coup and he was no longer my boss. Yay!
Starting point is 00:42:29 Returning to whatever it is we were talking about today. Why does this keep happening? I don't know. I'm not even, you know, downing the cough syrup anymore. Maybe that's why this keeps happening. Um, strategies for dealing with insurmountable problems. At least that's what my notes say I'm supposed to be talking about. We've already gone through some. Uh, ooh, return to first principles. I don't know. insurmountable problems. At least that's what my notes say I'm supposed to be talking about. We've already gone through some.
Starting point is 00:42:47 Ooh, return to first principles. I don't know. And how is this supposed to work? We talked a little bit about what was it supposed to do, but sometimes you have to return to the first. I spelled principles wrong. I spelled it like school principal. I can never remember which. And now I want to return to my first principal.
Starting point is 00:43:03 He was really nice. I don't remember. Okay, now I want to return to my first principal. He was really nice. I don't remember. Okay, so return to, you know, how is the science behind this supposed to work? That helps with algorithms. Okay, yeah. We recently saw some code where... Go back to a simulation, even.
Starting point is 00:43:19 Algorithms, people didn't want to change their algorithm because it was too complicated to re-verify. And I wanted to... It wasn't that they didn't want to change the algorithm. They didn't want to change the implementation. Right, because I wanted to do an FFT with fewer points because they weren't using the high points anyway. Yeah.
Starting point is 00:43:39 And they didn't want to re-verify it. That was an area where I really wanted to get very, let's talk about what the science is and the math and not worry about what the implementation is. And with getting stuck, sometimes I do that too. Return not to how is this supposed to work, but what are the goals? What is the big picture? What was the goals? What is the big picture?
Starting point is 00:44:06 What was the specification? What was the specification? Which usually doesn't exist. Or can I write specification to include this bug? That's usually the best solution for all bugs. It's not a bug, it's a feature. This was part of the requirements that occasionally every third tuesday the wheels will fall off and the engine will start to smoke there is a time when you can accept a bug um maybe
Starting point is 00:44:35 not accept it and let the wheels fall off but trap right before the wheels are about to fall off just go ahead and reboot the system I worked on a system once and it was very odd because the previous system once upon a time I worked on a high availability internet thing that was supposed to have however many nines
Starting point is 00:44:59 you're supposed to have uptime through the year and it had all this redundant stuff you could pull out big pieces of hardware while it was running and stick new ones in. All this very complicated stuff to keep it up all the time. And the software could handle crashes. You could reinstall software while that software was running. Very difficult things.
Starting point is 00:45:21 Anyway, the company after that was this medical device that got turned on during a procedure and then turned off. And you know how little you end up caring about memory leaks and stuff that you used to care about when something's only on for an hour and then it gets shut off? It's kind of embarrassing. But there were times where we'd prioritize things like, oh, we've got to figure out
Starting point is 00:45:45 where this memory leak is and then i'd realize no we really don't because it's never going to cause a problem it's going to take a week for that to show up and to you know to run out of ram and it's it's never on for a week yeah well it's not quality exactly but sometimes you have to look at the reality of your situation and pick your battles. And accept that shipping a product is what the company is there for. Creating perfection is wonderful, but sometimes if you want to get paid next week, you have to ship the product this week. Yeah.
Starting point is 00:46:25 It kind of kills me when I have to do that, though. Well, but, I mean, you have to look at it as this could be an insurmountable problem to solve, or at least this could be a multi-week adventure. Yeah. And no one will ever see it. And that goes back to you saying declare it a feature, declare it as a non-problem.
Starting point is 00:46:45 And sometimes things that feel particularly icky, offend our sensibilities as engineers, don't actually have real-world impact. Well, there are so many bugs that I've seen. Boy, I just want to go through all of them
Starting point is 00:47:01 that only an engineer would ever notice. Customers aren't going to care and we really should have just shipped this product. But the engineers cannot get over one pixel of bad kerning. Hey, kerning is extremely important. You didn't even know what kerning was two years ago. And now...
Starting point is 00:47:23 I knew what it was like maybe a year ago yeah and now now that you know i see it everywhere and bad kerning hurts doesn't it yeah bad kerning hurts everybody it's kind of like the time i did an ab test on 128 kilobit mp3s and noticed that they sounded bad and then i heard that they sounded bad from then on but before that they were fine yes don't train yourself to hear dissonance because everywhere you go there will be dissonance okay um sorry so headed back towards the topic which was what again uh it was strategies for dealing with insurmountable problems. One of the insurmountable problem categories is non-reproducible problems.
Starting point is 00:48:15 Well, if you can't ever reproduce it, then I think there's not much you can do. Again, it depends on the severity, too. So if it's something that, you know, somebody said, this happened once, and you look at it and go, wow, that's extremely bad. Well, actually, last week's fire drill had to do with it happened five of a thousand times. And when it happened, the unit was unusable. Which is bad.
Starting point is 00:48:43 So that, it was, yeah. And it was not a thousand. It wasn't one of these things that only happens once a year. It was, it took about an hour to get all five errors. Hey, but that's like two nines or so. Yeah, it was pretty good.
Starting point is 00:48:58 I had to look. Since they were just trying to do a demo, I'm like, you can't just hit the reset button fast enough. If all you want is a prototype demo, we can fix this right now. But no, to actually fix the problem, we had to dig deeper. But you have to make those things reproducible. You can't, if it takes 12 hours to reproduce a problem.
Starting point is 00:49:23 I've had those. I have too, but. I've had those. I have too, but... I've had things that linger. And you think you've... The really insidious ones are... This is what ends up happening with non-reproducible problems to me a lot is you'll get a report. Maybe it has a log and some good information,
Starting point is 00:49:43 so you've got something to work with. But you can't reproduce it because in my case it usually was a network equipment where you had to have a network with 50 devices you know 20 of them are this brand 30 of them are this brand they're running these protocols here's the topology
Starting point is 00:50:01 and you know this happened and your router crashed and so maybe you maybe you've done some good logging because you know in these situations you have to because it's hard to reproduce it's hard to reproduce it takes multiple people to configure a lab like and so you fix it maybe it was a poison packet of some kind. You had the length error. Oh, there it is. Okay, you fix it. You put it away and you deploy the update and everything's fine. Three weeks goes by, maybe longer. You're like, wow, okay, did it.
Starting point is 00:50:38 You have to fix it. And then it happens again. It's always like the day after you tell your boss you're done. And I've had that happen three or four times in a row. And it's really damn embarrassing. But there's, if you can't reproduce it, you can't reproduce it. So you've just got to do your best to either take what information you have and make a best guess. Or, again, change the circumstances.
Starting point is 00:51:06 And sometimes you can reproduce a problem by cheating, by sort of putting in almost unit test-like things, but simulated entities within the code that kind of force the same sorts of situations. I mean, there's ways to kind of work around the edges of it without having to build an entire company that looks like the one that had a problem. Well, for many complex systems,
Starting point is 00:51:34 you can build entities that will cause the problem more. You know, if you think the problem has to do with the system not having enough time, okay, well, put a delay here or there and keep interrupts out and make sure it doesn't have enough time. Or if you think you're occasionally running out of RAM, and this happens a lot because as you soon,
Starting point is 00:51:56 you know, as you write more code, you just take up smidgens of more RAM. Yeah. And kind of you might crest over the edge once in a while and have a problem. And you might go back down when you clean up this other function because you don't need those variables.
Starting point is 00:52:06 Go into your map file and... Really? Bleep. Go into your map file. Go into your map file and adjust your RAM so you have 20k less or 10k less or something. Force the issue and see how your system performs. That's funny I do it the other way.
Starting point is 00:52:28 Or do it the other way. I mean, you can go both ways depending on the situation. But yeah, I mean, if the problem goes away, when you have more RAM, then you're running out of RAM. And you try to figure out, is it heap RAM or is it stack RAM or is it static RAM? And you just keep breaking down the problem by decreasing those things
Starting point is 00:52:48 and causing the unreproducible thing to reproduce. I guess sometimes I remove resources because if you have a system that's supposed to be robust, you want to see how you're performing in resource-constrained situations. Do you know about the Netflix Monkey of Chaos? Yes. Yes, like that.
Starting point is 00:53:08 Do you want to explain to people who might not know what the Netflix Monkey of Chaos is? No, I was hoping you'd do it. As far as I understand it, they have code that runs through their system just randomly breaking things on their live production network. Yes.
Starting point is 00:53:26 And it's not on their test network. It's on their live production network. And so their system has to be resilient to faults and they just go ahead and inject faults all the time to prove to themselves that it is. I like that. I have never done that. I like that until whatever I'm watching stops working.
Starting point is 00:53:46 Well, yes. Blame the monkey. Blame the monkey. This is totally a monkey show. We're going to have to have a monkey title. Okay, so I think we've gone through most of my strategies. Oh, logging functions. We talked about how
Starting point is 00:54:05 printf can make timing things change and that can lead to problems that seem impossible but your logging functions don't have to go out to whatever is taking a long time to your serial port or to your formatting subsystem
Starting point is 00:54:20 you can make a logging system that is 10 bytes and you just adjust the bytes depending on where you are and that helps me trace through. Or a circular buffer of 10 events, whatever you can fit in RAM. And that gives you the history that I was talking about. And it gives you the traceability and history and for systems that don't have like my,
Starting point is 00:54:47 my six lines and, and trying to get them out and trying to figure out how my PWMs are supposed to be configured. It, it, that sort of thing helps with taking a snapshot of the system at lots of different points. And,
Starting point is 00:55:02 and you get some logging, but you don't have to pay the huge time dividend. Of course, it assumes you can debug it and you can reproduce it. So we're back to if you can't reproduce it, then the first thing to do is to reproduce it. The first thing to do is update your resume. Just quit.
Starting point is 00:55:18 It's easier. Just quit. Oh, yes. We are both quite enjoying our current contracts and you can tell by our witty banter about quitting. Nothing wrong with my contracts. Oh, well, one of us is enjoying their contracts. The other one's trying to gracefully figure out how to say, you know, this wasn't what I agreed to do.
Starting point is 00:55:41 Okay. Well, more strategies? We're skipping the big one. What's the big one? Sleep. Wow, that's a whole thing. I know. We're going to do a whole show on sleeping.
Starting point is 00:55:53 You should sleep. Go for a walk. Walk away. Do something else. Anything else. Preferably something that isn't similar. Take your mind off it completely because sometimes your mind is working on it when you're not.
Starting point is 00:56:07 Your subconscious is really a much better programmer. Well, my subconscious is really a much better programmer than I am. And sleep is important. And these times when you're sitting there in the spotlight and everybody's like, is it done yet? Is it done yet? Oh my God, is it done yet? When they're hovering around you,
Starting point is 00:56:24 definitely just close your eyes and lean back in your chair until they go away. Until they go away. Yeah. Or call the ambulance because they think you've had a stroke. That's actually,
Starting point is 00:56:37 there have been a couple of times when I needed to walk away from the problem and that would have been the fastest way to solve it. But because everybody was hovering around and needing a solution right this instant, that avenue wasn't open to me. And it's really sad when you have to say, I'm sorry, I have to go to the bathroom
Starting point is 00:56:56 in order to just get away from your code for a second. But taking a few deep breaths, it really helps. And not feeling stuck. I think the feeling of being stuck, the panicky, anxious, I might as well work on my resume, my eyes. Freezing. Freezing.
Starting point is 00:57:21 In the end, it's just code. Even if it's hardware, there is a bug and you can fix it. And if you can't, there's always another job. No, no, no, really. It is just code. It will be okay. Well, I think not necessarily it will be okay because it might not.
Starting point is 00:57:44 But no, it is. I mean, computers, as much as we hate them, are deterministic most of the time. And, you know, something is wrong, and that something can be found. And sometimes just remembering that is... You know, there always is a solution. It may be a complicated, expensive, painful solution
Starting point is 00:58:04 that requires re-spinning hardware or doing something that a company might not like, but there's always a solution. We didn't talk about hardware bugs, and if you think it's hardware. My strategy for that is to assume it is, and then, well, I mean, I'm a software engineer. That's my strategy.
Starting point is 00:58:27 Yeah, hardware engineers assume it's software. Yeah. I assume it's a hardware bug, and now I figure out how to prove it to the hardware engineer. And not throw it over the wall, you have a bug sort of prove it. That's not proving it. But sit down and figure out,
Starting point is 00:58:44 if the hardware worked, it would do X. And it's doing not X. Now I need to see what is it doing X. And so I write my little test. And if it's doing X, then I'm like, well, maybe it's not a hardware bug. But if it isn't doing X, then now I can go to the hardware engineer and say, this is what it is. And look here, I wrote you a test to see.
Starting point is 00:59:05 Now you can just change whatever you want on the board and rerun the test. And as soon as it does X, we agreed that it works and it's all my fault. And so let me know when you fix your hardware. I'll be at the movies. Yes, seeing Lego. Well, that was the last thing on my list.
Starting point is 00:59:23 I'm stuck as for what else to cover. It's an instrumental problem. Shall we call it is done? I think it's done. It's done. It's done until we get a bunch of hateful feedback that we can talk about on the next show. Well, I am sure there are other ways
Starting point is 00:59:43 to paint ourselves into a corner, but we'll go with that one. I'm hoping our strategies are more useful than the problem set we thought up. And I wonder, you listeners, are there coping strategies we missed? Insurmountable problems you'd like to talk about? Contact us, email show at embedded.fm or hit the contact link on embedded.fm. Thank you for listening. And thanks to Christopher White,
Starting point is 01:00:12 podcast producer and my partner here at Logical Elegance, our embedded systems consulting firm. Thank you for being on. Yes. Oh, the final quote comes from a real life genius who understands being stuck. Stephen Hawking. I was going to say Houdini. That would have been good too.
Starting point is 01:00:27 No, no, no. Stephen Hawking says, it is no good getting furious if you get stuck. What I do is keep thinking about the problem, but work on something else. Sometimes it's years before I see a way forward. In the case of information loss and black holes, it was 29 years. I certainly hope your crashes don't take that long to find and fix.
Starting point is 01:00:52 In the meantime, have a good week!

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.