Embedded - 38: Blame the Monkey
Episode Date: February 12, 2014Producer Chris White (@stoneymonster) and Elecia discuss some insurmountable problems and some strategies for approaching them. Google it (or look on Stack Exchange). Explain the problem to someone... else… even if they aren't there (use a stuffed animal or write a really detailed email, anticipating potential questions). Draw a picture (system/subsystem architecture or code block diagram or a doodle). Make sure you are running what you think you are, start over from a blank slate, making no assumptions about how your hardware is programmed. Identify and verify your assumptions about the all the pieces involved. Get scientific: define the problem, create a hypothesis, run an experiment, record the results. Small steps! Also: get methodological and write everything down. Return to first principals: how is this supposed to work? Revert to last known good and diff to find the cause of a new issue. Logging functions: they take time but can lead to a better trace, better picture. Make it reproducible: there is information in the solution if you can find the steps to repro. Step by step, reduce the steps until you can nab it in the act. Remove the voodoo. Avoidance: accept the bug (it's a feature!) and go on. Sleep, go for a walk, or work on something else.
Transcript
Discussion (0)
Welcome back to Making Embedded Systems, the show for people who love gadgets.
I'm Anecia White, and Chris White is joining me again today to chat about strategies for
dealing with insurmountable problems.
Apologies for my lingering stuffiness, I'll try not to blow my nose directly into the mic.
But right before I got this stupid cold,
I was doing some firefighting at a client,
and the algorithms guy,
who has been roped into doing the embedded work,
had dug himself a huge hole.
I came in and explained what he really needed was DMA to go with those spy drivers,
and he'd really thank me for it later.
But at some point, as we were talking about what had gone wrong
and what he was trying to do to fix it,
he turned to me and asked if I ever got stuck, just stuck on a problem.
No. Well, yes, of course, but not for very long. And certainly not as well as he had done it.
But that's the plan for today, to talk about getting stuck and getting unstuck. As I mentioned
in the intro blurb, Chris White is turning around his producer chair to chat through this with me.
Hi, Chris. Thank you for joining me.
Hi. Do you know joining me. Hi.
Do you know what I mean by stuck?
Yes, and I take issue with your not stuck for very long.
I've certainly had things I've been stuck on for weeks.
Is that not very long?
Well, stuck. What do you mean by stuck then?
Well, unable to come up with an explanation for a difficult problem, sometimes because it's difficult to reproduce, sometimes because it's so complex that it's difficult to find a framework to interrogate what's going on. The generalities I understand,
and I think we're going to be doing a lot of those today,
but do you have any specifics?
What things?
Well, I mean, I don't want to get deep into this immediately,
but things on complex systems where there's multiple threads happening
and maybe you have a race condition or corruption that only happens under certain very specific circumstances, maybe with external hardware, speaking to your hardware, you know, those kind of situations where the graph of interactions is complicated.
So complex systems having an not unreproducible problem,
but a difficult to reproduce problem.
Yeah, or even, again, I mean,
it doesn't have to be difficult to reproduce.
Sometimes it can be.
Having a problem that interacts with lots of its subsystems.
Yeah.
Okay.
And sometimes you dig into those and you think you know what's going on.
And this is kind of the pattern that happens to me a lot is I'll see a problem and go,
oh, I know exactly what's happening.
And I'll go fix the exactly what's happening and that won't be it at all.
Or I'll have made something much, much, much worse.
Or I'll fix the problem but caused a new sort of twisted version of the same problem to appear. So it's like you're pushing around the edges. And
the more complex the system is, it seems to me, the more often you end up in that situation where,
okay, I'm just going to fix this one line of code and everything will be fine. But what you really
do is you have a cascade effect that you know leads to another problem and
that that i've been in a situation where it's weeks to stabilize something from you know you
started here and you started pulling on the thread and pretty soon you didn't realize
yeah there's you've unraveled the entire sweater you should just start over yeah well but that wasn't actually i mean you never stopped
in there you never said oh my god i am stuck oh you kept pulling the thread i mean not right okay
so i see what you mean by stuck just giving up it's the it's the feeling of oh my god i'm never
going to fix this and i don't even have a clue what to do next.
I've certainly felt that way, but I keep trying something else.
Yeah. Even, even if there's something else doesn't really quite approach it. I mean,
even if there's something else seems kind of pointless, it's still, I never stop.
Right. Or if I, you know, if it's really bad I never stop.
Or if it's really bad, then I just quit that job.
I used to feel stuck often.
So I think it's an experience thing.
I definitely, early in my career,
I would just get hit with these things and I had,
you know, well, how do you begin to design this system?
It's just a wall and, and there's no, there's no footholds.
You, you just, it's just this huge thing.
Um, and I felt stuck that way.
And, uh, memory before I really understood memory right definitely having a pointer problems like I don't even know how to go about solving this
well and not to sound like to get off my lawn guy but get off my lawn back when
we were starting out there was less pervasive information on the internet.
You know, you couldn't go to Stack Exchange
because it didn't exist.
It's true.
And, you know, nowadays,
there's so many developers approaching so many problems
because this industry has exploded
that there's very few things that you might encounter
that somebody else hasn't at least encountered
a similar
incarnation of well you get those complex systems and and nobody has this set of lego blocks
but yeah but that's a rare situation i think and so i mean it sounds almost like cheating but
a lot of times when I,
the first thing I do when I run into a problem these days is to Google the exact error message
or the exact situation.
And a lot of the problems I run into these days,
and I think a lot of other people do,
is not so much,
oh, I've written all this code from scratch
and it doesn't work.
It's the interactions between things you're piecing together
like third-party APIs and things.
It's like, wow, how do I make this work this particular way
which isn't really standard?
And those kind of questions are generally things
that other people have asked.
Okay, so I was going to make a list
of the different types of problems we'll talk about
and then work on strategies to that.
All right.
Okay, so that was dealing with third-party libraries.
And you don't have the right visibility
because it's a black box.
Hardware problems is an area that,
as a software engineer, when I started out,
if it was a hardware problem,
or even if I thought it might be a hardware problem,
I was kind of stuck as to where to go next.
Random crashes.
Random crashes, yeah.
Stack overflows and hard fault handlers.
Definitely the I don't know where to start problem
used to hit me a lot.
Race conditions.
Timing and race conditions.
Things that happen that don't show up
when the debugger is attached.
That never happens to me. What attached that never happens to me what that never happens to me all of your all of your bugs show up when the when the debugger
is attached and only when the debugger is attached oh that's a different set oh it's the it's the
rebugger let's see problems that happen occasionally. The stuff that doesn't reproduce
easily.
The stuff that is
clearly not
software's fault.
Which is the
default state of
all problems to
start with.
Things that
aren't software,
they're clearly
not software's
fault.
Those are
cosmic rays.
Hey, that actually happens.
It does.
It really, really does.
And if you have to plan your software for it, wow.
I worked on a system that it was guaranteed to happen all the time
because it had so much RAM.
Yeah.
Okay.
Let's see.
From Twitter, I asked a question about what insurmountable problems.
Cache coherence, garbage collection, invalid pointers, and off-by-one errors.
I definitely see the invalid pointers.
Cache coherence, yeah.
Most, I mean, gosh.
Unless you're architecting the system you know the hardware system yourself
i don't know many of us run into that these days yeah i don't worry too much about cash
coherence or garbage collection because garbage collection what do you yeah i'm on java java
okay so we have our list of problems and now now it's let's go through and figure out
where would you start oh Oh, and it worked yesterday.
That was, that one.
That one used to be, I used to be so, so bad at that
and just get so tangled up with, oh my God, this worked yesterday.
It worked when I did it.
Well, reproducible and it works for me
is kind of a separate thing.
Okay, so that's my list of things that I have felt.
Oh my God, I might as well just quit.
This is never going to be fixed.
Let's just scram, start over.
Burn it down.
If I just plug it in backwards,
no one will realize it's software's fault. Start over. Burn it down. If I just plug it in backwards,
no one will realize it's software's fault.
So strategies.
And I wasn't going to do this one for one,
but maybe like now that we have our list of problems,
try to work on which strategies work really well.
And Google it.
That is by far the first thing.
Well, I mean, Stack Exchange has become a huge resource.
Even for embedded.
Yeah.
I used to say, well, yeah, Stack Exchange was nice if you were working on a PC. But now even for embedded, there's a lot of ARM stuff and even some TIDSP stuff there.
And I often feel like I should go there and start answering some questions at this point.
I do.
I feel like it's irresponsible not to participate in such a pretty amazing community.
And yet, still not signing up for Stack Exchange.
And that goes for Googling it, phrase it different ways.
And I'm thinking about the guy, the clients,
he couldn't have Googled his problem
because he couldn't state his problem.
He didn't know anything other than
there didn't seem to be enough time in the system.
Well, that's a skill.
I mean, you have to,
I mean, you can flippantly say Google it,
but if you don't really understand how to,
if you don't understand your problem well enough
to construct a search, then yeah, now you're in real trouble.
Yeah.
But one way to construct a search or some way I've talked about before, having a second pair of eyes, even if they aren't actually there, talk it through with someone.
And if you don't have somebody to talk it through with,
that's okay.
Use a teddy bear.
They're really, really effective
because just going through the problem is enough.
Yeah.
Or a duck.
Wasn't there a duck?
Yeah, there have been ducks too.
And a frog.
My frog was my bug eater when I worked at LeapFrog.
But I used to also email, like prepare this in-depth, what is the problem email and what
steps I've taken to colleagues.
Yeah.
And then never send it.
No, that's a good idea.
A lot of, I mean, a lot of us, at least myself,
you get in the habit of sitting in front of your computer
and the only thing you really type or write is code
or the occasional short email.
And it can be very powerful to get your thoughts organized
in a way that perhaps they aren't normally organized when you're coding.
Because you're, you know,
it's a different
application of your thinking.
Yeah, and it gets you
out of the monkey press
try over and over.
Which can be very useful sometimes.
It can be very useful, but
when you get to these insurmountable problems,
that's when you have to stop doing that. I guess.
You don't agree? I think I'm defending my
usual mode of debugging probably a little too heavily.
No, I mean, this is not, oh, there's a bug,
I have to fix it, sort of. Yeah, I know. I really want to go after those
things that at one time in your career
you would have said, I'm hosed.
I don't know.
But when I write these emails, usually it's to somebody I respect a lot.
Yeah.
And as I'm writing it, I'm trying to answer all of their questions.
So just to step back, and I know we want to get into specifics,
but just stay general.
At least in my past,
a lot of these instrumental problems
come in terrible times.
Like off the manufacturing floor
or from a customer report.
And you start looking at it
and you realize there's a problem,
but it's a deep problem.
You know, starting to feel tickle the insurmountable bit in the back of your brain and make you nervous but you know it's one thing if you if you're in the midst of developing a new product or something
and you've come across this this situation and you've got you know two three weeks to bang on it
it's another thing if you've got all of management hovering around you
saying, why isn't this fixed yet? And you're saying, well, you know, here's my long list of
stuff that you don't understand at all that I've gone through and they don't care. They just want
to know why software is not working. So that, to circle back, I think that's a good reason to have, okay, this is a, you know,
this is a somewhat challenging problem. Here's why. And be able to say that in a way that makes
sense. Oh, that's really hard. I mean, even if you were assuming that the people who are asking you for constant status updates were engineers at one
point in their life you can't you can't just break down and say oh well you need dma to go with those
spy drivers and blah blah blah and they're like yeah i totally know what spy is and i know what
dma is but quit treating me like i'm an idiot because I have no idea what you're talking about. And it's this weird combination of don't condescend to me and at the same time condescending back.
It's strange.
But I mean, you see that problem, right?
Oh, yeah.
Because being in their shoes, actually, having been a manager and having been the vulture asking for status updates, I really don't care what the problem is.
I just want to know when you're going to fix it.
And telling me it's insurmountable doesn't actually help me figure out when
we're going to ship this puppy.
So yeah,
it's fine to have these detailed analyses of what's wrong,
but can you make it sound bad enough that people are
going to leave you alone for a little while and good enough that they're not just going to fire
you sound too bad then they'll fire you too well how did you let it get this bad or sometimes
they'll throw more people on the problem when that isn't that isn't helpful at all now you just have
two people staring at the screen going, I don't know.
What do you think?
I don't know.
I used to work with a guy when these problems would come up.
He would say back to whoever found it or whoever was harassing him, well, that's impossible.
That just can't happen.
Well, that's what these problems feel like.
That's totally...
That's been my first reaction to a lot of these.
Well, that's just impossible.
That can't happen.
I'm sorry.
You're breaking the laws of physics.
It's not going to work that way.
I mean, there's an if statement right there
that you're past with the conditional that isn't true.
Yes, yes.
If A equal equal B, then do this.
And then you go in the debugger and B doesn't equal A
and you did it anyway.
Yeah.
Stupid computers.
They never do what you tell them to.
Actually, they do exactly what you tell them to.
Yes, but sometimes somebody else told them something
and you don't know what it is.
What were we talking about?
Insurmountable problems and a list of various kinds and strategies
by which one might solve them.
Okay.
So what do you do?
Once I get past the monkey stage yeah because i mean i always start with
with monkeying around yeah tweaks um so it depends on the kind of problem but generally uh i'll pull
out a pad of paper i like that and i'll start drawing maybe a block diagram of what's going on. If it's a messaging or multi-thread kind of interaction, draw a timeline to try to make sense of what I think at least should be happening.
And then you can go back and you can verify against that to say, okay, is this what's really happening?
There's a whole bunch of things you just said there.
The first is checking your assumptions, verifying what you think is supposed to happen.
Because occasionally these insurmountable problems
have been little tweaks
where it was actually doing exactly what I expected.
I just didn't quite think it would work that well
or in that order.
And so verifying the assumptions
with what was supposed to happen,
especially with message passing in complex systems.
That one we should highlight.
And also, you said draw a picture.
I think drawing a picture is a solution
for 60% of all books.
Hard books, not easy books.
You draw a related picture, though,
not like of an elephant or something.
Oh, really?
No, no, no, you doodle.
Fantastical doodles.
No, no, I have a lab notebook.
You have many lab notebooks.
We actually got our lab notebooks personalized.
That's how we are dependent on our lab notebooks.
For each client, I fill in the first two pages
and then draw somewhere else.
That's true.
I don't always use my lab notebook for this.
But drawing a picture of what's supposed to happen,
or I mentioned where do you start.
Those insurmountable, this system is too big,
there's no way I'm going to be able to do it.
Drawing a picture is breaking into parts.
When you have memory corruption errors or stack fault errors,
drawing a picture and figuring out,
okay, I know my stack is overflowing, but I don't know where.
When you draw a little picture of a stack overflowing,
and now you have, it helps my mental image as well as having a physical image even if it's
crappy sketch physical image it helps my mental image of what's happening
um and then paper uh because when i'm in that monkey stage where i'm just oh you know i'll
try this i'll tweak this i'll push go i'll tweak this, I'll push go, I'll tweak this, I'll push go.
I don't necessarily keep track of what I've done.
No.
And I will often do the same thing over and over again out of irritation.
It's not irritation.
It's a different I word.
That's not true.
Many things you do over and over again and get a different result.
But yeah, no, I didn't. I'll forget that I've done, like you said, I'll write down, you know,
keep track of what you've done because I will try the same thing over and over again.
Sometimes just because I forgot if I did it or maybe I didn't trust that I really did what I
thought I did. Well, and sometimes it does act differently because there was some step that you didn't
realize you did yeah that was critical and so it worked that time and it didn't work this time and
so the paper for me is a way of developing the steps to reproducibility
and then tweaking them you know get rid of the voodoo and figure
out the science.
Yeah.
Um, what else?
So what else is on your paper when you're, you're stuck?
Well, um, you know, again, it depends on the kind of problem.
Um, but a lot of times some sort of model of, of what the system looks like,
the software,
um,
the whole system,
just maybe the starting with the piece that I'm,
that's broken,
that's broken.
Uh,
you know,
I'll write things down like values and stuff that I,
that I expect to be there.
Sometimes with a debugger,
it's hard to keep track of how things change.
You know,
it doesn't keep a history of the particular variable or a register or something.
So sometimes it's valuable to, okay, this was at time zero, it was this.
At time one, it was this.
At time three, it was this.
It's nice to just kind of have a backtrace of the state of the system.
Because most debuggers only give you,
this is what the system's like right now.
Next statement.
Okay, this is what the system is like right now.
Well, and you can sometimes go up in the call stack
and see what everything was before you called this function.
It's still what was there.
It's still...
It's still a very instantaneous picture.
It's not going back in time, though. It's values of yeah or at that time uh and getting the picture of
what's happening now what's happening now that goes back to if you have a last known good code
right well yeah um figure out what the different...
I mean, last known good code doesn't ever crash.
This one crashes.
Diff. I'd start with diff.
Diff, for sure.
Or start with, you know, git log and blame.
But sometimes that hasn't worked well for me
for one reason or another.
But running the code side by side
in two computers, two devices.
Oh, I've Never done that.
Step, step, step.
Okay.
That has been useful.
For those, oh my God, this can't be happening.
All I did was change a comment.
So it can't be the code that's different.
Sort of.
Usually it's all I did was add a printf.
Well, printfs are kind of easy though.
Because printfs change the timing.
But you have to realize that's what you did.
Yes, yes.
Printfs, well, all the timing ones,
you can do things to replace them.
And I guess that's a toolbox sort of thing.
If it works without printf,
or it works with printf,
and then when you go without,
then you can start putting delays in for printfs
and see if that is really all that's happening
or if there's another characteristic
that is involved with printf.
Because printf also takes a bunch of RAM
and it does some stacky things.
Yeah, I can blow your stack away,
especially if you've got a serious printf with the...
Percents.
The percents.
Is that what we're calling them
the variable string arguments
I don't know the thing that you know
formats formatting that's it
formatting the percents
boy we're really staying organized today
this is how we debug too
we just bounce around
isn't this how everybody debugs
I mean really
where were we
you were asking me what I did first okay so Isn't this how everybody team bugs? I mean, really. Where were we?
You were asking me what I did first.
Okay, so write things down.
And I think your second suggestion was good.
If it's something that's changed over time,
figure out when the last known good state was and either run it side by side,
which is something I've never done.
Or diff.
Definitely diff.
Diff is your friend. Or diff. Definitely diff. Or diff is your friend.
Or diff.
Or if you don't have the capability to run it side-by-side
and you have some logging, run it the old way and get a full log
and then run it the new way and get a full log.
And it's always my luck that one of the things we put in with the new one
was the logging, and so I ended up backporting the logging to the old one.
And then it doesn't work anymore.
No, sometimes it works. yes of course and then the logging is your problem ta-da
oh oh we didn't talk about uh we talked a little bit about you know sometimes it runs and it works
and sometimes you do the same thing and it doesn't work yeah uh remember one whole 48-hour vultures hovering over me
waiting for me to get it done, period.
And I wasn't running the code I thought I was
the whole time.
The whole time.
Is it plugged in?
Yes, yes.
Don't forget.
But I think that's helpful with debugging sometimes is unplug everything and start over today fresh
with a known you did all the steps.
It's too easy to, you know,
we have Flash in our processors now.
And if you have a system with three processors,
did you actually run the Flash program for everything?
You know, it's really easy to get flustered and then start,
and I make fun of the monkey stuff because we all do it,
but it's really easy to start messing around
when you don't really know what the state of things is.
Well, yeah, because...
You don't, I mean, you may have your hardware put together wrong.
And, you know, it may not be that your problem is the things you set up wrong.
It may be that in the course of debugging, when you were acting like a monkey,
you did something that you didn't really notice or pay attention to.
And now you're in a different state.
And not even really debugging the problem you thought you were.
Oh, yeah.
Well, that happens a lot.
I had a stack problem, and I didn't know.
I didn't realize I was overflowing the stack. And so my algorithm was running badly,
and I spent forever figuring out what the algorithm's problem was.
And the truth was there were a couple of bugs in there.
So I would fix them and something different would happen.
You know, this makes us sound like such great engineers.
Let's just keep talking about how much we suck.
Sorry, go ahead.
So, you know, what was I going to say?
Well, one thing I found about a lot of these problems
is that you start out with a notion.
I think I sort of said this.
You start out with a notion of what it is,
and you might be proven wrong as you go through,
as you find out that your initial impression is even worse.
But a lot of times there's more than one thing going on.
It's true.
You may have two bugs that are kind of interacting with each other that you know and
in isolation either one of them may not have happened or in isolation either one of them
would have been easy to fix and find right but maybe you have stack corruption and maybe you're
running out of ram and so now everything's just all over the place and nothing makes sense. Yeah.
So that actually, the algorithm thing and that together,
we had James Grenning on the show and he talked about the test-driven development
and running tests for different areas.
And I think some of these insurmountable problems
fall to that sort of testing,
to the rigorous,
every line you put in has to be tested.
And if it's not testable on its own,
why is it a line of code?
Yeah, no, I think that's true on a module-to-module basis.
I think if you do good unit tests,
you shouldn't really run into these kind of situations on a per-module basis. I think if you do good unit tests, you shouldn't really run into these kind of situations
on a per module basis.
But that leads us back to stack corruptions.
Yeah.
And it leads us back to complexities in big systems.
That's almost like the difference between, you know,
testing in a test tube where, you know,
you could cure cancer with bleach.
Controlled circumstances.
And testing in a live being where, you know, obviously that would be bad.
But unit testing is often, and I, you know,
I think this is the way a lot of people do it.
They, you know, they surround the particular modules
with the test framework and they interact with it.
And okay, it works fine there.
But it's not necessarily on the real system
or it's obviously not interacting with the not necessarily on the real system or it's
it's obviously not interacting with the you know real callers um so your unit test is only as good
as you know as your imagination to some extent and you can simulate the whole system and have
your unit test be perfect but shouldn't you spend some of that time and energy building a system instead of simulation?
Well, I think unit tests are good,
but I do think they have their limits.
And I think they have their limits
in particularly in complicated message-passing,
multi-threaded architectures
where there's combinatorial problems that...
Well, you say message-passing and threads,
and I hear...
Not embedded. Semaphores and sure interrupts yeah
because one of the problems clients recently had dealt with having multiple interrupts stack up
and they weren't nesting them but then when you were done with this interrupt you'd lost all of
them no no they could do it they they went to
the next interrupt and they even had them prioritized correctly but if this low priority
one came and it finished and it took too long five interrupts later was when you saw the problem
because they had run out of time but it wasn't until the next hard time interrupt real-time interrupt occurred that
you realized you had totally shot yourself in the foot okay so i mean that's sort of where is
everything um goes back to visibility yeah which you mentioned earlier that sometimes you're working
in systems where you don't get to see what you
want to see i think you you said that with black boxes and third-party libraries um i don't think
i've actually said it on the air oh man so we can get to that no no you did you said i said
something about third-party libraries but oh oh I guess, yes, you're right.
You're reading ahead.
I'm not reading ahead so much as... You imagine me saying things.
You could just do my side of this, too.
Yeah, well, okay,
as long as we're being completely discombobulated.
It's just so good to be an A-plus show.
Okay, discombobulated, yes?
Well, there's only one show, so there's only one grade.
So if we're grading on a curve, we have to be the best.
According to your logic, I am stuck as a C student for the rest of my life.
What? No.
That's not how it works.
We're high-scoring this.
High score, low score okay so going back to to problems um yeah there's a lot of the times where
all else being equal you know you could have solved something quickly but
you can't see what's going on. And that's masking the problem.
Yes.
So times like this are, you know, you have hardware that you don't understand.
Maybe you bought it.
Maybe your double E is gone or...
Uncommunicated.
Uncommunicated.
Whatever.
Whatever.
Mean.
So you have this, you know, this black box piece of hardware that's doing something that it may be doing the right thing,
but you don't have visibility into its internal workings enough to kind of do the things you need to do to debug your problem.
You know, keep track of state and diagram things out.
So there's that.
There's third-party software, which people use a lot even in embedded situations
these days where you buy i don't know an artos or graphics library or something and something
weird is happening and again it may not even be with their stuff but you're using their stuff
and it would be nice if you could see this but you can't or your debugger is not very good
which definitely an embedded system incredibly deeply embedded and you don't have a debugger's not very good. Which definitely an embedded system is possible.
Or you're incredibly deeply embedded and you don't have a debugger,
or your hardware doesn't have the test points it turns out you need,
all kinds of things like that.
Yeah, I had that happen recently.
I was doing a complicated PWM system to drive a motor,
a three-phase motor, and we were sine commutating.
I was outputting the PWM on TTL levels,
and it was going to a motor controller where it was then going to FETs,
and then it was finally...
Boba Fett?
No, FETs.
So it kept up the voltage and current,
and then it was finally going out to the motor.
And so I was driving six signals that ended up being, at the end,
three signals out to the motor.
I couldn't see the six signals.
And it was one ball grid array and one fine-fitting pitch part.
And I was sure that my signals were doing what I said they were doing.
And the people I was working with said,
oh, you just have to do this.
As long as the signals look like this,
the output will be fine.
And the output didn't look fine.
They were out of their minds.
It's fine.
It's fine.
It was so bad.
And the motor sounded awful.
That's normal.
I needed to see my six lines to prove
that I was driving what they wanted.
Smoke meets, Smoke meets working.
For this one, it did.
Blew a lot of fats.
But it was one of those, I guess being stuck is about being powerless.
Well, you know, it feeds back into feelings of inadequacy after enough time.
But this isn't,
this isn't imposter syndrome.
We talked about that.
That was a good conversation,
but this is,
this is,
but,
but,
but it can make things worse because you can get,
you can get anxious and you know,
it's like being a,
it's like taking an exam,
right?
Where you blank out.
Yeah.
And you can blank out for,
I blanked out for a long time
on some of these problems the whole day.
It's a big cubicle of education.
It takes a nap or, you know,
sleeping overnight to, you know,
let your subconscious puzzle over it.
But going back to black boxes and things
and not having visibility,
I mean, I don't,
you know, some of the techniques I've used there
are kind of scientific exploration.
Okay, change the circumstances, change the inputs,
and see how it changes the outputs.
See how changing the initial conditions
of whatever is happening,
and I don't have a good example right now.
It could be an algorithm.
It could be some message transaction, you know, okay, insert a new message, change the timing,
and just, you know, try to get a feel for the parameters of the problem. You know,
people say garbage in, garbage out. Okay, change the garbage on one side and see how the garbage changes on the other side.
And sometimes you can get a sense for,
oh, this is what's going on because I did this
and it reacted this way.
That means it's not doing X or doing Y.
And this is where the piece of paper helps me,
is defining.
You create the hypothesis, you put the garbage in, you try it out,
and you record the results,
and you take tiny steps and see,
and then you notice that,
oh, it's only when the high bit is set that this happens.
Right, and sometimes you can be really methodical.
If you've got an input that's eight bits,
all right, it's doing this in this one,
fine, I'm going to give it every damn possible input.
When you accept it's going to take two hours to get through this,
then you just go ahead.
Then you come up with a two hour plan and you can be exhaustive.
And sometimes that's great because maybe,
maybe you have four problems and you've only seen one.
Yes,
that's true.
And of course that goes back to unit testing.
If you had a proper unit test for that,
you would have found all of these to begin with.
But you know, we don't all have proper unit tests sometimes.
It's a big secret.
Don't tell anyone.
But going back to scientific,
I always think about, you know, the scientific method.
And it has to do with defining the problem,
which is one of the things I see people who get stuck do
is they fail to define the problem, which is one of the things I see people who get stuck do,
is they fail to define the problem. They say, it's broken. I'm like, okay, how is it broken?
What is broken? What is it? It was working before, and now it's not.
Yes, yes. And then throughout the whole explanation, the pronouns will always be it. And there'll be no definition.
And I think defining the problem,
this kind of goes back to the Googling it,
which I'm pretty sure we did talk about. Yes, yes, that was after the show.
That was after the recording.
Okay.
But you have to define the problem.
You can't just say, my system is not working.
You have to say, my system is doing random things.
Okay, what is random?
What part of your system?
And break it down the problem.
It's an iteration, right?
I mean, you can start out from a user level and say,
the display doesn't look right.
Okay, the display doesn't look right.
Why?
The display has garbage in the lower right-hand corner. Okay, the display doesn't look right. Why?
The display has garbage in the lower right-hand corner.
Okay, what does the garbage look like?
Oh, this is the four-year-old method.
Yes, explain it to me like I'm a five-year-old or four-year-old.
And then the four-year-old will say, why?
And then at some point, but at some point you're going to get down to it's not completing a DMA transaction correctly.
Yeah.
At X time when, you know,
and you'll get,
you'll start from a more general description of the problem
and get to your more specific,
and the more, you know,
in the limit,
as you get to the most specific definition of the problem,
oftentimes the solution is apparent.
Yes.
It's when you don't really know what's going on
that it's still a problem.
Yeah.
Garbage in the bottom of the screen, that's a hard problem.
It's like, where do I start?
Oh, my God, it's all broken.
But DMA not completing in time.
Well, that's something you can try to figure out.
Right.
So breaking down the problem.
Or at least you know where the code is.
Yes, now at least you can step through the code.
Before, you were stepping through all of the graphing libraries,
and that wasn't possible.
Now you have only a little bit.
I do tend to get methodical and detail-oriented
when I get frustrated, when I get anxious.
And I think that helps.
That's a good fallback method for me.
Do you know it took me almost three years in my career
to realize detail-oriented was not a compliment?
If that's showing up on your performance reviews, listeners,
just realize that that is not actually a compliment.
No comments.
No, I'm trying to remember when, I mean, yeah.
I guess it's not a compliment.
It's not meant as a compliment.
It's not meant as a compliment.
I didn't take it as an insult until later.
I don't understand why it's an insult, especially for engineers.
Well,
the last person who said it to me which hopefully will be the last person ever
to have said it to me. You can be detail-obsessed
which is different than
detail-oriented.
I don't...
I was pointing out how
all of the different architectures that were
being proposed were
unsuitable given the problem
statement because i actually understood the problem and the person who was sure spouting
off all of these impossible things was you're so detail-oriented and i'm like yeah that's because
i don't want this to totally so to those people, that's a synonym to don't confuse me with facts.
Yes.
Or don't confuse me with details.
Detail-oriented means this is inconvenient.
Please stop saying these truths.
Yes.
And I don't get too detail-oriented
unless I'm doing something, debugging like this,
or until I'm like,
what you said to me 10 minutes ago
and what you say to me now are not the same.
And here, look, all the things that are different.
And I think to that guy, I said,
you know, I wouldn't be detail-oriented
if you had a clue of what you were saying.
Yeah.
And then I had a coup and he was no longer my boss.
Yay!
Returning to whatever it is we were talking about today. Why does this keep happening?
I don't know.
I'm not even, you know, downing the cough syrup anymore.
Maybe that's why this keeps happening.
Um, strategies for dealing with insurmountable problems.
At least that's what my notes say I'm supposed to be talking about.
We've already gone through some.
Uh, ooh, return to first principles. I don't know. insurmountable problems. At least that's what my notes say I'm supposed to be talking about. We've already gone through some.
Ooh, return to first principles.
I don't know.
And how is this supposed to work?
We talked a little bit about what was it supposed to do,
but sometimes you have to return to the first.
I spelled principles wrong.
I spelled it like school principal.
I can never remember which. And now I want to return to my first principal.
He was really nice. I don't remember. Okay, now I want to return to my first principal. He was really nice.
I don't remember.
Okay, so return to, you know,
how is the science behind this supposed to work?
That helps with algorithms.
Okay, yeah.
We recently saw some code where...
Go back to a simulation, even.
Algorithms, people didn't want to change their algorithm
because it was too complicated to re-verify.
And I wanted to...
It wasn't that they didn't want to change the algorithm.
They didn't want to change the implementation.
Right, because I wanted to do an FFT with fewer points
because they weren't using the high points anyway.
Yeah.
And they didn't want to re-verify it.
That was an area where I really wanted to get very,
let's talk about what the science is and the math
and not worry about what the implementation is.
And with getting stuck, sometimes I do that too.
Return not to how is this supposed to work,
but what are the goals?
What is the big picture? What was the goals? What is the big picture?
What was the specification?
What was the specification?
Which usually doesn't exist.
Or can I write specification to include this bug?
That's usually the best solution for all bugs.
It's not a bug, it's a feature.
This was part of the requirements that occasionally every third tuesday the wheels
will fall off and the engine will start to smoke there is a time when you can accept a bug um maybe
not accept it and let the wheels fall off but trap right before the wheels are about to fall off
just go ahead and reboot the system I worked on a system once
and it was very odd because
the previous system
once upon a time I worked on
a high availability internet thing
that was supposed to have
however many nines
you're supposed to have uptime through the year
and it had all this redundant stuff
you could pull out big pieces of hardware while it was running
and stick new ones in.
All this very complicated stuff to keep it up all the time.
And the software could handle crashes.
You could reinstall software while that software was running.
Very difficult things.
Anyway, the company after that was this medical device
that got turned on during a procedure and then turned off.
And you know how little you end up caring about memory leaks
and stuff that you used to care about
when something's only on for an hour and then it gets shut off?
It's kind of embarrassing.
But there were times where we'd prioritize things like,
oh, we've got to figure out
where this memory leak is and then i'd realize no we really don't because it's never going to
cause a problem it's going to take a week for that to show up and to you know to run out of ram and
it's it's never on for a week yeah well it's not quality exactly but sometimes you have to look at the reality of your situation and
pick your battles.
And accept that shipping a product is what the company is there for. Creating perfection
is wonderful, but sometimes if you want to get paid next week, you have to ship the product
this week.
Yeah.
It kind of kills me when I have to do that, though.
Well, but, I mean, you have to look at it as
this could be an insurmountable problem to solve,
or at least this could be a multi-week adventure.
Yeah.
And no one will ever see it.
And that goes back to you saying declare it a feature,
declare it as a non-problem.
And sometimes things that feel particularly
icky, offend our
sensibilities as engineers,
don't actually
have real-world impact.
Well, there are so many
bugs that I've seen.
Boy, I just want to go through all of them
that only an engineer
would ever notice.
Customers aren't going to care and we really should have just shipped this product.
But the engineers cannot get over
one pixel of bad kerning.
Hey, kerning is extremely important.
You didn't even know what kerning was two years ago.
And now...
I knew what it was like maybe a year ago
yeah and now now that you know i see it everywhere and bad kerning hurts doesn't it yeah bad kerning
hurts everybody it's kind of like the time i did an ab test on 128 kilobit mp3s and
noticed that they sounded bad and then i heard that they sounded bad from then on
but before that they were fine yes don't train yourself to hear dissonance because everywhere
you go there will be dissonance okay um sorry so headed back towards the topic which was what
again uh it was strategies for dealing with insurmountable problems.
One of the insurmountable problem categories is non-reproducible problems.
Well, if you can't ever reproduce it, then I think there's not much you can do.
Again, it depends on the severity, too.
So if it's something that, you know,
somebody said, this happened once,
and you look at it and go, wow, that's extremely bad.
Well, actually, last week's fire drill had to do with it happened five of a thousand times.
And when it happened, the unit was unusable.
Which is bad.
So that, it was, yeah.
And it was not a thousand.
It wasn't one of these things
that only happens once a year.
It was, it took about an hour
to get all five errors.
Hey, but that's like two nines or so.
Yeah, it was pretty good.
I had to look.
Since they were just trying to do a demo,
I'm like,
you can't just hit the reset button fast enough.
If all you want is a prototype demo, we can fix this right now.
But no, to actually fix the problem, we had to dig deeper.
But you have to make those things reproducible.
You can't, if it takes 12 hours to reproduce a problem.
I've had those.
I have too, but. I've had those. I have too, but...
I've had things that linger.
And you think you've...
The really insidious ones are...
This is what ends up happening with non-reproducible problems to me a lot is
you'll get a report.
Maybe it has a log and some good information,
so you've got something to work with.
But you can't reproduce it because in my case it usually
was a network equipment
where you had to have a network with
50
devices you know 20 of
them are this brand 30 of them are this brand
they're running these protocols here's the topology
and you know
this happened and your router crashed
and so maybe you maybe you've done some good logging because you know in these situations
you have to because it's hard to reproduce it's hard to reproduce it takes multiple people to
configure a lab like and so you fix it maybe it was a poison packet of some kind. You had the length error. Oh, there it is. Okay, you fix it.
You put it away and you deploy the update and everything's fine.
Three weeks goes by, maybe longer.
You're like, wow, okay, did it.
You have to fix it.
And then it happens again.
It's always like the day after you tell your boss you're done.
And I've had that happen three or four times in a row.
And it's really damn embarrassing.
But there's, if you can't reproduce it, you can't reproduce it.
So you've just got to do your best to either take what information you have and make a best guess.
Or, again, change the circumstances.
And sometimes you can reproduce a problem by cheating,
by sort of putting in almost unit test-like things,
but simulated entities within the code
that kind of force the same sorts of situations.
I mean, there's ways to kind of work around the edges of it
without having to build an entire company
that looks like the one that had a problem.
Well, for many complex systems,
you can build entities that will cause the problem more.
You know, if you think the problem has to do with the system
not having enough time,
okay, well, put a delay here or there
and keep interrupts out
and make sure it doesn't have enough time.
Or if you think you're occasionally running out of RAM,
and this happens a lot because as you soon,
you know, as you write more code,
you just take up smidgens of more RAM.
Yeah.
And kind of you might crest over the edge once in a while
and have a problem.
And you might go back down
when you clean up this other function
because you don't need those variables.
Go into your map file and...
Really?
Bleep.
Go into your map file.
Go into your map file and adjust your RAM
so you have 20k less or 10k less or something.
Force the issue and see how your system performs.
That's funny I do it the other way.
Or do it the other way.
I mean, you can go both ways depending on the situation.
But yeah, I mean, if the problem goes away,
when you have more RAM, then you're running out of RAM.
And you try to figure out, is it heap RAM or is it stack RAM
or is it static RAM?
And you just keep breaking down the problem
by decreasing those things
and causing the unreproducible thing to reproduce.
I guess sometimes I remove resources
because if you have a system that's supposed to be robust,
you want to see how you're performing
in resource-constrained situations.
Do you know about the Netflix Monkey of Chaos?
Yes.
Yes, like that.
Do you want to explain to people who might not know
what the Netflix Monkey of Chaos is?
No, I was hoping you'd do it.
As far as I understand it, they have
code that runs through their system
just randomly breaking things
on their live production network.
Yes.
And it's not on their test network.
It's on their live production network.
And so their system has to be resilient to faults
and they just go ahead and inject faults all the time
to prove to themselves that it is.
I like that.
I have never done that.
I like that until whatever I'm watching stops working.
Well, yes.
Blame the monkey.
Blame the monkey.
This is totally a monkey show.
We're going to have to have a monkey title.
Okay, so I think we've gone through most of my strategies.
Oh, logging functions.
We talked about how
printf can make timing things
change and that can lead to problems
that seem impossible
but your logging functions don't
have to go out to
whatever is taking a long time
to your serial port or
to your formatting subsystem
you can make a logging system
that is
10 bytes and you just adjust
the bytes depending on where you are and that helps me
trace through. Or a circular buffer of 10 events, whatever
you can fit in RAM. And that gives you the history
that I was talking about. And it gives you the traceability and history
and for systems that don't have like my,
my six lines and,
and trying to get them out and trying to figure out how my PWMs are supposed
to be configured.
It,
it,
that sort of thing helps with taking a snapshot of the system at lots of
different points.
And,
and you get some logging,
but you don't have to pay the huge time dividend.
Of course, it assumes you can debug it and you can
reproduce it. So we're back to
if you can't reproduce it,
then the first thing to do is to reproduce it.
The first thing to do is update your resume.
Just quit.
It's easier. Just quit.
Oh, yes.
We are both quite enjoying our current contracts
and you can tell by our witty banter about quitting.
Nothing wrong with my contracts.
Oh, well, one of us is enjoying their contracts.
The other one's trying to gracefully figure out how to say,
you know, this wasn't what I agreed to do.
Okay.
Well, more strategies?
We're skipping the big one.
What's the big one?
Sleep.
Wow, that's a whole thing.
I know.
We're going to do a whole show on sleeping.
You should sleep.
Go for a walk.
Walk away.
Do something else.
Anything else.
Preferably something that isn't similar.
Take your mind off it completely
because sometimes your mind is working on it when you're not.
Your subconscious is really a much better programmer.
Well, my subconscious is really a much better programmer than I am.
And sleep is important.
And these times when you're sitting there in the spotlight
and everybody's like, is it done yet?
Is it done yet?
Oh my God, is it done yet?
When they're hovering around you,
definitely just close your eyes
and lean back in your chair
until they go away.
Until they go away.
Yeah.
Or call the ambulance
because they think you've had a stroke.
That's actually,
there have been a couple of times
when I needed to walk away from the problem
and that would have been the fastest way to solve it.
But because everybody was hovering around
and needing a solution right this instant,
that avenue wasn't open to me.
And it's really sad when you have to say,
I'm sorry, I have to go to the bathroom
in order to just get away from your code for a second.
But taking a few deep breaths,
it really helps.
And not feeling stuck.
I think the feeling of being stuck, the panicky, anxious,
I might as well work on my resume, my eyes.
Freezing.
Freezing.
In the end, it's just code.
Even if it's hardware,
there is a bug and you can fix it.
And if you can't, there's always another job.
No, no, no, really.
It is just code.
It will be okay.
Well, I think not necessarily it will be okay because it might not.
But no, it is.
I mean, computers, as much as we hate them,
are deterministic most of the time.
And, you know, something is wrong,
and that something can be found.
And sometimes just remembering that is...
You know, there always is a solution.
It may be a complicated, expensive, painful solution
that requires re-spinning hardware
or doing something that a company might not like,
but there's always a solution.
We didn't talk about hardware bugs,
and if you think it's hardware.
My strategy for that is to assume it is,
and then, well, I mean, I'm a software engineer.
That's my strategy.
Yeah, hardware engineers assume it's software.
Yeah.
I assume it's a hardware bug,
and now I figure out how to prove it to the hardware engineer.
And not throw it over the wall,
you have a bug sort of prove it.
That's not proving it.
But sit down and figure out,
if the hardware worked, it would do X.
And it's doing not X.
Now I need to see what is it doing X.
And so I write my little test.
And if it's doing X, then I'm like, well, maybe it's not a hardware bug.
But if it isn't doing X, then now I can go to the hardware engineer and say, this is
what it is.
And look here, I wrote you a test to see.
Now you can just change whatever you want on the board
and rerun the test.
And as soon as it does X,
we agreed that it works and it's all my fault.
And so let me know when you fix your hardware.
I'll be at the movies.
Yes, seeing Lego.
Well, that was the last thing on my list.
I'm stuck as for what else to cover.
It's an instrumental problem.
Shall we call it is done?
I think it's done.
It's done.
It's done until we get a bunch of hateful feedback
that we can talk about on the next show.
Well, I am sure there are other ways
to paint ourselves into a corner,
but we'll go with that one.
I'm hoping our strategies are more useful than the problem set we thought up.
And I wonder, you listeners, are there coping strategies we missed?
Insurmountable problems you'd like to talk about?
Contact us, email show at embedded.fm or hit the contact link on embedded.fm.
Thank you for listening.
And thanks to Christopher White,
podcast producer and my partner here at Logical Elegance,
our embedded systems consulting firm.
Thank you for being on.
Yes.
Oh, the final quote comes from a real life genius who understands being stuck.
Stephen Hawking.
I was going to say Houdini.
That would have been good too.
No, no, no.
Stephen Hawking says,
it is no good getting furious if you get stuck.
What I do is keep thinking about the problem,
but work on something else.
Sometimes it's years before I see a way forward.
In the case of information loss and black holes, it was 29 years.
I certainly hope your crashes don't take that long to find and fix.
In the meantime, have a good week!