Embedded - 421: Paint the Iceberg Yellow
Episode Date: July 21, 2022Chris Hobbs talks with Elecia about safety critical systems. Safety-critical systems keep humans alive. Writing software for these embedded systems carries a heavy responsibility. Engineers need t...o understand how to make code fail safely and how to reduce risks through good design and careful development. The book discussed was Embedded Software Development for Safety-Critical Systems by Chris Hobbs. This discussion was originally for Classpert (where Elecia is teaching her Making Embedded Systems course) and the video is on Classpert’s YouTube if you want to see faces. There were many terms with letters and numbers, here is a guide: IEC 61508: Functional Safety of Electrical/Electronic/Programmable Electronic Safety-related Systems; relates to industrial systems and forms the foundation for many other standards ISO 26262: Road vehicles - Functional Safety; extends and specializes IEC 61508 for systems within cards IEC 62304 specifies life cycle requirements for the development of medical software and software within medical devices. It has been adopted as national standards and therefore can be used as a benchmark to comply with regulatory requirements. MISRA C: a set of software development guidelines for the C programming language DO178-C and DO178-B: Software Considerations in Airborne Systems and Equipment Certification are the primary documents by which the certification authorities such as FAA, EASA and Transport Canada approve all commercial software-based aerospace systems ISO/IEC 29119: Software and systems engineering -- Software testing ISO 14971:2019 Medical devices — Application of risk management to medical devices IEC 62304:2006 Medical device software — Software life cycle processes Transcript
Transcript
Discussion (0)
Welcome to Embedded. I am Alicia White, and this week we have something a bit different for you.
I got to sit down with Chris Hobbs, author of Embedded Software Development for Safety-Critical
Systems, as part of my Classpert lecture series. I have a quick
warning for you. Around minute 55, we start talking about lawyers and self-harm. If that's
a topic that would bother you, please feel free to skip it as soon as you hear the word lawyer.
In the meantime, I hope you enjoy our conversation.
Welcome. I'm Alicia White, and I'm going to be talking with Chris Hobbs, author of Embedded
Software Development for Safety-Critical Systems. Chris, thank you for talking with me today.
Thank you for inviting me.
Could you tell us about yourself as if we met at an Embedded Systems Conference?
Okay. Yes. As you say, I'm Chris. I work for BlackBerry QNX. I work in our kernel development group.
That is the group that works on the heart of the operating system, if you like. The operating system itself
is certified for use in safety-critical systems, and I have particular responsibility for that side
of things, ensuring that what we produce meets the safety requirements that are placed on it.
What exactly is a safety critical system?
Well I noticed on the introduction slide you had there that you spoke about software and aircraft,
software and nuclear reactors, software and pacemakers and yes those are all safety critical.
More prosaically we also have safety critical software in Coca-Cola machines these days.
It came as a bit of a surprise to me as well, actually, when I first met this one.
But in the olden days, Coca-Cola machines used to dispense cans or tins of drink, and that was straightforward.
Now, the more modern ones actually mix your drink.
So you could choose, I want diet Coke, cherry flavor, something, something.
And the software mixes your drink for you.
One of the ingredients is caffeine.
Caffeine is poisonous.
So if the software goes wrong and puts too much caffeine in your drink, it'll poison you.
So suddenly Coca-Cola machines and soft drinks machines in general become safety critical.
And so, yes, what used to be a relatively small field with railway systems, nuclear power stations and what have you, is now expanding.
The other one I've worked on recently is these little robot vacuum cleaners that you run around your floor after.
When you go to bed, you leave the robot vacuum cleanerers that you run around your floor after when you go to bed, you leave the
robot vacuum cleaner running around. If they reach the top of the stairs, they're supposed to stop.
And that's a software controlled thing. If they don't stop and they go over the stairs,
then of course they could kill some child or something sitting at the bottom of the stairs.
So suddenly robot vacuum cleaners are also safety critical. So anything that could potentially damage a human being or the environment, we consider to be a safety critical system.
And it is, as I say, a growth area at the moment.
Some of those sound like normal embedded systems.
I mean, the robot vacuum particularly and the Coke machine. But how are safety-critical systems different from just pedometers
and children's toys?
Yes, I think that's a big question.
And part of the answer, I think, comes down to the culture of the company
that is building the system.
The safety culture of a number of companies that are working,
particularly in the aviation industry, have been questioned recently,
as I'm sure you've realized.
And it is this safety culture that underlies what is required to produce a system that is going to be safety critical.
And also another difference is lifetime.
I mentioned there it's not just human life that we deal with for safety,
it's also the environment.
So, for example, at the bottom of the North Sea off the east coast of Britain,
there are oil rigs.
And buried down at the bottom of the oil rig or the seabed, there are these
embedded devices which, if the pressure builds up too much, will actually chop off the pipe that
goes up and seal the oil well to prevent an environmental disaster. If that happens,
then it costs millions and millions and millions of dollars
to get that back again but of course if it happens it's going to happen but replacing the software
in that or even upgrading the software in that system is extremely difficult extremely costly so
unlike a child's toy or something like that, you may have software here
which is required to work for 30, 40 years without attention. So that's another difference,
I think, between it and the toy software. So what do you do differently? Yes, that's an interesting question.
One of the international standards, ISO 26262, which is one of the standards for road vehicles, basically,
has in its second part, part two there, some examples of what makes a good safety culture in a company
and what makes a poor safety culture.
The trouble is, of course, we are software people.
I mean, my work, I spend my life programming computers.
We are not used to this unmeasurable concept of a safety culture.
So all we can look for is examples.
And this is a subset.
There's a page full of these in in iso 26262 but just to
take a couple of examples here um heavy dependence on testing at the end of the development cycle to
demonstrate the quality of your product is considered poor safety culture this is also
an important one the reward system favors cost and schedule over safety and quality.
We've seen that recently in a large U.S. aircraft manufacturer,
which put countdown clocks in conference rooms to remind engineers
when their software had to be ready.
And, you know, the companies I work for,
if you're working on a non-safety critical system,
then your bonus each year depends on whether you
deliver on time if you're working on a safety critical system your bonus never depends on
whether you deliver on time so the reward system favors it that the reward system in a good safety culture will penalize those who take shortcuts so i think the the basis of your
the the fundamental answer to your question of what is different in when i'm doing uh safety
work is the safety culture within the company and now how you apply that safety culture to produce good software that's a another question
but yes the safety culture is fundamental to the company and to the development organization
so if the safety culture i mean why do we even get so many failures we hear about things failing we hear about cars with unintended acceleration
and oil rigs failing is it just about not having the right culture it is in part about having the
wrong culture and not having the right culture but something else that's been observed fairly
recently actually is the concept of sotif, I'm not sure, safety of the intended functionality.
Are you familiar with this concept or perhaps I could give a quick description? Only from your book,
go ahead. Okay, so the idea here is that a lot of people have a lot of better examples,
but this is the example I give. What was decided, traditionally the way we have looked at a safety-critical system is that a dangerous situation occurs when something fails or something malfunctions.
So the idea is this thing failed, therefore something happened and someone got hurt. There was a study done not that long ago, particularly in the medical world,
where they discovered that 93% of dangerous situations occurred when nothing failed.
Everything worked exactly as it has been designed to work.
I've got an example here.
I mean, it's one I made up.
I can give you a more genuine one if you wish. But let's assume that we're in an autonomous car it's traveling
on the road there's a manual car right close behind us okay a child comes down the hill on
a skateboard towards the road okay the camera system will pick that up and give it to the
neural network or the Bayesian network or whatever
that's doing the recognition. It recognizes that it's a child, 80% probability. Remember,
this will never be 100%, but it recognizes that it's going to be a child with 87% probability.
It could also be a blowing paper bag that's wandering along the road or it could be a dog but yeah the camera system has
correctly recognized that is that this child on the skateboard is a child the analysis system
correctly measures its speed as being 15 kilometers an hour great the decision system now rejects the
identification as a child because children do not travel at 15 kilometers an hour
unless on bicycles this child we've identified is not on a bicycle no wheels there so it's not a
child it is probably the blowing paper bag which we identified earlier and remember that was done
correctly as a human i'm like no that's that not true. If you identified it as a child, then there has to be another reason.
You can't just ignore the information coming in.
So how do we end up in that box?
This is the problem.
Nobody thought when they were putting the system together of a child on a skateboard.
Children only go at 15 kilometers an hour if they're on bicycles.
So we didn't consider that.
And I'll come to that in a moment because that is also really important.
We didn't consider that situation.
So the decision system says either I'm going to hit a paper bag
or I'm going to apply the brakes hard and possibly hurt the person
in the car behind me.
So correctly, it decides not to brake. So the point there is that everything did exactly what
it was designed to do. Nothing failed, nothing malfunctioned, nothing went wrong. Every subsystem
did exactly what it should do. But you're right in your indignation there that we forgot that children can travel at 15 kilometers an hour if they are on a skateboard.
Now, the concept of an accidental system. There was a study done where they took a ship in the North Sea off the east coast of Britain,
a large ship, and they sailed it to an area where they were jamming GPS.
They just wanted to see what would happen to its navigation system.
That's good.
What happened was, of course, the navigation system is that you can find on the internet
pictures of where this ship thought it was.
It jumped from Ireland to Norway
to here. The ship was jumping around like a rabbit. That was expected. If you jam the GPS,
then you expect that you're not going to get accurate navigation. What was not expected
was that the radar failed. So they went to the radar manufacturer and said, hey, why did your radar fail just because we jammed the GPS?
And he said, no, we don't use GPS. There's no GPS in our radar.
They went to the people who made the components for the GPS, sorry, for the radar.
And one of them said, oh, yeah, we use GPS for timing.
And that's super common. I mean, GPS has a one pulse per second.
I've used it in inertial measurement systems. It's just, it's really nice to have.
Yep, absolutely. And the trouble here was that that component manufacturer didn't know
that their system was going to go into a radar. The radar manufacturer didn't know that that component was dependent on GPS.
So we had what's called an accidental system.
We accidentally built a system dependent on GPS.
Nobody would have tested it in the case of a GPS failure
because no one knew that it was dependent on GPS.
The argument runs that a lot of the systems we're building today are so complicated and so complex that we don't know everything about
how they work. I understand that with the machine learning, but for your example, somebody routed the wire from the GPS to the component.
Presumably.
Or the component had it integrated into it, and therefore they just plugged the component in.
So this was an accidental system.
And the idea is that we cannot predict all of the behavior of that system.
And SOTIF, safety of the intended functionality, everything worked.
Nothing failed.
Nothing failed at all, but we got a dangerous situation.
Everything in that radar worked exactly as it was designed to work,
but we hadn't thought of the total consequences of what it was. There's a lot of examples of the SOTIF.
Nancy Leveson gives one which is apparently a genuine example. The US military had a missile
at one Air Force base that they needed to take to a different airport space so obviously what you
do is you strap it on the bottom of an aircraft and you fly it from one base to the other
but that would be a waste of time and a waste of fuel so they decided what they would do was put a
dummy missile also on that aircraft and when it got up to altitude it would intercept another u.s
aircraft and they would fire the dummy missile at the other aircraft just for practice.
I think you can see what's going to happen here.
So, yes, it took off with the two missiles on.
It intercepted the other U.S. aircraft.
The pilot correctly fired the dummy missile.
That caught you.
You were thinking otherwise there.
But the missile control system was designed so that if you fired missile A,
but missile B was in a better position to shoot that aircraft down,
it would fire missile B instead.
And in this case, there was an antenna in the way of the dummy missile,
so the missile control software decided to fire the genuine missile it destroyed the aircraft the pilot got out don't
worry it's not a sad story but again everything worked perfectly the pilot correctly fired the
dummy missile the um missile control software did exactly what it was supposed to do. It overrode the pilot and fired the other missile.
See, that one makes a lot more sense to me because it was trying to be smart.
Yes.
And that's where everything went wrong.
So much of software is trying to be clever, and that's where everything goes bad.
Yeah, and I think you could make that argument with my example with the child on the skateboard that the system was being clever by saying no
that's not a child because it can't be a child if it's traveling at 15 kilometers an hour and
it's not on a bicycle so again the software was trying to be smart and failing and it but that
one it was trying to be smart in a way that doesn't make sense to me
because things change in the future um kids get i don't know magnetic levitation skates and suddenly
they're zipping all over but yeah the missile makes more okay so how do we, you mentioned the safety culture, but what about tactics? How do we avoid these things? I mean, I've heard about risk management documentation. that's going to be certified in any way with what we call a hazard and risk analysis.
You have to be a bit careful about these terms, hazard and risk,
because they differ from standard to standard.
The way I use it is the iceberg is the hazard.
It's a passive thing. It's just sitting there.
The risk is something active.
The ship may run into it.
Other standards, on the other hand, would say the hazard is a ship running into the iceberg.
So we have to be a bit careful about terminology.
But to me, the iceberg is the hazard.
The risk is running into it.
And so we do a hazard and risk analysis on the product, and using brainstorming and various,
there is an ISO standard on this,
identify the hazards associated with the product
and identify the risks associated with them.
We then have to mitigate those risks,
and anything that's mitigating the risk becomes a requirement,
a safety requirement on the system.
So if you take the iceberg, we may decide to paint the iceberg yellow
to make it more visible.
Okay, silly idea.
The iceberg, we're going to paint it yellow.
So there is now a safety requirement that says the iceberg must be painted yellow.
Okay.
And there is then still a residual risk that yellow painting icebergs doesn't help at night the fundamental point, because it's that that is defining what the risks are and what we're going to do to mitigate them and what requirements we make.
So typically, there will be a requirement.
And then at the other end of the development, so that's your development, we're setting up the hazard and risk analysis at the other end one thing we have to to deliver is our justification for why we believe we have
built a sufficiently safe system and there's two things in that of course what is sufficiently safe
and how do you demonstrate that you have met that yes
so back to the skateboard uh and child the hazard is the child and the risk is the chance of hitting
it hitting the child correct if i understand your terminology and yes using my terminology my technology. We might mitigate that by saying anything that we aren't sure about,
we're not going to hit. And then the residual risk still is we might break suddenly and thereby
hurt the person behind us. Yes, that is the residual risk. If you remember, there was an
incident a while back in the US with a woman at night walking across the road pushing a bicycle.
She was hit by an autonomous car, an Uber.
The car had initially been programmed in a way you stated, if I do not recognize it, then I will stop.
They found it was stopping every sort of 10 minute 10 minutes because it was something
there that hadn't been anticipated so they changed it to say we'll stop if we see one of these one of
these one of these one of these or one of these and a woman pushing a bicycle was not one of those
a woman riding a bicycle would have been okay so I just want to go to their team and say,
you can't do this.
I want to throw down a heavy book and say,
you need somebody on your team who thinks through all of these problems,
who actually has, well, I'm not going to say insulting things,
but who has the creativity of thought to consider the risks that clearly that team did not have.
How do we get people to that point, to that, I'm not in the box,
I want to think about everything that could happen, not just what does happen?
What does happen?
There's two things that are happening there.
One is that last month, actually, there was a SOTIF standard came out with a sort of semi-structured way of considering the safety
of the intended functionality.
But also this safety case that I mentioned earlier, the thing that identifies why I believe my system is adequately safe is actually going to try to answer that question.
And this is one of the things that's happening to our standards at the moment, that most of the safety, what are called safety standards at the moment that we have
are prescriptive they tell you what you must do you must use this technique this technique
this technique you must not do this you must not do that the trouble with that is that it takes 10
15 years to bring in a standard and in that time techniques change software world is changing very
very rapidly so basically by doing that you are burning in the need to apply 10 15 year old
technology which is not good so the series of standards coming out now like UL's 4600, which are what are called goal-based standards, G-O-A-L, goal,
as in football goal, goal-based standards,
which say we don't care how you build your system,
we don't care what you've done,
but demonstrate to me that your system is adequately safe.
And that, I think, is where your imagination comes in that sitting
down and imagining awkward situations where well what would happen if this were to happen
um ul4600 gives a couple of good examples actually for it's for autonomous cars basically and it talks
about gives one example of an autonomous car there is a building that's on fire there are fire engines
outside there are people milling around lots of people on the road someone has used their cell
phone to call their car their autonomous car has arrived to pick them
up now what it's doing of course is it's having to go on the wrong side of the road there are hose
fire hoses there are people there are what have you could your autonomous car handle that situation
and you're right we need people of imagination to look at these situations.
Now, the person who produced UL4600 has also published a number of papers that say
that a lot of these incidents that your car may meet during its life are long tail. It is unlikely that any car, any particular car,
will meet the situation of an aircraft landing on a road in front of it.
But inevitably, over the course of 10 years,
a car somewhere will meet the incident of an aircraft landing on a road in front of it.
So do we teach every car how to handle an aircraft landing on the road in front of it. So do we teach every car how to
handle an aircraft landing on the road in front of it, given that only one car is probably in its
entire lifetime likely to meet that situation? So becoming imaginative, as you say, is great,
but we have a limited amount of memory that we can give to these systems
in the car to understand. Have you met a child coming down a hill towards your vote on a
skateboard when you've been driving? Probably not. I mean, I've seen kids on rollerblades,
which is even less identifiable. And I live on a hill, so yes, they get going pretty fast.
You mentioned, as far as the creativity part, in your book,
you mentioned that there are some standards that are starting to ask
some of the questions we should be thinking about.
Like, what happens if more happens?
Do you remember? Do you recall what I'm talking about?
Yeah.
There is an ISO standard for doing a hazard risk analysis.
I must admit that initially it was pointed out to me by one of our assessors,
and I thought it was pretty useless.
But we applied it because our assessor told us to and find it's fairly useful.
I can look up its
number i can't remember off the top of my head but yes what it does is it structures the brainstorming
so when you're in a room trying to identify hazards and risks you are brainstorming well
what could go wrong with this well maybe ah maybe a child could come down a hill on a skateboard
or maybe this or maybe this.
And what the standard does is it gives you keywords,
specific keywords like less, fewer, none.
So what would happen if there were no memory available on this system?
It's basically a structured way of doing a brainstorming
so we use that quite extensively now to to do exactly that what happens take a keyword such as
none too early too late what if the camera system gives this input too late or too early and things like that?
But it is only really a way of structuring a brainstorming session.
I always like the question, let's assume that something catastrophically failed.
What was it?
Yes.
That sort of backwards looking.
But this creativity is important for figuring out a diverse risk analysis.
But there's so much paperwork.
I mean, I'm happy to be creative, and I've done the paperwork for FDA and FAA,
but that paperwork is kind of, let's just say boring to write.
Why do we have to do it?
Right. I think a lot of it is, well, a lot of it can be semi-automatically generated.
I think that's one of the points to be made.
Producing the paperwork doesn't actually make your system any better, as you appreciate.
I'm a pilot. I own a small aircraft. And for example, the landing light bulb
is just a standard bulb that I could go down to the local car shop and buy a replacement for.
But I'm not allowed to. Even though it has the same type number and the same
this and the other as a car bulb, I'm not allowed to buy that. I have to buy an aviation grade one
that comes with all of the paperwork. Where was it built? When was it done? Who sent it to whom?
Who sold it to whom? And of course, that bulb is five times the cost of going down the local shop and buying one.
But it comes with that paperwork.
And a lot of that paperwork can be generated, as I say, semi-automatically.
This thing that I keep referring to at the end that we produced during the development,
but at the end the safety case that tries to justify that we are adequately safe,
we have templates for that.
And we would expect someone to apply that particular template and do it
rather than producing the paperwork from scratch every time
now i go sometimes as a consultant into a startup company particularly you know a medical startup
there's been a spin out from a university the university people know all about the medical
side of it they know nothing about the software side of it, and they've got some student-produced software that was written
by some students three years ago who've now disappeared.
And there, yes, there is a lot of back paperwork to be done.
But in general, once you're onto the system,
it should be semi-automatic, that paperwork.
But every standard is different i mean i remember the geo 178b
and the fda documentation had different names for everything you mentioned risk and hazards
mean different things in different standards are we getting to the point where everybody's starting
to agree and i i wouldn't have to i mean you work for qnx
it's a real-time operating system so you have to do both don't you yep and yep we um we're used in
railway systems we're used in industrial systems we use autonomous cars we're used in uh uh aircraft system medical systems um it is awkward there is a group in the
uk at the moment um in the safety critical systems club that is trying to put together
a standardized nomenclature my colleague dave bannum is part of that group
i honestly don't hold out much hope for what they're doing.
My feeling is that it's happened before where we've had 10 standards on something
and we then tried to consolidate them into one and we ended up with 11 standards.
Yeah, that seems to be the typical way of going but yeah dave and his group are really trying to produce
a common nomenclature and vocabulary for use but no each standard at the moment is is different
and uses the terms differently it's real it's it's annoying let's say so going back to risk analysis, you have a way, how do we determine if a failure is going to happen?
How do we put a number on the probability of something going wrong?
Right. So remember when we talk about this, that that study I mentioned right at the beginning argues that only something like seven
percent of dangerous situations occur because something failed sotif is the other side and
is supposed to handle the other 93 but you're right all almost all of the standards that we
are dealing with at the moment assume failure so we have to assess failure. So a standard like IEC 61508, which is the base standard for lots of other standards, assigns a safety integrity level to your product.
And the safety integrity level is dependent on your failure rate per hour so for example sill three safety integrity level three means a failure rate of less than
10 to the minus seven per hour one in 10 million per hour so how do you assess that is the question
and the answer is it is not easy it is obviously a lot easier if you have an existing product. When I first came to QNX 12, 13, 13 years ago,
they had a product with a long history and a good history
which had been very carefully capped.
So we could look at the hours of use and the number of reported failures.
The problem, of course, is we don't know what percentage of failures were being reported.
You know, our software is almost certainly in your car radio, for example.
But if your car radio stopped working, you would turn it off, turn it on again.
And we didn't get to hear about that failure.
If every car in the world that had our software the radio failed we would hear about it
but would we count that as one failure or a million failures so there are problems with that
and what the way i've done this is with a bayesian fault tree building up the bayesian fault tree
gives us the opportunity to go in either direction.
I mean, you mentioned earlier, the system has failed what caused it, which is down from top
to bottom, if you like. The Bayesian fault tree also allows us to go the other way. If this were
to fail, what effect would that have on the system? And so you can do sensitivity analyses and things like that so the place to start again i
think is in what ways could this fail and if we take an operating system it doesn't matter whether
it's linux or qnx operating system the we identified that really there are only three
ways in which an operating system can fail.
That an operating system is an event handler.
It handles exceptions.
It handles interrupts.
It handles these sort of things.
So really there's only three ways it can fail.
It can lose, it can miss an event. An event occurs, an interrupt occurs occurs but because it's overloaded it doesn't get
it doesn't notice it it can notice the event and handle it incorrectly or it can handle the event
completely correctly but corrupt its internal state so that it will fail on the next one and basically it's then a trawl through the logs and failure logs and what have you
to find how often those failures are occurring
and whether they are reducing or whether they go up at every release.
My colleague Wachau and I presented a paper at a conference last year
where we were applying a Bayesian network on that to see
if we could predict the number of bugs that would appear in the field based on some work done by
Fenton and Neil it was an interesting piece of work that we did there I think but yes it is not
not a trivial exercise particularly as a lot of the standards believe that software does not fail randomly, which of course it does.
I have definitely blamed cosmic rays and ground loops and random numbers on some of my errors, I'm sure.
But I have a question from one of our audience members, Timon. How do you ensure safe operation with an acceptable probability in a system that is not fully
auditable down to the assembly level?
For example, a complex GUI or a machine learning driven algorithm?
Yes, particularly the machine learning algorithm, I think, a good really good example there i mean we all know examples
of machine learning systems that have learned the wrong thing uh that's a um
one i'm i i know for certainty because i know the people involved there was a company in a car company in southern Germany.
They built a track, a test track, for their autonomous vehicles.
And their autonomous vehicles learned to go around the test track perfectly,
absolutely perfectly, great.
They then built an identical test track elsewhere in Germany,
and the cars couldn't go around it at all although the test track was identical
and what they found when they did the investigation was that this the car and the first test track had
not learned the track it had learned the advertising holdings so you know turn right when it sees an
advert for coca-cola turn left and the track was identical
in the second case but the adverts weren't so there was a system that could have been deployed
it was working absolutely perfectly yet it had learned completely the wrong things
and yeah this question that you have here is to some some extent, impossible to answer, because we have these accidental
systems. The systems that we're building are so complex that we cannot understand them.
There's a term here, intellectual debt, which Jonathan Citroen produced.
An example of that, nothing to do with software,
is that we've been using aspirin for pain relief,
apparently since 18-something.
We finally understood how aspirin worked in 1965 or something,
somewhere around there.
So for 100 years, we were using aspirin without actually understanding how it worked.
We knew it worked, but we didn't know how it worked.
The thing is the same with these systems that we're building with machine learning.
They seem to work, but we don't know how they work.
Now, why is that dangerous?
Well, it's dangerous with aspirin because in that intervening period where we were using it but didn't know how it worked, how could we have predicted how it would work and interact with some other drug?
With our machine learning system, yes, it appears to work. It appears to work really well. But how can we anticipate how it will work with another machine learning subsystem when we put these two together?
And this is a problem called intellectual debt.
It's not my term.
It's Siflain's term.
But we are facing a large problem, and machine learning is a significant part of that.
But, yeah, we're never going to be able to analyze software down to the hardware level.
And, you know, the techniques that we've used in the past to try to verify the correctness of our software,
like testing, dynamic testing, you know, now are becoming increasingly ineffective.
Testing software these days does not actually cover much.
I call it digital homeopathy.
But machine learning is in everything.
I mean, I totally agree with you.
I've done autonomous cars, autonomous vehicles,
and it will learn whatever you tell it to and it's not what
you intended usually yeah so now when you take that software and combine it with another machine
land system which you also don't understand fully to anticipate how those two will interact
becomes very difficult i was a little surprised in your book that you had Markov modeling
for that very reason, that it is not an auditable heuristic.
How do you use Markov modeling?
Yeah, so we use Markov modeling largely because,
no, I shouldn't say that. The standards, IEC 61508, ISO 26262, EN 50128 for the railways,
they are prescriptive standards, as I said earlier,
and they give methods and techniques which are either not recommended
or are recommended or highly recommended.
And if a technique is highly recommended in the standard and you don't do it,
then you've got to justify why you don't do it,
and you've got to demonstrate why what you do do is better.
So in a lot of cases, it's simply easier to do it.
We faced this at QNX for a number of years.
There was a particular technique which we thought was not useful. We justified the fact that it was not useful. In the end, it got
too awkward to carry on arguing that it was not useful as a technique. So we hired a bunch of
students, we locked them in a room and got them to do it it was stupid but it just took it off our back so that
next time we went for certification we could say yes we do it tick give us a tick box please
and there's a lot of things that are in that i have used markov modeling for various things for
anomaly detection here and there but really the systems we're using these days are not sufficiently
deterministic to make markov modeling particularly useful the you know the
processors we're running on the socs we're running on come with 20 pages or more of
errata which make it completely non-deterministic you know there's a phrase i heard at a recent
conference there is no such thing as deterministic software running on a modern processor and i think
that's a correct statement so yeah i would, I would not push Markov modeling.
It's in my book because the standards require it.
Maybe it won't be in a third edition.
The standards require it?
What kind of standards?
I mean, which standards?
How?
Why?
I mean, they can't disallow machine learning because it's not auditable
and then say, oh oh but markov modeling
is totally reasonable the difference is minor yeah the problem here is these standards as i said are
out of date it i mean the last fish the last edition of ice of iec 61508 came out in 2010
there may be a new version coming out this year we're expecting a new version this year
so that's 12 years between issues of the standard and i'm not sure how to say this um politely
a lot of people working on the standards are people who did their software development even longer ago. So they are used to a single-threaded, single-core,
run-to-completion executive-type model of a processor and of a program.
And so they are prescribing techniques which really are not applicable anymore.
And I suspect Markov modeling is one of those.
This is where I think this move towards goal-based standards
like UL4600 is so useful.
I don't care what techniques you use.
I don't care what languages you use.
I don't care what this and the other.
Demonstrate to me now that your system as built is adequately safe.
And I think that's a much better way of doing stuff.
It makes it harder for everybody.
It makes it harder for the assessor because the assessor hasn't got a checklist.
Did you do this? Yep. Did you do this? Yep. Did you do this?
Did you not do that?
It makes it harder for the developers, people like ourselves ourselves because we don't have the checklist to
say well we have to do this but it is a much more honest approach demonstrate to me with a safety
case that your system is adequately safe are there tools for putting those sorts of things together
are there tools to ensure you hit tick all the boxes, or is that all in the future?
There is, again, we can talk about the two sides. The tick box exercise, yep, there are various
checklists that you can download. IEC 6308 has a checklist and what have you. The better approach, the other one I talk about,
the safety case approach.
I think we did the safety case approach for FAAA.
I think the DO-178B is more in that line,
where we basically had to define everything
and then prove we'd defined it.
Is that what you mean by a safety case or a goal
driven? So the idea here is that we put together an argument for why our system is sufficiently
safe. Now there's a number of notations for doing this. This one is called goal structuring notation
and I'm using a tool called Socrates. There's a number of tools. This is Socrates tool. So what I can do here is
we claim, this is our top level claim, we are claiming that our system, that our product, sorry,
is sufficiently safe for deployment. And we're basing that claim on the fact that we have
identified the hazards and risks, we have adequately mitigated them,
that we have provided the customer with sufficient information that the customer can use the product safely,
and the fact that we develop the product in accordance with some controlled process.
Now, we're only claiming the product is sufficiently safe if it is used as specified in our safety manual.
And then, of course, we can go down.
So what do we say?
Customer can use the product safely.
Jump to the subtree here.
Zoom in.
So what I'm claiming here is that if in 25 years' time the customer comes back with a bug,
we can reproduce that problem.
We have adequate customer support,
documentation is adequate, and so on.
So the idea here is two things. First of all, and all UL 4600 and the more modern standards require
is this argument based on evidence.
The first thing to do is you must
put the argument together before you start to look for evidence.
Otherwise, you get confirmation bias. There's a little experiment I do on confirmation bias.
You've probably done this exercise yourself, but it one i i do with people a lot i say to
them on the next slide i'm doing a slide presentation let's say and i say on the next
slide i've written a rule for generating an x number in this sequence you're allowed to guess
numbers in order to discover the rule now i'm not going to ask you to do this this year because
i don't want you to to look like an idiot at the end of this.
No, I read your book. I know what to guess.
Okay, great. So what happens is people guess 12, 14, 16. I say, great, yep, those numbers all work.
On your slide, you have 2, 4, 6, 8, and 10. Okay.
So you must now guess what the rule is for generating a next number in this
sequence and so to do so you can guess numbers and so typically people guess 12 14 16 and i will say
yep they work so what's the rule they say well it's it's even numbers i say nope not even numbers guess a few more and they go 18 20 22 what's the rule well
plus two no it's not plus two and this goes on for some time i've been up all the way to 40 and
and 44 and things with some customers until somebody guesses 137 just to be awkward and i say yep that works and it then leads us to what the rule is
the rule has to be larger that each number must be larger than the previous one
the problem that this identifies is what's called confirmation bias at which we're all as human beings subject to. If you think you know the answer,
you only look for evidence that supports your belief. If you believe that this is even numbers,
you only look for evidence that it is even numbers. This was identified by Francis Bacon
back in the 17th century. It was rediscovered, if you like, fairly recently.
And we applied this to some of our safety cases
and we started finding all sorts of additional bugs.
Instead of asking people,
produce me an argument to demonstrate
that this system is safe.
Now, if you ask someone to do that, what sort of evidence are they going to look for? They're going to look for evidence that this system is safe. Now, if you ask someone to do that,
what sort of evidence are they going to look for? They're going to look for evidence that the system
is safe. So we said, look for evidence that the system is not safe. And then we will try to
eliminate those. By doing that, we found an additional 25 or so problems that we had never noticed in our safety cases previously.
So we took that to the standards bodies that produce the standards for this goal structuring
notation, and now that doubt has been added to the standard. So basically, the idea is we put together an argument. We argue the customer can use the product safely.
We argue that we have identified the hazards and risks,
that we have done this, and we take that to all the stakeholders
and say, if I had the evidence for this argument,
would that convince you?
And typically they'll say, it's good but we'd also like this and we'd also like that and you can build that so we build the argument
only then do we then go and look for evidence so here for example we come through the residual
risks are acceptable there is a and the subclaim of that is that there
is a plan in place to trace the residual risks during customer use of the products as a customer
starts but if the customer uses our product in an environment we did not expect and there are
new risks then we have a plan in place to trace those. So now please show me your evidence that that is true. So first of
all, we put together the argument, we agree the argument structure, and only then do we go to look
for the evidence. And what we'd like to be able to do is put doubt on that. So okay, you've got a plan,
but has that plan ever been used has that plan actually been approved
do your engineers actually know about that plan you know put all of the doubt you possibly can
into this safety case and i think then we have a justification for saying this product is
sufficiently safe for deployment and nothing to do with as deployment. And nothing to do with, as you were saying,
nothing to do with the fact that I used this technique
or I used Markov modeling or I did this or the other.
It is the argument that says why I think my product is sufficiently safe.
In your book, you had a disturbing section with a lawyer talking about liability for engineers.
Oh, yes.
I don't know whether I mentioned the anecdote with Ott Nortland.
Possibly I did.
For those who haven't read the book, I was at a conference a few years ago,
a safety conference, and we were all standing around chatting as you do
and one of the people there was Ott Nortland who's well known in the safety area and he
state he said something that hit me hit me like a brick he said he had a friend who is a lawyer
that lawyer often takes on cases for engineers who are being prosecuted
because their system has hurt somebody or the environment.
And he said that his friend could typically get the engineer off
and found innocent if the case came to court.
But often the case does not come to court
because the engineer has committed suicide
before the case reaches court.
I'm not sure if that's the anecdote you were thinking of, Alicia,
but as you can imagine, it stopped the conversation
as we all sort of started to think about the implications of this
but it does make you realize that the the work we're doing here is real
that people do get hurt by bad software and the environment does get hurt by bad software. So, yeah, it is real.
And there's a lot of moral questions to be asked
as well as technical questions.
And as somebody who has been in the ICU and surrounded by devices,
I want that documentation to have been done.
This risk analysis, it can be very tedious.
For all that we're saying,
there's a creativity aspect to it. But all of this documentation, the goal,
well, the standards prescribe what you're supposed to do. The goal is to make sure you think things
through. Yes, the standards can be used. There's various ways of looking at the standards.
I was in the ICU earlier this year.
I broke my wrist on the ice, and I was horrified to see,
although I knew about it intellectually, I was horrified to see it practically,
the number of Wi-Fi, Bluetooth connections coming from these devices
that were all around me.
Was it designed to work together in that way?
You know, those systems? I don't know.
But, you know, there's different ways to look at the standards.
I don't like prescriptive standards, as I probably indicated during the course of this.
However, the prescriptive standards do give guidelines for a number of types of people.
As I say, that startup, that spin out from a university that has had no product experience,
basically, they really could use those standards standards not as must do this, must do
this, must do that, but as a guideline on how to build a system. And I think no question there,
these are good guidelines in general. They may be a little out of date, but they're good guidelines.
They're certainly better than nothing. The other way of looking at these safety standards is that although they say
on the cover that each one says this is a functional safety standard, building a safe
product is trivially easy. The car that doesn't move is safe. The train that doesn't move is safe.
The aircraft that doesn't take off is safe what as soon as we make
the product useful we make it less safe as soon as we let the car move it becomes less safe so i
like to think of these standards sometimes as usable usefulness standards they allow us to make
a safe system useful and i think if you approach them in that manner,
then it answers, I think, in part,
your concern about your devices in your intensive care unit
and what have you, how they can be used.
But yes, certainly some level of confidence,
like that safety case I spoke about,
the product is sufficiently safe for deployment in a hospital environment
with other pieces of equipment, with Wi-Fi's and Bluetooth's around it,
and used by staff who are semi-trained and are in a hurry and are tired at the end of a long day.
That should be documented and demonstrated.
Yeah, I'd agree wholeheartedly.
And documented and demonstrated because we want engineers
and managers to think about that case.
Yes.
But this is, again, comes back to what you were saying earlier of imagination.
My wife often says that when she looks over my shoulder at some of these things,
you need someone who is not a software engineer to be thinking up the cases where this could be deployed,
it could be used, because you engineers, Chris, you are not sufficiently imaginative.
It's not something that engineers do, is be imaginative in that way.
And so, yeah, it is a problem.
But ultimately, we are not going to be able to foresee
and take into account every situation.
But certainly, if you look at those
medical devices in the hospital you'll find most of them have some sort of keypad on so that the
the attendant or nurse or doctor can type in a dose or something if you look at them you'll find
half of the keypads are one way up like a telephone the others are the other way up like a calculator either zero is at the top or zero is
at the bottom now in the course of a day a nurse will probably have to handle 20 of these devices
all of which are differently laid out with different um keyboard layouts and all that
sort of stuff that should have been standardized yeah that's that's that is setting you up to make a
mistake it's yes it's as it's not that we're making things safe that way we're we're actually
designing them in a way that will cause a problem in the name of intellectual property and lack of standards?
Because following a standard is a pain.
I mean, it's not as much fun as designing software from the seat of their pants.
No, it is much more fun to sit down and start coding.
Yeah, I agree wholeheartedly.
I am a programmer.
I work all day in Ada and Python and C.
And, yeah, it is much more fun and much less efficient to sit down and start coding.
I have a question from a viewer. A viewer, Rodrigo DM, asked, how do you develop fault-tolerant systems with hardware that is not dedicated to safety-critical design?
For example, an ARM M0+.
Yeah.
So, as I said, the hardware is always going to be a problem.
The only way, really, you can do that, if what you're looking for is a fault-tolerant
system is duplication or replication um preferably with diversification uh the
this is this is painful because of course it um requires uh it costs more money because you're going to put two of them in.
So you've got to say either I am going to have sufficient failure detection
that I can stop the machine and keep it safe,
or I'm going to have replication.
I'm going to put two processors in or whatever. I had a customer a while back who was taking replication
or diversification to an extreme.
They were going to put one ARM processor, one x86 processor,
one processor running Linux, one processor running Wind River,
one processor running QNX, and so on.
And I asked the question,
why are you diversifying the hardware? And the answer was, well, because the hardware has bugs.
We know that. There's 20 pages of errata. And I said, well, yeah, but these are Heisenbugs. These are not bore bugs. These are random bugs. The last time I can remember a bore bug, a solid bug in a processor,
was that x86 Pentium processor that couldn't divide.
If you remember the Pentium bug back in 1994, 1995.
If the processor is going to fail, it is likely to be going to fail randomly, in which case two processors of
the same type are almost certainly going to be sufficient. Or even one processor with something
running something like virtual synchrony, which would detect the fact that the hardware error has
occurred, and then take the appropriate action, which may be simply running that piece of software again.
And I know there's a couple of companies,
particularly in southern Germany, using coded processing
to do the safety-critical computation so that you can check implicitly
the correctness of whether the
computation has been done correctly. And the argument is you can have as many hardware
problems as you like, which don't affect my safety. I don't care. Bit flip in memory is
going to occur every hour, but that's fine if it doesn't affect my computation. So if I use
something like coded processing, where I can check that the computation was done correctly to within one in a million, say, then I don't care about those hardware problems.
But again, you've got to justify that.
And that is the way I've seen one of our customers do it with coded processing using then non-certified hardware.
And do you, back to Timon, do you have tools for doing risk analysis?
Are there specific things you use?
I don't know a good tool.
If anybody has, then please let me know.
Over the years, we have built tools to allow us to do this
yeah but they're internal python scripts to do this and the other uh no i don't know of a good
tool for doing risk analysis sorry about that i'm kind of sad yeah, that's right. Part of the goals of many of the documentation involve traceability,
where, as you were showing, you have a safe product,
and that breaks into multiple things.
That includes the safety manual, which breaks into multiple things.
Do you have a tool for that?
For the traceability yeah yeah this is so this is
something that most of the processes demand a spice for example demands this um cmi the tracing
of a requirement to the design to the low level to the code, to the test cases, all the verification
evidence that you have for that. What we have found there, this is going to sound silly,
this is going to sound really silly, but we use LaTeX for all of our document preparation.
The beauty of LaTeX is that it's textual. It is just ASCII text. So basically,
we can embed comments in our documents. And then at the end of the product, when we have to produce
the complete table of traceability, we simply run a Python script over those documents, it reads those structured comments, and it produces the document automatically.
So
if, on the other hand, we were
using some proprietary documentation
tool like Microsoft's
Word or something like that, I don't
believe we could do that, and I'm not sure how
you would do that. You'd have
to keep that as a separate document
manually.
But the nice thing about LaTeX is just ASCII text.
You can run a Python script on it.
You can pull out all this stuff and produce these really, really,
really boring documents that tell you that safety requirement number 1234 is in paragraph 3.2.7.4 of the design documentation,
and it relates to lines 47 to 92
of this particular c module and so on because all of that is just in ascii text so that's the way
we've done it well i meant to only keep you for about an hour um and we've gone a bit over um
one more question about languages uh you mentioned ada which is a language that has the
idea of contracts and and provability and you mentioned c which is the reputation for being
the wild wild west of unsafety yes which languages do you like Which languages should we be using? And how do we figure that out?
Yeah, this is an interesting, the standards themselves
deliberately do not, or sorry, recommend not using C. So if you look in IEC 61508,
it gives a list of languages that it likes and dislikes, and it specifically says,
don't use C. What it does say is, you can use C if you use a subset of C,
and that subset is checked by some form of static analysis. So for example, you might take the
MISRA subset of C, and you might use Coverity or something like that to clockwork to check that you are using that.
I feel that, and to be fair, I find that there's a whole load of people there selling
products to try to make it, to make C better, to check that C doing this and that you're not doing
this in C and you're not doing that. And I feel that we are getting to the point where we've got
to stop putting lipstick on the C pig and go elsewhere. Now, where do you
go elsewhere? Now, that's a good question. I just put up on the screen a list of some of the things
that I feel you could discuss about what you need in languages. Ada and Spark Ada in particular,
yes, we have the formal proving and we have a customer I'm working with at the moment who is using spark ada and that's
great the other one that's on the horizon at the moment well d was on the horizon for a while but
um rust seems to be coming on the horizon i have a bad experience with rust i teach a course which
this is one of the slides and a few years ago I wrote a Rust version of a very bad C program
that I have that has a race condition, and Rust would not compile it.
That was great.
That was exactly what I wanted.
Rust refused to compile this badly structured program.
A couple of months later I was giving that course again,
and I said, look, look watch and i'm going to
compile this with rust and show you how it doesn't accept it it accepted it the compiler had been
changed i repeated the same thing about six months later and this time rust gave me a warning message
with a different thing that's the problem with rust it's not yet stable and that's the problem with Rust. It's not yet stable. And it actually says, as I'm sure you're aware,
in the Rust documentation,
that this is the best documentation we have.
It is not yet really suitable.
It's not stable.
So basically you're missing a long history for the compiler linker
and a stable history of the product.
And, yeah, there's a lot of other things you can talk about on languages.
As I say, I write a lot of Ada, and particularly Spark Ada,
and we're working closely with AdaCore on that.
But AdaCore is now, of course, supporting Rust as well,
and I think Rust may be the future eventually once it stabilizes.
Once it stabilizes.
That's always been my caveat as well.
Yeah.
And you said at the top that there are plenty of opportunities in this area if someone wants to work in the
safety critical systems one how do they get into it and two what skills do they need to develop
first that's a really good question yes yes it is a growth area yes and and and what's more it is
an interesting area because of all of the things we've been discussing today,
the language support and all of these things,
the accidental systems, how we handle accidental systems,
whether we should be looking at SOTIF
or whether we should be looking at failure.
There's a lot of research going on,
a lot of interesting stuff going on.
The problem is that it's basically an old man's game.
And, yeah, when I go to conferences, which I do quite regularly,
I think I probably lower the average age of the people by attending,
which is really worrying.
Yeah.
And most of the people giving the presentations are most of the
people the conferences are men and i think that's got to change there was a really useful thing uh
at the embeddice at the um safety critical systems congress last year where a young woman uh stood up
and gave a presentation on how they're intending to make this more inclusive,
but it hasn't happened. The trouble is education. I was giving an IEEE chat some years ago now,
and I had an audience full of academics. And I said, okay, which of you teach some form of
computing? And most of them put their hands up okay which of you teach embedded
computing and a sort of few put their hands up how many of you teach anything to do with safety
critical embedded programming and there's one university university of waterloo but a chap from
there put his hand up so this is not being taught in the universities and therefore it is being uh it is coming up the hard way
and so i think the way to do it is you've just got to get in we are we are looking for people
at the moment uh everybody is looking for people i think the skills there's three levels of skill
that people need there's skill in software engineering in general.
There is skill in the particular vertical area,
whether that's railway trains or medical devices or whatever.
And there is then skill in the safety-critical stuff.
And I think any company that's looking for people
is going to be looking for at least two of
those you're not going to get all three and so yeah you can read books like mine it's it's not
going to really help that amount you've you've got to go out and and do it so i think becoming
familiar with the embedded software world as as Alicia teaches and what have you,
and then becoming familiar with a vertical market,
whether that's aviation or whether that's autonomous cars
or something like that, and then go and just apply.
And do you have any thoughts you'd like to leave us with?
Well, I think to some extent,
it was what I've just said, but I think it's worth just repeating. This is a growth area.
This is an exciting area. There's lots of research going on, on digital clones and all sorts of
things that's going on at the moment.
This is an area where we need young people who are going to take it to the next level.
And so let people like myself retire even, get out of the industry and stop lowering the average age of conferences.
Yes.
Yeah.
Our speaker has been Chris Hobbs, author of Embedded Software Development for Safety
Critical Systems.
Chris, thank you for being with us.
Well, as I say, thank you for the invitation.
I enjoyed myself.
And if there are any further questions and if they can be made available in some way,
I'm very happy to try to address them, of course.
All right. We will figure that out.
I'd like to thank Felipe and Jason from Classpert for making this happen.
And our Making Embedded Systems class mentor, Aaron, for putting up some of those helpful links of the standards we talked about.
I'm Alicia White, an instructor for Classpert,
teaching the course Making Embedded Systems.
And I have a podcast called Embedded FM,
where we will probably be hearing this interview.
Thank you so much for joining us, and we hope you have a good day.