Embedded - 421: Paint the Iceberg Yellow

Starting point is 00:00:00 Welcome to Embedded. I am Alicia White, and this week we have something a bit different for you. I got to sit down with Chris Hobbs, author of Embedded Software Development for Safety-Critical Systems, as part of my Classpert lecture series. I have a quick warning for you. Around minute 55, we start talking about lawyers and self-harm. If that's a topic that would bother you, please feel free to skip it as soon as you hear the word lawyer. In the meantime, I hope you enjoy our conversation. Welcome. I'm Alicia White, and I'm going to be talking with Chris Hobbs, author of Embedded Software Development for Safety-Critical Systems. Chris, thank you for talking with me today.

Starting point is 00:01:00 Thank you for inviting me. Could you tell us about yourself as if we met at an Embedded Systems Conference? Okay. Yes. As you say, I'm Chris. I work for BlackBerry QNX. I work in our kernel development group. That is the group that works on the heart of the operating system, if you like. The operating system itself is certified for use in safety-critical systems, and I have particular responsibility for that side of things, ensuring that what we produce meets the safety requirements that are placed on it. What exactly is a safety critical system? Well I noticed on the introduction slide you had there that you spoke about software and aircraft,

Starting point is 00:01:53 software and nuclear reactors, software and pacemakers and yes those are all safety critical. More prosaically we also have safety critical software in Coca-Cola machines these days. It came as a bit of a surprise to me as well, actually, when I first met this one. But in the olden days, Coca-Cola machines used to dispense cans or tins of drink, and that was straightforward. Now, the more modern ones actually mix your drink. So you could choose, I want diet Coke, cherry flavor, something, something. And the software mixes your drink for you. One of the ingredients is caffeine.

Starting point is 00:02:33 Caffeine is poisonous. So if the software goes wrong and puts too much caffeine in your drink, it'll poison you. So suddenly Coca-Cola machines and soft drinks machines in general become safety critical. And so, yes, what used to be a relatively small field with railway systems, nuclear power stations and what have you, is now expanding. The other one I've worked on recently is these little robot vacuum cleaners that you run around your floor after. When you go to bed, you leave the robot vacuum cleanerers that you run around your floor after when you go to bed, you leave the robot vacuum cleaner running around. If they reach the top of the stairs, they're supposed to stop. And that's a software controlled thing. If they don't stop and they go over the stairs,

Starting point is 00:03:15 then of course they could kill some child or something sitting at the bottom of the stairs. So suddenly robot vacuum cleaners are also safety critical. So anything that could potentially damage a human being or the environment, we consider to be a safety critical system. And it is, as I say, a growth area at the moment. Some of those sound like normal embedded systems. I mean, the robot vacuum particularly and the Coke machine. But how are safety-critical systems different from just pedometers and children's toys? Yes, I think that's a big question. And part of the answer, I think, comes down to the culture of the company

Starting point is 00:04:00 that is building the system. The safety culture of a number of companies that are working, particularly in the aviation industry, have been questioned recently, as I'm sure you've realized. And it is this safety culture that underlies what is required to produce a system that is going to be safety critical. And also another difference is lifetime. I mentioned there it's not just human life that we deal with for safety, it's also the environment.

Starting point is 00:04:34 So, for example, at the bottom of the North Sea off the east coast of Britain, there are oil rigs. And buried down at the bottom of the oil rig or the seabed, there are these embedded devices which, if the pressure builds up too much, will actually chop off the pipe that goes up and seal the oil well to prevent an environmental disaster. If that happens, then it costs millions and millions and millions of dollars to get that back again but of course if it happens it's going to happen but replacing the software in that or even upgrading the software in that system is extremely difficult extremely costly so

Starting point is 00:05:21 unlike a child's toy or something like that, you may have software here which is required to work for 30, 40 years without attention. So that's another difference, I think, between it and the toy software. So what do you do differently? Yes, that's an interesting question. One of the international standards, ISO 26262, which is one of the standards for road vehicles, basically, has in its second part, part two there, some examples of what makes a good safety culture in a company and what makes a poor safety culture. The trouble is, of course, we are software people. I mean, my work, I spend my life programming computers.

Starting point is 00:06:12 We are not used to this unmeasurable concept of a safety culture. So all we can look for is examples. And this is a subset. There's a page full of these in in iso 26262 but just to take a couple of examples here um heavy dependence on testing at the end of the development cycle to demonstrate the quality of your product is considered poor safety culture this is also an important one the reward system favors cost and schedule over safety and quality. We've seen that recently in a large U.S. aircraft manufacturer,

Starting point is 00:06:49 which put countdown clocks in conference rooms to remind engineers when their software had to be ready. And, you know, the companies I work for, if you're working on a non-safety critical system, then your bonus each year depends on whether you deliver on time if you're working on a safety critical system your bonus never depends on whether you deliver on time so the reward system favors it that the reward system in a good safety culture will penalize those who take shortcuts so i think the the basis of your the the fundamental answer to your question of what is different in when i'm doing uh safety

Starting point is 00:07:35 work is the safety culture within the company and now how you apply that safety culture to produce good software that's a another question but yes the safety culture is fundamental to the company and to the development organization so if the safety culture i mean why do we even get so many failures we hear about things failing we hear about cars with unintended acceleration and oil rigs failing is it just about not having the right culture it is in part about having the wrong culture and not having the right culture but something else that's been observed fairly recently actually is the concept of sotif, I'm not sure, safety of the intended functionality. Are you familiar with this concept or perhaps I could give a quick description? Only from your book, go ahead. Okay, so the idea here is that a lot of people have a lot of better examples,

Starting point is 00:08:38 but this is the example I give. What was decided, traditionally the way we have looked at a safety-critical system is that a dangerous situation occurs when something fails or something malfunctions. So the idea is this thing failed, therefore something happened and someone got hurt. There was a study done not that long ago, particularly in the medical world, where they discovered that 93% of dangerous situations occurred when nothing failed. Everything worked exactly as it has been designed to work. I've got an example here. I mean, it's one I made up. I can give you a more genuine one if you wish. But let's assume that we're in an autonomous car it's traveling on the road there's a manual car right close behind us okay a child comes down the hill on

Starting point is 00:09:34 a skateboard towards the road okay the camera system will pick that up and give it to the neural network or the Bayesian network or whatever that's doing the recognition. It recognizes that it's a child, 80% probability. Remember, this will never be 100%, but it recognizes that it's going to be a child with 87% probability. It could also be a blowing paper bag that's wandering along the road or it could be a dog but yeah the camera system has correctly recognized that is that this child on the skateboard is a child the analysis system correctly measures its speed as being 15 kilometers an hour great the decision system now rejects the identification as a child because children do not travel at 15 kilometers an hour

Starting point is 00:10:25 unless on bicycles this child we've identified is not on a bicycle no wheels there so it's not a child it is probably the blowing paper bag which we identified earlier and remember that was done correctly as a human i'm like no that's that not true. If you identified it as a child, then there has to be another reason. You can't just ignore the information coming in. So how do we end up in that box? This is the problem. Nobody thought when they were putting the system together of a child on a skateboard. Children only go at 15 kilometers an hour if they're on bicycles.

Starting point is 00:11:07 So we didn't consider that. And I'll come to that in a moment because that is also really important. We didn't consider that situation. So the decision system says either I'm going to hit a paper bag or I'm going to apply the brakes hard and possibly hurt the person in the car behind me. So correctly, it decides not to brake. So the point there is that everything did exactly what it was designed to do. Nothing failed, nothing malfunctioned, nothing went wrong. Every subsystem

Starting point is 00:11:38 did exactly what it should do. But you're right in your indignation there that we forgot that children can travel at 15 kilometers an hour if they are on a skateboard. Now, the concept of an accidental system. There was a study done where they took a ship in the North Sea off the east coast of Britain, a large ship, and they sailed it to an area where they were jamming GPS. They just wanted to see what would happen to its navigation system. That's good. What happened was, of course, the navigation system is that you can find on the internet pictures of where this ship thought it was. It jumped from Ireland to Norway

Starting point is 00:12:25 to here. The ship was jumping around like a rabbit. That was expected. If you jam the GPS, then you expect that you're not going to get accurate navigation. What was not expected was that the radar failed. So they went to the radar manufacturer and said, hey, why did your radar fail just because we jammed the GPS? And he said, no, we don't use GPS. There's no GPS in our radar. They went to the people who made the components for the GPS, sorry, for the radar. And one of them said, oh, yeah, we use GPS for timing. And that's super common. I mean, GPS has a one pulse per second. I've used it in inertial measurement systems. It's just, it's really nice to have.

Starting point is 00:13:14 Yep, absolutely. And the trouble here was that that component manufacturer didn't know that their system was going to go into a radar. The radar manufacturer didn't know that that component was dependent on GPS. So we had what's called an accidental system. We accidentally built a system dependent on GPS. Nobody would have tested it in the case of a GPS failure because no one knew that it was dependent on GPS. The argument runs that a lot of the systems we're building today are so complicated and so complex that we don't know everything about how they work. I understand that with the machine learning, but for your example, somebody routed the wire from the GPS to the component.

Starting point is 00:14:07 Presumably. Or the component had it integrated into it, and therefore they just plugged the component in. So this was an accidental system. And the idea is that we cannot predict all of the behavior of that system. And SOTIF, safety of the intended functionality, everything worked. Nothing failed. Nothing failed at all, but we got a dangerous situation. Everything in that radar worked exactly as it was designed to work,

Starting point is 00:14:38 but we hadn't thought of the total consequences of what it was. There's a lot of examples of the SOTIF. Nancy Leveson gives one which is apparently a genuine example. The US military had a missile at one Air Force base that they needed to take to a different airport space so obviously what you do is you strap it on the bottom of an aircraft and you fly it from one base to the other but that would be a waste of time and a waste of fuel so they decided what they would do was put a dummy missile also on that aircraft and when it got up to altitude it would intercept another u.s aircraft and they would fire the dummy missile at the other aircraft just for practice. I think you can see what's going to happen here.

Starting point is 00:15:31 So, yes, it took off with the two missiles on. It intercepted the other U.S. aircraft. The pilot correctly fired the dummy missile. That caught you. You were thinking otherwise there. But the missile control system was designed so that if you fired missile A, but missile B was in a better position to shoot that aircraft down, it would fire missile B instead.

Starting point is 00:16:02 And in this case, there was an antenna in the way of the dummy missile, so the missile control software decided to fire the genuine missile it destroyed the aircraft the pilot got out don't worry it's not a sad story but again everything worked perfectly the pilot correctly fired the dummy missile the um missile control software did exactly what it was supposed to do. It overrode the pilot and fired the other missile. See, that one makes a lot more sense to me because it was trying to be smart. Yes. And that's where everything went wrong. So much of software is trying to be clever, and that's where everything goes bad.

Starting point is 00:16:43 Yeah, and I think you could make that argument with my example with the child on the skateboard that the system was being clever by saying no that's not a child because it can't be a child if it's traveling at 15 kilometers an hour and it's not on a bicycle so again the software was trying to be smart and failing and it but that one it was trying to be smart in a way that doesn't make sense to me because things change in the future um kids get i don't know magnetic levitation skates and suddenly they're zipping all over but yeah the missile makes more okay so how do we, you mentioned the safety culture, but what about tactics? How do we avoid these things? I mean, I've heard about risk management documentation. that's going to be certified in any way with what we call a hazard and risk analysis. You have to be a bit careful about these terms, hazard and risk, because they differ from standard to standard.

Starting point is 00:17:52 The way I use it is the iceberg is the hazard. It's a passive thing. It's just sitting there. The risk is something active. The ship may run into it. Other standards, on the other hand, would say the hazard is a ship running into the iceberg. So we have to be a bit careful about terminology. But to me, the iceberg is the hazard. The risk is running into it.

Starting point is 00:18:17 And so we do a hazard and risk analysis on the product, and using brainstorming and various, there is an ISO standard on this, identify the hazards associated with the product and identify the risks associated with them. We then have to mitigate those risks, and anything that's mitigating the risk becomes a requirement, a safety requirement on the system. So if you take the iceberg, we may decide to paint the iceberg yellow

Starting point is 00:18:53 to make it more visible. Okay, silly idea. The iceberg, we're going to paint it yellow. So there is now a safety requirement that says the iceberg must be painted yellow. Okay. And there is then still a residual risk that yellow painting icebergs doesn't help at night the fundamental point, because it's that that is defining what the risks are and what we're going to do to mitigate them and what requirements we make. So typically, there will be a requirement. And then at the other end of the development, so that's your development, we're setting up the hazard and risk analysis at the other end one thing we have to to deliver is our justification for why we believe we have

Starting point is 00:19:54 built a sufficiently safe system and there's two things in that of course what is sufficiently safe and how do you demonstrate that you have met that yes so back to the skateboard uh and child the hazard is the child and the risk is the chance of hitting it hitting the child correct if i understand your terminology and yes using my terminology my technology. We might mitigate that by saying anything that we aren't sure about, we're not going to hit. And then the residual risk still is we might break suddenly and thereby hurt the person behind us. Yes, that is the residual risk. If you remember, there was an incident a while back in the US with a woman at night walking across the road pushing a bicycle. She was hit by an autonomous car, an Uber.

Starting point is 00:20:51 The car had initially been programmed in a way you stated, if I do not recognize it, then I will stop. They found it was stopping every sort of 10 minute 10 minutes because it was something there that hadn't been anticipated so they changed it to say we'll stop if we see one of these one of these one of these one of these or one of these and a woman pushing a bicycle was not one of those a woman riding a bicycle would have been okay so I just want to go to their team and say, you can't do this. I want to throw down a heavy book and say, you need somebody on your team who thinks through all of these problems,

Starting point is 00:21:35 who actually has, well, I'm not going to say insulting things, but who has the creativity of thought to consider the risks that clearly that team did not have. How do we get people to that point, to that, I'm not in the box, I want to think about everything that could happen, not just what does happen? What does happen? There's two things that are happening there. One is that last month, actually, there was a SOTIF standard came out with a sort of semi-structured way of considering the safety of the intended functionality.

Starting point is 00:22:17 But also this safety case that I mentioned earlier, the thing that identifies why I believe my system is adequately safe is actually going to try to answer that question. And this is one of the things that's happening to our standards at the moment, that most of the safety, what are called safety standards at the moment that we have are prescriptive they tell you what you must do you must use this technique this technique this technique you must not do this you must not do that the trouble with that is that it takes 10 15 years to bring in a standard and in that time techniques change software world is changing very very rapidly so basically by doing that you are burning in the need to apply 10 15 year old technology which is not good so the series of standards coming out now like UL's 4600, which are what are called goal-based standards, G-O-A-L, goal, as in football goal, goal-based standards,

Starting point is 00:23:31 which say we don't care how you build your system, we don't care what you've done, but demonstrate to me that your system is adequately safe. And that, I think, is where your imagination comes in that sitting down and imagining awkward situations where well what would happen if this were to happen um ul4600 gives a couple of good examples actually for it's for autonomous cars basically and it talks about gives one example of an autonomous car there is a building that's on fire there are fire engines outside there are people milling around lots of people on the road someone has used their cell

Starting point is 00:24:22 phone to call their car their autonomous car has arrived to pick them up now what it's doing of course is it's having to go on the wrong side of the road there are hose fire hoses there are people there are what have you could your autonomous car handle that situation and you're right we need people of imagination to look at these situations. Now, the person who produced UL4600 has also published a number of papers that say that a lot of these incidents that your car may meet during its life are long tail. It is unlikely that any car, any particular car, will meet the situation of an aircraft landing on a road in front of it. But inevitably, over the course of 10 years,

Starting point is 00:25:18 a car somewhere will meet the incident of an aircraft landing on a road in front of it. So do we teach every car how to handle an aircraft landing on the road in front of it. So do we teach every car how to handle an aircraft landing on the road in front of it, given that only one car is probably in its entire lifetime likely to meet that situation? So becoming imaginative, as you say, is great, but we have a limited amount of memory that we can give to these systems in the car to understand. Have you met a child coming down a hill towards your vote on a skateboard when you've been driving? Probably not. I mean, I've seen kids on rollerblades, which is even less identifiable. And I live on a hill, so yes, they get going pretty fast.

Starting point is 00:26:08 You mentioned, as far as the creativity part, in your book, you mentioned that there are some standards that are starting to ask some of the questions we should be thinking about. Like, what happens if more happens? Do you remember? Do you recall what I'm talking about? Yeah. There is an ISO standard for doing a hazard risk analysis. I must admit that initially it was pointed out to me by one of our assessors,

Starting point is 00:26:38 and I thought it was pretty useless. But we applied it because our assessor told us to and find it's fairly useful. I can look up its number i can't remember off the top of my head but yes what it does is it structures the brainstorming so when you're in a room trying to identify hazards and risks you are brainstorming well what could go wrong with this well maybe ah maybe a child could come down a hill on a skateboard or maybe this or maybe this. And what the standard does is it gives you keywords,

Starting point is 00:27:12 specific keywords like less, fewer, none. So what would happen if there were no memory available on this system? It's basically a structured way of doing a brainstorming so we use that quite extensively now to to do exactly that what happens take a keyword such as none too early too late what if the camera system gives this input too late or too early and things like that? But it is only really a way of structuring a brainstorming session. I always like the question, let's assume that something catastrophically failed. What was it?

Starting point is 00:27:56 Yes. That sort of backwards looking. But this creativity is important for figuring out a diverse risk analysis. But there's so much paperwork. I mean, I'm happy to be creative, and I've done the paperwork for FDA and FAA, but that paperwork is kind of, let's just say boring to write. Why do we have to do it? Right. I think a lot of it is, well, a lot of it can be semi-automatically generated.

Starting point is 00:28:36 I think that's one of the points to be made. Producing the paperwork doesn't actually make your system any better, as you appreciate. I'm a pilot. I own a small aircraft. And for example, the landing light bulb is just a standard bulb that I could go down to the local car shop and buy a replacement for. But I'm not allowed to. Even though it has the same type number and the same this and the other as a car bulb, I'm not allowed to buy that. I have to buy an aviation grade one that comes with all of the paperwork. Where was it built? When was it done? Who sent it to whom? Who sold it to whom? And of course, that bulb is five times the cost of going down the local shop and buying one.

Starting point is 00:29:28 But it comes with that paperwork. And a lot of that paperwork can be generated, as I say, semi-automatically. This thing that I keep referring to at the end that we produced during the development, but at the end the safety case that tries to justify that we are adequately safe, we have templates for that. And we would expect someone to apply that particular template and do it rather than producing the paperwork from scratch every time now i go sometimes as a consultant into a startup company particularly you know a medical startup

Starting point is 00:30:11 there's been a spin out from a university the university people know all about the medical side of it they know nothing about the software side of it, and they've got some student-produced software that was written by some students three years ago who've now disappeared. And there, yes, there is a lot of back paperwork to be done. But in general, once you're onto the system, it should be semi-automatic, that paperwork. But every standard is different i mean i remember the geo 178b and the fda documentation had different names for everything you mentioned risk and hazards

Starting point is 00:30:55 mean different things in different standards are we getting to the point where everybody's starting to agree and i i wouldn't have to i mean you work for qnx it's a real-time operating system so you have to do both don't you yep and yep we um we're used in railway systems we're used in industrial systems we use autonomous cars we're used in uh uh aircraft system medical systems um it is awkward there is a group in the uk at the moment um in the safety critical systems club that is trying to put together a standardized nomenclature my colleague dave bannum is part of that group i honestly don't hold out much hope for what they're doing. My feeling is that it's happened before where we've had 10 standards on something

Starting point is 00:31:53 and we then tried to consolidate them into one and we ended up with 11 standards. Yeah, that seems to be the typical way of going but yeah dave and his group are really trying to produce a common nomenclature and vocabulary for use but no each standard at the moment is is different and uses the terms differently it's real it's it's annoying let's say so going back to risk analysis, you have a way, how do we determine if a failure is going to happen? How do we put a number on the probability of something going wrong? Right. So remember when we talk about this, that that study I mentioned right at the beginning argues that only something like seven percent of dangerous situations occur because something failed sotif is the other side and is supposed to handle the other 93 but you're right all almost all of the standards that we

Starting point is 00:32:57 are dealing with at the moment assume failure so we have to assess failure. So a standard like IEC 61508, which is the base standard for lots of other standards, assigns a safety integrity level to your product. And the safety integrity level is dependent on your failure rate per hour so for example sill three safety integrity level three means a failure rate of less than 10 to the minus seven per hour one in 10 million per hour so how do you assess that is the question and the answer is it is not easy it is obviously a lot easier if you have an existing product. When I first came to QNX 12, 13, 13 years ago, they had a product with a long history and a good history which had been very carefully capped. So we could look at the hours of use and the number of reported failures. The problem, of course, is we don't know what percentage of failures were being reported.

Starting point is 00:34:08 You know, our software is almost certainly in your car radio, for example. But if your car radio stopped working, you would turn it off, turn it on again. And we didn't get to hear about that failure. If every car in the world that had our software the radio failed we would hear about it but would we count that as one failure or a million failures so there are problems with that and what the way i've done this is with a bayesian fault tree building up the bayesian fault tree gives us the opportunity to go in either direction. I mean, you mentioned earlier, the system has failed what caused it, which is down from top

Starting point is 00:34:52 to bottom, if you like. The Bayesian fault tree also allows us to go the other way. If this were to fail, what effect would that have on the system? And so you can do sensitivity analyses and things like that so the place to start again i think is in what ways could this fail and if we take an operating system it doesn't matter whether it's linux or qnx operating system the we identified that really there are only three ways in which an operating system can fail. That an operating system is an event handler. It handles exceptions. It handles interrupts.

Starting point is 00:35:34 It handles these sort of things. So really there's only three ways it can fail. It can lose, it can miss an event. An event occurs, an interrupt occurs occurs but because it's overloaded it doesn't get it doesn't notice it it can notice the event and handle it incorrectly or it can handle the event completely correctly but corrupt its internal state so that it will fail on the next one and basically it's then a trawl through the logs and failure logs and what have you to find how often those failures are occurring and whether they are reducing or whether they go up at every release. My colleague Wachau and I presented a paper at a conference last year

Starting point is 00:36:22 where we were applying a Bayesian network on that to see if we could predict the number of bugs that would appear in the field based on some work done by Fenton and Neil it was an interesting piece of work that we did there I think but yes it is not not a trivial exercise particularly as a lot of the standards believe that software does not fail randomly, which of course it does. I have definitely blamed cosmic rays and ground loops and random numbers on some of my errors, I'm sure. But I have a question from one of our audience members, Timon. How do you ensure safe operation with an acceptable probability in a system that is not fully auditable down to the assembly level? For example, a complex GUI or a machine learning driven algorithm?

Starting point is 00:37:18 Yes, particularly the machine learning algorithm, I think, a good really good example there i mean we all know examples of machine learning systems that have learned the wrong thing uh that's a um one i'm i i know for certainty because i know the people involved there was a company in a car company in southern Germany. They built a track, a test track, for their autonomous vehicles. And their autonomous vehicles learned to go around the test track perfectly, absolutely perfectly, great. They then built an identical test track elsewhere in Germany, and the cars couldn't go around it at all although the test track was identical

Starting point is 00:38:06 and what they found when they did the investigation was that this the car and the first test track had not learned the track it had learned the advertising holdings so you know turn right when it sees an advert for coca-cola turn left and the track was identical in the second case but the adverts weren't so there was a system that could have been deployed it was working absolutely perfectly yet it had learned completely the wrong things and yeah this question that you have here is to some some extent, impossible to answer, because we have these accidental systems. The systems that we're building are so complex that we cannot understand them. There's a term here, intellectual debt, which Jonathan Citroen produced.

Starting point is 00:39:07 An example of that, nothing to do with software, is that we've been using aspirin for pain relief, apparently since 18-something. We finally understood how aspirin worked in 1965 or something, somewhere around there. So for 100 years, we were using aspirin without actually understanding how it worked. We knew it worked, but we didn't know how it worked. The thing is the same with these systems that we're building with machine learning.

Starting point is 00:39:36 They seem to work, but we don't know how they work. Now, why is that dangerous? Well, it's dangerous with aspirin because in that intervening period where we were using it but didn't know how it worked, how could we have predicted how it would work and interact with some other drug? With our machine learning system, yes, it appears to work. It appears to work really well. But how can we anticipate how it will work with another machine learning subsystem when we put these two together? And this is a problem called intellectual debt. It's not my term. It's Siflain's term. But we are facing a large problem, and machine learning is a significant part of that.

Starting point is 00:40:26 But, yeah, we're never going to be able to analyze software down to the hardware level. And, you know, the techniques that we've used in the past to try to verify the correctness of our software, like testing, dynamic testing, you know, now are becoming increasingly ineffective. Testing software these days does not actually cover much. I call it digital homeopathy. But machine learning is in everything. I mean, I totally agree with you. I've done autonomous cars, autonomous vehicles,

Starting point is 00:41:03 and it will learn whatever you tell it to and it's not what you intended usually yeah so now when you take that software and combine it with another machine land system which you also don't understand fully to anticipate how those two will interact becomes very difficult i was a little surprised in your book that you had Markov modeling for that very reason, that it is not an auditable heuristic. How do you use Markov modeling? Yeah, so we use Markov modeling largely because, no, I shouldn't say that. The standards, IEC 61508, ISO 26262, EN 50128 for the railways,

Starting point is 00:41:51 they are prescriptive standards, as I said earlier, and they give methods and techniques which are either not recommended or are recommended or highly recommended. And if a technique is highly recommended in the standard and you don't do it, then you've got to justify why you don't do it, and you've got to demonstrate why what you do do is better. So in a lot of cases, it's simply easier to do it. We faced this at QNX for a number of years.

Starting point is 00:42:26 There was a particular technique which we thought was not useful. We justified the fact that it was not useful. In the end, it got too awkward to carry on arguing that it was not useful as a technique. So we hired a bunch of students, we locked them in a room and got them to do it it was stupid but it just took it off our back so that next time we went for certification we could say yes we do it tick give us a tick box please and there's a lot of things that are in that i have used markov modeling for various things for anomaly detection here and there but really the systems we're using these days are not sufficiently deterministic to make markov modeling particularly useful the you know the processors we're running on the socs we're running on come with 20 pages or more of

Starting point is 00:43:25 errata which make it completely non-deterministic you know there's a phrase i heard at a recent conference there is no such thing as deterministic software running on a modern processor and i think that's a correct statement so yeah i would, I would not push Markov modeling. It's in my book because the standards require it. Maybe it won't be in a third edition. The standards require it? What kind of standards? I mean, which standards?

Starting point is 00:43:57 How? Why? I mean, they can't disallow machine learning because it's not auditable and then say, oh oh but markov modeling is totally reasonable the difference is minor yeah the problem here is these standards as i said are out of date it i mean the last fish the last edition of ice of iec 61508 came out in 2010 there may be a new version coming out this year we're expecting a new version this year so that's 12 years between issues of the standard and i'm not sure how to say this um politely

Starting point is 00:44:36 a lot of people working on the standards are people who did their software development even longer ago. So they are used to a single-threaded, single-core, run-to-completion executive-type model of a processor and of a program. And so they are prescribing techniques which really are not applicable anymore. And I suspect Markov modeling is one of those. This is where I think this move towards goal-based standards like UL4600 is so useful. I don't care what techniques you use. I don't care what languages you use.

Starting point is 00:45:18 I don't care what this and the other. Demonstrate to me now that your system as built is adequately safe. And I think that's a much better way of doing stuff. It makes it harder for everybody. It makes it harder for the assessor because the assessor hasn't got a checklist. Did you do this? Yep. Did you do this? Yep. Did you do this? Did you not do that? It makes it harder for the developers, people like ourselves ourselves because we don't have the checklist to

Starting point is 00:45:45 say well we have to do this but it is a much more honest approach demonstrate to me with a safety case that your system is adequately safe are there tools for putting those sorts of things together are there tools to ensure you hit tick all the boxes, or is that all in the future? There is, again, we can talk about the two sides. The tick box exercise, yep, there are various checklists that you can download. IEC 6308 has a checklist and what have you. The better approach, the other one I talk about, the safety case approach. I think we did the safety case approach for FAAA. I think the DO-178B is more in that line,

Starting point is 00:46:37 where we basically had to define everything and then prove we'd defined it. Is that what you mean by a safety case or a goal driven? So the idea here is that we put together an argument for why our system is sufficiently safe. Now there's a number of notations for doing this. This one is called goal structuring notation and I'm using a tool called Socrates. There's a number of tools. This is Socrates tool. So what I can do here is we claim, this is our top level claim, we are claiming that our system, that our product, sorry, is sufficiently safe for deployment. And we're basing that claim on the fact that we have

Starting point is 00:47:20 identified the hazards and risks, we have adequately mitigated them, that we have provided the customer with sufficient information that the customer can use the product safely, and the fact that we develop the product in accordance with some controlled process. Now, we're only claiming the product is sufficiently safe if it is used as specified in our safety manual. And then, of course, we can go down. So what do we say? Customer can use the product safely. Jump to the subtree here.

Starting point is 00:47:54 Zoom in. So what I'm claiming here is that if in 25 years' time the customer comes back with a bug, we can reproduce that problem. We have adequate customer support, documentation is adequate, and so on. So the idea here is two things. First of all, and all UL 4600 and the more modern standards require is this argument based on evidence. The first thing to do is you must

Starting point is 00:48:26 put the argument together before you start to look for evidence. Otherwise, you get confirmation bias. There's a little experiment I do on confirmation bias. You've probably done this exercise yourself, but it one i i do with people a lot i say to them on the next slide i'm doing a slide presentation let's say and i say on the next slide i've written a rule for generating an x number in this sequence you're allowed to guess numbers in order to discover the rule now i'm not going to ask you to do this this year because i don't want you to to look like an idiot at the end of this. No, I read your book. I know what to guess.

Starting point is 00:49:08 Okay, great. So what happens is people guess 12, 14, 16. I say, great, yep, those numbers all work. On your slide, you have 2, 4, 6, 8, and 10. Okay. So you must now guess what the rule is for generating a next number in this sequence and so to do so you can guess numbers and so typically people guess 12 14 16 and i will say yep they work so what's the rule they say well it's it's even numbers i say nope not even numbers guess a few more and they go 18 20 22 what's the rule well plus two no it's not plus two and this goes on for some time i've been up all the way to 40 and and 44 and things with some customers until somebody guesses 137 just to be awkward and i say yep that works and it then leads us to what the rule is the rule has to be larger that each number must be larger than the previous one

Starting point is 00:50:13 the problem that this identifies is what's called confirmation bias at which we're all as human beings subject to. If you think you know the answer, you only look for evidence that supports your belief. If you believe that this is even numbers, you only look for evidence that it is even numbers. This was identified by Francis Bacon back in the 17th century. It was rediscovered, if you like, fairly recently. And we applied this to some of our safety cases and we started finding all sorts of additional bugs. Instead of asking people, produce me an argument to demonstrate

Starting point is 00:51:01 that this system is safe. Now, if you ask someone to do that, what sort of evidence are they going to look for? They're going to look for evidence that this system is safe. Now, if you ask someone to do that, what sort of evidence are they going to look for? They're going to look for evidence that the system is safe. So we said, look for evidence that the system is not safe. And then we will try to eliminate those. By doing that, we found an additional 25 or so problems that we had never noticed in our safety cases previously. So we took that to the standards bodies that produce the standards for this goal structuring notation, and now that doubt has been added to the standard. So basically, the idea is we put together an argument. We argue the customer can use the product safely. We argue that we have identified the hazards and risks,

Starting point is 00:51:52 that we have done this, and we take that to all the stakeholders and say, if I had the evidence for this argument, would that convince you? And typically they'll say, it's good but we'd also like this and we'd also like that and you can build that so we build the argument only then do we then go and look for evidence so here for example we come through the residual risks are acceptable there is a and the subclaim of that is that there is a plan in place to trace the residual risks during customer use of the products as a customer starts but if the customer uses our product in an environment we did not expect and there are

Starting point is 00:52:38 new risks then we have a plan in place to trace those. So now please show me your evidence that that is true. So first of all, we put together the argument, we agree the argument structure, and only then do we go to look for the evidence. And what we'd like to be able to do is put doubt on that. So okay, you've got a plan, but has that plan ever been used has that plan actually been approved do your engineers actually know about that plan you know put all of the doubt you possibly can into this safety case and i think then we have a justification for saying this product is sufficiently safe for deployment and nothing to do with as deployment. And nothing to do with, as you were saying, nothing to do with the fact that I used this technique

Starting point is 00:53:27 or I used Markov modeling or I did this or the other. It is the argument that says why I think my product is sufficiently safe. In your book, you had a disturbing section with a lawyer talking about liability for engineers. Oh, yes. I don't know whether I mentioned the anecdote with Ott Nortland. Possibly I did. For those who haven't read the book, I was at a conference a few years ago, a safety conference, and we were all standing around chatting as you do

Starting point is 00:54:05 and one of the people there was Ott Nortland who's well known in the safety area and he state he said something that hit me hit me like a brick he said he had a friend who is a lawyer that lawyer often takes on cases for engineers who are being prosecuted because their system has hurt somebody or the environment. And he said that his friend could typically get the engineer off and found innocent if the case came to court. But often the case does not come to court because the engineer has committed suicide

Starting point is 00:54:50 before the case reaches court. I'm not sure if that's the anecdote you were thinking of, Alicia, but as you can imagine, it stopped the conversation as we all sort of started to think about the implications of this but it does make you realize that the the work we're doing here is real that people do get hurt by bad software and the environment does get hurt by bad software. So, yeah, it is real. And there's a lot of moral questions to be asked as well as technical questions.

Starting point is 00:55:31 And as somebody who has been in the ICU and surrounded by devices, I want that documentation to have been done. This risk analysis, it can be very tedious. For all that we're saying, there's a creativity aspect to it. But all of this documentation, the goal, well, the standards prescribe what you're supposed to do. The goal is to make sure you think things through. Yes, the standards can be used. There's various ways of looking at the standards. I was in the ICU earlier this year.

Starting point is 00:56:08 I broke my wrist on the ice, and I was horrified to see, although I knew about it intellectually, I was horrified to see it practically, the number of Wi-Fi, Bluetooth connections coming from these devices that were all around me. Was it designed to work together in that way? You know, those systems? I don't know. But, you know, there's different ways to look at the standards. I don't like prescriptive standards, as I probably indicated during the course of this.

Starting point is 00:56:48 However, the prescriptive standards do give guidelines for a number of types of people. As I say, that startup, that spin out from a university that has had no product experience, basically, they really could use those standards standards not as must do this, must do this, must do that, but as a guideline on how to build a system. And I think no question there, these are good guidelines in general. They may be a little out of date, but they're good guidelines. They're certainly better than nothing. The other way of looking at these safety standards is that although they say on the cover that each one says this is a functional safety standard, building a safe product is trivially easy. The car that doesn't move is safe. The train that doesn't move is safe.

Starting point is 00:57:41 The aircraft that doesn't take off is safe what as soon as we make the product useful we make it less safe as soon as we let the car move it becomes less safe so i like to think of these standards sometimes as usable usefulness standards they allow us to make a safe system useful and i think if you approach them in that manner, then it answers, I think, in part, your concern about your devices in your intensive care unit and what have you, how they can be used. But yes, certainly some level of confidence,

Starting point is 00:58:24 like that safety case I spoke about, the product is sufficiently safe for deployment in a hospital environment with other pieces of equipment, with Wi-Fi's and Bluetooth's around it, and used by staff who are semi-trained and are in a hurry and are tired at the end of a long day. That should be documented and demonstrated. Yeah, I'd agree wholeheartedly. And documented and demonstrated because we want engineers and managers to think about that case.

Starting point is 00:59:00 Yes. But this is, again, comes back to what you were saying earlier of imagination. My wife often says that when she looks over my shoulder at some of these things, you need someone who is not a software engineer to be thinking up the cases where this could be deployed, it could be used, because you engineers, Chris, you are not sufficiently imaginative. It's not something that engineers do, is be imaginative in that way. And so, yeah, it is a problem. But ultimately, we are not going to be able to foresee

Starting point is 00:59:40 and take into account every situation. But certainly, if you look at those medical devices in the hospital you'll find most of them have some sort of keypad on so that the the attendant or nurse or doctor can type in a dose or something if you look at them you'll find half of the keypads are one way up like a telephone the others are the other way up like a calculator either zero is at the top or zero is at the bottom now in the course of a day a nurse will probably have to handle 20 of these devices all of which are differently laid out with different um keyboard layouts and all that sort of stuff that should have been standardized yeah that's that's that is setting you up to make a

Starting point is 01:00:27 mistake it's yes it's as it's not that we're making things safe that way we're we're actually designing them in a way that will cause a problem in the name of intellectual property and lack of standards? Because following a standard is a pain. I mean, it's not as much fun as designing software from the seat of their pants. No, it is much more fun to sit down and start coding. Yeah, I agree wholeheartedly. I am a programmer. I work all day in Ada and Python and C.

Starting point is 01:01:08 And, yeah, it is much more fun and much less efficient to sit down and start coding. I have a question from a viewer. A viewer, Rodrigo DM, asked, how do you develop fault-tolerant systems with hardware that is not dedicated to safety-critical design? For example, an ARM M0+. Yeah. So, as I said, the hardware is always going to be a problem. The only way, really, you can do that, if what you're looking for is a fault-tolerant system is duplication or replication um preferably with diversification uh the this is this is painful because of course it um requires uh it costs more money because you're going to put two of them in.

Starting point is 01:02:06 So you've got to say either I am going to have sufficient failure detection that I can stop the machine and keep it safe, or I'm going to have replication. I'm going to put two processors in or whatever. I had a customer a while back who was taking replication or diversification to an extreme. They were going to put one ARM processor, one x86 processor, one processor running Linux, one processor running Wind River, one processor running QNX, and so on.

Starting point is 01:02:43 And I asked the question, why are you diversifying the hardware? And the answer was, well, because the hardware has bugs. We know that. There's 20 pages of errata. And I said, well, yeah, but these are Heisenbugs. These are not bore bugs. These are random bugs. The last time I can remember a bore bug, a solid bug in a processor, was that x86 Pentium processor that couldn't divide. If you remember the Pentium bug back in 1994, 1995. If the processor is going to fail, it is likely to be going to fail randomly, in which case two processors of the same type are almost certainly going to be sufficient. Or even one processor with something running something like virtual synchrony, which would detect the fact that the hardware error has

Starting point is 01:03:40 occurred, and then take the appropriate action, which may be simply running that piece of software again. And I know there's a couple of companies, particularly in southern Germany, using coded processing to do the safety-critical computation so that you can check implicitly the correctness of whether the computation has been done correctly. And the argument is you can have as many hardware problems as you like, which don't affect my safety. I don't care. Bit flip in memory is going to occur every hour, but that's fine if it doesn't affect my computation. So if I use

Starting point is 01:04:21 something like coded processing, where I can check that the computation was done correctly to within one in a million, say, then I don't care about those hardware problems. But again, you've got to justify that. And that is the way I've seen one of our customers do it with coded processing using then non-certified hardware. And do you, back to Timon, do you have tools for doing risk analysis? Are there specific things you use? I don't know a good tool. If anybody has, then please let me know. Over the years, we have built tools to allow us to do this

Starting point is 01:05:08 yeah but they're internal python scripts to do this and the other uh no i don't know of a good tool for doing risk analysis sorry about that i'm kind of sad yeah, that's right. Part of the goals of many of the documentation involve traceability, where, as you were showing, you have a safe product, and that breaks into multiple things. That includes the safety manual, which breaks into multiple things. Do you have a tool for that? For the traceability yeah yeah this is so this is something that most of the processes demand a spice for example demands this um cmi the tracing

Starting point is 01:05:57 of a requirement to the design to the low level to the code, to the test cases, all the verification evidence that you have for that. What we have found there, this is going to sound silly, this is going to sound really silly, but we use LaTeX for all of our document preparation. The beauty of LaTeX is that it's textual. It is just ASCII text. So basically, we can embed comments in our documents. And then at the end of the product, when we have to produce the complete table of traceability, we simply run a Python script over those documents, it reads those structured comments, and it produces the document automatically. So if, on the other hand, we were

Starting point is 01:06:52 using some proprietary documentation tool like Microsoft's Word or something like that, I don't believe we could do that, and I'm not sure how you would do that. You'd have to keep that as a separate document manually. But the nice thing about LaTeX is just ASCII text.

Starting point is 01:07:09 You can run a Python script on it. You can pull out all this stuff and produce these really, really, really boring documents that tell you that safety requirement number 1234 is in paragraph 3.2.7.4 of the design documentation, and it relates to lines 47 to 92 of this particular c module and so on because all of that is just in ascii text so that's the way we've done it well i meant to only keep you for about an hour um and we've gone a bit over um one more question about languages uh you mentioned ada which is a language that has the idea of contracts and and provability and you mentioned c which is the reputation for being

Starting point is 01:07:54 the wild wild west of unsafety yes which languages do you like Which languages should we be using? And how do we figure that out? Yeah, this is an interesting, the standards themselves deliberately do not, or sorry, recommend not using C. So if you look in IEC 61508, it gives a list of languages that it likes and dislikes, and it specifically says, don't use C. What it does say is, you can use C if you use a subset of C, and that subset is checked by some form of static analysis. So for example, you might take the MISRA subset of C, and you might use Coverity or something like that to clockwork to check that you are using that. I feel that, and to be fair, I find that there's a whole load of people there selling

Starting point is 01:09:07 products to try to make it, to make C better, to check that C doing this and that you're not doing this in C and you're not doing that. And I feel that we are getting to the point where we've got to stop putting lipstick on the C pig and go elsewhere. Now, where do you go elsewhere? Now, that's a good question. I just put up on the screen a list of some of the things that I feel you could discuss about what you need in languages. Ada and Spark Ada in particular, yes, we have the formal proving and we have a customer I'm working with at the moment who is using spark ada and that's great the other one that's on the horizon at the moment well d was on the horizon for a while but um rust seems to be coming on the horizon i have a bad experience with rust i teach a course which

Starting point is 01:10:02 this is one of the slides and a few years ago I wrote a Rust version of a very bad C program that I have that has a race condition, and Rust would not compile it. That was great. That was exactly what I wanted. Rust refused to compile this badly structured program. A couple of months later I was giving that course again, and I said, look, look watch and i'm going to compile this with rust and show you how it doesn't accept it it accepted it the compiler had been

Starting point is 01:10:32 changed i repeated the same thing about six months later and this time rust gave me a warning message with a different thing that's the problem with rust it's not yet stable and that's the problem with Rust. It's not yet stable. And it actually says, as I'm sure you're aware, in the Rust documentation, that this is the best documentation we have. It is not yet really suitable. It's not stable. So basically you're missing a long history for the compiler linker and a stable history of the product.

Starting point is 01:11:11 And, yeah, there's a lot of other things you can talk about on languages. As I say, I write a lot of Ada, and particularly Spark Ada, and we're working closely with AdaCore on that. But AdaCore is now, of course, supporting Rust as well, and I think Rust may be the future eventually once it stabilizes. Once it stabilizes. That's always been my caveat as well. Yeah.

Starting point is 01:11:38 And you said at the top that there are plenty of opportunities in this area if someone wants to work in the safety critical systems one how do they get into it and two what skills do they need to develop first that's a really good question yes yes it is a growth area yes and and and what's more it is an interesting area because of all of the things we've been discussing today, the language support and all of these things, the accidental systems, how we handle accidental systems, whether we should be looking at SOTIF or whether we should be looking at failure.

Starting point is 01:12:19 There's a lot of research going on, a lot of interesting stuff going on. The problem is that it's basically an old man's game. And, yeah, when I go to conferences, which I do quite regularly, I think I probably lower the average age of the people by attending, which is really worrying. Yeah. And most of the people giving the presentations are most of the

Starting point is 01:12:48 people the conferences are men and i think that's got to change there was a really useful thing uh at the embeddice at the um safety critical systems congress last year where a young woman uh stood up and gave a presentation on how they're intending to make this more inclusive, but it hasn't happened. The trouble is education. I was giving an IEEE chat some years ago now, and I had an audience full of academics. And I said, okay, which of you teach some form of computing? And most of them put their hands up okay which of you teach embedded computing and a sort of few put their hands up how many of you teach anything to do with safety critical embedded programming and there's one university university of waterloo but a chap from

Starting point is 01:13:37 there put his hand up so this is not being taught in the universities and therefore it is being uh it is coming up the hard way and so i think the way to do it is you've just got to get in we are we are looking for people at the moment uh everybody is looking for people i think the skills there's three levels of skill that people need there's skill in software engineering in general. There is skill in the particular vertical area, whether that's railway trains or medical devices or whatever. And there is then skill in the safety-critical stuff. And I think any company that's looking for people

Starting point is 01:14:24 is going to be looking for at least two of those you're not going to get all three and so yeah you can read books like mine it's it's not going to really help that amount you've you've got to go out and and do it so i think becoming familiar with the embedded software world as as Alicia teaches and what have you, and then becoming familiar with a vertical market, whether that's aviation or whether that's autonomous cars or something like that, and then go and just apply. And do you have any thoughts you'd like to leave us with?

Starting point is 01:15:04 Well, I think to some extent, it was what I've just said, but I think it's worth just repeating. This is a growth area. This is an exciting area. There's lots of research going on, on digital clones and all sorts of things that's going on at the moment. This is an area where we need young people who are going to take it to the next level. And so let people like myself retire even, get out of the industry and stop lowering the average age of conferences. Yes. Yeah.

Starting point is 01:15:43 Our speaker has been Chris Hobbs, author of Embedded Software Development for Safety Critical Systems. Chris, thank you for being with us. Well, as I say, thank you for the invitation. I enjoyed myself. And if there are any further questions and if they can be made available in some way, I'm very happy to try to address them, of course. All right. We will figure that out.

Starting point is 01:16:08 I'd like to thank Felipe and Jason from Classpert for making this happen. And our Making Embedded Systems class mentor, Aaron, for putting up some of those helpful links of the standards we talked about. I'm Alicia White, an instructor for Classpert, teaching the course Making Embedded Systems. And I have a podcast called Embedded FM, where we will probably be hearing this interview. Thank you so much for joining us, and we hope you have a good day.

CODACE Plant Stand

Embedded - 421: Paint the Iceberg Yellow

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

CODACE Plant Stand

Embedded - 421: Paint the Iceberg Yellow

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.