The "software industry in general", sure, but my understanding is that any flight-critical software goes under a much more rigorous testing regime and uses much better practices the average software project. It's probably why such issues were caught so early here.
Similar formal methods are used for critical software pieces in Airbus aircrafts (source: the same speaker as for the linked video, but I don't remember which talk that was in).
When proving its correct, isn’t there always the possibility that the proof abstracted away some real world nuance into an assumption, and that assumption could be wrong?
Yes, I guess the proof is only relative to the formal spec, if the spec (formal or not) is nonsense it's another problem (so you might not have to test the program, but still have to test the complete system).
But I guess (again) that having formal specs getting proven helps to avoid inconsistent specs. I recall a talk by Martyn Thomas where he said that with these methods it was harder to get programs to compile (for example, the compiler could complain if it detected an inconsistency, either between the spec and the program or within the spec itself).
This topic reminds me of this relatively old (but still super interesting) article about the team that worked on the space shuttle's onboard computer systems and the rigorous processes they followed to ensure correctness and safety: https://www.fastcompany.com/28121/they-write-right-stuff.
> the team that worked on the space shuttle's onboard computer systems and the rigorous processes they followed to ensure correctness and safety
Those processes are expensive. It'd imagine that they're a huge political problem to maintain, giving cost-cutting pressures and the temptations of COTS [1].
Compliance with SEI CMMI Level 5 probably is expensive. While looking for more data I noted that several Boeing units are at Level 5. And I also found that between 2008 and 2019, about 12% of appraisals given across all industries and countries were at maturity levels 4 and 5, and hundreds are at Level 5 (more than Level 4), according to CMMI Institute: https://cmmiinstitute.com/resource-files/public/cmmi-adoptio...
So it all makes me wonder about the official maturity of Boeing's 737 team, and also how much that rating relates to actual software quality and safety objectives (as apparently achieved in the space shuttle work)....
Oh totally, I can't dispute that. It's just good reading in general, but is more of a response to the grandparent comment stating "the software industry is a complete joke in terms of quality control". I think that's probably not a completely off-base assessment of quality control on average in the tech industry, but the point is it's not like no one knows how to do better; rigorous QA processes exist, but it's a huge tradeoff between cost/agility/etc and the expected cost/damage of not getting it right.
Okay, I think that's an unhelpful way to put it, but it's still worth addressing, and there's a lot to unpack here, so here goes.
My original point was that flight critical software shouldn't be lumped in with the bugs-typical world of software. I still say that's correct.
But "bug" has a range of meanings. From what I read of the story of the original fatal error on the 300 MAX, the software was not buggy in the sense of "deviating from specification". It's just that -- as it turns out -- the specification was bad, and the MCAS should not have overridden pilot input when it did.
In a sense -- the one I consider most important -- that matches my model of flight-critical software. As software, they made absolutely sure that, in every case, it did what it was intended to; to the extent that it was in error, it was not in the sense of "we failed to make the software do what the spec says".
OTOH, I agree that "bug" is taken more broadly in software than the sense I was using here, that there usually isn't such a clean ("Waterfall") separation between "specifying what the software should do" and "ensuring that it does that" -- there's a collaborative process of ferreting out the implications of "what the system should do at what points".
In these latest events, it looks like the bug was in the narrower sense I meant above, of "it doesn't meet the spec", and that bug was found before deaths. From the article:
>The problem was that an indicator light, designed to warn of a malfunction by a system that helps raise and lower the plane’s nose, was turning on when it wasn’t supposed to, the company said.
Now, in fairness, that's a much later discovery than my optimistic model held, but still far earlier than "being used for flying the public around", as (the analog of) might happen with the software industry more generally.
The only issue is here is that the proximate cause of failure was what you’re saying.
What is worrying is that there could be many other potential sources of failure which simply didn’t trigger because the MCAS situation was so bad it led to 2 plane crashes before they even had a chance to bring down some planes.
In other words, if you have 2 bugs, one with a 1/100 chance of triggering and the other with a 1/1000 chance of triggering, by the time you hit 500 attempts, odds are the first bug triggered twice, while the second didn’t even once. So you solve the first bug, that still leaves you with a problem that has a 1/1000 chance of triggering, when the expectation is that bugs should only trigger at worst 1/10000 times.
Solving the MCAS issue is not by itself a reliable indicator that this is a safe plane to fly, especially since we know there are many fundamental procedural reasons to be worried about the quality of the plane.
Sounds right to me! My point is just, it doesn't feel right to lump this kind of failure in with "lol all software is super buggy and this is no exception". Yes, there was a failure here; no, it didn't look anything like "ohhh yeah pointers are super hard to get right".
> the MCAS should not have overridden pilot input when it did
The entire purpose of MCAS was to override pilot input as the typical instincts of a trained 737 pilot is to do the wrong thing. The failure was not handling edge cases (conflicting or missing sensor data) and no clear way to turn off the system when it was failing.