Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> the backup system applied the same logic to the flight plan with the same result

Oops. In software, the backup system should use different logic. When I worked at Boeing on the 757 stab trim system, there were two avionics computers attached to the wires to activate the trim. The attachment was through a comparator, that would shut off the authority of both boxes if they didn't agree.

The boxes were designed with:

1. different algorithms

2. different programming languages

3. different CPUs

4. code written by different teams with a firewall between them

The idea was that bugs from one box would not cause the other to fail in the same way.



This would have been a 2oo2 system where the pilot becomes the backup. 2oo2 systems are not highly available.

Air traffic control systems should at least be 2oo3[1] (3 systems independently developed of which 2 must concur at any given time) so that a failure of one system would still allow the other two to continue operation without impacting availability of the aviation industry.

Human backup is not possible because of human resourcing and complexity. ATC systems would need to be available to provide separation under IFR[2] and CVFR[3] conditions.

[1] https://en.wikipedia.org/wiki/Triple_modular_redundancy

[2] https://en.wikipedia.org/wiki/Instrument_flight_rules#Separa...

[3] https://en.wikipedia.org/wiki/Visual_flight_rules#Controlled...


> Air traffic control systems should at least be 2oo3... Human backup is not possible because of human resourcing and complexity.

But this was a 1oo1 system, and the human backup handled it well enough: a lot of people were inconvenienced, but there were no catastrophes, and (AFAIK) nothing that got close to being one.

As for the benefits of independent development: it might have helped, but the chances of this being so are probably not as much as one would have hoped if one thought programming errors are essentially random defects analogous to, say, weaknesses in a bundle of cables; I had a bit more to say about it here:

https://news.ycombinator.com/item?id=37476624


> But this was a 1oo1 system, and the human backup handled it well enough ...

Heh, a hundred million pound outage. ;)

True, no-one seems to have died from it directly though.


True. I don't want to downplay the actual cost (or, worse, suggest that we should accept "the system worked as intended" excuses), but it's not just that there were no crashes: the air traffic itself remained under control throughout the event. Compare this to, for example, the financial "flash crash" of 2010, or the nuclear 'excursions' at Fukushima / Chernobyl / Three Mile Island / Windscale, where those nominally in control were reduced to being passive observers.

It also serves as a reminder of how far we have to go before we can automate away the jobs of pilots and air traffic controllers.


This reminds me of a backwoods hike I took with a friend some years back. We each brought a compass, "for redundancy", but it wasn't until we were well underway that we noticed our respective compasses frequently disagreed. We often wished we had a third to break the tie!


Sounds like the joke about a man with one watch always being sure about what time it is, but a man with two being continuously in doubt.


Just computate the average, then counter the documented drift vs a external source?


My grandfather was working with Stanisław Skarżyński, who was preparing for his first crossing of the Atlantic in a lightweight airplane (RWD-5bis, 450kg empty weight) in 1933.

They initially mounted two compasses in the cockpit, but Skarżyński taped one of them over so that it wasn't visible, saying wisely that if one fails, he will have no idea which one is correct.


> if one fails, he will have no idea which one is correct

Depends how it fails! For example, say, when you change direction one turns and the other doesn't.


Couldn't he bring his own 3rd? Compasses aren't heavy?


…or a 4th and a 5th, and have voting rounds — an idea explored by Stanisław Lem in "Golem XIV", where a parliament of machines voted :-)


That's a cool story! Would have loved to have heard more about that :)


In this case the problem was choosing an excessively naive algorithm. I'm very inexperienced but that seems to me like the solution would be to spend a bit more money on reviewing the one implementation rather than writing two new ones from scratch.


you would be very surprised how difficult avionics are from even a fundamental level.

I'll provide a relatively simple example.

Just even attempting to design a starfox game clone where the ship goes towards the mouse cursor using euler angles will almost immediately result in gimbol lock and your starfighter locking up tighter than unlubricated car engine going 100mph and unable to move. [0]

The standard solution in games(or at least what I used) has been to use quaternions [1] (Hamilton defined a quaternion as the quotient of two directed lines in a three-dimensional space,[3] or, equivalently, as the quotient of two vectors.) So you essentially dump your 3D coordinate into the 4D quaternion coordinate, apply your matrix rotations, then convert back to 3D space and apply your rotations/transforms.

This was literally just to get my little space ship to go where my mouse cursor was on the screen without it locking up.

So... yeah, I cannot even begin to imagine the complexity of what a Boeing 757 (let alone a 787) is doing under the hood to deal with reality and not causing it to brick up and fall out of the sky.

[0] https://math.stackexchange.com/questions/8980/euler-angles-a... [1] https://en.wikipedia.org/wiki/Quaternion


I don't think we're talking about that kind of software, though. This big was in code that needs to parse a line defined by named points and then clip the line to the portion in the UK. Not trivial, but I can imagine writing that myself.

But regardless the more complex the code the worse idea it is to maintain three parallel implementations, if you won't/can't afford to do it properly


I was doing some orientation sensing 20 years ago with an IMU and ran into the same problem. I had never known at the time it was gimbal lock (which I had heard of) but did read quaternions were the way to fix it. Pesky problem.


> Human backup is not possible because of human resourcing

This is an artificial restraint. In the end, it comes down to risk management: "Are we willing to pay someone to make sure the system stays up when the computer does something unexpected?".

Considering this bug only showed up no, chances are there was a project manager who decided the risk would me extremely low and not worth spending another 200k or so of yearly operation expenses on.


First thought that came to my mind as well when I read it. This failover system seems to be more designed to mitigate hardware failures than software bugs.


I also understand that it is impractical to implement the ATC system software twice using different algorithms. The software at least checked for an illogical state and exited, which was the right thing to do.

A fix I would consider is to have the inputs more thoroughly checked for correctness before passing them on to the ATC system.


> A fix I would consider is to have the inputs more thoroughly checked for correctness before passing them on to the ATC system.

Thoroughly checking of the inputs as far as possible should be a given, but in this case, the inputs were correct: while the use of duplicate identifiers is considerably less than ideal, the constraints on where that was permitted meant that there was one deterministically unambiguous parsing of the flight plan, as demonstrated in the article. The proximate cause of the problem was not in the inputs, but how they were processed by the ATC system.

For the same reason, multiple implementations of the software would only have helped if a majority of the teams understood this issue and got it right. I recall a fairly influential paper in the '90s (IIRC) in which multiple independent implementations of a requirements specification were compared, and the finding was that the errors were quite strongly correlated - i.e. there was a tendency for the teams to make the same mistakes as each other.


not stronger isolation between different flight plans? it seems "obvious" to me that if one flight plan is causing a bug in the handling logic, the system should be able to recover by continuing with the next flight plan and flagging the error to operators to impact that flight only


I'm no aviation expert, but perhaps with waypoints:

  A B C D E
   /
  F G H I J
If flight plan #1 is known to be going from F-B at flight level 130, and you have a (supposedly) bogus flight plan #2, they can't quite be sure if it might be going from A-G at flight level 130 at the same time and thus causing a really bad day for both aircraft. I'd worry that dropping plan #2 into a queue for manual intervention, especially if this kind of thing only happens once every 5 years, could be disastrous if people don't realize what's happening and why. Many people might never have seen anything in that queue and may not be trained to diagnose the problem and manually translate the flight plan.

This might not be the reason why the developer chose to have the program essentially pull the fire alarm and go home in this case, but that's the impression I got.


The ATC system handled well enough (i.e. no disasters, and AFAIK, no near misses) something much more complicated than one aircraft showing up with no flight plan: the failure of this particular system put all the flights in that category.

I mentioned elsewhere that any ATC system has to be resilient enough to handle things like in-flight equipment failure, medical emergencies, and the diversion of multiple aircraft on account of bad weather or an incident which shuts down a major airport.

As for why the system "pulled the plug", the author of the article suspects that this particular error was regarded as something that would not occur unless something catastrophic had caused it, whereas, in reality, it affected only one flight and could probably have been easily worked around if the system had informed ATC which flight plan was causing the problem.


I'm not sure they're even used for that purpose - that side of thing is done "live" as I understand it - the plans are so that ATC has the details on hand for each flight and it doesn't all need to be communicated by radio as they pass through.


"unexpected errors" are not necessarily problems with the flight plans. They could be anything.


I wonder where most of the complexity lies in ATC. Naively you’d think there would be some mega computer needed to solve the puzzle but the UK only sees 6k flights a day and the scale of the problem, like most things in the physical world, is well bounded. That’s about the same number of buses in London, or a tenth of the number of Uber drivers in NYC.

It would be interesting to actually see the code.


Much of the complexity is in interop. Passing data between ATC control positions, between different facilities, and between different countries. Then every airline has a bidirectional data feed, plus all the independent GA flights (either via flight service or via third-party apps). Plus additional systems for weather, traffic management, radar, etc. Plus everything happening on the defense side.

All using communication links and protocols that have evolved organically since the 1950s, need global consensus (with hundreds of different countries' implementations), and which need to never fail.


The system should have just rejected the FPL, notify the admins about the problem and keep working. The admins could have fixed whatever the software could not handle.

The affected flight could have been vectored by ATC if needed to divert from filed FPL.

Way less work and a better doutcome than the “system throws hands in the air and becomes unresponsive”.


"When a failsafe system fails, it fails by failing to fail safe."

J. Gall


Different teams often make the same mistake. The system you describe is not perfect, but makes sense.


I neglected to mention there was a third party that reviewed the algorithms to verify they weren't the same.

Nothing is perfect, though, and the pilot is the backup for failure of that system. I.e. turn off the stab trim system.


Is this still the case for simple algorithms?


I don't know as much about modern avionics.


if this is true, then would it be a better investment to have the 2nd team produce a fuzz testing/systematic testing mechanism instead of producing a secondary copy of the same system?

In fact, make it adversarial testing such that this team is rewarded (may be financially) if mistakes or problems are found from the 1st team's program.


Such incentives can lead to reduced collaboration. If I get paid every time you make mistakes, I won't want you to get better at your job


the whole point is that they're not collaborating so as to avoid cross-contamination. also you don't get paid unless and until you identify the mistake. if you decrease the reward over time, there is an additional incentive to not sit on the information


As as side note, too bad they knowingly didn't reuse such an approach for the MAX..


The MAX system relied on the pilot remembering the stab trim cutoff switch and what it was for.


Even though the trim cutoff switch didn't work as it used to do on the previous generation of 737s, and the pilots were not notified about the change.


Wouldn't trim be an number of which a significant tolerance is permissible at any given time? Or does "agree" mean "within a preset tolerance"?


Naturally, any comparator would have some slack in it to account for variations. Even CPU internals have such slack, that's why there's a "clock" to synchronize things.


I would be very interested in knowing which languages were used. Do you know which were? Thanks


One of them was Pascal. This was around 1980 or so.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: