> the backup system applied the same logic to the flight plan with the same resu...

dhx · on Sept 12, 2023

This would have been a 2oo2 system where the pilot becomes the backup. 2oo2 systems are not highly available.

Air traffic control systems should at least be 2oo3[1] (3 systems independently developed of which 2 must concur at any given time) so that a failure of one system would still allow the other two to continue operation without impacting availability of the aviation industry.

Human backup is not possible because of human resourcing and complexity. ATC systems would need to be available to provide separation under IFR[2] and CVFR[3] conditions.

[1] https://en.wikipedia.org/wiki/Triple_modular_redundancy

[2] https://en.wikipedia.org/wiki/Instrument_flight_rules#Separa...

[3] https://en.wikipedia.org/wiki/Visual_flight_rules#Controlled...

mannykannot · on Sept 12, 2023

> Air traffic control systems should at least be 2oo3... Human backup is not possible because of human resourcing and complexity.

But this was a 1oo1 system, and the human backup handled it well enough: a lot of people were inconvenienced, but there were no catastrophes, and (AFAIK) nothing that got close to being one.

As for the benefits of independent development: it might have helped, but the chances of this being so are probably not as much as one would have hoped if one thought programming errors are essentially random defects analogous to, say, weaknesses in a bundle of cables; I had a bit more to say about it here:

https://news.ycombinator.com/item?id=37476624

justinclift · on Sept 12, 2023

> But this was a 1oo1 system, and the human backup handled it well enough ...

Heh, a hundred million pound outage. ;)

True, no-one seems to have died from it directly though.

mannykannot · on Sept 12, 2023

True. I don't want to downplay the actual cost (or, worse, suggest that we should accept "the system worked as intended" excuses), but it's not just that there were no crashes: the air traffic itself remained under control throughout the event. Compare this to, for example, the financial "flash crash" of 2010, or the nuclear 'excursions' at Fukushima / Chernobyl / Three Mile Island / Windscale, where those nominally in control were reduced to being passive observers.

It also serves as a reminder of how far we have to go before we can automate away the jobs of pilots and air traffic controllers.

wcarss · on Sept 12, 2023

This reminds me of a backwoods hike I took with a friend some years back. We each brought a compass, "for redundancy", but it wasn't until we were well underway that we noticed our respective compasses frequently disagreed. We often wished we had a third to break the tie!

jangxx · on Sept 12, 2023

Sounds like the joke about a man with one watch always being sure about what time it is, but a man with two being continuously in doubt.

sparrowInHand · on Sept 12, 2023

Just computate the average, then counter the documented drift vs a external source?

jwr · on Sept 12, 2023

My grandfather was working with Stanisław Skarżyński, who was preparing for his first crossing of the Atlantic in a lightweight airplane (RWD-5bis, 450kg empty weight) in 1933.

They initially mounted two compasses in the cockpit, but Skarżyński taped one of them over so that it wasn't visible, saying wisely that if one fails, he will have no idea which one is correct.

jefftk · on Sept 12, 2023

> if one fails, he will have no idea which one is correct

Depends how it fails! For example, say, when you change direction one turns and the other doesn't.

cutemonster · on Sept 12, 2023

Couldn't he bring his own 3rd? Compasses aren't heavy?

jwr · on Sept 13, 2023

…or a 4th and a 5th, and have voting rounds — an idea explored by Stanisław Lem in "Golem XIV", where a parliament of machines voted :-)

zero_k · on Sept 12, 2023

That's a cool story! Would have loved to have heard more about that :)

iudqnolq · on Sept 12, 2023

In this case the problem was choosing an excessively naive algorithm. I'm very inexperienced but that seems to me like the solution would be to spend a bit more money on reviewing the one implementation rather than writing two new ones from scratch.

virtue3 · on Sept 12, 2023

you would be very surprised how difficult avionics are from even a fundamental level.

I'll provide a relatively simple example.

Just even attempting to design a starfox game clone where the ship goes towards the mouse cursor using euler angles will almost immediately result in gimbol lock and your starfighter locking up tighter than unlubricated car engine going 100mph and unable to move. [0]

The standard solution in games(or at least what I used) has been to use quaternions [1] (Hamilton defined a quaternion as the quotient of two directed lines in a three-dimensional space,[3] or, equivalently, as the quotient of two vectors.) So you essentially dump your 3D coordinate into the 4D quaternion coordinate, apply your matrix rotations, then convert back to 3D space and apply your rotations/transforms.

This was literally just to get my little space ship to go where my mouse cursor was on the screen without it locking up.

So... yeah, I cannot even begin to imagine the complexity of what a Boeing 757 (let alone a 787) is doing under the hood to deal with reality and not causing it to brick up and fall out of the sky.

[0] https://math.stackexchange.com/questions/8980/euler-angles-a... [1] https://en.wikipedia.org/wiki/Quaternion

iudqnolq · on Sept 12, 2023

I don't think we're talking about that kind of software, though. This big was in code that needs to parse a line defined by named points and then clip the line to the portion in the UK. Not trivial, but I can imagine writing that myself.

But regardless the more complex the code the worse idea it is to maintain three parallel implementations, if you won't/can't afford to do it properly

madengr · on Sept 12, 2023

I was doing some orientation sensing 20 years ago with an IMU and ran into the same problem. I had never known at the time it was gimbal lock (which I had heard of) but did read quaternions were the way to fix it. Pesky problem.

DocTomoe · on Sept 12, 2023

> Human backup is not possible because of human resourcing

This is an artificial restraint. In the end, it comes down to risk management: "Are we willing to pay someone to make sure the system stays up when the computer does something unexpected?".

Considering this bug only showed up no, chances are there was a project manager who decided the risk would me extremely low and not worth spending another 200k or so of yearly operation expenses on.

wavemode · on Sept 12, 2023

First thought that came to my mind as well when I read it. This failover system seems to be more designed to mitigate hardware failures than software bugs.

WalterBright · on Sept 12, 2023

I also understand that it is impractical to implement the ATC system software twice using different algorithms. The software at least checked for an illogical state and exited, which was the right thing to do.

A fix I would consider is to have the inputs more thoroughly checked for correctness before passing them on to the ATC system.

mannykannot · on Sept 12, 2023

> A fix I would consider is to have the inputs more thoroughly checked for correctness before passing them on to the ATC system.

Thoroughly checking of the inputs as far as possible should be a given, but in this case, the inputs were correct: while the use of duplicate identifiers is considerably less than ideal, the constraints on where that was permitted meant that there was one deterministically unambiguous parsing of the flight plan, as demonstrated in the article. The proximate cause of the problem was not in the inputs, but how they were processed by the ATC system.

For the same reason, multiple implementations of the software would only have helped if a majority of the teams understood this issue and got it right. I recall a fairly influential paper in the '90s (IIRC) in which multiple independent implementations of a requirements specification were compared, and the finding was that the errors were quite strongly correlated - i.e. there was a tendency for the teams to make the same mistakes as each other.

nightpool · on Sept 12, 2023

not stronger isolation between different flight plans? it seems "obvious" to me that if one flight plan is causing a bug in the handling logic, the system should be able to recover by continuing with the next flight plan and flagging the error to operators to impact that flight only

xp84 · on Sept 12, 2023

I'm no aviation expert, but perhaps with waypoints:

  A B C D E
   /
  F G H I J

If flight plan #1 is known to be going from F-B at flight level 130, and you have a (supposedly) bogus flight plan #2, they can't quite be sure if it might be going from A-G at flight level 130 at the same time and thus causing a really bad day for both aircraft. I'd worry that dropping plan #2 into a queue for manual intervention, especially if this kind of thing only happens once every 5 years, could be disastrous if people don't realize what's happening and why. Many people might never have seen anything in that queue and may not be trained to diagnose the problem and manually translate the flight plan.

This might not be the reason why the developer chose to have the program essentially pull the fire alarm and go home in this case, but that's the impression I got.

mannykannot · on Sept 12, 2023

The ATC system handled well enough (i.e. no disasters, and AFAIK, no near misses) something much more complicated than one aircraft showing up with no flight plan: the failure of this particular system put all the flights in that category.

I mentioned elsewhere that any ATC system has to be resilient enough to handle things like in-flight equipment failure, medical emergencies, and the diversion of multiple aircraft on account of bad weather or an incident which shuts down a major airport.

As for why the system "pulled the plug", the author of the article suspects that this particular error was regarded as something that would not occur unless something catastrophic had caused it, whereas, in reality, it affected only one flight and could probably have been easily worked around if the system had informed ATC which flight plan was causing the problem.

bdavbdav · on Sept 12, 2023

I'm not sure they're even used for that purpose - that side of thing is done "live" as I understand it - the plans are so that ATC has the details on hand for each flight and it doesn't all need to be communicated by radio as they pass through.

WalterBright · on Sept 12, 2023

"unexpected errors" are not necessarily problems with the flight plans. They could be anything.

gorgoiler · on Sept 12, 2023

I wonder where most of the complexity lies in ATC. Naively you’d think there would be some mega computer needed to solve the puzzle but the UK only sees 6k flights a day and the scale of the problem, like most things in the physical world, is well bounded. That’s about the same number of buses in London, or a tenth of the number of Uber drivers in NYC.

It would be interesting to actually see the code.

tjohns · on Sept 12, 2023

Much of the complexity is in interop. Passing data between ATC control positions, between different facilities, and between different countries. Then every airline has a bidirectional data feed, plus all the independent GA flights (either via flight service or via third-party apps). Plus additional systems for weather, traffic management, radar, etc. Plus everything happening on the defense side.

All using communication links and protocols that have evolved organically since the 1950s, need global consensus (with hundreds of different countries' implementations), and which need to never fail.

ExoticPearTree · on Sept 12, 2023

The system should have just rejected the FPL, notify the admins about the problem and keep working. The admins could have fixed whatever the software could not handle.

The affected flight could have been vectored by ATC if needed to divert from filed FPL.

Way less work and a better doutcome than the “system throws hands in the air and becomes unresponsive”.

shatnersbassoon · on Sept 12, 2023

"When a failsafe system fails, it fails by failing to fail safe."

J. Gall

borissk · on Sept 12, 2023

Different teams often make the same mistake. The system you describe is not perfect, but makes sense.

WalterBright · on Sept 12, 2023

I neglected to mention there was a third party that reviewed the algorithms to verify they weren't the same.

Nothing is perfect, though, and the pilot is the backup for failure of that system. I.e. turn off the stab trim system.

bdavbdav · on Sept 12, 2023

Is this still the case for simple algorithms?

WalterBright · on Sept 13, 2023

I don't know as much about modern avionics.

chii · on Sept 12, 2023

if this is true, then would it be a better investment to have the 2nd team produce a fuzz testing/systematic testing mechanism instead of producing a secondary copy of the same system?

In fact, make it adversarial testing such that this team is rewarded (may be financially) if mistakes or problems are found from the 1st team's program.

awestroke · on Sept 12, 2023

Such incentives can lead to reduced collaboration. If I get paid every time you make mistakes, I won't want you to get better at your job

iraqmtpizza · on Sept 12, 2023

the whole point is that they're not collaborating so as to avoid cross-contamination. also you don't get paid unless and until you identify the mistake. if you decrease the reward over time, there is an additional incentive to not sit on the information

fransje26 · on Sept 12, 2023

As as side note, too bad they knowingly didn't reuse such an approach for the MAX..

WalterBright · on Sept 13, 2023

The MAX system relied on the pilot remembering the stab trim cutoff switch and what it was for.

fransje26 · on Sept 22, 2023

Even though the trim cutoff switch didn't work as it used to do on the previous generation of 737s, and the pilots were not notified about the change.

jojobas · on Sept 12, 2023

Wouldn't trim be an number of which a significant tolerance is permissible at any given time? Or does "agree" mean "within a preset tolerance"?

WalterBright · on Sept 13, 2023

Naturally, any comparator would have some slack in it to account for variations. Even CPU internals have such slack, that's why there's a "clock" to synchronize things.

f1shy · on Sept 12, 2023

I would be very interested in knowing which languages were used. Do you know which were? Thanks

WalterBright · on Sept 13, 2023

One of them was Pascal. This was around 1980 or so.