For those interested in the topic, especially point 7 and 8 (root cause and hind...

csours · on Dec 27, 2020

I strongly agree with this recommendation.

Personal Experience: During and after the economic collapse of 2007-2009 I wondered who was at fault; who to blame. I kept waiting for a clear answer to "THE root cause". Since then I've read things like Dekker's work, and come to realize that blame is not a productive way of thinking in complex systems.

A quick example: in many car accidents, you can easily point to the person who caused the accident; for example the person who runs a red light, texts and drives, or drives drunk is easily found at fault. But what about a case where someone 3 or 4 car lengths ahead makes a quick lane change and an accident occurs behind them?

Joeri · on Dec 27, 2020

This is why I prefer thinking about root solutions rather than root causes. The answer to the question “how can we make sure something like this cannot happen again?” (for a reasonably wide definition of this). The nice thing is that there are usually many right answers, all of which can be implemented, while when looking for root causes there may not actually be one.

travmatt · on Dec 27, 2020

A good example was the AWS S3 outage that occurred when a single engineer mistyped a command[0]. While the outage wouldn't have occurred had an engineer not mistyped a command, that conclusion still would have missed the issue that the system should have some level of resiliency against simple typos - in their case, checking that actions that wouldn't take subsystems below their minimum required capacity.

[0] https://aws.amazon.com/message/41926/

academi · on Dec 28, 2020

Systems should still be able to be taken offline, though, even if that means failure.

For example, let’s say you have a service that uses another service that raised its cost from free to $100/hour and you call it 1000 times per hour.

Even though you may not have a fallback, and your service may fail, you need to be able to disable it. In this case, an admin is unavailable and the only recourse would be to lower the capacity to 0, since you have that control.

That doesn’t negate the benefit of validation, but don’t be too heavy-handed with validation, just as a reaction to failure without fully thinking it through.

jimktrains2 · on Dec 28, 2020

Ideally a destructive command shouldn't be accidently triggerable. At the very least it should require some positive confirmation. Alternatively, a series of actions could be required, such as changing the capacity (which should be the comqmnd where the double checks and positive confirmations happen in my opinion) followed by changing the services usage.

konjin · on Dec 28, 2020

The root cause was the deregulation in the 90s and the removal of the Glass-Steagall act.

Complex systems are also simple systems when viewed as a black box from the out side. A lion always eats a gazelle given a chance and a bank always explodes if not regulated to within an inch of its life.

That people like to pretend internal complexity matches external complexity is a very odd mental quirk. It is false in both directions of implication. Conway's game of life is as simple a game as you can get yet it has the most complex behavior possible.

dkarl · on Dec 28, 2020

Thank you for making this point. The idea that it was an unpredictable failure mode of a complex system, rather than predictable exploitation of a system where safeguards against exploitation were deliberately removed as "inefficient," is exactly what will ensure a repeat in the future.

hansvm · on Dec 27, 2020

> But what about a case where someone 3 or 4 car lengths ahead makes a quick lane change and an accident occurs behind them?

You'd have to be going extremely slowly for 3 or 4 car lengths to be a safe following distance. On a typical 60-70mph freeway you should have a gap of at least 15-20 car lengths, and then accidents like that will happen only when other factors are at play (and if those factors are predictable like water on the road then your distance/speed should be adjusted accordingly).

While I think that example was bad, your point about there existing accidents without single points of blame is still valid.

csours · on Dec 27, 2020

I meant to say 3 or 4 cars ahead. Thank you for taking the point.

hansvm · on Dec 27, 2020

That makes sense. To be fair I shouldn't have been so hasty anyway; with your example of a lane change, even if you can control your following distance it's hard to also keep people from other lanes cutting in too closely.

yholio · on Dec 27, 2020

> A quick example: in many car accidents, you can easily point to the person who caused the accident

This is a bad example: traffic is a complex sistem with a century of ruleset evolution specifically intended to isolate personal responsibility and provide a simple interface for the users, that, when correctly used, guarantees a collision free ride for all participants.

The systemic failures of trafic are more related to the fallible nature of its actors. The safety guarantees work only when humans demonstrate almost super-human regard to the safety of others, are never inattentive, tired, in a hurry or influenced by substances or medical conditions etc.

We try to align personal incentives to systemic goals with hefty punishments, but there is a diminishing return on that, at some point you have to consider humans unreliable and design your system to be fault-tolerant. Indeed, most modern trafic systems are doing this today with things like impact absorbing railings, speed bumps, wide shoulders and curves etc.

yuliyp · on Dec 28, 2020

It's a perfectly reasonable example. Most complex systems share those properties. It makes glaringly obvious that isolating blame/personal responsibility has limited effectiveness in preventing accidents. While we need a blame assignment system that's clear to reason about for financial reasons, for other systems where that need isn't so great, more effort ends up getting spent on figuring out how to prevent outages than assigning blame for them.

mjcohen · on Dec 27, 2020

"guarantees a collision free ride for all participants"?

The only way to win is not to play.

I was stationary in a traffic jam when I was rear-ended by the car in back of me. Fortunately, I was not hurt at all.

How could I have avoided this? (see above)?

JadeNB · on Dec 27, 2020

> I was stationary in a traffic jam when I was rear-ended by the car in back of me. Fortunately, I was not hurt at all.

I believe yholio's statement about "when correctly used" was meant to be "when correctly used by all participants"; i.e., no single participant can guarantee their own, or anyone else's, safety, no matter how careful they are.

On the other hand, the guarantee is, as yholio notes, of an almost entirely theoretical nature even given universal cooperation:

> The safety guarantees work only when humans demonstrate almost super-human regard to the safety of others, are never inattentive, tired, in a hurry or influenced by substances or medical conditions etc.

yholio · on Dec 27, 2020

I think you missed the "when correctly used" part. Clearly, the person who hit you was not correctly using the system which places the whole responsibility of this situation onto them. When obeying all rules of the road, situations where rear-ending someone is probable should not arise.

The systemic failure here is expecting people not to phase out and pay less attention to the road when driving for hours at high speed on monotonous highways.

bobthepanda · on Dec 28, 2020

Well, traffic doesn’t exist in a vacuum. One could engineer a system where this is less likely to happen, by simply engineering a system where people are less likely to drive. Many European cities are systematically reducing incentives to drive at all, by improving alternative means like public transport and biking, removing parking, narrowing or eliminating car lanes, and cutting off through streets for cars.

narag · on Dec 27, 2020

I wrote another comment, before reading the article that says it better than me:

Catastrophe requires multiple failures – single point failures are not enough.

roenxi · on Dec 27, 2020

> During and after the economic collapse of 2007-2009 I wondered who was at fault; who to blame. I kept waiting for a clear answer to "THE root cause".

Well, there can be a difference between "who is at fault" and "THE root cause". Quite a large difference, potentially.

In the case of '07-'09, maybe nobody was at fault (seems plausible) but there was a very neat root cause. The government handed out a lot of money to people who shouldn't have gotten it, the people responsible were largely protected from bankruptcy and the system forced to reform largely as it was. The people who took excessive risk earned an excessive reward - they should have all gone bankrupt. The financial system should actually have changed, and people who made productive investments and didn't take risk should have become ascendant. Instead we have the same old crowd playing the same old game.

Disabling the major feedback mechanism of capitalism is about as root-causal as can be gotten. Nobody in particular chose to disable it though, it was a consensus decision among the powerful.

worik · on Dec 28, 2020

Yes you are correct, root cause != person to blame. But....

That is such a over simplification and ignores a lot, most, of the issues in the GFC.

* Regulatory capture

* Illegal behaviour by trusted actors (chiefly banks)

* Corrupt judges

* Carelessness by trusted agencies e.g., credit rating agencies

Long long list of the collection of failures in the international financial system in general and American civil society in particular.

roenxi · on Dec 28, 2020

Any issue can be made impossible to solve if enough details are highlighted. Deciding what to eat for breakfast is an impossible collection of nutrient requirements, bodily requirements, long and short term tradeoffs, economic problems and social expectations. Yet somehow we mostly manage.

"Oh but it's complicated" is a completely standard line of misdirection that comes up very regularly when people are making repeated bad decisions. Even small children sometimes try it. It is very, very rarely true and particularly in totally synthetic systems like the monetary one. There is always a point of greatest leverage that could be changed and it makes sense to call that the root cause and try changing it.

Now I'm totally open to the idea that I don't know what the point of greatest leverage is in the GFC. I haven't read the regulations and I wasn't in the room when the money was being handed out. But there were billions to trillions of dollars in fake wealth that turned out never to have existed. The fact that the banking industry skated through with the same people largely in charge suggests strongly that no serious attempt was made to figure out who exactly was screwing up.

worik · on Dec 28, 2020

"But there were billions to trillions of dollars in fake wealth that turned out never to have existed" That is not true. They were financial assets which only have the value that people put into them. They are underpinned by confidence. Confidence goes away, it takes value with it.

"no serious attempt was made to figure out who exactly was screwing up" Exactly. Some very powerful people got very rich and it was in now bodies incentives to find anybody accountable.

These are interesting and may help you understand: https://en.wikipedia.org/wiki/Minsky_moment

https://www.sonyclassics.com/insidejob/

andrewjl · on Dec 28, 2020

What would you have done differently from what the policymakers have done? And how would you account for potentially undesirable second-order effects? (Which were quite massive for most of the obvious alternatives.)

apta · on Dec 28, 2020

> During and after the economic collapse of 2007-2009 I wondered who was at fault; who to blame

Sometimes it is possible to at least narrow things down to an underlying inherent instability. In the case of your example, a huge underlying cause is an economic system based on debt (backed by interest and usurious transactions). It's for a reason that usury/interest is banned in Islam, Christianity, and Judaism for example. It's a parasitic practice that makes the economy fundamentally unstable. This includes dangerous practices such as selling debt for debt (again part of the same crisis), and things like stock shorting (which, interestingly enough was also banned during the crisis, at least for some critical company stocks).

triangleman · on Dec 27, 2020

There is a relatively simple root cause of the financial crisis, it's called moral hazard:

https://en.m.wikipedia.org/wiki/Greenspan_put

Now, how exactly to solve the problem is a complex question, so I suppose in that respect it's hard to think productively about it.

watwut · on Dec 27, 2020

Afaik, there were actual massive frauds going on.

dkarl · on Dec 28, 2020

Massive frauds are predictable when it becomes profitable to participate. For example, right now in the UK, the government is guaranteeing certain loans through a Covid relief program. Predictably, banks are letting a high number of fraudulent loan applications through. The banks are participating in the fraud as victims, willing victims who expect to make a profit from it. It seems like the government is going to raise its eyebrows and "tut-tut" a little bit but basically let them get away with it.

https://www.ft.com/content/bbe858d9-8678-4d1a-84c1-a62ac426a...

triangleman · on Dec 27, 2020

Like Madoff? The subprime lending/NINJA loans? In my opinion both of those things would have been harder if everyone was doing more due diligence. Moral hazard took away some of the incentive to do that.

watwut · on Dec 28, 2020

"everyone was doing due diligence" is not really realistic expectation. It amounts to blaming everyone for small bad thing while excusing massive big bad thing going on.

triangleman · on Dec 28, 2020

I found a decent analogy for what I'm talking about:

https://www.history.com/news/beanie-babies-value-criminal-ac...

With a bubble in Beanie Babies there was accompanying fraud and other criminal activity, with most consumers not wittingly participating, but they were affected by it.

So with the asset bubble of the 2000's, this kind of behavior affected everyone, as everyone needs a house.

In another thread I have said I don't blame the masses who may have made bad decisions in hindsight, as they did not have access to all the relevant information. So I do blame the fraudsters, but in the end I really only blame the Fed for keeping interest rates too low, and of course that can circle back around to society as well if you like.

https://youtu.be/d0nERTFo-Sk

simonh · on Dec 27, 2020

Right, even if all the wrongdoing had all happened together, but the rest of the economic and financial system had been sound, we’d have been ok.

karmakaze · on Dec 27, 2020

Root cause analysis when run properly doesn't only come out with one culprit, the process can identify many potential sources of weakness in technical, human processes, documentation--at least one of which (typically more) should immediately be improved.

dredmorbius · on Dec 28, 2020

https://www.oreilly.com/library/view/the-field-guide/9781317...

thewebcount · on Dec 27, 2020

How does that square with the need to improve systems so the same problems don't happen again? I get not wanting to put blame on a particular person, group, or cause when it's multi-factorial, but how can you improve if you don't figure out why the failure occurred?

thatsamonad · on Dec 27, 2020

I’m not the OP, but when I think of “systemic thinking” I think the focus is more on looking at all of the factors involved as part of a holistic model rather than placing blame at the feet of a particular individual or process. You can still identify causes and try to remediate them, but most of the time the remediation shouldn’t be something like “let’s fire Bob for making a mistake/error”, but rather, “Let’s look at all of the events that led up to Bob making that mistake and figure out how we can help him avoid it in the future through a system, process, or people change, or a combination therein”.

That being said, if someone is negligent and consistently does negligent things they should probably be put into a position where their negligence won’t cause catastrophic system failures or loss of life. Sometimes that does mean firing someone.

bobsomers · on Dec 27, 2020

A big part is acknowledging that the actions that human operators take is largely a result of the environment in which they operate. Typically there are many issues with that environment that be improved and all human operators will benefit.

To give you a more concrete example, it moves the analysis away from "Bob deleted the production database" into a more productive space of "we really shouldn't have a process that relies on any human logging into the production database and running SQL queries by hand, that's prone to human mistake".

wpietri · on Dec 27, 2020

That's one of the central questions of the book. But my take is that there are a bunch of ways to answer a "why" question, some more useful than others.

One very common mode is to take a complex causal web, trace until you find a person of low status, and then yell at and/or punish said scapegoat. That desire to blame is a very human approach, but it a) isn't very effective in preventing the next problem, and b) prevents real systemic understanding by providing a false feeling of resolution.

So if we really want to figure out why the failure occurred and reduce the odds of it happening again, we need to give up blame and look at how systems create incentives and behaviors in the people caught up in them. Only if everybody feels a sense of personal safety do we have much chance of getting at what happened and discussing it calmly enough that we can come to real understandings and real solutions.

thewebcount · on Dec 28, 2020

Thanks for the clarification. This sounds like an interesting book.

The phrasing used on the web site is "Post-accident attribution to a ‘root cause’ is fundamentally wrong." At first glance, it sounds like the author means there is no cause that can be found so you shouldn't try to determine the cause. First they clarify by saying there are many causes not just one. However, this phrasing made me scratch my head:

> The evaluations based on such reasoning as ‘root cause’ do not reflect a technical understanding of the nature of failure but rather the social, cultural need to blame specific, localized forces or events for outcomes.

I don't know what other organizations are like, but where I work, when we do a "root cause analysis," we aren't literally looking for a single cause, despite the name. The "root cause" is almost always that pieces a, b, and c came together in an unexpected way. I can definitely think of places where I worked where they were mostly out to place blame, though, and I guess that's what they were trying to caution against.

wpietri · on Dec 28, 2020

I think blame is one way it can go bad, but not the only one. The whole framing of a "root cause" is dangerous, in that it encourages people to look for exactly one thing, and then not look beyond it when they find it. It sounds like your organization does decently in that regard, but they're doing it in spite of the "root cause" frame.

jccooper · on Dec 27, 2020

There's a definite difference between blame and cause, and they don't conflict. Blame is for individuals, cause is for systems. While you do need to hold individuals accountable, most of the time you should focus on fixing the system, which is a much more durable fix.

bumby · on Dec 27, 2020

Part of the philosophy is to change the mindset from a “person” perspective to a “process” perspective. I.e., what gaps in the process led to the mishap, not what person caused the mishap.

Organizations that are people dependent rather than process dependent tend to have higher risks of failures.

nullsense · on Dec 27, 2020

This book taught me about Rasmussen's model of safety. It's a good book.