For those interested in the topic, especially point 7 and 8 (root cause and hindsight bias), I strongly recommend Dekker's "A Field Guide to 'Human Error'". It digs in on those points in a very practical way. We recently had a minor catastrophe at work; it was very useful in preparing me to guide people away from individual blame and toward systemic thinking.
Personal Experience: During and after the economic collapse of 2007-2009 I wondered who was at fault; who to blame. I kept waiting for a clear answer to "THE root cause". Since then I've read things like Dekker's work, and come to realize that blame is not a productive way of thinking in complex systems.
A quick example: in many car accidents, you can easily point to the person who caused the accident; for example the person who runs a red light, texts and drives, or drives drunk is easily found at fault. But what about a case where someone 3 or 4 car lengths ahead makes a quick lane change and an accident occurs behind them?
This is why I prefer thinking about root solutions rather than root causes. The answer to the question “how can we make sure something like this cannot happen again?” (for a reasonably wide definition of this). The nice thing is that there are usually many right answers, all of which can be implemented, while when looking for root causes there may not actually be one.
A good example was the AWS S3 outage that occurred when a single engineer mistyped a command[0]. While the outage wouldn't have occurred had an engineer not mistyped a command, that conclusion still would have missed the issue that the system should have some level of resiliency against simple typos - in their case, checking that actions that wouldn't take subsystems below their minimum required capacity.
Systems should still be able to be taken offline, though, even if that means failure.
For example, let’s say you have a service that uses another service that raised its cost from free to $100/hour and you call it 1000 times per hour.
Even though you may not have a fallback, and your service may fail, you need to be able to disable it. In this case, an admin is unavailable and the only recourse would be to lower the capacity to 0, since you have that control.
That doesn’t negate the benefit of validation, but don’t be too heavy-handed with validation, just as a reaction to failure without fully thinking it through.
Ideally a destructive command shouldn't be accidently triggerable. At the very least it should require some positive confirmation. Alternatively, a series of actions could be required, such as changing the capacity (which should be the comqmnd where the double checks and positive confirmations happen in my opinion) followed by changing the services usage.
The root cause was the deregulation in the 90s and the removal of the Glass-Steagall act.
Complex systems are also simple systems when viewed as a black box from the out side. A lion always eats a gazelle given a chance and a bank always explodes if not regulated to within an inch of its life.
That people like to pretend internal complexity matches external complexity is a very odd mental quirk. It is false in both directions of implication. Conway's game of life is as simple a game as you can get yet it has the most complex behavior possible.
Thank you for making this point. The idea that it was an unpredictable failure mode of a complex system, rather than predictable exploitation of a system where safeguards against exploitation were deliberately removed as "inefficient," is exactly what will ensure a repeat in the future.
> But what about a case where someone 3 or 4 car lengths ahead makes a quick lane change and an accident occurs behind them?
You'd have to be going extremely slowly for 3 or 4 car lengths to be a safe following distance. On a typical 60-70mph freeway you should have a gap of at least 15-20 car lengths, and then accidents like that will happen only when other factors are at play (and if those factors are predictable like water on the road then your distance/speed should be adjusted accordingly).
While I think that example was bad, your point about there existing accidents without single points of blame is still valid.
That makes sense. To be fair I shouldn't have been so hasty anyway; with your example of a lane change, even if you can control your following distance it's hard to also keep people from other lanes cutting in too closely.
> A quick example: in many car accidents, you can easily point to the person who caused the accident
This is a bad example: traffic is a complex sistem with a century of ruleset evolution specifically intended to isolate personal responsibility and provide a simple interface for the users, that, when correctly used, guarantees a collision free ride for all participants.
The systemic failures of trafic are more related to the fallible nature of its actors. The safety guarantees work only when humans demonstrate almost super-human regard to the safety of others, are never inattentive, tired, in a hurry or influenced by substances or medical conditions etc.
We try to align personal incentives to systemic goals with hefty punishments, but there is a diminishing return on that, at some point you have to consider humans unreliable and design your system to be fault-tolerant. Indeed, most modern trafic systems are doing this today with things like impact absorbing railings, speed bumps, wide shoulders and curves etc.
It's a perfectly reasonable example. Most complex systems share those properties. It makes glaringly obvious that isolating blame/personal responsibility has limited effectiveness in preventing accidents. While we need a blame assignment system that's clear to reason about for financial reasons, for other systems where that need isn't so great, more effort ends up getting spent on figuring out how to prevent outages than assigning blame for them.
> I was stationary in a traffic jam when I was rear-ended by the car in back of me. Fortunately, I was not hurt at all.
I believe yholio's statement about "when correctly used" was meant to be "when correctly used by all participants"; i.e., no single participant can guarantee their own, or anyone else's, safety, no matter how careful they are.
On the other hand, the guarantee is, as yholio notes, of an almost entirely theoretical nature even given universal cooperation:
> The safety guarantees work only when humans demonstrate almost super-human regard to the safety of others, are never inattentive, tired, in a hurry or influenced by substances or medical conditions etc.
I think you missed the "when correctly used" part. Clearly, the person who hit you was not correctly using the system which places the whole responsibility of this situation onto them. When obeying all rules of the road, situations where rear-ending someone is probable should not arise.
The systemic failure here is expecting people not to phase out and pay less attention to the road when driving for hours at high speed on monotonous highways.
Well, traffic doesn’t exist in a vacuum. One could engineer a system where this is less likely to happen, by simply engineering a system where people are less likely to drive. Many European cities are systematically reducing incentives to drive at all, by improving alternative means like public transport and biking, removing parking, narrowing or eliminating car lanes, and cutting off through streets for cars.
> During and after the economic collapse of 2007-2009 I wondered who was at fault; who to blame. I kept waiting for a clear answer to "THE root cause".
Well, there can be a difference between "who is at fault" and "THE root cause". Quite a large difference, potentially.
In the case of '07-'09, maybe nobody was at fault (seems plausible) but there was a very neat root cause. The government handed out a lot of money to people who shouldn't have gotten it, the people responsible were largely protected from bankruptcy and the system forced to reform largely as it was. The people who took excessive risk earned an excessive reward - they should have all gone bankrupt. The financial system should actually have changed, and people who made productive investments and didn't take risk should have become ascendant. Instead we have the same old crowd playing the same old game.
Disabling the major feedback mechanism of capitalism is about as root-causal as can be gotten. Nobody in particular chose to disable it though, it was a consensus decision among the powerful.
Any issue can be made impossible to solve if enough details are highlighted. Deciding what to eat for breakfast is an impossible collection of nutrient requirements, bodily requirements, long and short term tradeoffs, economic problems and social expectations. Yet somehow we mostly manage.
"Oh but it's complicated" is a completely standard line of misdirection that comes up very regularly when people are making repeated bad decisions. Even small children sometimes try it. It is very, very rarely true and particularly in totally synthetic systems like the monetary one. There is always a point of greatest leverage that could be changed and it makes sense to call that the root cause and try changing it.
Now I'm totally open to the idea that I don't know what the point of greatest leverage is in the GFC. I haven't read the regulations and I wasn't in the room when the money was being handed out. But there were billions to trillions of dollars in fake wealth that turned out never to have existed. The fact that the banking industry skated through with the same people largely in charge suggests strongly that no serious attempt was made to figure out who exactly was screwing up.
"But there were billions to trillions of dollars in fake wealth that turned out never to have existed" That is not true. They were financial assets which only have the value that people put into them. They are underpinned by confidence. Confidence goes away, it takes value with it.
"no serious attempt was made to figure out who exactly was screwing up" Exactly. Some very powerful people got very rich and it was in now bodies incentives to find anybody accountable.
What would you have done differently from what the policymakers have done? And how would you account for potentially undesirable second-order effects? (Which were quite massive for most of the obvious alternatives.)
> During and after the economic collapse of 2007-2009 I wondered who was at fault; who to blame
Sometimes it is possible to at least narrow things down to an underlying inherent instability. In the case of your example, a huge underlying cause is an economic system based on debt (backed by interest and usurious transactions). It's for a reason that usury/interest is banned in Islam, Christianity, and Judaism for example. It's a parasitic practice that makes the economy fundamentally unstable. This includes dangerous practices such as selling debt for debt (again part of the same crisis), and things like stock shorting (which, interestingly enough was also banned during the crisis, at least for some critical company stocks).
Massive frauds are predictable when it becomes profitable to participate. For example, right now in the UK, the government is guaranteeing certain loans through a Covid relief program. Predictably, banks are letting a high number of fraudulent loan applications through. The banks are participating in the fraud as victims, willing victims who expect to make a profit from it. It seems like the government is going to raise its eyebrows and "tut-tut" a little bit but basically let them get away with it.
Like Madoff? The subprime lending/NINJA loans? In my opinion both of those things would have been harder if everyone was doing more due diligence. Moral hazard took away some of the incentive to do that.
"everyone was doing due diligence" is not really realistic expectation. It amounts to blaming everyone for small bad thing while excusing massive big bad thing going on.
With a bubble in Beanie Babies there was accompanying fraud and other criminal activity, with most consumers not wittingly participating, but they were affected by it.
So with the asset bubble of the 2000's, this kind of behavior affected everyone, as everyone needs a house.
In another thread I have said I don't blame the masses who may have made bad decisions in hindsight, as they did not have access to all the relevant information. So I do blame the fraudsters, but in the end I really only blame the Fed for keeping interest rates too low, and of course that can circle back around to society as well if you like.
Root cause analysis when run properly doesn't only come out with one culprit, the process can identify many potential sources of weakness in technical, human processes, documentation--at least one of which (typically more) should immediately be improved.
How does that square with the need to improve systems so the same problems don't happen again? I get not wanting to put blame on a particular person, group, or cause when it's multi-factorial, but how can you improve if you don't figure out why the failure occurred?
I’m not the OP, but when I think of “systemic thinking” I think the focus is more on looking at all of the factors involved as part of a holistic model rather than placing blame at the feet of a particular individual or process. You can still identify causes and try to remediate them, but most of the time the remediation shouldn’t be something like “let’s fire Bob for making a mistake/error”, but rather, “Let’s look at all of the events that led up to Bob making that mistake and figure out how we can help him avoid it in the future through a system, process, or people change, or a combination therein”.
That being said, if someone is negligent and consistently does negligent things they should probably be put into a position where their negligence won’t cause catastrophic system failures or loss of life. Sometimes that does mean firing someone.
A big part is acknowledging that the actions that human operators take is largely a result of the environment in which they operate. Typically there are many issues with that environment that be improved and all human operators will benefit.
To give you a more concrete example, it moves the analysis away from "Bob deleted the production database" into a more productive space of "we really shouldn't have a process that relies on any human logging into the production database and running SQL queries by hand, that's prone to human mistake".
That's one of the central questions of the book. But my take is that there are a bunch of ways to answer a "why" question, some more useful than others.
One very common mode is to take a complex causal web, trace until you find a person of low status, and then yell at and/or punish said scapegoat. That desire to blame is a very human approach, but it a) isn't very effective in preventing the next problem, and b) prevents real systemic understanding by providing a false feeling of resolution.
So if we really want to figure out why the failure occurred and reduce the odds of it happening again, we need to give up blame and look at how systems create incentives and behaviors in the people caught up in them. Only if everybody feels a sense of personal safety do we have much chance of getting at what happened and discussing it calmly enough that we can come to real understandings and real solutions.
Thanks for the clarification. This sounds like an interesting book.
The phrasing used on the web site is "Post-accident attribution to a ‘root cause’ is fundamentally wrong." At first glance, it sounds like the author means there is no cause that can be found so you shouldn't try to determine the cause. First they clarify by saying there are many causes not just one. However, this phrasing made me scratch my head:
> The evaluations based on such reasoning as ‘root cause’ do not reflect a technical understanding of the nature of failure but rather the social, cultural need to blame specific, localized forces or events for outcomes.
I don't know what other organizations are like, but where I work, when we do a "root cause analysis," we aren't literally looking for a single cause, despite the name. The "root cause" is almost always that pieces a, b, and c came together in an unexpected way. I can definitely think of places where I worked where they were mostly out to place blame, though, and I guess that's what they were trying to caution against.
I think blame is one way it can go bad, but not the only one. The whole framing of a "root cause" is dangerous, in that it encourages people to look for exactly one thing, and then not look beyond it when they find it. It sounds like your organization does decently in that regard, but they're doing it in spite of the "root cause" frame.
There's a definite difference between blame and cause, and they don't conflict. Blame is for individuals, cause is for systems. While you do need to hold individuals accountable, most of the time you should focus on fixing the system, which is a much more durable fix.
Part of the philosophy is to change the mindset from a “person” perspective to a “process” perspective. I.e., what gaps in the process led to the mishap, not what person caused the mishap.
Organizations that are people dependent rather than process dependent tend to have higher risks of failures.