This is why I prefer thinking about root solutions rather than root causes. The ...

travmatt · on Dec 27, 2020

A good example was the AWS S3 outage that occurred when a single engineer mistyped a command[0]. While the outage wouldn't have occurred had an engineer not mistyped a command, that conclusion still would have missed the issue that the system should have some level of resiliency against simple typos - in their case, checking that actions that wouldn't take subsystems below their minimum required capacity.

[0] https://aws.amazon.com/message/41926/

academi · on Dec 28, 2020

Systems should still be able to be taken offline, though, even if that means failure.

For example, let’s say you have a service that uses another service that raised its cost from free to $100/hour and you call it 1000 times per hour.

Even though you may not have a fallback, and your service may fail, you need to be able to disable it. In this case, an admin is unavailable and the only recourse would be to lower the capacity to 0, since you have that control.

That doesn’t negate the benefit of validation, but don’t be too heavy-handed with validation, just as a reaction to failure without fully thinking it through.

jimktrains2 · on Dec 28, 2020

Ideally a destructive command shouldn't be accidently triggerable. At the very least it should require some positive confirmation. Alternatively, a series of actions could be required, such as changing the capacity (which should be the comqmnd where the double checks and positive confirmations happen in my opinion) followed by changing the services usage.