Hacker News new | past | comments | ask | show | jobs | submit login

This is why I prefer thinking about root solutions rather than root causes. The answer to the question “how can we make sure something like this cannot happen again?” (for a reasonably wide definition of this). The nice thing is that there are usually many right answers, all of which can be implemented, while when looking for root causes there may not actually be one.



A good example was the AWS S3 outage that occurred when a single engineer mistyped a command[0]. While the outage wouldn't have occurred had an engineer not mistyped a command, that conclusion still would have missed the issue that the system should have some level of resiliency against simple typos - in their case, checking that actions that wouldn't take subsystems below their minimum required capacity.

[0] https://aws.amazon.com/message/41926/


Systems should still be able to be taken offline, though, even if that means failure.

For example, let’s say you have a service that uses another service that raised its cost from free to $100/hour and you call it 1000 times per hour.

Even though you may not have a fallback, and your service may fail, you need to be able to disable it. In this case, an admin is unavailable and the only recourse would be to lower the capacity to 0, since you have that control.

That doesn’t negate the benefit of validation, but don’t be too heavy-handed with validation, just as a reaction to failure without fully thinking it through.


Ideally a destructive command shouldn't be accidently triggerable. At the very least it should require some positive confirmation. Alternatively, a series of actions could be required, such as changing the capacity (which should be the comqmnd where the double checks and positive confirmations happen in my opinion) followed by changing the services usage.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: