> an EXACT SAME system took over and ran the exact same code
Did you ever work with HA systems? Because this is how they work. It's two copies of the same system intended for the cases when eg. hardware fails, or network partitioning happens etc.
No, I do not. But HA systems work like that because hardware or network failure is what they are designed to guard against, not a latent bug in the software logic. If there's a software bug, both systems will exhibit the same behavior, so HA fails there.
In practice, you have two kinds of HA systems (based on this criteria):
* Live + standby. Typically, the state of the live system is passively replicasted to the standby, where standby is meant to take over if it doesn't hear from the live one / the live one sends nonsense. (For example, you can use Kubernetes API server in this capacity).
* Consensus systems where each actor plays the same role, while there's an "elected" master which deals with synchronization of the system state. (For example, you can use Etcd).
In either case, it's the same program, but with a somewhat different state.
It doesn't make sense to make different programs to deal with this problem because you will have double the amount of bugs for no practical gains. It's a lot more likely that two different programs will fail to communicate to each other than one program communicating to its own replica. Also, if you believe you were right the first time: why would you make the other one different? You will definitely want to choose the better of the two and have copies of that than have a better and a worse work together...
How can you tell whether the problem is due to a software bug or due to a hardware fault though? The software could have thrown the "catastrophic failure, stop the world" exception due to memory corruption.
Did you ever work with HA systems? Because this is how they work. It's two copies of the same system intended for the cases when eg. hardware fails, or network partitioning happens etc.