I assume most organizations, both small and large, just host on whatever provider they know or that costs them the least. If you have budget maybe you deploy to multiple providers for redundancy? But that increases cost and complexity.
Who’s going to bother with colo given the cost / complexity? Who’s going to run a server from their office given ISP restrictions and downtime fears?
Companies can architect their backends to be able to fail back to another region in case of outage, and either don't test it or don't bother to have it in place because they can just blame Amazon, and don't otherwise have an SLA for their service.
To fix it, test your failback procedures. For everything else, there's nothing to fix, it's working by design.
> Companies can architect their backends to be able to fail back to another region in case of outage, and either don't test it or don't bother to have it in place because they can just blame Amazon, and don't otherwise have an SLA for their service.
My CI was down for 2 hours this morning, despite not even being on AWS. We have a set of credentials on that host that we call assumeRole with and push to an S3 bucket, which has a lambda that duplicates to buckets in other regions. All our IAM calls were failing due to this outage, and we have 0 items deployed in us-east-1 (we're european)
One thing that AWS should do is provide an easier way to detect these hidden dependencies. You can do that with CloudTrail if you know how to do it (filter operations by region and check that none are in us-east-1), but a more explicit service would be nice.
The problem was we couldn’t log into cloud trail, or the console at all, to identify that, because IAM identity center is single region. This was a decision recommended by AWS, and blessed by our (army of) SRE teams.
But you can run TWO identity centers in different regions for the price of one(1)! IAM IDC is just a regular application hosted on the AWS infrastructure, it really has nothing special.
The hindsight is 20/20, of course, it's a good practice to audit CloudTrail periodically for unexpected regional dependencies.
Indeed. I also noticed this morning that you're not the person I replied to, and I took your response (which was actually helpful) in the context of the original post which was "people are happy to just blame AWS when they're down".
Either way, we would have only made it one step farther in our CI, as the next step is to build a conatiner with a base image from docker hub, and that was down too. The idea of running a multi region nexus repository to avoid Docker hub outages for my 14 person engineering team seems slightly overkill!
The easiest way to provide some resilience to the build process is to add a pull-through cache using AWS ECR. It might backfire due to egress costs, though, if you're building outside the AWS infrastructure.
It's actually an interesting exercise to enumerate _all_ the external dependencies. But yeah, avoiding them all seems to be less than helpful for the vast majority of users.
On the flipside, then you have to maintain instances of everything.
For most of what I run these days, I'd rather just have someone else run and administer my database. Same with my load balancers. And my Kubernetes cluster. I don't really care if there is an outage every 2 years.
> I assume most organizations, both small and large, just host on whatever provider they know or that costs them the least
I don't think there is a "most" organizations. Either they're looking for big cloud or they're not, and least-cost is usually the last consideration when looking at any cloud, because you're trying to pay a premium to get particular advantages.
The realistic antidote? Move to a less-shitty region. Or architect your systems to be failure-resilient.
(Most people seem to think the entire region was offline? That's wrong. It was just particular services which wouldn't process control plane requests, and then a failure cascade caused more problems. But things that were already started running before, stayed running. A region is multiple datacenters. Even AZs are often multiple datacenters. It's virtually impossible for a whole region to stop working.)
If the cost is worth the complexity then you just do it. Otherwise you don't. How much did a company lose today compared to how much it costs to set it up
And colo and datacenters aren't immune to going down
I assume most organizations, both small and large, just host on whatever provider they know or that costs them the least. If you have budget maybe you deploy to multiple providers for redundancy? But that increases cost and complexity.
Who’s going to bother with colo given the cost / complexity? Who’s going to run a server from their office given ISP restrictions and downtime fears?
What is the realistic antidote here?