Can someone educate me on the solution to this? I assume most organizations, bot...

98codes · 2025-10-20T18:04:40 1760983480

Companies can architect their backends to be able to fail back to another region in case of outage, and either don't test it or don't bother to have it in place because they can just blame Amazon, and don't otherwise have an SLA for their service.

To fix it, test your failback procedures. For everything else, there's nothing to fix, it's working by design.

maccard · 2025-10-20T18:10:11 1760983811

> Companies can architect their backends to be able to fail back to another region in case of outage, and either don't test it or don't bother to have it in place because they can just blame Amazon, and don't otherwise have an SLA for their service.

My CI was down for 2 hours this morning, despite not even being on AWS. We have a set of credentials on that host that we call assumeRole with and push to an S3 bucket, which has a lambda that duplicates to buckets in other regions. All our IAM calls were failing due to this outage, and we have 0 items deployed in us-east-1 (we're european)

cyberax · 2025-10-20T18:56:28 1760986588

You likely used a us-east-1 IAM endpoint instead of a regionalized one ( https://aws.amazon.com/blogs/security/how-to-use-regional-aw... ). We've been using it, and we're not experiencing any issues whatsoever in us-east-2.

One thing that AWS should do is provide an easier way to detect these hidden dependencies. You can do that with CloudTrail if you know how to do it (filter operations by region and check that none are in us-east-1), but a more explicit service would be nice.

maccard · 2025-10-21T07:40:12 1761032412

We did indeed.

The problem was we couldn’t log into cloud trail, or the console at all, to identify that, because IAM identity center is single region. This was a decision recommended by AWS, and blessed by our (army of) SRE teams.

cyberax · 2025-10-21T09:41:28 1761039688

But you can run TWO identity centers in different regions for the price of one(1)! IAM IDC is just a regular application hosted on the AWS infrastructure, it really has nothing special.

The hindsight is 20/20, of course, it's a good practice to audit CloudTrail periodically for unexpected regional dependencies.

(1) offer void for services that run on AWS.

maccard · 2025-10-21T10:10:37 1761041437

Indeed. I also noticed this morning that you're not the person I replied to, and I took your response (which was actually helpful) in the context of the original post which was "people are happy to just blame AWS when they're down".

Either way, we would have only made it one step farther in our CI, as the next step is to build a conatiner with a base image from docker hub, and that was down too. The idea of running a multi region nexus repository to avoid Docker hub outages for my 14 person engineering team seems slightly overkill!

cyberax · 2025-10-21T20:10:13 1761077413

The easiest way to provide some resilience to the build process is to add a pull-through cache using AWS ECR. It might backfire due to egress costs, though, if you're building outside the AWS infrastructure.

It's actually an interesting exercise to enumerate _all_ the external dependencies. But yeah, avoiding them all seems to be less than helpful for the vast majority of users.

ruuda · 2025-10-20T18:42:07 1760985727

Rent servers from a local provider. It's cheaper, you get more control over the hardware, but most of all, it avoids correlated failures.

cheeze · 2025-10-20T18:48:22 1760986102

On the flipside, then you have to maintain instances of everything.

For most of what I run these days, I'd rather just have someone else run and administer my database. Same with my load balancers. And my Kubernetes cluster. I don't really care if there is an outage every 2 years.

jimbokun · 2025-10-20T20:43:55 1760993035

That only helps if their uptime is better than AWS.

0xbadcafebee · 2025-10-21T22:50:51 1761087051

> I assume most organizations, both small and large, just host on whatever provider they know or that costs them the least

I don't think there is a "most" organizations. Either they're looking for big cloud or they're not, and least-cost is usually the last consideration when looking at any cloud, because you're trying to pay a premium to get particular advantages.

The realistic antidote? Move to a less-shitty region. Or architect your systems to be failure-resilient.

(Most people seem to think the entire region was offline? That's wrong. It was just particular services which wouldn't process control plane requests, and then a failure cascade caused more problems. But things that were already started running before, stayed running. A region is multiple datacenters. Even AZs are often multiple datacenters. It's virtually impossible for a whole region to stop working.)

tayo42 · 2025-10-20T18:59:56 1760986796

If the cost is worth the complexity then you just do it. Otherwise you don't. How much did a company lose today compared to how much it costs to set it up

And colo and datacenters aren't immune to going down

gytisgreitai · 2025-10-20T17:58:50 1760983130

What cost? Complexity - yes, to some extent.