AWS outage shows internet users 'at mercy' of too few providers, experts say

dang · 2025-10-20T20:49:20 1760993360

Related ongoing thread:

AWS Multiple Services Down in us-east-1 - https://news.ycombinator.com/item?id=45640838 -(1650 comments so far)

sunrunner · 2025-10-20T19:21:46 1760988106

The 'experts' also made similar criticisms with the Fastly outage in 2021 and did anything obvious change as a result? In a week's time no national newspapers will be talking about this.

Meanwhile, everyone that spends actual time in these areas:

- Knows that running an operation at AWS scale is difficult and any armchair critism from 'experts' is exactly that. Actions speak louder than words.

- Understands that the cost of actually accounting for this kind of scenarios is incredibly high for the benefit in most cases

- Knows that genuinely 'critical' services (i.e. health) should be designed to account for this, and every other 'serious' issue such as 'I can't log in to Fortnite' just shows what the price and effort of actually making that work is versus how much it costs affected companies when it happens

- Knows how much time national newspapers spend actually talking about the importance of multi-region/multi-cloud redundancy, that is, it's zero until the one day where it happens and then it's old news

- Is just curious as to just what exactly happened from a technical perspective

This isn't to say that good blameless post-mortem shouldn't happen to figure out process and technical issues, but the armchair criticism with no actual followup? All noise, no signal.

imgabe · 2025-10-20T20:05:22 1760990722

The "experts" in this case are

> Dr Corinne Cath-Speth, the head of digital at human rights organisation Article 19

Dr. Cath-Speth has a PhD in cultural anthropology

> Cori Crider, the executive director of the Future of Technology Institute

A lawyer

> Madeline Carr, professor of global politics and cybersecurity at University College London

A professor. Her bio doesn't say what her degree is in, but she mostly seems to publish in political science and international relations

So, not a single technical expert. Not anyone who has ever run a hosting service before or even worked for one. Just people who write papers and sit around waiting for journalists to call them for quotes.

kopecs · 2025-10-20T20:17:32 1760991452

Do you not think it a bit too hyperbolic to throw scare quotes around experts and imply the only people who can have opinions on systemic risk are software engineers? I don't think it is unreasonable for people who haven't run or worked for a hosting service to have opinions on the policy aspect or economic impact of hyperscalers.

sunrunner · 2025-10-20T20:26:55 1760992015

> I don't think it is unreasonable for people who haven't run or worked for a hosting service to have opinions on the policy aspect or economic impact of hyperscalers.

Yeah, that's completely fair. My angle was more that firstly this doesn't come across as an opinion that needs the expert in question, and secondly this is yet another case of 'Talk is cheap, show me the code', particularly when quotes in the article include "We urgently need diversification in cloud computing."

I feel like the 'We' is doing an awful lot of heavy lifting and there's no mention of the costs of taking on such a task.

Additionally, and awkwardly, it's possible to be both a monopoly in the space but also technically a more stable solution, making the cost for competitors or people willing to use competitors doubly high.

Edit: Realised afer the fact I'm GP to your post, assumed it was mine, keeping the words anyway.

array_key_first · 2025-10-21T00:38:29 1761007109

I don't think anyone needs to produce any code. I've worked at companies with thousands of employees who don't use any cloud services.

It can be done, and contrary to marketing, it's probably cheaper and more reliable.

selfhoster11 · 2025-10-21T15:13:06 1761059586

What code is needed to make a decision to go with a smaller provider instead of AWS?

ufmace · 2025-10-20T21:23:51 1760995431

No, it's 100% appropriate. Anyone can have opinions on anything, but frankly, most of them have little relevance to reality. Their use of the word "expert" is supposed to mean the person has knowledge or expertise that renders their opinion on a subject substantially more valid and relevant than any regular person. That clearly is not the case here. If I wanted to know what a random person on the street thought about a subject, I could go ask one myself. The purpose of news organizations was supposed to be to better-inform people by getting opinions from actual relevant experts in a subject.

These people don't seem to have much ability to discuss relevant subjects like what the actual reliability of lower-tier hosting providers is, the value-add to business and iteration speed of having a variety of extra services (SQS, DynamoDB, VPC, RDS, managed K8s, etc) available, etc.

jimbokun · 2025-10-20T20:24:37 1760991877

I don’t think it’s useful at all.

What are they going to say that’s useful for making concrete technical decisions?

They can advise on how to write contracts for dealing with these situations after the fact, I suppose.

imgabe · 2025-10-20T20:23:40 1760991820

Anyone can have an opinion, I never said or implied otherwise. Having an opinion does not make one an expert, hence the scare quotes.

The headline is misleading because when there is news about experts saying something about technology, one would naturally think that they are at least somewhat technical experts. Instead the "expert" is the director of the "Big Tech is Bad Institute" who says that "Big Tech is Bad". And their qualification of being an expert is solely that they are director of the "Big Tech is Bad Institute".

yubblegum · 2025-10-21T00:28:45 1761006525

> when there is news about experts saying something about technology, one would naturally think that they are at least somewhat technical experts.

But the experts here are not "saying something about technology". Rather they are saying something about uses of technology. So they don't need to be cloud engineers or know anything about datacenters, at all, really. What would be required (and here you may have a leg to stand on) is expertise in social and economic aspects of (now) critical infrastructure.

ghaff · 2025-10-20T20:45:51 1760993151

And one would hope that the stats being quoted about desktop share were from someone who has been at that research firm in the last 20 years or so. I'm not sure how active he is at all at this point. I have a feeling someone looking for some stats found something old that may or may not have actually had a date on it.

(If I'm wrong mea culpa but I'm pretty sure.)

wagwang · 2025-10-20T20:15:52 1760991352

Opinions are valid but also worthless. Just give me a funny tweet to digest the situation.

mhb · 2025-10-20T20:18:14 1760991494

[flagged]

inopinatus · 2025-10-21T00:54:48 1761008088

The actual experts I was paying attention to said that wearing a K/N-94/95 type mask lowers the statistical rate of transmission, that is, infection of others by your deadly virus.

The subsequent findings are that cloth-type masks are less effective (but not wholly ineffective) compared to clinical/surgical masks at limiting the aerosolized viral shedding from those already infected. So if a cloth mask was all you had, the advice became "please wear it".

Turns out, many people assume advice is only relevant when given for their own direct & immediate personal benefit, so they hear what they want to hear, and even the idea of giving a shit about externalities is sheer anathema. That gets boiled down further for idiot-grade TV and bad-faith social media troll engagement and we wind up with reductive and snarky soundbites, like the remark above, that help nobody at all.

Back on topic, the choice of so-called "experts" in the Guardian's coverage of the AWS matter seems to be a classic matchup of journalistic expediency with self-promoting interests to pad an article that otherwise has little to say beyond paraphrasing Amazon's operational updates.

mhb · 2025-10-21T14:43:59 1761057839

It's unclear what you're arguing. The leading experts (Fauci/CDC) who most Americans were paying attention to were not providing this shading of meaning which you are trying to impute to them. That would be the case if they said something like N95 masks will provide excellent protection for you from the virus if worn correctly, but we have a shortage, so please make do with alternatives so that health care workers have access to them. That is not what they said. Instead they sacrificed credibility at the altar of expediency to the detriment of future trust.

What's reductive is assuming that people are motivated exclusively by self-interest instead of trusting them to make good decisions when told the truth.

rovolo · 2025-10-24T17:42:14 1761327734

Fauci said the following on 2020-03-08: https://www.factcheck.org/2020/05/outdated-fauci-video-on-fa...

> When you’re in the middle of an outbreak, wearing a mask might make people feel a little bit better and it might even block a droplet, but it’s not providing the perfect protection that people think that it is. And, often, there are unintended consequences — people keep fiddling with the mask and they keep touching their face.

> But, when you think masks, you should think of health care providers needing them and people who are ill... It could lead to a shortage of masks for the people who really need it.

He said that there's a shortage, and that he didn't trust that people would wear the masks correctly. I remember that most of the early anti-mask guidance I heard was claims that they weren't likely to prevent yourself from getting infected because: the mask would become an infectious surface; and people wouldn't handle the mask as infectious.

Opinions started to shift over March, and the CDC put out guidance on 2020-04-03 to wear cloth masks in public. https://www.npr.org/sections/coronavirus-live-updates/2020/0...

> It is mainly to prevent those people who have the virus — and might not know it — from spreading the infection to others.

> U.S. health authorities have long maintained that face masks should be reserved only for medical professionals and patients suffering from COVID-19, the deadly disease caused by the coronavirus. The CDC had based this recommendation on the fact that such coverings offer little protection for wearers, and the need to conserve the country's alarmingly sparse supplies of personal protective equipment.

I used wikipedia for dates and sources: https://en.wikipedia.org/wiki/Face_masks_during_the_COVID-19...

inopinatus · 2025-10-21T20:17:26 1761077846

This information was and is widely available.

Your earlier statement was entirely framed in self-interest, so you don’t get to complain about being pulled up on that now.

mhb · 2025-10-21T21:09:07 1761080947

The self-interest of wanting to be told the truth? Uh, yeah.

inopinatus · 2025-10-21T22:02:33 1761084153

Sounds more like you chose to ignore it. My family was wearing medical-grade disposable facemasks and socially distancing from February 2020 on the basis of healthcare advice.

Hunting for a bogeyman in retrospect is the bad-faith narrative of the mediocre culture warrior. Good luck with your undifferentiated rage or whatever.

mhb · 2025-10-21T22:11:39 1761084699

Good for you. Nonetheless a non sequitur in this discussion.

inopinatus · 2025-10-21T22:17:03 1761085023

That’s certainly true. Face masks are not relevant to AWS outages.

Waterluvian · 2025-10-20T20:44:36 1760993076

Right?! Same with seatbelts. I don’t wear mine because there’s obviously still automobile deaths. Experts said seatbelts would protect us from deadly accidents. What else are they wrong about?!

mhb · 2025-10-20T20:56:49 1760993809

That counterargument might make sense if seat belts were not generally protective in accidents or if experts were telling you to wear crepe paper seat belts instead of nylon ones because the nylon ones were needed elsewhere.

yubblegum · 2025-10-21T00:33:35 1761006815

Those comments were made in an information regime that severly censored contrary expert opinion. We had experts in various related field who were automatically labeled as cranks simply because they disagreed with the social engineering experiment and test run of various social control mechanisms (worldwide ..).

Isamu · 2025-10-20T23:23:17 1761002597

What “experts” can you directly cite? Or is the reality that you recall opinion makers saying “experts” are making clearly unsupported claims?

antod · 2025-10-21T00:07:18 1761005238

Was that from actual experts, or bad faith strawman coverage (plenty of that about).

At least in my country, there was sober objective coverage from experts about their purpose and percentage effectiveness at reducing the range and spread of potentially infected droplets. Masks were somewhat effective for filtering incoming droplets, but most effective at containing outgoing droplets. The smaller the viral load you were exposed to the lower your chances of getting infected. Experts never claimed them to be 100% though, it was about reducing transmission rates not absolute protection.

Which is the main reason they're used in surgery too coincidentally (they aren't primarily for the surgeon's protection). Or is that an even longer running conspiracy?

SJC_Hacker · 2025-10-21T15:52:49 1761061969

You misunderstood then. It was mainly to protect OTHER people from a virus YOU might be carrying.

qmmmur · 2025-10-21T01:23:16 1761009796

Unfortunately this site is full of Americans…

zenoprax · 2025-10-20T20:16:58 1760991418

I think your third point is what I've had to attune to when criticizing cloud dependence. I think if your entire source of revenue is dependent on AWS then you should be prepared for 16+ hours of downtime per year. Individuals notice it more when something is down for hours but with good observability I am guessing the business notices it more when performance drags for the other 8742 hours of the year. Bursts of downtime per day can still be attributed to the device, wifi, ISP, or some other intermediary's DNS/BGP.

If your margins are so tight that 16 hours of downtime will bankrupt you then I think either: a) I have no idea how to run a business; or b) you have no idea how to run a business. I'm also biased because I love highly fault-tolerant, geo-redundant, durable systems much more than "good enough for this KPI".

sunrunner · 2025-10-20T20:36:06 1760992566

> but with good observability I am guessing the business notices it more when performance drags for the other 8742 hours of the year

This is really good point that aligns with my experience. Today's event was LOUD and (compared to other incidents) long, but perhaps not really that long compared to the situation you describe that for most businesses is going to be more pernicious.

Business intelligence and analytics-type folks at $DAYJOB are _very_ watchful for the year-on-year deviations and even periods where the prediction lines didn't match up for even just a few hours.

BrenBarn · 2025-10-20T19:47:42 1760989662

I think all of that is mostly irrelevant. You don't need to pay a huge cost to avoid the small benefit, you don't need every service to be resilient to this, or any of that. You just need multiple different providers so that not everyone gets screwed at once.

sunrunner · 2025-10-20T19:59:01 1760990341

But that would require companies to actually spend time and money testing and working with either a cross-provider multi-master-type system (with all the associated consistency headaches) or regularly test a functioning disaster-recovery/fallback system.

The time spent on that (let alone cost, for companies with large amounts of data) far outweighs the cost when a single region has an issue of today's scope. And you said it yourself, it's a 'small benefit'. Small benefits sound like exactly the things not worth spending time or money on.

For as much as many companies have had issues today, the daily reality is that these same companies haven't been having issues all the rest of the time (or this wouldn't have felt so shocking) and are likely to be okay with an outage of this scope (plus, everyone's too busy making noise about the issues to be working normally).

morshu9001 · 2025-10-22T23:58:17 1761177497

There are multiple different providers, with nothing artificially limiting their use. Also idk what's so bad about Fortnite and Snapchat going down at once instead of it being staggered.

bamboozled · 2025-10-20T19:51:32 1760989892

Yes but we live in a highly anti-competitive monopolized world now. With more to come under the new admin.

hdgvhicv · 2025-10-20T22:09:17 1760998157

There’s two or three gartner approved ways of doing things for fortune 500 ctos, and f500 wannabes.

It’s not a monopoly but it’s close.

jdminhbg · 2025-10-20T19:57:39 1760990259

It’s hard to think of anything less monopolized than cloud hosting. There are hundreds of providers.

bamboozled · 2025-10-20T20:35:29 1760992529

Yeah right, and how many of them have any substantial customer base compared to AWS and Azure?

estimator7292 · 2025-10-20T20:02:37 1760990557

For any business that matters, your choices are amazon, google, Microsoft, and that's about it.

I couldn't even name another provider except maybe Hetzner

bamboozled · 2025-10-20T20:36:45 1760992605

The three you mentioned have over 60% market share which is why this article exists at all. Knowing what I know about cloud ifnra, anyone who is actually anyone is hosting on the big three. So it's not just a market share, it's market share + impact / importance.

You could also argue that YT is on GCP (to some level) and that would probably bump that number up much higher.

The vast majority of people hosting things on the internet are on these providers. But you get downvoted for pointing that out now.

alecco · 2025-10-20T19:55:30 1760990130

> - Knows that running an operation at AWS scale is difficult and any armchair critism from 'experts' is exactly that. Actions speak louder than words.

NO. From their own reports, clearly AWS is too centralized and dependent on a specific region (us-east-1) and a specific service (DynamoDB). This has been observed for well over 10 years. Why do they stay in this centralized architecture? Cloud services need much higher standards than the average corporation. Just look how they took down 2000+ services for many hours.

[1] https://health.aws.amazon.com/health/status

inopinatus · 2025-10-20T20:19:19 1760991559

Even wearing my ex-AWS hat and understanding to some degree the internal complexity of these services, I too am boggled that foundational stuff is still out of Virginia and not a separately operated global region for the subset of control-plane dependencies that can’t be refactored into tolerating eventual consistency (such as parts of IAM).

We always used to talk a lot about minimising blast radius and there’s been enough time, and enough scale, to fix it.

Nevertheless the Guardian’s choice to label self-promoting policy wonks as “experts” is a cringe-inducing reminder that journalists don’t know anything about anything.

sunrunner · 2025-10-20T20:06:38 1760990798

I don't deny that an incident of this scope should prompt a serious technical and process review (and as you describe it, it sounds like this is long overdue), however how often does this kind of thing not affect 2000+ services? Companies should be tracking the time they don't have issues as much as the time they do in order to actually understand if they'd be better off elsewhere.

And to be clear, I'm not at all arguing for the monopolisation of cloud providers, only stating that it's easy to point from far away and say 'This is bad' while simultaneously not doing anything to understand the cost and make that change that you say is important, because it's actually costly (in many dimensions) to do.

benterix · 2025-10-21T08:57:59 1761037079

> Knows how much time national newspapers spend actually talking about the importance of multi-region/multi-cloud redundancy

For the record, multi-region redundancy is moot, and I can't stress it enough. It is not the first time that on the surface it looks like a single region but in fact services in multiple regions are affected.

And multi-cloud hot standby can be terribly expensive, unless your infra is very simple. And it's not easy to get it right either until you planned for it from day one.

A4ET8a8uTh0_v2 · 2025-10-20T20:47:50 1760993270

Um.. you don't need to be an expert in security, comp.science or economics to know that putting all eggs in one basket may not be a great idea as introduces one giant systemic target. If anything, regular people here are uniquely qualified to say something along the lines of:

Oi, this is ridiculous. Maybe more things should be ran locally..

FWIW, it was instructive to me as to which companies were not able to function today.

sysguest · 2025-10-20T20:38:57 1760992737

> - Knows that genuinely 'critical' services (i.e. health) should be designed to account for this

yeah but aws advertises as "trust me bro I won't go down for 99.99999%"

I've seen a lot of gov proposals using aws to 'get away with downtime management'

hippo77 · 2025-10-20T20:25:40 1760991940

These are Guardian 'experts' so can be safely ignored.

gnerd00 · 2025-10-20T19:33:40 1760988820

maybe your VC overlords need a reality check?

free_bip · 2025-10-20T19:32:16 1760988736

Because the experts have no say in policy. The only people who have a say are the people bribing (sorry I mean "lobbying") Congress. And even they have very little say because Congress is currently on a hot streak of doing absolutely nothing.

labrador · 2025-10-20T19:02:05 1760986925

Kieran Healy @kjhealy@mastodon.social

Always worth taking sentences that use “the Cloud” or “the Internet” and try replacing those phrases with “A shed in Virginia” to see how they hold up. “Our service is fully based in a shed in Virginia”; “All my files are in a shed in Virginia”; “A shed in Virginia was designed to survive a nuclear war”, etc.

https://mastodon.social/@kjhealy/115407725852594322

SpicyLemonZest · 2025-10-20T19:27:32 1760988452

Sounds like a pretty good shed! Like a lot of pithy commentary on the cloud, this ignores the fact the practical alternative to a shed in Virginia for most businesses is a shelf in the supply closet. "Oops, Jim Bob tripped over the power cord, guess we won't get any emails until the IT guy shows up" - this used to be a routine experience.

gspencley · 2025-10-20T19:50:30 1760989830

> "Oops, Jim Bob tripped over the power cord, guess we won't get any emails until the IT guy shows up" - this used to be a routine experience.

You're not entirely wrong, but you're being hyperbolic too. I'm actually curious how old you are / how long you've worked in tech, because I started out pre-cloud and things weren't nearly as bad or as limited as you suggest.

First, on-prem servers are not the only alternative to "cloud." Many businesses, including the ones I worked for, did co-location. The companies owned their own bare metal servers, but would rent a rack in a data centre, and certain things - like the network admin - was entirely outsourced to the data centre / hosting company.

You could also rent managed bare metal servers (you still can). This means that you can pretty much outsource your entire IT department, but you're still not doing cloud services. Meaning you've got bare metal servers, someone you're paying at the hosting company is handling security updates and troubleshooting. You don't get things like auto-scaling or serverless or other cloud features, but you also don't have to worry about Jim tripping over the power cable either.

There's also still virtual servers. Which is basically a VM running on a server that hosts multiple clients.

All of this is to say that the alternative is not "cloud" or "box in a closet." The alternative is "cloud" and a ton of different server options: owned, rented, co-located, on-prem, dedicated, virtual, managed v un-managed (outsource IT vs admin your own) and the list goes on and on.

Spivak · 2025-10-20T20:52:29 1760993549

But is the distinction meaningful? The alternative to a shed in Virginia is a different shed in Montana? I mean sure there are a lot of different sheds out there but they're all still sheds. They're all shared responsibility models where the line is drawn in different areas, some outages will be because of your fuckup, some will be theirs.

Not saying as an industry we shouldn't diversify a little but it doesn't fundamentally change the relationship each company has to their hosting provider.

maccard · 2025-10-20T20:01:32 1760990492

We run a subset of our CI workload on on-prem workstations because the cost/performance ratio of consumer hardware is so much higher than servers. 1TB NVMe drive, with a 7950x/i9, 64GB RAM and gigabit networking is < $1000. It actually completes our CI job faster than AWS restarts a gpu instance.

100% of our failure rates with this machine have been "carpet cleaners unplugged the machine" in 2 years. Last year we had nobody in the office (due to carpet cleaning). This year we sent someone in straight after the cleaning to fix it.

SpicyLemonZest · 2025-10-20T20:55:06 1760993706

I've never managed IT professionally myself (pre-cloud or otherwise), so a lot of my information comes from family members who do, but my impression is that bare metal rental and colo centers weren't realistic options for any but the most technically sophisticated organizations. I know schools, stores, even research centers who went straight from on-prem to managed cloud with no real consideration for anything in between.

gspencley · 2025-10-22T13:37:54 1761140274

My first paid position as a software developer was for a small, dot-com startup in Windsor, ON Canada. We co-located in Detroit - which meant border crossings (though this was pre-911 so crossing as a Canadian citizen was easier) - and had just a couple of servers on a rack. We were software engineers and had people who knew what they were doing. So yeah we were technically proficient. But I'm not sure I'd call that company among "the most technically sophisticated of organizations". We were tiny. In fact, when I first started working there, we were working out of a house with a workforce of like 25 people max.

When that company went under during the dot-com crash, I started my first business shortly after. It was 2003, I was 21 years-old and this business allowed me to work from home and feed my family until I re-entered the job market in 2018. For 15 years, I was a one-person organization, and because my business operated "free" adult-entertainment websites, bandwidth was my most significant expense. For that reason, even when Cloud became a thing (which it wasn't in 2003), I never migrated because of the bandwidth costs alone. Cloudflare was a major game changer but even it didn't exist when I first started out. There were CDNs like Akamai but they were crazy expensive and out of my league. So at its peak, I had about 12 bare metal servers around the world (all rented from the same hosting company - original called Server Matrix it then became Softlayer and then was bought by IBM and went to shit and is now IBM Cloud). I admin'd those on top of writing and maintaining all of the code and running the business independently with occasional help from my wife.

I am obviously very technically competent. I'm a Principal Software Engineer today. But technically sophisticated? There wasn't much sophistication about it. I did bare metal servers because it was the only cost-effective way to run my business. It was attainable and it worked. And it worked in a way that Cloud couldn't when Cloud came on the scene - so I never went Cloud with that operation just due to cost alone.

darkwater · 2025-10-20T19:45:32 1760989532

With a gazillion of shelves, closets, Jims and cables. So if Fortnite's Jim trips on a wire, Canva's Jim is quitely sipping coffee at his desk.

franz_vlkshp · 2025-10-20T20:18:46 1760991526

on the other hand, that's a small price to pay to having total control and physical access to your own infrastructure. if the sysadmin did his job properly, an incident like that shouldn't require anything else but to plug the server back in and hit the power switch. but then if he did his job properly, no one but IT should be tripping on power cables to begin with.

hdgvhicv · 2025-10-20T22:13:38 1760998418

My home infrastructure is immune from the “unplug” problem by being hosted on two different 10 year old £30 raspberry pis in different rooms.

But apparently that’s too hard for the average.

noir_lord · 2025-10-20T19:48:44 1760989724

Once had a site wide outage (biggish manufacturing company) of the internet and backup servers because one of the women wanted to plug her hair straighteners in for the xmas party.

In a surprise to literally no one that happening on the last friday before xmas break got my "We need to secure the main comms cabinet" (which had the backup server and main ingress for WAN and was in a separate building on other side of site) item that I'd been asking about for months to the top of the list.

Still one of my favourite "outages" because I got to my desk, turned PC on, no network, walked across the landing into the main office, opened comms cabinet, plugged it back in and was "resolved" before the MD got to my desk.

jacobsenscott · 2025-10-20T20:33:23 1760992403

Largely mitigated by twist lock sockets.

impure · 2025-10-20T18:23:22 1760984602

We already have diversification. You can rent a VPS from hundreds of possible companies. And people are very happy with them, it seems every month or two there’s a post here about how some company slashed their cloud bill by switching to a VPS. What we have here is a lock-in and marketing problem.

jasode · 2025-10-20T18:50:28 1760986228

>You can rent a VPS from hundreds of possible companies. And people are very happy with them, it seems every month or two there’s a post here about how some company slashed their cloud bill by switching to a VPS.

Companies are using higher-level "PaaS" suite of services from AWS such as DynamoDB, RedShift, etc and not just the lower-level "IaaS" such as basic EC2 instances or pure containers. Same "lock-in" situation with using the higher-level services from MS Azure and Google Cloud.

For those dependent on high-level services, migrating to a VPS like Hetzner or self-hosting is not possible unless they re-invent the AWS stack by installing/babysitting a bunch of open-source software. It's going to be a lot more involved than just installing a PostgreSQL db instance on a VPS.

SoftTalker · 2025-10-20T19:00:19 1760986819

> It's going to be a lot more involved

Yes, and you can't escape that by outsourcing it. The complexity is still there, and it will still bite you when your outsourcer fails to manage it.

candiddevmike · 2025-10-20T19:01:31 1760986891

Same thing applies to AWS...

throwaway894345 · 2025-10-20T19:23:04 1760988184

I’m not really making a point here as much as an observation, but if my stack that I manage atop VMs in a data center goes down, my customers are pissed at me. If AWS goes down along with half the Internet, my customers are completely sympathetic.

candiddevmike · 2025-10-20T19:36:08 1760988968

Maybe just for you and after they realize it's part of the ongoing AWS outages, but for most folks, an outage is still their problem, and their SLA, regardless of if it's upstream from them.

throwaway894345 · 2025-10-21T00:11:25 1761005485

I disagree. I think most customers are much more sympathetic to an AWS outage than they are to a self-managed outage. Whether that ought to be the case or not is a different question.

SoftTalker · 2025-10-21T02:57:19 1761015439

But if your services are up when everyone on AWS is down you look like a wizard.

throwaway894345 · 2025-10-21T03:33:21 1761017601

Unfortunately, people rarely notice when something is working, and the few who do will probably just assume you weren’t on AWS in the first place and move on with their day.

TiredOfLife · 2025-10-21T06:41:20 1761028880

And every comment in those threads is how AWS is webscale and wont go down, while the vps will have uptime of 1 day a month

TZubiri · 2025-10-20T18:50:57 1760986257

Amazon offers VPS as well, EC2 instances, were those affected? I think they weren't.

swiftcoder · 2025-10-20T19:07:28 1760987248

Our actual running instances were pretty much fine throughout, as was the RDS cluster, but we had no way to launch new instances (or auto-scale), and no way to invoke any of the other AWS services (IAM, SQS, Lambda, etc). Also no cloud watch logs/metrics for the duration, so limited visibility.

Overall not that bad for us, but if you had more high-level service dependencies, there would have been impact.

TYPE_FASTER · 2025-10-20T19:00:59 1760986859

> While most operations are recovered, requests to launch new EC2 instances (or services that launch EC2 instances such as ECS) in the US-EAST-1 Region are still experiencing increased error rates.

> We continue to investigate the root cause for the network connectivity issues that are impacting AWS services such as DynamoDB, SQS, and Amazon Connect in the US-EAST-1 Region. We have identified that the issue originated from within the EC2 internal network.

So, kinda? Some global services depend on us-east-1...

> Global services or features that rely on US-EAST-1 endpoints such as IAM updates and DynamoDB Global tables may also be experiencing issues.

Basically, you know it's going to be a bumpy day when us-east-1 has an issue because your ability to run across regions depends on what the issue is what the impact is.

morshu9001 · 2025-10-20T17:59:34 1760983174

The expert opinions are more about geopolitics, like maybe don't have all your country's systems realtime depend on a foreign company.

If you are just one company whose goal is to maximize uptime without bringing in the complexity of multi-cloud, relying on AWS is reasonable. You probably won't get better uptime using something else, you'll only be down at different times than most others, which in most cases is actually worse.

kristianc · 2025-10-20T18:08:01 1760983681

For the kind of person being quoted, the stock in trade is not actually doing anything to fix it, it's in being the person quoted when something goes wrong.

dynamite-ready · 2025-10-20T17:55:14 1760982914

The whole industry walked straight into the cloud service lock-in trap. How would we begin to wind back? I also think Docker is as much to blame as the bigger cloud vendors.

spjt · 2025-10-20T20:05:43 1760990743

I don't think it wants to. Ask any on-call engineer or support tech how they felt when, after having their phone blow up at 1am because everything is falling apart, they found out that this was an AWS-wide outage.

Jcowell · 2025-10-20T18:20:06 1760984406

Why is docker to blame?

dynamite-ready · 2025-10-20T18:39:42 1760985582

It's subjective I guess, but I feel as though containerisation has greatly supported the large Cloud vendor's desire to subvert the more common model of computing... Like, before, your server was a computer, much like your desktop machine, and you programmed it much like your desktop machine.

But now, people are quite happy to put their app in a Docker container and outsource all design and architecture decisions pertaining to data storage and performance.

And with that, the likes of ECS, Dynamo, RedShift, etc, are a somewhat reasonable answer to that. It's much easier to offer a distinct proposition around that state of affairs, than say a market that was solely based on EC2-esque VMs.

What I did not like, but absolutely expected, was this lurch towards near enough standardising one specific vendor's model. We're in quite a strange place atm, where AWS specific knowledge might actually have a slightly higher value than traditional DevOps skills for many organisations.

Felt like this all happened both at the speed of light, and in slow motion, at the same time.

godelski · 2025-10-20T20:16:49 1760991409

Containers let me essentially build those machines but at the actual requirements I need for a particular system. So instead of 10 machines I can build 1. I then don't need to upgrade that machine if my service changes.

Its also more resilient because I can trash a container and load up a new one with low overhead. I can't really do that with a full machine. It also gives some more security by sandboxing.

This does lead to laziness by programmers accelerated by myopic management. "It works" except when it doesn't. Easy to say you just need to restart the container then to figure out the actual issue.

But I'm not sure what that has to do with cloud. You'd do the same thing self hosting. Probably save money too. Though I'm frequently confused why people don't do both. Self host and host in the cloud. That's how you create resilience. Though you also need to fix problems rather than restart to be resilient too.

I feel like our industry wants to move fast but without direction. It's like we know velocity matters but since it's easier to read the speedometer we pretend they're the same thing. So fast and slow makes sense. Fast by magnitude of the vector. Slow if you're measuring how fast we make progress in the intended direction.

throwaway894345 · 2025-10-20T19:28:17 1760988497

Containers have nothing to do with storage. They are completely orthogonal to storage (you can use Dynamo or RedShift from EC2), and many people run Docker directly on VMs. Plenty of us still spend lots of time thinking about storage and state even with containers.

Containers allow me to outsource host management. I gladly spend far less time troubleshooting cloud-init, SSH, process managers, and logging/metrics agents.

dynamite-ready · 2025-10-20T19:50:25 1760989825

> Containers have nothing to do with storage. They are completely orthogonal to storage

Exactly.

And sure, you can use S3/Dynamo/Aurora from an EC2 box, but what would be the point of that? Just get the app running in a container, and we can look into infrastructure later.

It's a very common refrain. That's why I believe Docker is strongly to linked the development of these proprietary, cloud based models of computing, that place containerisation at the heart of an ecosystem that bastardises the classic idea of a 'server'.

The existence of S3 is one good result of this. IAM, on the other hand, can die in dumpster fire. Though it won't...

throwaway894345 · 2025-10-21T00:10:28 1761005428

> And sure, you can use S3/Dynamo/Aurora from an EC2 box, but what would be the point of that?

An easy API? Easy replication / failover / backups? I would absolutely use S3 even with EC2.

> IAM, on the other hand, can die in dumpster fire.

I’m no great fan of AWS’s approach to IAM, but much of the pain is just the nature of fine-grained / least-privilege permissioning. On EC2 it’s more common to just grant broader permissions; IAM makes you think about least privilege, but you absolutely can grant admin for everything. And as far as a permissioning API goes, IAM is much cleaner/saner than Linux permissions.

pythonaut_16 · 2025-10-20T19:28:30 1760988510

I don't see how Docker makes that worse.

Before Docker you had things like Heroku and Amazon Elastic Beanstalk with a much greater degree of lock in than Docker.

ECS and its analogues on the other cloud providers have very little lock in. You should be able to deploy your container to any provider or your own VM. I don't see what Dynamo and data storage have to do with that. If we were all on EC2s with no other services you'd still have to figure out how to move your data somewhere else?

Like I truly don't understand your argument here.

SJC_Hacker · 2025-10-21T18:14:08 1761070448

Containerization was basically a way to get rid of the problem of "it works on my machine", mainly the OS version and installed libraries. Plenty of instances where program X will work on system A, but not system B, but program Y works on system B but not A. Or X is supported on Redhat/Ubuntu/etc. but you can't or don't want to build from source.

Even if that is not a problem, you avoid having to install the kitchen sink on your host and make sure everything is configured properly. Just get it working on a container, build and image and spin it up when you need it. Leaves the host machine fairly clean.

You can run a bunch of services as containers within a single host. No cloud or k8s needed. docker-compose is sufficient for testing or smallish projects.

Also, there is a security benefit because if the container is compromised, problem is limited that container not the entire host.

neom · 2025-10-20T17:57:04 1760983024

Been a while since I worked in cloud but at least when I got out of it, the primitives where all shoring up to be generally very similar.

Did multi cloud redundancy end up being too expensive? Tech didn't line up enough? No good business case?

The elastic cloud story that never was? https://www.slideshare.net/slideshow/pets-vs-cattle-the-elas...

What happened?

LaurensBER · 2025-10-20T18:00:40 1760983240

The (cognitive) overhead of managing and deploying to multiple clouds usually isn't worth it for most teams. Hiring experts and maintaining knowledge about the ins and outs of two (or more) clouds is less feasible for small, fast moving teams.

Simplicity is linked to uptime and having a single cloud solution is a simpeler solution.

For large companies, its mostly cost savings. Easier to negotiate a good discount at N million versus N/2 million.

Besides that no-one ever got fired for picking AWS ;)

tadfisher · 2025-10-20T18:03:02 1760983382

Not a justifiable expense when no one else is resilient against their AWS region going down either. Also cross-cloud orchestration is quite dead because every provider is still 100% proprietary bullshit and the control plane is... kubernetes. We settled for kubernetes.

morshu9001 · 2025-10-20T18:33:24 1760985204

Also if you can't even do cross region, cross cloud won't happen

dylan604 · 2025-10-20T18:48:18 1760986098

Cross region isn't simple when you have terabytes of storage in buckets in a region. Building services in other regions without that data doesn't really do any good. Maintaining instances in various regions is easy, but it's that data that complicates everything. If you need to use the instances in a different region because your main region is down, you still can't do anything because those cross region instances can't access the necessary data.

Veserv · 2025-10-20T20:37:55 1760992675

Entire terabytes?! My god, I can only barely fit that onto a single SD card the size of my pinky nail.

It is quite bizarre that such paltry amounts of data and problems with such tiny scale seem to pose challenging problems when done in the cloud.

dylan604 · 2025-10-20T21:08:04 1760994484

Such a sophomoric response. It does not matter how large your storage use is exactly. The point is that nobody is going to pay to replicate that data in multiple clouds or within multiple regions of the same cloud provider.

Btw, I'd love to have a link to where I could buy an SD card the size of a pinky nail that holds terabytes of data.

Veserv · 2025-10-20T21:42:16 1760996536

It absolutely matters how large your storage use is. Terabytes of storage is easily manageable on even basic consumer hardware. Terabytes of storage costs just hundreds of dollars if you are not paying the cloud tax.

If you got resiliency and uptime for a extra hundred dollars a year, that would be a no-brainer for any commercial operation. The byzantine kafkaesque horror of the cloud results in trivial problems and costs ballooning into nearly insurmountable and cost-ineffective obstacles.

These are not hard or costly problems or difficult scales. They have been made hard and costly and difficult.

dylan604 · 2025-10-20T23:26:31 1761002791

Your pedantry is just boring. Yes, I used the word terabyte instead I guess something more palatable to you for being large. Fine s/exabyte/terabyte/.

I work with buckets where single files are >1 terabyte. There's more than one of these files, hence terabytes. I'm not going to do a human-readable summary listing of an entire bucket to get the full size. The point of the actual size is irrelevant. When people are spending 5-6 digits on cloud storage per month, they are not going to do it in multiple places. period. Maybe the new storage unit should just be monthly cloud spend, but then your pedantry will say nonsense like which cloud sever, which storage solution type, blah blah blah.

Veserv · 2025-10-21T00:45:38 1761007538

Ah yes, let us just gloss over 6 orders of magnitude when we are discussing cost-effectiveness and feasibility. What is the difference between 100$ and 100,000,000$ of spend really? Basically the same thing.

BenjiWiebe · 2025-10-20T22:35:20 1760999720

Yes they exaggerated, it takes several pinky nail sized cards to store several TB. Only 1TB per microSD.

Veserv · 2025-10-21T04:51:16 1761022276

They have them at 2 TB [1] now for just 300$. And SanDisk announced 4 TB last year, but I do not see them for sale just yet.

[1] https://shop.sandisk.com/products/memory-cards/microsd-cards...

TZubiri · 2025-10-20T18:59:47 1760986787

Bottomline is that AWS gives you the tools to survive this outage within their own ecosystem.

If there's an issue with relying only on AWS it has not been expressed in this outage.

dylan604 · 2025-10-20T19:16:18 1760987778

exactly what tools helps make your large volume of data stored in a down region available to other regions without duplicating the monthly storage fees?

morshu9001 · 2025-10-20T19:26:31 1760988391

You duplicate the fees. But it's the same or worse trying to do multi cloud.

dylan604 · 2025-10-20T19:39:23 1760989163

Which is precisely why it's not done

neom · 2025-10-20T19:42:27 1760989347

I seems to recall it was fairly common to have a read only versions of sites when there was a major outage - we did that a lot with deviantart in the early 2000s, did that fall out of favour or too complex with modern stacks or?

dylan604 · 2025-10-20T21:04:06 1760994246

If only everything was a simple website. You're totally ignoring other types of workflows that would be impossible to use a read-only fall back. Not just impossible, but pointless.

morshu9001 · 2025-10-21T02:00:00 1761012000

HN does it too, but it's a simple site

morshu9001 · 2025-10-20T23:31:08 1761003068

I don't think storage cost is the reason, more that it's hard to design for regional failures. DB by itself as one example, cross region read replica usually introduces eventual consistency to a system that'd otherwise be immediately consistent.

TZubiri · 2025-10-21T22:30:49 1761085849

Well yeah, but that's why we get paid the big bucks right?

morshu9001 · 2025-10-22T23:00:16 1761174016

We do, non-tech company's IT dept doesn't so much

neom · 2025-10-21T20:07:55 1761077275

Thanks for the helpful reply! Do you think that would be still true if one accepted a constraint of the "down" version of the property served had data that was stale, say 24 hours behind what the user would have seen had they been logged in?

morshu9001 · 2025-10-22T23:02:24 1761174144

Yeah except it would probably be delayed way less than 24h. And then you have to figure out how to merge the data back in after, unless you're ok just losing it permanently. And make sure things are handled right if other healthy DBs point to things in the failed-over DB that disappeared.

jimbokun · 2025-10-20T20:34:57 1760992497

Data has a lot of gravity.

Analemma_ · 2025-10-20T18:05:32 1760983532

All the cloud providers have cheap compute but ludicrously expensive network egress. Trying to multicloud will stick you with a massive traffic bill, which is probably not a coincidence.

starman55 · 2025-10-20T22:22:44 1760998964

It really depends on how you will built it. You can architect it for multi cloud from top down where the client/browser talk to one region, With DNS with health check, and replication happens at the DB layer. Your services don't talk cross region at service level, so avoiding a lot of cross region/cloud communication. Most use cases can be addressed this way.

jamesblonde · 2025-10-20T18:47:11 1760986031

It's a market regulation failure. Which results in a failed market, with the cloud infra provider also providing data services. 20 years ago, there were 20+ widely used operational databases. Now, it's like DynamoDB with like half the market.

conductr · 2025-10-20T19:41:09 1760989269

How should this have played out in a regulated market? DynamoDB gets released, then what? Has limits on the market share it's allowed to steal?

Should we similarly cap say Front End frameworks on market penetration / growth? Is react too big to fail? Do we need to force some of it's users to use something else?

jimbokun · 2025-10-20T20:38:07 1760992687

What would these regulations say, exactly?

toast0 · 2025-10-20T19:11:09 1760987469

It seems that clouds balance their budget on egress charges... which leads to cross cloud communication being too expensive to setup multi cloud redundancy. Cross region redundancy is often too expensive too. Even cross availability zones is too expensive for some clouds and applications. (Cross region redundancy in a single cloud doesn't always work out, if the cloud has an outage on a global subsystem, or the broken subsystem gets pushed to multiple regions before exhibiting symptoms)

Additionally, moving your load to a different cloud can be challenging while one is down. It ends up being a lot of work that pays off for a few hours a year. For a lot of applications, it's better to just suffer the downtime and spend money on other things.

dylan604 · 2025-10-20T18:45:20 1760985920

If you're a company providing services to people that already have data stored in VendorA's cloud, being on a different cloud would be expensive and prevent you from winning much work. If it turns out that VendorA happens to be the vendor for your clients, you build your services to run on VendorA's cloud too.

This is the situation for my company that started with the intent of being platform agnostic, but it quickly became much less complex as all of the potential client pool was using the same cloud. People with buckets with large amounts of data are not going to be able to convince the bean counters that it would be worth it to have that storage bill from multiple vendors.

conductr · 2025-10-20T19:35:49 1760988949

> are not going to be able to convince the bean counters that it would be worth it to have that storage bill from multiple vendors

Because it rarely is. Occasional downtime is just a cost of doing business. It is, or should be, rare enough that you just take it as it comes instead of trying to have a redundancy. We don't build tunnels everywhere as a backup for surface roads on snowy days. We just cancel school and work for the day and make up for it later. Do some important things get impacted? Sure, but most things are as mission critical as we make them out to be. The press coverage of an AWS outage makes it so easy to shrug it off and point fingers.

justapassenger · 2025-10-20T19:15:59 1760987759

There’s a huge difference between “similar” and “works and is ROI positive for my business across the whole lifecycle”.

Multi cloud redundancy is like Java being a solution to platform independency.

sumtechguy · 2025-10-20T18:17:07 1760984227

Many companies idea of a disaster plan is to make it after the disaster.

You have to build it in. That takes time money and training. Do you do failovers? Do they work? What is your backup situation? What is your list of work items to do during the failover? How long does it take? Do you even HAVE a failover plan? Can your services handle being in 'split brain'? Do you have specialty services that can only run in one place?

The unfortunate reality is this planning happens many times too late.

TZubiri · 2025-10-20T18:53:05 1760986385

It feels like a hat on a hat, cloud systems are already designed for redundancy, adding a redundant layer on top of that is like a double condom, or invesisting in multiple investment funds.

rubiquity · 2025-10-20T18:00:58 1760983258

Networking leaving the cloud provider (or even just to another zone on the same cloud) is $0.02 GB. That adds up real fast.

ryandvm · 2025-10-20T18:33:24 1760985204

Man, I did not have "AWS us-east-1 will only have TWO 9s this year" on my bingo card.

aurumque · 2025-10-20T18:42:14 1760985734

For those of us who have been using AWS for almost 20 years now, I can't imagine why anyone would willingly choose us-east-1 for anything. It is the oldest, highest traffic, most critical path region and is subject to turbulence.

tlogan · 2025-10-20T18:59:51 1760986791

I think it is a little complicated. For example, your service might be using full failover but you use API from other service which are down.

Or you might use BART to come to work and you got stuck: https://www.kqed.org/news/12060687/bart-resumes-service-but-...

dingnuts · 2025-10-20T19:21:18 1760988078

ha! I saw another comment on here talking about how ec2 doesn't need to be held to the same standard as the power company because it's not as important as real infrastructure.

wish I'd already had this link in my back pocket. our industry needs to take its job, as a whole, much more seriously.

captainkrtek · 2025-10-20T19:36:34 1760988994

“Global” and “edge” services such as IAM, Route53, CloudFront and so on have dependencies on us-east-1, so even if you don’t think you do, you probably do.

interroboink · 2025-10-20T18:54:02 1760986442

By some logic, that would mean it is the most battle-tested and highest-stakes (and therefore most carefully-managed) choice. I.e. reasons in favor.

Not that I disagree with you, but maybe not for the reasons you say (:

swiftcoder · 2025-10-20T19:11:18 1760987478

> By some logic, that would mean it is the most battle-tested and highest-stakes (and therefore most carefully-managed) choice

As someone who used to work on the inside, us-east-1 has the biggest pile of legacy workarounds for internal AWS issues, it has a variety of legacy API behaviours that don't exist in other regions, and because everyone picks it as the default, it has significantly more pressure on contested resources (i.e. things like spot instance pools).

Plus since it's the default in all the tooling, if you ever decide to go multi-region, you'll find tons of things break right away.

bongodongobob · 2025-10-20T19:41:07 1760989267

Well, we didn't, but some of our third party softwares did. Hard to avoid.

morshu9001 · 2025-10-20T18:51:10 1760986270

It can make sense to depend on the thing that will attract massive worldwide attention if/when it goes down. Or, more likely, it's just a default people don't change.

TZubiri · 2025-10-20T18:51:40 1760986300

Wait, was the whole region affected? Like even if you had an EC2 instance?

mads_quist · 2025-10-20T19:01:24 1760986884

No, we run on US East 1 but only EC2. Everything was running smoothly!

mads_quist · 2025-10-20T19:05:31 1760987131

Our strategy has always been to use as little higher abstractions from cloud providers as possible. Glad we went this way, saved us quite a bunch of SLA breaches today! I am confident to say that it's "best of both worlds". We get great availability zone redundancy by AWS without having to rely on and pay for all those PaaS stuff the cloud giants offer. Also, we can "fairly easy" migrate to any other cloud provider because we only need Debian instances running.

bigstrat2003 · 2025-10-20T19:43:52 1760989432

Yes, it was. We have EC2 instances that we turn on as-needed, and at times were unable to start said instances.

dijit · 2025-10-20T18:10:20 1760983820

And we lean into it by saying "Well, if everyone else is down, I get a free pass".

(which, is not true in reality if you have ordinary customers).

cpncrunch · 2025-10-20T18:45:32 1760985932

"The root cause is an underlying internal subsystem responsible for monitoring the health of our network load balancers."

https://health.aws.amazon.com/health/status?path=service-his...

KronisLV · 2025-10-20T20:03:40 1760990620

So, how many people will actually switch their setups to multi-cloud as a consequence of this? How many will move over to self-hosting? Or will they just do a post-incident report, wave hands around and do nothing?

Because I think it's very much the same way as it is with Cloudflare - while the large vendors aren't always openly hostile, we can just smile and hope that they don't get too keen on reminding us that they're holding us hostage.

I don't see that changing anytime soon. I've personally also used Hetzner, Contabo, Scaleway, Vultr, DigitalOcean, Time4VPS and some other platforms, but when people couple their setups to CF/AWS/GCP/Azure, typically that coupling is hard to get rid of and doing so is hard to justify.

SkyPuncher · 2025-10-20T20:27:42 1760992062

For most companies, I suspect this will actually re-affirm _not_ switching to multi-cloud.

Lots of businesses who will be completely forgotten as having an outage today because all of their customers were dealing with their own outages and outages in dozens of other providers.

Obviously, that doesn't fly for everyone.

1970-01-01 · 2025-10-20T20:11:32 1760991092

GCP and Azure should be running a 10% sale/discount (Coupon code: RAINYDAY) for new accounts during the week of an AWS outage. The bean counters would take note.

jimbokun · 2025-10-20T20:32:58 1760992378

Nobody ever got fired for buying IBM…

…no, Microsoft…

…no, AWS.

Aeolun · 2025-10-20T18:30:43 1760985043

It’s only a single region. If anything it shows how many people just double down on the default without any redundancy.

arbll · 2025-10-20T18:50:54 1760986254

A single region that is a SPOF for global AWS services*

starman55 · 2025-10-20T22:27:35 1760999255

Is us-east-2 services impacted today? which ones?

stronglikedan · 2025-10-20T19:13:26 1760987606

> It’s only a single region

Which was effectively the only region

jimmar · 2025-10-20T19:27:33 1760988453

My company has been ahead of all of this by causing outages in our own data center without waiting for the cloud to do it for them.

On a serious note, resiliency takes effort and investment no matter where you host your content.

binary132 · 2025-10-20T18:24:44 1760984684

Wow, thanks experts! I never could have figured this out without you :)))

patrickmcnamara · 2025-10-20T18:49:13 1760986153

This article isn't written for you. It's written for my mom, etc.

hippo77 · 2025-10-20T20:33:35 1760992415

Surprised to see an article like that even getting shared here. The Guardian seems to be wrong on almost every tech issue.

binary132 · 2025-10-20T21:16:49 1760995009

Does your mother frequent hackernews?

patrickmcnamara · 2025-10-21T20:41:30 1761079290

This article was written for The Guardian, not Hacker News.

binary132 · 2025-10-22T00:22:11 1761092531

yet here it is, posted on hackernews

patrickmcnamara · 2025-10-22T06:35:30 1761114930

So why complain about the experts?

esafak · 2025-10-20T18:32:39 1760985159

There has to be an Onion article for this.

01HNNWZ0MV43FF · 2025-10-20T19:01:06 1760986866

"No way to prevent this, says only region where this regularly happens"

physicsguy · 2025-10-20T17:57:17 1760983037

We don’t use AWS at work but we still experienced disruption because lots of our customers do, and use it to transfer data to us. That means we then saw an uplift in data transfers as their systems came back online.

There is no panacea. The reason many people use these is because it’s easy and hard to find people that know other clouds and their quirks.

racl101 · 2025-10-20T18:02:19 1760983339

I find it weird many people are just realizing this. I've had this conversation with regards to talking about what should happen if a couple of bad earth quakes, not even "the big one", were to occur.

But on the other hand, maybe I hang around too many tech people to not empathically understand the other point of view.

morshu9001 · 2025-10-20T18:14:24 1760984064

We've seen big outages already but nothing that lasts too long. If an outage became prolonged enough, people would find solutions. We don't know what this massive outage would even look like, so whatever preparation you do, it might still break.

Also there are some outages that affect real life like airlines, but tech news overstates some like Facebook. It turns out that FB and IG can be totally broken for a whole day, the world will keep spinning, and they won't even lose users.

jraph · 2025-10-20T19:05:52 1760987152

I think many (most?) non tech people don't even know that Amazon is first and foremost a cloud provider (and one of the biggest at that, if not the biggest) and that its market thing is almost a side activity at this point.

bee_rider · 2025-10-20T18:06:51 1760983611

US east is pretty geologically stable I think.

JadoJodo · 2025-10-20T19:33:57 1760988837

> "Also in the UK, Ring users complained on social media that their doorbells were not working."

I sincerely hope that the base functionality of these doorbells (i.e., triggering the ringing of the bell within the home) is preserved in the event of an internet outage.

jesterson · 2025-10-21T00:43:35 1761007415

This is not a provider scarcity problem - there are numerous providers out there, but user's problem - they voluntarily choose crappy service at large scale, believing sales managers "it's reliable".

0xbadcafebee · 2025-10-21T22:55:52 1761087352

It is reliable. Even considering the inflated availability numbers, it's stupidly reliable.

jesterson · 2025-10-22T02:46:03 1761101163

Recent (and not so recent) events prove it isn't, or is it?

0xbadcafebee · 2025-10-22T04:02:36 1761105756

Terms like reliability have specific definitions in computer systems:

  Term           | Definition                          | Measurement
  --------------   -----------------------------------   -------------------------------------------
  Availability   | Basically, system uptime            | A percentage over time
  Durability     | Basically, persistence of data      | A percentage over time
  Resiliency     | Basically, self-healing             | A probability within a time period (usually)
  Reliability    | Basically, operational probability  | A probability within a time period (usually)
  Fault tolerant | Basically, it cannot fail           | Binary (it has faults or it doesn't)

Unlike more mathy fields, reliability is more of a "quality" that is qualified by one or more measurements (like Mean Time Between Failure). You define your metric, you give an estimate of what that value should be, and if you come in under it, you're reliable.

AWS has always stretched the truth when it comes to these numbers, but they do come pretty close to them most of the time. If you can find a different provider who'll even offer a number, it is usually not as close, and there's usually no contract that has any teeth to enforce it. Or they'll give very vague claims that don't get into specifics.

At least, not for "cloud providers" (other than the hyperscalers). You can find a datacenter who'll give you a number, but that's for like, their power reliability. That's a very different thing than saying "there is X probability over Y time that a server I run for you will not go down". Partly because it's pretty freakin' hard to wrangle all the different things that can go wrong with so much certainty that you can put a number on it. So most people give things like reliability, durability, availability, etc numbers for specific components of a system.

AWS S3 offers 99.999999999% durability and 99.99% availability. Now, did AWS S3 go down completely during the outage? Not as far as I'm aware. Maybe the control plane did, or a management portal, or billing, or something? But I'll bet you the PUT, GET, DELETE operations kept on flowing within 99.99% availability. Some other components in AWS may have been failing like crazy (which may have no guarantees...), but that one component probably stayed up within its guaranteed amount.

Design your apps to run on AWS using the components with specific guarantees, and you can estimate how reliable your end product will be. As far as I know, nobody has a better track record for meeting the guarantees. Even considering events like this.

jesterson · 2025-10-29T04:56:27 1761713787

> Design your apps to run on AWS using the components with specific guarantees, and you can estimate how reliable your end product will be

Alternatively, don't waste time and design you product without AWS but with failover across several telcos.

judahmeek · 2025-10-20T20:40:57 1760992857

I recall reading that when the costs of distribution (but not the costs of discoverability) are low, generally you end up with a power law sort of distribution of consumers to providers, where provider #1 has exponentially more market share than provider #2 and provider #2 has exponentially more market share than provider #3, #4, etc.

Examples of this are Windows/Mac, McDonalds/Burger King, Playstation/Xbox, Nvidia/?, AWS/Azure?, Android/iPhone, etc...

Basically, the majority of users all using the same dependency/platform/product is basic economics.

ChrisArchitect · 2025-10-20T18:09:16 1760983756

More discussion: https://news.ycombinator.com/item?id=45640838

xp84 · 2025-10-20T20:17:13 1760991433

In 2011 there was some kind of big outage at some major AWS US-east pop. I started a job at a company (very boring B2C startup) which had taken the lesson from that, that "cloud anything is dangerous."

They went and bought a bunch of literal servers and installed them in a datacenter, 90 miles away from our offices, and this is where all our applications ran for the remainder of that company's existence (about 6 more years). For the whole time I was at that company, we had somewhat more, and usually more lengthy, outages than the average startup. The only difference is that when some piece of networking gear took a crap, or a disk failed, or whatever, our guys had to diagnose and resolve it (Their karma, I guess, since this was their idea).

Anyway, I do think it would be good if at least so-calld 'tech companies' had a little less obsession to outsource everything -- even easy things -- to AWS, GCP, and Azure. I feel that way mainly for cost reasons as many of these services are wildly overpriced. But also we shouldn't kid ourselves by ignoring the advantages of operating at the scale those guys do. They can afford to have multiple absolute wizards available around the clock who make sure that when a problem happens, it's not the kind of "S-show" we had at my old company where we're all on a slack room or zoom or whatever and just guessing at to try for half an hour before we can figure out what the actual issue is.

robomc · 2025-10-20T20:40:10 1760992810

This. And when a service goes down it's a lot easier to explain to your client/boss that "half the internet is down" than "our boutique solution is broken so it's just us actually".

999900000999 · 2025-10-20T20:24:21 1760991861

I largely agree with you. When AWS goes down, for most situations I can just go outside and smoke a cigarette and not worry about it.

It's someone else's problem.

shadowgovt · 2025-10-20T18:29:13 1760984953

Sure. Are the "experts" going to pony up the cash to build in redundancy, or change the market fundamentals that make it make more sense for a startup to rush to product on a shoestring and then keep adding features instead of building against not-yet-happened failure modes?

If not, I look forward to the next single-point-of-failure outage. And the next. And the next.

0xbadcafebee · 2025-10-20T18:02:24 1760983344

This is what I call "fool's availability": reducing single points of failure (one cloud provider) without adding any actual redundancy.

If you removed AWS/GCP/Azure/etc and just had 100 small providers scattered all over, the result would be hundreds of outages throughout the year, as opposed to one big outage every other year [in one region]. AWS is already way more reliable than any other provider.

The real problem here is that companies that use AWS are morons who don't know how to architect/build infrastructure properly.

If it's important, it should be built right, regardless of who the provider is. A software building code would mandate how companies could use infrastructure (AWS or any provider) so that important services would not go down when one service or region goes down.

This is the basic concept behind things like the electrical code. It doesn't matter how great a public utility is; if your business is wired up so badly that a stiff breeze sets it on fire, just switching utilities isn't gonna help. And some utilities do occasionally have problems that persist down their lines to the customers, so customers need to set up equipment to protect against those failures. Whole-house surge protectors, lightning arresters, EMP shields, etc are necessary so that a rare event doesn't fry expensive customer equipment.

dudeinjapan · 2025-10-20T18:21:19 1760984479

Its probably worse—a given stack using multiple of these small providers will probably have more “single points of failure” (providers used in series rather than parallel.)

(If most companies liked using cloud providers in parallel, they’d already be doing it today between AWS, Azure, and GCP.)

morshu9001 · 2025-10-20T18:53:34 1760986414

Yes but most of those companies aren't morons, they're just taking an acceptable risk. Multi-region or multi-cloud setup is nontrivial.

0xbadcafebee · 2025-10-21T00:07:24 1761005244

Most companies I've worked for (and have heard about from others) have either lacked the knowledge, or the will, to evaluate risk. They build things until they "just work", and their thought process ends there. They don't examine the design to identify its reliability and security risks. They don't calculate the losses. They still have issues, but they just happen to be acceptable most of the time.

Example1: A company's infra goes down, but it doesn't come back up correctly. People run around trying to get it working again. It takes much longer than they hoped/expected, and they lose a lot more money than they expected. This is because they never really understood the risk they were exposed to. If they understood it, they would have done more ahead of time to mitigate that much risk.

(today's outage is this case. A lot of companies are going to lose money after today, because their customers are not happy with these "acceptable risks". Presumably, losing this much money due to one outage will not be an acceptable risk in hindsight. So the company either didn't understand its risk, or it did but was too stupid to prevent it)

Example 2: A company gets hacked, and its data is either exposed or wiped. This is a much worse result; they can lose tons of money, chase off customers, damage their brand, open them up to lawsuits and fines, even tank the whole company. It's clear that this risk is pretty unacceptable. But it keeps happening. And the reason usually isn't "some genius hacker"; it was a lack of understanding the risk of not investing in security.

(there's tons of examples of these in the news. presumably, not investing in security was not an acceptable risk in hindsight when it ended their business! almost always, the people involved in making these products don't know enough about security to understand the risks. but they also don't invest in security training, mandatory security controls, checklists, processes, quality gates, etc)

You don't need multi-region or multi-cloud to mitigate reliability risks. Just like you don't need to hire a big security team or invest tons of cash to mitigate security risks. You can use your existing infra and tools, and mitigate both issues. You just have to use them wisely. It takes some effort and time, but you do it once and it pays dividends indefinitely.

Building something without identifying its security/reliability risks, and then not calculating those risks' impact, is not acceptable risk; it's ignored risk. Is tanking your company and shedding customers an acceptable risk? Well, there's one way to find out.

morshu9001 · 2025-10-21T15:16:17 1761059777

This outage would've required cross region failover to be immune to. We'll see if customers switch to whatever company was resilient, but this has happened before and the answer was no.

0xbadcafebee · 2025-10-21T22:37:03 1761086223

The company I work for has a ton of stuff in us-east-1, many large products and sites, and we didn't go down. Our products/services aren't multi-region or multi-cloud. We don't pay exorbitant bills or have super complicated architectures.

morshu9001 · 2025-10-22T00:46:50 1761094010

If you were using AWS services that went down in us-east-1, how did you avoid an outage without failing over to anything outside that region?

0xbadcafebee · 2025-10-22T16:05:27 1761149127

That's the thing - most AWS services didn't "go down", as in stop working entirely. There were specific operations of specific services that were failing. Increased API error rates, inability to start new EC2 instances, billing metrics unavailable, AWS console unavailable, etc.

The outage wasn't like "all our servers stopped running". It was dynamic, new, specific operations that failed. If you just had a Fargate container that was started a week ago, and you have no need to restart the container today, it just kept chugging along.

Our architecture is stuff that just keeps chugging along. Fargate, S3, RDS, CloudFront, CloudFlare, etc. From our perspective, there was no outage in us-east-1. Literally the only alert we got the entire time was "billing limit exceeded" - and that was a false alarm, because it was set to alarm if there is zero billing data.

morshu9001 · 2025-10-24T00:59:24 1761267564

But is this strategy or luck? I'm not seeing how those many companies did something dumb or wrong here while you did it right. Like are they only affected because they overcomplicated their deployments? Either way, your service isn't resilient against a generalized regional outage it sounds like.

greenavocado · 2025-10-20T19:02:20 1760986940

The only reason we can't leave AWS is because we have 500 terabytes of data in S3

jewel · 2025-10-20T19:15:02 1760987702

Talk to the other vendors. I know of a place that had about that same amount and decided to have a redundant copy of all of their data in another vendor's S3-compatible product. That vendor paid for all of their egress fees as long as they signed a 12-month contract and used their tool for the migration.

coredog64 · 2025-10-20T19:29:28 1760988568

AWS will credit your egress fees if you incur them via leaving.

https://aws.amazon.com/blogs/aws/free-data-transfer-out-to-i...

ovaistariq · 2025-10-20T20:08:51 1760990931

What other AWS services do you depend on?

greenavocado · 2025-10-20T20:19:28 1760991568

Mostly EC2 for data mining terabytes of historical data stored in S3. Production usage is fairly lightweight compared to the EC2 and S3 stuff. We did cut our bill a lot by moving to single AZ redundancy.

halis · 2025-10-20T18:07:30 1760983650

Just need to retire the us-east-1 region, it's becoming a meme at this point.

dabinat · 2025-10-20T19:58:30 1760990310

This is coming right after we switched back to AWS after trying to switch storage to Cloudflare R2. Even with this outage, I still consider AWS more reliable than Cloudflare.

_pvzn · 2025-10-20T19:28:29 1760988509

The "experts" should lay out a good alternative in that case. Smaller providers also run into outages.

dexterdog · 2025-10-20T22:39:25 1760999965

And they all get to claim that they have better uptime to potential customers because nobody other than their current customers remembers their outages.

boznz · 2025-10-20T18:56:45 1760986605

I've really got to get me one of these 'expert' job gigs!

saltysalt · 2025-10-20T19:34:35 1760988875

AWS is this generation's mainframe. /joking

labrador · 2025-10-20T17:54:29 1760982869

This new post is interesting: https://news.ycombinator.com/item?id=45646777

"October 17, 2025, was my last day at Amazon Web Services... CloudFront is a CDN, a content delivery network, or, simply put, a large distributed cache for your cat photos. And a very successful one. Something like 30% of all internet traffic goes through CloudFront in one way or another. Pretty cool, huh? In practice, this means that with any change, you have a chance of crashing 30% of the internet."

Yokolos · 2025-10-20T18:14:06 1760984046

Ngl, that sounds like my dream job.

ta1243 · 2025-10-20T18:38:42 1760985522

> Something like 30% of all internet traffic goes through CloudFront in one way or another. Pretty cool, huh?

No. No its not. But tech enthusiasts on HN and Reddit love it.

(Another 30% runs through cloudflare)

Jzush · 2025-10-20T18:20:44 1760984444

If only there was a system of computers on the Internet that was distributed across the world where we could host things instead of all in one location. We could call it the "cloud".

ta1243 · 2025-10-20T18:57:06 1760986626

We could connect distributed computers on distributed networks together using some form of internetworking protocol.

Jzush · 2025-10-27T08:06:53 1761552413

Like some kind of interconnected network. We could call it connetwork or connet for short. We’ll be rich!

heavyset_go · 2025-10-20T18:32:44 1760985164

It makes us vulnerable to a centrality attack either foreign or domestic. If someone wants to fuck society up, only a handful of data centers, routers, networking junctions, etc could do it.