Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
AWS outage shows internet users 'at mercy' of too few providers, experts say (theguardian.com)
258 points by evolve2k 47 days ago | hide | past | favorite | 215 comments


Related ongoing thread:

AWS Multiple Services Down in us-east-1 - https://news.ycombinator.com/item?id=45640838 -(1650 comments so far)


The 'experts' also made similar criticisms with the Fastly outage in 2021 and did anything obvious change as a result? In a week's time no national newspapers will be talking about this.

Meanwhile, everyone that spends actual time in these areas:

- Knows that running an operation at AWS scale is difficult and any armchair critism from 'experts' is exactly that. Actions speak louder than words.

- Understands that the cost of actually accounting for this kind of scenarios is incredibly high for the benefit in most cases

- Knows that genuinely 'critical' services (i.e. health) should be designed to account for this, and every other 'serious' issue such as 'I can't log in to Fortnite' just shows what the price and effort of actually making that work is versus how much it costs affected companies when it happens

- Knows how much time national newspapers spend actually talking about the importance of multi-region/multi-cloud redundancy, that is, it's zero until the one day where it happens and then it's old news

- Is just curious as to just what exactly happened from a technical perspective

This isn't to say that good blameless post-mortem shouldn't happen to figure out process and technical issues, but the armchair criticism with no actual followup? All noise, no signal.


The "experts" in this case are

> Dr Corinne Cath-Speth, the head of digital at human rights organisation Article 19

Dr. Cath-Speth has a PhD in cultural anthropology

> Cori Crider, the executive director of the Future of Technology Institute

A lawyer

> Madeline Carr, professor of global politics and cybersecurity at University College London

A professor. Her bio doesn't say what her degree is in, but she mostly seems to publish in political science and international relations

So, not a single technical expert. Not anyone who has ever run a hosting service before or even worked for one. Just people who write papers and sit around waiting for journalists to call them for quotes.


Do you not think it a bit too hyperbolic to throw scare quotes around experts and imply the only people who can have opinions on systemic risk are software engineers? I don't think it is unreasonable for people who haven't run or worked for a hosting service to have opinions on the policy aspect or economic impact of hyperscalers.


> I don't think it is unreasonable for people who haven't run or worked for a hosting service to have opinions on the policy aspect or economic impact of hyperscalers.

Yeah, that's completely fair. My angle was more that firstly this doesn't come across as an opinion that needs the expert in question, and secondly this is yet another case of 'Talk is cheap, show me the code', particularly when quotes in the article include "We urgently need diversification in cloud computing."

I feel like the 'We' is doing an awful lot of heavy lifting and there's no mention of the costs of taking on such a task.

Additionally, and awkwardly, it's possible to be both a monopoly in the space but also technically a more stable solution, making the cost for competitors or people willing to use competitors doubly high.

Edit: Realised afer the fact I'm GP to your post, assumed it was mine, keeping the words anyway.


I don't think anyone needs to produce any code. I've worked at companies with thousands of employees who don't use any cloud services.

It can be done, and contrary to marketing, it's probably cheaper and more reliable.


What code is needed to make a decision to go with a smaller provider instead of AWS?


No, it's 100% appropriate. Anyone can have opinions on anything, but frankly, most of them have little relevance to reality. Their use of the word "expert" is supposed to mean the person has knowledge or expertise that renders their opinion on a subject substantially more valid and relevant than any regular person. That clearly is not the case here. If I wanted to know what a random person on the street thought about a subject, I could go ask one myself. The purpose of news organizations was supposed to be to better-inform people by getting opinions from actual relevant experts in a subject.

These people don't seem to have much ability to discuss relevant subjects like what the actual reliability of lower-tier hosting providers is, the value-add to business and iteration speed of having a variety of extra services (SQS, DynamoDB, VPC, RDS, managed K8s, etc) available, etc.


I don’t think it’s useful at all.

What are they going to say that’s useful for making concrete technical decisions?

They can advise on how to write contracts for dealing with these situations after the fact, I suppose.


Anyone can have an opinion, I never said or implied otherwise. Having an opinion does not make one an expert, hence the scare quotes.

The headline is misleading because when there is news about experts saying something about technology, one would naturally think that they are at least somewhat technical experts. Instead the "expert" is the director of the "Big Tech is Bad Institute" who says that "Big Tech is Bad". And their qualification of being an expert is solely that they are director of the "Big Tech is Bad Institute".


> when there is news about experts saying something about technology, one would naturally think that they are at least somewhat technical experts.

But the experts here are not "saying something about technology". Rather they are saying something about uses of technology. So they don't need to be cloud engineers or know anything about datacenters, at all, really. What would be required (and here you may have a leg to stand on) is expertise in social and economic aspects of (now) critical infrastructure.


And one would hope that the stats being quoted about desktop share were from someone who has been at that research firm in the last 20 years or so. I'm not sure how active he is at all at this point. I have a feeling someone looking for some stats found something old that may or may not have actually had a date on it.

(If I'm wrong mea culpa but I'm pretty sure.)


Opinions are valid but also worthless. Just give me a funny tweet to digest the situation.


[flagged]


The actual experts I was paying attention to said that wearing a K/N-94/95 type mask lowers the statistical rate of transmission, that is, infection of others by your deadly virus.

The subsequent findings are that cloth-type masks are less effective (but not wholly ineffective) compared to clinical/surgical masks at limiting the aerosolized viral shedding from those already infected. So if a cloth mask was all you had, the advice became "please wear it".

Turns out, many people assume advice is only relevant when given for their own direct & immediate personal benefit, so they hear what they want to hear, and even the idea of giving a shit about externalities is sheer anathema. That gets boiled down further for idiot-grade TV and bad-faith social media troll engagement and we wind up with reductive and snarky soundbites, like the remark above, that help nobody at all.

Back on topic, the choice of so-called "experts" in the Guardian's coverage of the AWS matter seems to be a classic matchup of journalistic expediency with self-promoting interests to pad an article that otherwise has little to say beyond paraphrasing Amazon's operational updates.


It's unclear what you're arguing. The leading experts (Fauci/CDC) who most Americans were paying attention to were not providing this shading of meaning which you are trying to impute to them. That would be the case if they said something like N95 masks will provide excellent protection for you from the virus if worn correctly, but we have a shortage, so please make do with alternatives so that health care workers have access to them. That is not what they said. Instead they sacrificed credibility at the altar of expediency to the detriment of future trust.

What's reductive is assuming that people are motivated exclusively by self-interest instead of trusting them to make good decisions when told the truth.


Fauci said the following on 2020-03-08: https://www.factcheck.org/2020/05/outdated-fauci-video-on-fa...

> When you’re in the middle of an outbreak, wearing a mask might make people feel a little bit better and it might even block a droplet, but it’s not providing the perfect protection that people think that it is. And, often, there are unintended consequences — people keep fiddling with the mask and they keep touching their face.

> But, when you think masks, you should think of health care providers needing them and people who are ill... It could lead to a shortage of masks for the people who really need it.

He said that there's a shortage, and that he didn't trust that people would wear the masks correctly. I remember that most of the early anti-mask guidance I heard was claims that they weren't likely to prevent yourself from getting infected because: the mask would become an infectious surface; and people wouldn't handle the mask as infectious.

Opinions started to shift over March, and the CDC put out guidance on 2020-04-03 to wear cloth masks in public. https://www.npr.org/sections/coronavirus-live-updates/2020/0...

> It is mainly to prevent those people who have the virus — and might not know it — from spreading the infection to others.

> U.S. health authorities have long maintained that face masks should be reserved only for medical professionals and patients suffering from COVID-19, the deadly disease caused by the coronavirus. The CDC had based this recommendation on the fact that such coverings offer little protection for wearers, and the need to conserve the country's alarmingly sparse supplies of personal protective equipment.

I used wikipedia for dates and sources: https://en.wikipedia.org/wiki/Face_masks_during_the_COVID-19...


This information was and is widely available.

Your earlier statement was entirely framed in self-interest, so you don’t get to complain about being pulled up on that now.


The self-interest of wanting to be told the truth? Uh, yeah.


Sounds more like you chose to ignore it. My family was wearing medical-grade disposable facemasks and socially distancing from February 2020 on the basis of healthcare advice.

Hunting for a bogeyman in retrospect is the bad-faith narrative of the mediocre culture warrior. Good luck with your undifferentiated rage or whatever.


Good for you. Nonetheless a non sequitur in this discussion.


That’s certainly true. Face masks are not relevant to AWS outages.


Right?! Same with seatbelts. I don’t wear mine because there’s obviously still automobile deaths. Experts said seatbelts would protect us from deadly accidents. What else are they wrong about?!


That counterargument might make sense if seat belts were not generally protective in accidents or if experts were telling you to wear crepe paper seat belts instead of nylon ones because the nylon ones were needed elsewhere.


Those comments were made in an information regime that severly censored contrary expert opinion. We had experts in various related field who were automatically labeled as cranks simply because they disagreed with the social engineering experiment and test run of various social control mechanisms (worldwide ..).


What “experts” can you directly cite? Or is the reality that you recall opinion makers saying “experts” are making clearly unsupported claims?


Was that from actual experts, or bad faith strawman coverage (plenty of that about).

At least in my country, there was sober objective coverage from experts about their purpose and percentage effectiveness at reducing the range and spread of potentially infected droplets. Masks were somewhat effective for filtering incoming droplets, but most effective at containing outgoing droplets. The smaller the viral load you were exposed to the lower your chances of getting infected. Experts never claimed them to be 100% though, it was about reducing transmission rates not absolute protection.

Which is the main reason they're used in surgery too coincidentally (they aren't primarily for the surgeon's protection). Or is that an even longer running conspiracy?


You misunderstood then. It was mainly to protect OTHER people from a virus YOU might be carrying.


Unfortunately this site is full of Americans…


I think your third point is what I've had to attune to when criticizing cloud dependence. I think if your entire source of revenue is dependent on AWS then you should be prepared for 16+ hours of downtime per year. Individuals notice it more when something is down for hours but with good observability I am guessing the business notices it more when performance drags for the other 8742 hours of the year. Bursts of downtime per day can still be attributed to the device, wifi, ISP, or some other intermediary's DNS/BGP.

If your margins are so tight that 16 hours of downtime will bankrupt you then I think either: a) I have no idea how to run a business; or b) you have no idea how to run a business. I'm also biased because I love highly fault-tolerant, geo-redundant, durable systems much more than "good enough for this KPI".


> but with good observability I am guessing the business notices it more when performance drags for the other 8742 hours of the year

This is really good point that aligns with my experience. Today's event was LOUD and (compared to other incidents) long, but perhaps not really that long compared to the situation you describe that for most businesses is going to be more pernicious.

Business intelligence and analytics-type folks at $DAYJOB are _very_ watchful for the year-on-year deviations and even periods where the prediction lines didn't match up for even just a few hours.


I think all of that is mostly irrelevant. You don't need to pay a huge cost to avoid the small benefit, you don't need every service to be resilient to this, or any of that. You just need multiple different providers so that not everyone gets screwed at once.


But that would require companies to actually spend time and money testing and working with either a cross-provider multi-master-type system (with all the associated consistency headaches) or regularly test a functioning disaster-recovery/fallback system.

The time spent on that (let alone cost, for companies with large amounts of data) far outweighs the cost when a single region has an issue of today's scope. And you said it yourself, it's a 'small benefit'. Small benefits sound like exactly the things not worth spending time or money on.

For as much as many companies have had issues today, the daily reality is that these same companies haven't been having issues all the rest of the time (or this wouldn't have felt so shocking) and are likely to be okay with an outage of this scope (plus, everyone's too busy making noise about the issues to be working normally).


There are multiple different providers, with nothing artificially limiting their use. Also idk what's so bad about Fortnite and Snapchat going down at once instead of it being staggered.


Yes but we live in a highly anti-competitive monopolized world now. With more to come under the new admin.


There’s two or three gartner approved ways of doing things for fortune 500 ctos, and f500 wannabes.

It’s not a monopoly but it’s close.


It’s hard to think of anything less monopolized than cloud hosting. There are hundreds of providers.


Yeah right, and how many of them have any substantial customer base compared to AWS and Azure?


For any business that matters, your choices are amazon, google, Microsoft, and that's about it.

I couldn't even name another provider except maybe Hetzner


The three you mentioned have over 60% market share which is why this article exists at all. Knowing what I know about cloud ifnra, anyone who is actually anyone is hosting on the big three. So it's not just a market share, it's market share + impact / importance.

You could also argue that YT is on GCP (to some level) and that would probably bump that number up much higher.

The vast majority of people hosting things on the internet are on these providers. But you get downvoted for pointing that out now.


> - Knows that running an operation at AWS scale is difficult and any armchair critism from 'experts' is exactly that. Actions speak louder than words.

NO. From their own reports, clearly AWS is too centralized and dependent on a specific region (us-east-1) and a specific service (DynamoDB). This has been observed for well over 10 years. Why do they stay in this centralized architecture? Cloud services need much higher standards than the average corporation. Just look how they took down 2000+ services for many hours.

[1] https://health.aws.amazon.com/health/status


Even wearing my ex-AWS hat and understanding to some degree the internal complexity of these services, I too am boggled that foundational stuff is still out of Virginia and not a separately operated global region for the subset of control-plane dependencies that can’t be refactored into tolerating eventual consistency (such as parts of IAM).

We always used to talk a lot about minimising blast radius and there’s been enough time, and enough scale, to fix it.

Nevertheless the Guardian’s choice to label self-promoting policy wonks as “experts” is a cringe-inducing reminder that journalists don’t know anything about anything.


I don't deny that an incident of this scope should prompt a serious technical and process review (and as you describe it, it sounds like this is long overdue), however how often does this kind of thing not affect 2000+ services? Companies should be tracking the time they don't have issues as much as the time they do in order to actually understand if they'd be better off elsewhere.

And to be clear, I'm not at all arguing for the monopolisation of cloud providers, only stating that it's easy to point from far away and say 'This is bad' while simultaneously not doing anything to understand the cost and make that change that you say is important, because it's actually costly (in many dimensions) to do.


> Knows how much time national newspapers spend actually talking about the importance of multi-region/multi-cloud redundancy

For the record, multi-region redundancy is moot, and I can't stress it enough. It is not the first time that on the surface it looks like a single region but in fact services in multiple regions are affected.

And multi-cloud hot standby can be terribly expensive, unless your infra is very simple. And it's not easy to get it right either until you planned for it from day one.


Um.. you don't need to be an expert in security, comp.science or economics to know that putting all eggs in one basket may not be a great idea as introduces one giant systemic target. If anything, regular people here are uniquely qualified to say something along the lines of:

Oi, this is ridiculous. Maybe more things should be ran locally..

FWIW, it was instructive to me as to which companies were not able to function today.


> - Knows that genuinely 'critical' services (i.e. health) should be designed to account for this

yeah but aws advertises as "trust me bro I won't go down for 99.99999%"

I've seen a lot of gov proposals using aws to 'get away with downtime management'


These are Guardian 'experts' so can be safely ignored.


maybe your VC overlords need a reality check?


Because the experts have no say in policy. The only people who have a say are the people bribing (sorry I mean "lobbying") Congress. And even they have very little say because Congress is currently on a hot streak of doing absolutely nothing.


Kieran Healy @kjhealy@mastodon.social

Always worth taking sentences that use “the Cloud” or “the Internet” and try replacing those phrases with “A shed in Virginia” to see how they hold up. “Our service is fully based in a shed in Virginia”; “All my files are in a shed in Virginia”; “A shed in Virginia was designed to survive a nuclear war”, etc.

https://mastodon.social/@kjhealy/115407725852594322


Sounds like a pretty good shed! Like a lot of pithy commentary on the cloud, this ignores the fact the practical alternative to a shed in Virginia for most businesses is a shelf in the supply closet. "Oops, Jim Bob tripped over the power cord, guess we won't get any emails until the IT guy shows up" - this used to be a routine experience.


> "Oops, Jim Bob tripped over the power cord, guess we won't get any emails until the IT guy shows up" - this used to be a routine experience.

You're not entirely wrong, but you're being hyperbolic too. I'm actually curious how old you are / how long you've worked in tech, because I started out pre-cloud and things weren't nearly as bad or as limited as you suggest.

First, on-prem servers are not the only alternative to "cloud." Many businesses, including the ones I worked for, did co-location. The companies owned their own bare metal servers, but would rent a rack in a data centre, and certain things - like the network admin - was entirely outsourced to the data centre / hosting company.

You could also rent managed bare metal servers (you still can). This means that you can pretty much outsource your entire IT department, but you're still not doing cloud services. Meaning you've got bare metal servers, someone you're paying at the hosting company is handling security updates and troubleshooting. You don't get things like auto-scaling or serverless or other cloud features, but you also don't have to worry about Jim tripping over the power cable either.

There's also still virtual servers. Which is basically a VM running on a server that hosts multiple clients.

All of this is to say that the alternative is not "cloud" or "box in a closet." The alternative is "cloud" and a ton of different server options: owned, rented, co-located, on-prem, dedicated, virtual, managed v un-managed (outsource IT vs admin your own) and the list goes on and on.


But is the distinction meaningful? The alternative to a shed in Virginia is a different shed in Montana? I mean sure there are a lot of different sheds out there but they're all still sheds. They're all shared responsibility models where the line is drawn in different areas, some outages will be because of your fuckup, some will be theirs.

Not saying as an industry we shouldn't diversify a little but it doesn't fundamentally change the relationship each company has to their hosting provider.


We run a subset of our CI workload on on-prem workstations because the cost/performance ratio of consumer hardware is so much higher than servers. 1TB NVMe drive, with a 7950x/i9, 64GB RAM and gigabit networking is < $1000. It actually completes our CI job faster than AWS restarts a gpu instance.

100% of our failure rates with this machine have been "carpet cleaners unplugged the machine" in 2 years. Last year we had nobody in the office (due to carpet cleaning). This year we sent someone in straight after the cleaning to fix it.


I've never managed IT professionally myself (pre-cloud or otherwise), so a lot of my information comes from family members who do, but my impression is that bare metal rental and colo centers weren't realistic options for any but the most technically sophisticated organizations. I know schools, stores, even research centers who went straight from on-prem to managed cloud with no real consideration for anything in between.


My first paid position as a software developer was for a small, dot-com startup in Windsor, ON Canada. We co-located in Detroit - which meant border crossings (though this was pre-911 so crossing as a Canadian citizen was easier) - and had just a couple of servers on a rack. We were software engineers and had people who knew what they were doing. So yeah we were technically proficient. But I'm not sure I'd call that company among "the most technically sophisticated of organizations". We were tiny. In fact, when I first started working there, we were working out of a house with a workforce of like 25 people max.

When that company went under during the dot-com crash, I started my first business shortly after. It was 2003, I was 21 years-old and this business allowed me to work from home and feed my family until I re-entered the job market in 2018. For 15 years, I was a one-person organization, and because my business operated "free" adult-entertainment websites, bandwidth was my most significant expense. For that reason, even when Cloud became a thing (which it wasn't in 2003), I never migrated because of the bandwidth costs alone. Cloudflare was a major game changer but even it didn't exist when I first started out. There were CDNs like Akamai but they were crazy expensive and out of my league. So at its peak, I had about 12 bare metal servers around the world (all rented from the same hosting company - original called Server Matrix it then became Softlayer and then was bought by IBM and went to shit and is now IBM Cloud). I admin'd those on top of writing and maintaining all of the code and running the business independently with occasional help from my wife.

I am obviously very technically competent. I'm a Principal Software Engineer today. But technically sophisticated? There wasn't much sophistication about it. I did bare metal servers because it was the only cost-effective way to run my business. It was attainable and it worked. And it worked in a way that Cloud couldn't when Cloud came on the scene - so I never went Cloud with that operation just due to cost alone.


With a gazillion of shelves, closets, Jims and cables. So if Fortnite's Jim trips on a wire, Canva's Jim is quitely sipping coffee at his desk.


on the other hand, that's a small price to pay to having total control and physical access to your own infrastructure. if the sysadmin did his job properly, an incident like that shouldn't require anything else but to plug the server back in and hit the power switch. but then if he did his job properly, no one but IT should be tripping on power cables to begin with.


My home infrastructure is immune from the “unplug” problem by being hosted on two different 10 year old £30 raspberry pis in different rooms.

But apparently that’s too hard for the average.


Once had a site wide outage (biggish manufacturing company) of the internet and backup servers because one of the women wanted to plug her hair straighteners in for the xmas party.

In a surprise to literally no one that happening on the last friday before xmas break got my "We need to secure the main comms cabinet" (which had the backup server and main ingress for WAN and was in a separate building on other side of site) item that I'd been asking about for months to the top of the list.

Still one of my favourite "outages" because I got to my desk, turned PC on, no network, walked across the landing into the main office, opened comms cabinet, plugged it back in and was "resolved" before the MD got to my desk.


Largely mitigated by twist lock sockets.


We already have diversification. You can rent a VPS from hundreds of possible companies. And people are very happy with them, it seems every month or two there’s a post here about how some company slashed their cloud bill by switching to a VPS. What we have here is a lock-in and marketing problem.


>You can rent a VPS from hundreds of possible companies. And people are very happy with them, it seems every month or two there’s a post here about how some company slashed their cloud bill by switching to a VPS.

Companies are using higher-level "PaaS" suite of services from AWS such as DynamoDB, RedShift, etc and not just the lower-level "IaaS" such as basic EC2 instances or pure containers. Same "lock-in" situation with using the higher-level services from MS Azure and Google Cloud.

For those dependent on high-level services, migrating to a VPS like Hetzner or self-hosting is not possible unless they re-invent the AWS stack by installing/babysitting a bunch of open-source software. It's going to be a lot more involved than just installing a PostgreSQL db instance on a VPS.


> It's going to be a lot more involved

Yes, and you can't escape that by outsourcing it. The complexity is still there, and it will still bite you when your outsourcer fails to manage it.


Same thing applies to AWS...


I’m not really making a point here as much as an observation, but if my stack that I manage atop VMs in a data center goes down, my customers are pissed at me. If AWS goes down along with half the Internet, my customers are completely sympathetic.


Maybe just for you and after they realize it's part of the ongoing AWS outages, but for most folks, an outage is still their problem, and their SLA, regardless of if it's upstream from them.


I disagree. I think most customers are much more sympathetic to an AWS outage than they are to a self-managed outage. Whether that ought to be the case or not is a different question.


But if your services are up when everyone on AWS is down you look like a wizard.


Unfortunately, people rarely notice when something is working, and the few who do will probably just assume you weren’t on AWS in the first place and move on with their day.


And every comment in those threads is how AWS is webscale and wont go down, while the vps will have uptime of 1 day a month


Amazon offers VPS as well, EC2 instances, were those affected? I think they weren't.


Our actual running instances were pretty much fine throughout, as was the RDS cluster, but we had no way to launch new instances (or auto-scale), and no way to invoke any of the other AWS services (IAM, SQS, Lambda, etc). Also no cloud watch logs/metrics for the duration, so limited visibility.

Overall not that bad for us, but if you had more high-level service dependencies, there would have been impact.


> While most operations are recovered, requests to launch new EC2 instances (or services that launch EC2 instances such as ECS) in the US-EAST-1 Region are still experiencing increased error rates.

> We continue to investigate the root cause for the network connectivity issues that are impacting AWS services such as DynamoDB, SQS, and Amazon Connect in the US-EAST-1 Region. We have identified that the issue originated from within the EC2 internal network.

So, kinda? Some global services depend on us-east-1...

> Global services or features that rely on US-EAST-1 endpoints such as IAM updates and DynamoDB Global tables may also be experiencing issues.

Basically, you know it's going to be a bumpy day when us-east-1 has an issue because your ability to run across regions depends on what the issue is what the impact is.


The expert opinions are more about geopolitics, like maybe don't have all your country's systems realtime depend on a foreign company.

If you are just one company whose goal is to maximize uptime without bringing in the complexity of multi-cloud, relying on AWS is reasonable. You probably won't get better uptime using something else, you'll only be down at different times than most others, which in most cases is actually worse.


For the kind of person being quoted, the stock in trade is not actually doing anything to fix it, it's in being the person quoted when something goes wrong.


The whole industry walked straight into the cloud service lock-in trap. How would we begin to wind back? I also think Docker is as much to blame as the bigger cloud vendors.


I don't think it wants to. Ask any on-call engineer or support tech how they felt when, after having their phone blow up at 1am because everything is falling apart, they found out that this was an AWS-wide outage.


Why is docker to blame?


It's subjective I guess, but I feel as though containerisation has greatly supported the large Cloud vendor's desire to subvert the more common model of computing... Like, before, your server was a computer, much like your desktop machine, and you programmed it much like your desktop machine.

But now, people are quite happy to put their app in a Docker container and outsource all design and architecture decisions pertaining to data storage and performance.

And with that, the likes of ECS, Dynamo, RedShift, etc, are a somewhat reasonable answer to that. It's much easier to offer a distinct proposition around that state of affairs, than say a market that was solely based on EC2-esque VMs.

What I did not like, but absolutely expected, was this lurch towards near enough standardising one specific vendor's model. We're in quite a strange place atm, where AWS specific knowledge might actually have a slightly higher value than traditional DevOps skills for many organisations.

Felt like this all happened both at the speed of light, and in slow motion, at the same time.


Containers let me essentially build those machines but at the actual requirements I need for a particular system. So instead of 10 machines I can build 1. I then don't need to upgrade that machine if my service changes.

Its also more resilient because I can trash a container and load up a new one with low overhead. I can't really do that with a full machine. It also gives some more security by sandboxing.

This does lead to laziness by programmers accelerated by myopic management. "It works" except when it doesn't. Easy to say you just need to restart the container then to figure out the actual issue.

But I'm not sure what that has to do with cloud. You'd do the same thing self hosting. Probably save money too. Though I'm frequently confused why people don't do both. Self host and host in the cloud. That's how you create resilience. Though you also need to fix problems rather than restart to be resilient too.

I feel like our industry wants to move fast but without direction. It's like we know velocity matters but since it's easier to read the speedometer we pretend they're the same thing. So fast and slow makes sense. Fast by magnitude of the vector. Slow if you're measuring how fast we make progress in the intended direction.


Containers have nothing to do with storage. They are completely orthogonal to storage (you can use Dynamo or RedShift from EC2), and many people run Docker directly on VMs. Plenty of us still spend lots of time thinking about storage and state even with containers.

Containers allow me to outsource host management. I gladly spend far less time troubleshooting cloud-init, SSH, process managers, and logging/metrics agents.


> Containers have nothing to do with storage. They are completely orthogonal to storage

Exactly.

And sure, you can use S3/Dynamo/Aurora from an EC2 box, but what would be the point of that? Just get the app running in a container, and we can look into infrastructure later.

It's a very common refrain. That's why I believe Docker is strongly to linked the development of these proprietary, cloud based models of computing, that place containerisation at the heart of an ecosystem that bastardises the classic idea of a 'server'.

The existence of S3 is one good result of this. IAM, on the other hand, can die in dumpster fire. Though it won't...


> And sure, you can use S3/Dynamo/Aurora from an EC2 box, but what would be the point of that?

An easy API? Easy replication / failover / backups? I would absolutely use S3 even with EC2.

> IAM, on the other hand, can die in dumpster fire.

I’m no great fan of AWS’s approach to IAM, but much of the pain is just the nature of fine-grained / least-privilege permissioning. On EC2 it’s more common to just grant broader permissions; IAM makes you think about least privilege, but you absolutely can grant admin for everything. And as far as a permissioning API goes, IAM is much cleaner/saner than Linux permissions.


I don't see how Docker makes that worse.

Before Docker you had things like Heroku and Amazon Elastic Beanstalk with a much greater degree of lock in than Docker.

ECS and its analogues on the other cloud providers have very little lock in. You should be able to deploy your container to any provider or your own VM. I don't see what Dynamo and data storage have to do with that. If we were all on EC2s with no other services you'd still have to figure out how to move your data somewhere else?

Like I truly don't understand your argument here.


Containerization was basically a way to get rid of the problem of "it works on my machine", mainly the OS version and installed libraries. Plenty of instances where program X will work on system A, but not system B, but program Y works on system B but not A. Or X is supported on Redhat/Ubuntu/etc. but you can't or don't want to build from source.

Even if that is not a problem, you avoid having to install the kitchen sink on your host and make sure everything is configured properly. Just get it working on a container, build and image and spin it up when you need it. Leaves the host machine fairly clean.

You can run a bunch of services as containers within a single host. No cloud or k8s needed. docker-compose is sufficient for testing or smallish projects.

Also, there is a security benefit because if the container is compromised, problem is limited that container not the entire host.


Been a while since I worked in cloud but at least when I got out of it, the primitives where all shoring up to be generally very similar.

Did multi cloud redundancy end up being too expensive? Tech didn't line up enough? No good business case?

The elastic cloud story that never was? https://www.slideshare.net/slideshow/pets-vs-cattle-the-elas...

What happened?


The (cognitive) overhead of managing and deploying to multiple clouds usually isn't worth it for most teams. Hiring experts and maintaining knowledge about the ins and outs of two (or more) clouds is less feasible for small, fast moving teams.

Simplicity is linked to uptime and having a single cloud solution is a simpeler solution.

For large companies, its mostly cost savings. Easier to negotiate a good discount at N million versus N/2 million.

Besides that no-one ever got fired for picking AWS ;)


Not a justifiable expense when no one else is resilient against their AWS region going down either. Also cross-cloud orchestration is quite dead because every provider is still 100% proprietary bullshit and the control plane is... kubernetes. We settled for kubernetes.


Also if you can't even do cross region, cross cloud won't happen


Cross region isn't simple when you have terabytes of storage in buckets in a region. Building services in other regions without that data doesn't really do any good. Maintaining instances in various regions is easy, but it's that data that complicates everything. If you need to use the instances in a different region because your main region is down, you still can't do anything because those cross region instances can't access the necessary data.


Entire terabytes?! My god, I can only barely fit that onto a single SD card the size of my pinky nail.

It is quite bizarre that such paltry amounts of data and problems with such tiny scale seem to pose challenging problems when done in the cloud.


Such a sophomoric response. It does not matter how large your storage use is exactly. The point is that nobody is going to pay to replicate that data in multiple clouds or within multiple regions of the same cloud provider.

Btw, I'd love to have a link to where I could buy an SD card the size of a pinky nail that holds terabytes of data.


It absolutely matters how large your storage use is. Terabytes of storage is easily manageable on even basic consumer hardware. Terabytes of storage costs just hundreds of dollars if you are not paying the cloud tax.

If you got resiliency and uptime for a extra hundred dollars a year, that would be a no-brainer for any commercial operation. The byzantine kafkaesque horror of the cloud results in trivial problems and costs ballooning into nearly insurmountable and cost-ineffective obstacles.

These are not hard or costly problems or difficult scales. They have been made hard and costly and difficult.


Your pedantry is just boring. Yes, I used the word terabyte instead I guess something more palatable to you for being large. Fine s/exabyte/terabyte/.

I work with buckets where single files are >1 terabyte. There's more than one of these files, hence terabytes. I'm not going to do a human-readable summary listing of an entire bucket to get the full size. The point of the actual size is irrelevant. When people are spending 5-6 digits on cloud storage per month, they are not going to do it in multiple places. period. Maybe the new storage unit should just be monthly cloud spend, but then your pedantry will say nonsense like which cloud sever, which storage solution type, blah blah blah.


Ah yes, let us just gloss over 6 orders of magnitude when we are discussing cost-effectiveness and feasibility. What is the difference between 100$ and 100,000,000$ of spend really? Basically the same thing.


Yes they exaggerated, it takes several pinky nail sized cards to store several TB. Only 1TB per microSD.


They have them at 2 TB [1] now for just 300$. And SanDisk announced 4 TB last year, but I do not see them for sale just yet.

[1] https://shop.sandisk.com/products/memory-cards/microsd-cards...


Bottomline is that AWS gives you the tools to survive this outage within their own ecosystem.

If there's an issue with relying only on AWS it has not been expressed in this outage.


exactly what tools helps make your large volume of data stored in a down region available to other regions without duplicating the monthly storage fees?


You duplicate the fees. But it's the same or worse trying to do multi cloud.


Which is precisely why it's not done


I seems to recall it was fairly common to have a read only versions of sites when there was a major outage - we did that a lot with deviantart in the early 2000s, did that fall out of favour or too complex with modern stacks or?


If only everything was a simple website. You're totally ignoring other types of workflows that would be impossible to use a read-only fall back. Not just impossible, but pointless.


HN does it too, but it's a simple site


I don't think storage cost is the reason, more that it's hard to design for regional failures. DB by itself as one example, cross region read replica usually introduces eventual consistency to a system that'd otherwise be immediately consistent.


Well yeah, but that's why we get paid the big bucks right?


We do, non-tech company's IT dept doesn't so much


Thanks for the helpful reply! Do you think that would be still true if one accepted a constraint of the "down" version of the property served had data that was stale, say 24 hours behind what the user would have seen had they been logged in?


Yeah except it would probably be delayed way less than 24h. And then you have to figure out how to merge the data back in after, unless you're ok just losing it permanently. And make sure things are handled right if other healthy DBs point to things in the failed-over DB that disappeared.


Data has a lot of gravity.


All the cloud providers have cheap compute but ludicrously expensive network egress. Trying to multicloud will stick you with a massive traffic bill, which is probably not a coincidence.


It really depends on how you will built it. You can architect it for multi cloud from top down where the client/browser talk to one region, With DNS with health check, and replication happens at the DB layer. Your services don't talk cross region at service level, so avoiding a lot of cross region/cloud communication. Most use cases can be addressed this way.


It's a market regulation failure. Which results in a failed market, with the cloud infra provider also providing data services. 20 years ago, there were 20+ widely used operational databases. Now, it's like DynamoDB with like half the market.


How should this have played out in a regulated market? DynamoDB gets released, then what? Has limits on the market share it's allowed to steal?

Should we similarly cap say Front End frameworks on market penetration / growth? Is react too big to fail? Do we need to force some of it's users to use something else?


What would these regulations say, exactly?


It seems that clouds balance their budget on egress charges... which leads to cross cloud communication being too expensive to setup multi cloud redundancy. Cross region redundancy is often too expensive too. Even cross availability zones is too expensive for some clouds and applications. (Cross region redundancy in a single cloud doesn't always work out, if the cloud has an outage on a global subsystem, or the broken subsystem gets pushed to multiple regions before exhibiting symptoms)

Additionally, moving your load to a different cloud can be challenging while one is down. It ends up being a lot of work that pays off for a few hours a year. For a lot of applications, it's better to just suffer the downtime and spend money on other things.


If you're a company providing services to people that already have data stored in VendorA's cloud, being on a different cloud would be expensive and prevent you from winning much work. If it turns out that VendorA happens to be the vendor for your clients, you build your services to run on VendorA's cloud too.

This is the situation for my company that started with the intent of being platform agnostic, but it quickly became much less complex as all of the potential client pool was using the same cloud. People with buckets with large amounts of data are not going to be able to convince the bean counters that it would be worth it to have that storage bill from multiple vendors.


> are not going to be able to convince the bean counters that it would be worth it to have that storage bill from multiple vendors

Because it rarely is. Occasional downtime is just a cost of doing business. It is, or should be, rare enough that you just take it as it comes instead of trying to have a redundancy. We don't build tunnels everywhere as a backup for surface roads on snowy days. We just cancel school and work for the day and make up for it later. Do some important things get impacted? Sure, but most things are as mission critical as we make them out to be. The press coverage of an AWS outage makes it so easy to shrug it off and point fingers.


There’s a huge difference between “similar” and “works and is ROI positive for my business across the whole lifecycle”.

Multi cloud redundancy is like Java being a solution to platform independency.


Many companies idea of a disaster plan is to make it after the disaster.

You have to build it in. That takes time money and training. Do you do failovers? Do they work? What is your backup situation? What is your list of work items to do during the failover? How long does it take? Do you even HAVE a failover plan? Can your services handle being in 'split brain'? Do you have specialty services that can only run in one place?

The unfortunate reality is this planning happens many times too late.


It feels like a hat on a hat, cloud systems are already designed for redundancy, adding a redundant layer on top of that is like a double condom, or invesisting in multiple investment funds.


Networking leaving the cloud provider (or even just to another zone on the same cloud) is $0.02 GB. That adds up real fast.


Man, I did not have "AWS us-east-1 will only have TWO 9s this year" on my bingo card.


For those of us who have been using AWS for almost 20 years now, I can't imagine why anyone would willingly choose us-east-1 for anything. It is the oldest, highest traffic, most critical path region and is subject to turbulence.


I think it is a little complicated. For example, your service might be using full failover but you use API from other service which are down.

Or you might use BART to come to work and you got stuck: https://www.kqed.org/news/12060687/bart-resumes-service-but-...


ha! I saw another comment on here talking about how ec2 doesn't need to be held to the same standard as the power company because it's not as important as real infrastructure.

wish I'd already had this link in my back pocket. our industry needs to take its job, as a whole, much more seriously.


“Global” and “edge” services such as IAM, Route53, CloudFront and so on have dependencies on us-east-1, so even if you don’t think you do, you probably do.


By some logic, that would mean it is the most battle-tested and highest-stakes (and therefore most carefully-managed) choice. I.e. reasons in favor.

Not that I disagree with you, but maybe not for the reasons you say (:


> By some logic, that would mean it is the most battle-tested and highest-stakes (and therefore most carefully-managed) choice

As someone who used to work on the inside, us-east-1 has the biggest pile of legacy workarounds for internal AWS issues, it has a variety of legacy API behaviours that don't exist in other regions, and because everyone picks it as the default, it has significantly more pressure on contested resources (i.e. things like spot instance pools).

Plus since it's the default in all the tooling, if you ever decide to go multi-region, you'll find tons of things break right away.


Well, we didn't, but some of our third party softwares did. Hard to avoid.


It can make sense to depend on the thing that will attract massive worldwide attention if/when it goes down. Or, more likely, it's just a default people don't change.


Wait, was the whole region affected? Like even if you had an EC2 instance?


No, we run on US East 1 but only EC2. Everything was running smoothly!


Our strategy has always been to use as little higher abstractions from cloud providers as possible. Glad we went this way, saved us quite a bunch of SLA breaches today! I am confident to say that it's "best of both worlds". We get great availability zone redundancy by AWS without having to rely on and pay for all those PaaS stuff the cloud giants offer. Also, we can "fairly easy" migrate to any other cloud provider because we only need Debian instances running.


Yes, it was. We have EC2 instances that we turn on as-needed, and at times were unable to start said instances.


And we lean into it by saying "Well, if everyone else is down, I get a free pass".

(which, is not true in reality if you have ordinary customers).


"The root cause is an underlying internal subsystem responsible for monitoring the health of our network load balancers."

https://health.aws.amazon.com/health/status?path=service-his...


So, how many people will actually switch their setups to multi-cloud as a consequence of this? How many will move over to self-hosting? Or will they just do a post-incident report, wave hands around and do nothing?

Because I think it's very much the same way as it is with Cloudflare - while the large vendors aren't always openly hostile, we can just smile and hope that they don't get too keen on reminding us that they're holding us hostage.

I don't see that changing anytime soon. I've personally also used Hetzner, Contabo, Scaleway, Vultr, DigitalOcean, Time4VPS and some other platforms, but when people couple their setups to CF/AWS/GCP/Azure, typically that coupling is hard to get rid of and doing so is hard to justify.


For most companies, I suspect this will actually re-affirm _not_ switching to multi-cloud.

Lots of businesses who will be completely forgotten as having an outage today because all of their customers were dealing with their own outages and outages in dozens of other providers.

Obviously, that doesn't fly for everyone.


GCP and Azure should be running a 10% sale/discount (Coupon code: RAINYDAY) for new accounts during the week of an AWS outage. The bean counters would take note.


Nobody ever got fired for buying IBM…

…no, Microsoft…

…no, AWS.


It’s only a single region. If anything it shows how many people just double down on the default without any redundancy.


A single region that is a SPOF for global AWS services*


Is us-east-2 services impacted today? which ones?


> It’s only a single region

Which was effectively the only region


My company has been ahead of all of this by causing outages in our own data center without waiting for the cloud to do it for them.

On a serious note, resiliency takes effort and investment no matter where you host your content.


Wow, thanks experts! I never could have figured this out without you :)))


This article isn't written for you. It's written for my mom, etc.


Surprised to see an article like that even getting shared here. The Guardian seems to be wrong on almost every tech issue.


Does your mother frequent hackernews?


This article was written for The Guardian, not Hacker News.


yet here it is, posted on hackernews


So why complain about the experts?


There has to be an Onion article for this.


"No way to prevent this, says only region where this regularly happens"


We don’t use AWS at work but we still experienced disruption because lots of our customers do, and use it to transfer data to us. That means we then saw an uplift in data transfers as their systems came back online.

There is no panacea. The reason many people use these is because it’s easy and hard to find people that know other clouds and their quirks.


I find it weird many people are just realizing this. I've had this conversation with regards to talking about what should happen if a couple of bad earth quakes, not even "the big one", were to occur.

But on the other hand, maybe I hang around too many tech people to not empathically understand the other point of view.


We've seen big outages already but nothing that lasts too long. If an outage became prolonged enough, people would find solutions. We don't know what this massive outage would even look like, so whatever preparation you do, it might still break.

Also there are some outages that affect real life like airlines, but tech news overstates some like Facebook. It turns out that FB and IG can be totally broken for a whole day, the world will keep spinning, and they won't even lose users.


I think many (most?) non tech people don't even know that Amazon is first and foremost a cloud provider (and one of the biggest at that, if not the biggest) and that its market thing is almost a side activity at this point.


US east is pretty geologically stable I think.


> "Also in the UK, Ring users complained on social media that their doorbells were not working."

I sincerely hope that the base functionality of these doorbells (i.e., triggering the ringing of the bell within the home) is preserved in the event of an internet outage.


This is not a provider scarcity problem - there are numerous providers out there, but user's problem - they voluntarily choose crappy service at large scale, believing sales managers "it's reliable".


It is reliable. Even considering the inflated availability numbers, it's stupidly reliable.


Recent (and not so recent) events prove it isn't, or is it?


Terms like reliability have specific definitions in computer systems:

  Term           | Definition                          | Measurement
  --------------   -----------------------------------   -------------------------------------------
  Availability   | Basically, system uptime            | A percentage over time
  Durability     | Basically, persistence of data      | A percentage over time
  Resiliency     | Basically, self-healing             | A probability within a time period (usually)
  Reliability    | Basically, operational probability  | A probability within a time period (usually)
  Fault tolerant | Basically, it cannot fail           | Binary (it has faults or it doesn't)
Unlike more mathy fields, reliability is more of a "quality" that is qualified by one or more measurements (like Mean Time Between Failure). You define your metric, you give an estimate of what that value should be, and if you come in under it, you're reliable.

AWS has always stretched the truth when it comes to these numbers, but they do come pretty close to them most of the time. If you can find a different provider who'll even offer a number, it is usually not as close, and there's usually no contract that has any teeth to enforce it. Or they'll give very vague claims that don't get into specifics.

At least, not for "cloud providers" (other than the hyperscalers). You can find a datacenter who'll give you a number, but that's for like, their power reliability. That's a very different thing than saying "there is X probability over Y time that a server I run for you will not go down". Partly because it's pretty freakin' hard to wrangle all the different things that can go wrong with so much certainty that you can put a number on it. So most people give things like reliability, durability, availability, etc numbers for specific components of a system.

AWS S3 offers 99.999999999% durability and 99.99% availability. Now, did AWS S3 go down completely during the outage? Not as far as I'm aware. Maybe the control plane did, or a management portal, or billing, or something? But I'll bet you the PUT, GET, DELETE operations kept on flowing within 99.99% availability. Some other components in AWS may have been failing like crazy (which may have no guarantees...), but that one component probably stayed up within its guaranteed amount.

Design your apps to run on AWS using the components with specific guarantees, and you can estimate how reliable your end product will be. As far as I know, nobody has a better track record for meeting the guarantees. Even considering events like this.


> Design your apps to run on AWS using the components with specific guarantees, and you can estimate how reliable your end product will be

Alternatively, don't waste time and design you product without AWS but with failover across several telcos.


I recall reading that when the costs of distribution (but not the costs of discoverability) are low, generally you end up with a power law sort of distribution of consumers to providers, where provider #1 has exponentially more market share than provider #2 and provider #2 has exponentially more market share than provider #3, #4, etc.

Examples of this are Windows/Mac, McDonalds/Burger King, Playstation/Xbox, Nvidia/?, AWS/Azure?, Android/iPhone, etc...

Basically, the majority of users all using the same dependency/platform/product is basic economics.



In 2011 there was some kind of big outage at some major AWS US-east pop. I started a job at a company (very boring B2C startup) which had taken the lesson from that, that "cloud anything is dangerous."

They went and bought a bunch of literal servers and installed them in a datacenter, 90 miles away from our offices, and this is where all our applications ran for the remainder of that company's existence (about 6 more years). For the whole time I was at that company, we had somewhat more, and usually more lengthy, outages than the average startup. The only difference is that when some piece of networking gear took a crap, or a disk failed, or whatever, our guys had to diagnose and resolve it (Their karma, I guess, since this was their idea).

Anyway, I do think it would be good if at least so-calld 'tech companies' had a little less obsession to outsource everything -- even easy things -- to AWS, GCP, and Azure. I feel that way mainly for cost reasons as many of these services are wildly overpriced. But also we shouldn't kid ourselves by ignoring the advantages of operating at the scale those guys do. They can afford to have multiple absolute wizards available around the clock who make sure that when a problem happens, it's not the kind of "S-show" we had at my old company where we're all on a slack room or zoom or whatever and just guessing at to try for half an hour before we can figure out what the actual issue is.


This. And when a service goes down it's a lot easier to explain to your client/boss that "half the internet is down" than "our boutique solution is broken so it's just us actually".


I largely agree with you. When AWS goes down, for most situations I can just go outside and smoke a cigarette and not worry about it.

It's someone else's problem.


Sure. Are the "experts" going to pony up the cash to build in redundancy, or change the market fundamentals that make it make more sense for a startup to rush to product on a shoestring and then keep adding features instead of building against not-yet-happened failure modes?

If not, I look forward to the next single-point-of-failure outage. And the next. And the next.


This is what I call "fool's availability": reducing single points of failure (one cloud provider) without adding any actual redundancy.

If you removed AWS/GCP/Azure/etc and just had 100 small providers scattered all over, the result would be hundreds of outages throughout the year, as opposed to one big outage every other year [in one region]. AWS is already way more reliable than any other provider.

The real problem here is that companies that use AWS are morons who don't know how to architect/build infrastructure properly.

If it's important, it should be built right, regardless of who the provider is. A software building code would mandate how companies could use infrastructure (AWS or any provider) so that important services would not go down when one service or region goes down.

This is the basic concept behind things like the electrical code. It doesn't matter how great a public utility is; if your business is wired up so badly that a stiff breeze sets it on fire, just switching utilities isn't gonna help. And some utilities do occasionally have problems that persist down their lines to the customers, so customers need to set up equipment to protect against those failures. Whole-house surge protectors, lightning arresters, EMP shields, etc are necessary so that a rare event doesn't fry expensive customer equipment.


Its probably worse—a given stack using multiple of these small providers will probably have more “single points of failure” (providers used in series rather than parallel.)

(If most companies liked using cloud providers in parallel, they’d already be doing it today between AWS, Azure, and GCP.)


Yes but most of those companies aren't morons, they're just taking an acceptable risk. Multi-region or multi-cloud setup is nontrivial.


Most companies I've worked for (and have heard about from others) have either lacked the knowledge, or the will, to evaluate risk. They build things until they "just work", and their thought process ends there. They don't examine the design to identify its reliability and security risks. They don't calculate the losses. They still have issues, but they just happen to be acceptable most of the time.

Example1: A company's infra goes down, but it doesn't come back up correctly. People run around trying to get it working again. It takes much longer than they hoped/expected, and they lose a lot more money than they expected. This is because they never really understood the risk they were exposed to. If they understood it, they would have done more ahead of time to mitigate that much risk.

(today's outage is this case. A lot of companies are going to lose money after today, because their customers are not happy with these "acceptable risks". Presumably, losing this much money due to one outage will not be an acceptable risk in hindsight. So the company either didn't understand its risk, or it did but was too stupid to prevent it)

Example 2: A company gets hacked, and its data is either exposed or wiped. This is a much worse result; they can lose tons of money, chase off customers, damage their brand, open them up to lawsuits and fines, even tank the whole company. It's clear that this risk is pretty unacceptable. But it keeps happening. And the reason usually isn't "some genius hacker"; it was a lack of understanding the risk of not investing in security.

(there's tons of examples of these in the news. presumably, not investing in security was not an acceptable risk in hindsight when it ended their business! almost always, the people involved in making these products don't know enough about security to understand the risks. but they also don't invest in security training, mandatory security controls, checklists, processes, quality gates, etc)

You don't need multi-region or multi-cloud to mitigate reliability risks. Just like you don't need to hire a big security team or invest tons of cash to mitigate security risks. You can use your existing infra and tools, and mitigate both issues. You just have to use them wisely. It takes some effort and time, but you do it once and it pays dividends indefinitely.

Building something without identifying its security/reliability risks, and then not calculating those risks' impact, is not acceptable risk; it's ignored risk. Is tanking your company and shedding customers an acceptable risk? Well, there's one way to find out.


This outage would've required cross region failover to be immune to. We'll see if customers switch to whatever company was resilient, but this has happened before and the answer was no.


The company I work for has a ton of stuff in us-east-1, many large products and sites, and we didn't go down. Our products/services aren't multi-region or multi-cloud. We don't pay exorbitant bills or have super complicated architectures.


If you were using AWS services that went down in us-east-1, how did you avoid an outage without failing over to anything outside that region?


That's the thing - most AWS services didn't "go down", as in stop working entirely. There were specific operations of specific services that were failing. Increased API error rates, inability to start new EC2 instances, billing metrics unavailable, AWS console unavailable, etc.

The outage wasn't like "all our servers stopped running". It was dynamic, new, specific operations that failed. If you just had a Fargate container that was started a week ago, and you have no need to restart the container today, it just kept chugging along.

Our architecture is stuff that just keeps chugging along. Fargate, S3, RDS, CloudFront, CloudFlare, etc. From our perspective, there was no outage in us-east-1. Literally the only alert we got the entire time was "billing limit exceeded" - and that was a false alarm, because it was set to alarm if there is zero billing data.


But is this strategy or luck? I'm not seeing how those many companies did something dumb or wrong here while you did it right. Like are they only affected because they overcomplicated their deployments? Either way, your service isn't resilient against a generalized regional outage it sounds like.


The only reason we can't leave AWS is because we have 500 terabytes of data in S3


Talk to the other vendors. I know of a place that had about that same amount and decided to have a redundant copy of all of their data in another vendor's S3-compatible product. That vendor paid for all of their egress fees as long as they signed a 12-month contract and used their tool for the migration.


AWS will credit your egress fees if you incur them via leaving.

https://aws.amazon.com/blogs/aws/free-data-transfer-out-to-i...


What other AWS services do you depend on?


Mostly EC2 for data mining terabytes of historical data stored in S3. Production usage is fairly lightweight compared to the EC2 and S3 stuff. We did cut our bill a lot by moving to single AZ redundancy.


Just need to retire the us-east-1 region, it's becoming a meme at this point.


This is coming right after we switched back to AWS after trying to switch storage to Cloudflare R2. Even with this outage, I still consider AWS more reliable than Cloudflare.


The "experts" should lay out a good alternative in that case. Smaller providers also run into outages.


And they all get to claim that they have better uptime to potential customers because nobody other than their current customers remembers their outages.


I've really got to get me one of these 'expert' job gigs!


AWS is this generation's mainframe. /joking


This new post is interesting: https://news.ycombinator.com/item?id=45646777

"October 17, 2025, was my last day at Amazon Web Services... CloudFront is a CDN, a content delivery network, or, simply put, a large distributed cache for your cat photos. And a very successful one. Something like 30% of all internet traffic goes through CloudFront in one way or another. Pretty cool, huh? In practice, this means that with any change, you have a chance of crashing 30% of the internet."


Ngl, that sounds like my dream job.


> Something like 30% of all internet traffic goes through CloudFront in one way or another. Pretty cool, huh?

No. No its not. But tech enthusiasts on HN and Reddit love it.

(Another 30% runs through cloudflare)


If only there was a system of computers on the Internet that was distributed across the world where we could host things instead of all in one location. We could call it the "cloud".


We could connect distributed computers on distributed networks together using some form of internetworking protocol.


Like some kind of interconnected network. We could call it connetwork or connet for short. We’ll be rich!


It makes us vulnerable to a centrality attack either foreign or domestic. If someone wants to fuck society up, only a handful of data centers, routers, networking junctions, etc could do it.


There are many public clouds and VPS providers out there. Who the fuck are these experts?

The real issue is that business pricks will cut costs and single-homing in a single availability zone will be the only workable solution.

On top of that, infrastructure ops are seen as a nuisance who get in the way of the sexy stuff like shipping your latest code changes now. If you complicate the ops pipeline that gets in the way of sexy dev work. So fuck that just ship lol!


providers should stop using just us-east-1 like idiots.


Can someone educate me on the solution to this?

I assume most organizations, both small and large, just host on whatever provider they know or that costs them the least. If you have budget maybe you deploy to multiple providers for redundancy? But that increases cost and complexity.

Who’s going to bother with colo given the cost / complexity? Who’s going to run a server from their office given ISP restrictions and downtime fears?

What is the realistic antidote here?


Companies can architect their backends to be able to fail back to another region in case of outage, and either don't test it or don't bother to have it in place because they can just blame Amazon, and don't otherwise have an SLA for their service.

To fix it, test your failback procedures. For everything else, there's nothing to fix, it's working by design.


> Companies can architect their backends to be able to fail back to another region in case of outage, and either don't test it or don't bother to have it in place because they can just blame Amazon, and don't otherwise have an SLA for their service.

My CI was down for 2 hours this morning, despite not even being on AWS. We have a set of credentials on that host that we call assumeRole with and push to an S3 bucket, which has a lambda that duplicates to buckets in other regions. All our IAM calls were failing due to this outage, and we have 0 items deployed in us-east-1 (we're european)


You likely used a us-east-1 IAM endpoint instead of a regionalized one ( https://aws.amazon.com/blogs/security/how-to-use-regional-aw... ). We've been using it, and we're not experiencing any issues whatsoever in us-east-2.

One thing that AWS should do is provide an easier way to detect these hidden dependencies. You can do that with CloudTrail if you know how to do it (filter operations by region and check that none are in us-east-1), but a more explicit service would be nice.


We did indeed.

The problem was we couldn’t log into cloud trail, or the console at all, to identify that, because IAM identity center is single region. This was a decision recommended by AWS, and blessed by our (army of) SRE teams.


But you can run TWO identity centers in different regions for the price of one(1)! IAM IDC is just a regular application hosted on the AWS infrastructure, it really has nothing special.

The hindsight is 20/20, of course, it's a good practice to audit CloudTrail periodically for unexpected regional dependencies.

(1) offer void for services that run on AWS.


Indeed. I also noticed this morning that you're not the person I replied to, and I took your response (which was actually helpful) in the context of the original post which was "people are happy to just blame AWS when they're down".

Either way, we would have only made it one step farther in our CI, as the next step is to build a conatiner with a base image from docker hub, and that was down too. The idea of running a multi region nexus repository to avoid Docker hub outages for my 14 person engineering team seems slightly overkill!


The easiest way to provide some resilience to the build process is to add a pull-through cache using AWS ECR. It might backfire due to egress costs, though, if you're building outside the AWS infrastructure.

It's actually an interesting exercise to enumerate _all_ the external dependencies. But yeah, avoiding them all seems to be less than helpful for the vast majority of users.


Rent servers from a local provider. It's cheaper, you get more control over the hardware, but most of all, it avoids correlated failures.


On the flipside, then you have to maintain instances of everything.

For most of what I run these days, I'd rather just have someone else run and administer my database. Same with my load balancers. And my Kubernetes cluster. I don't really care if there is an outage every 2 years.


That only helps if their uptime is better than AWS.


> I assume most organizations, both small and large, just host on whatever provider they know or that costs them the least

I don't think there is a "most" organizations. Either they're looking for big cloud or they're not, and least-cost is usually the last consideration when looking at any cloud, because you're trying to pay a premium to get particular advantages.

The realistic antidote? Move to a less-shitty region. Or architect your systems to be failure-resilient.

(Most people seem to think the entire region was offline? That's wrong. It was just particular services which wouldn't process control plane requests, and then a failure cascade caused more problems. But things that were already started running before, stayed running. A region is multiple datacenters. Even AZs are often multiple datacenters. It's virtually impossible for a whole region to stop working.)


If the cost is worth the complexity then you just do it. Otherwise you don't. How much did a company lose today compared to how much it costs to set it up

And colo and datacenters aren't immune to going down


What cost? Complexity - yes, to some extent.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: