Hacker News new | past | comments | ask | show | jobs | submit login

The status page is green, but there are outages reported: https://downdetector.com/status/google-cloud/





Why even have a status page? Someone reported that their org of >100,000 users can't use Google Meet. If corps aren't going to update their status page, might as well just not have one.

https://www.google.com/appsstatus/dashboard/

https://status.cloud.google.com/index.html

Edit: The GCP status page got updated <1 minute after I posted this, showing affected services are Cloud Data Fusion, Cloud Memorystore, Cloud Shell, Cloud Workstations, Google Cloud Bigtable, Google Cloud Console, Google Cloud Dataproc, Google Cloud Storage, Identity and Access Management, Identity Platform, Memorystore for Memcached, Memorystore for Redis, Memorystore for Redis Cluster, Vertex AI Search


There's no situation where the corporation controls the status page where you can trust the status page to have accurate information. None. The incentives will never be aligned in this regard. It's just too tempting and easy for the corp to control the narrative when they maintain their own status page.

The only accurate status pages are provided by third party service checkers.


> The incentives will never be aligned in this regard.

Well, yes, incentives, do big customers with wads of cash have an incentive to demand accurate reporting from their suppliers so they can react better rather than trying to identify issues? If there's systematic underreporting, then apparently not. Though in this case they did update their page.


In practice how this plays out is that the big wads of cash holders will make demand, and Google (or whoever, Google is just the standin for the generic Corp here) will give them the actual information privately. It will still never be trusted to be reflected accurately on the public status page.

If you think about it from the corp’s perspective, it makes perfect sense. They weigh the risk reward. Are they going to be rewarded for the radical transparency or suffer fall out by acknowledging how bad of a dumpster fire the situation actually is? Easier for the corp to just lie, obscure and downplay to avoid having to even face that conundrum in the first place.


  If there's systematic underreporting, then apparently not.
You answered your own question.

Who gets a promotion from a working status board?

I have zero faith in status pages. It's easier and more reliable to just check twitter.

Heroku was down for _hours_ the other day before there was any mention of an incident - meanwhile there were hundreds of comments across twitter, hn, reddit etc.


anecdotally, the status pages have been taken away from engineering and are run by customer support and marketing

> might as well just not have one

This is my position.



It was nearly an hour into our company's internal incident channel on this for GCP to finally declare that yes, in fact, things on fire.

… I get that PR-types probably want to massage the message, but going radio dark is not good PR.


Why can't companies be honest with being down. It helps us all out so we don't spend an hour internalizing.

We are truly in gods hands.

$ prod

Fetching cluster endpoint and auth data. ERROR: (gcloud.container.clusters.get-credentials) ResponseError: code=503, message=Visibility check was unavailable. Please retry the request and contact support if the problem persists


Because they have unrealistic targets so they make up fake uptime numbers. 99.999% would mean not even having an hour of downtime in 10 years.

I remember reddit being down for like a whole day or so and they claimed 99.5% in that month.


Ma Bell hit that decently often.

Is that even knowable? Like, I know they called it “The Astonishing, Unfailing, Bell System” but if they had an outage somewhere did they actually have an infrastructure of “canary phones” and such to tell in real time? (As in, they’d know even if service was restored in an hour)

Not trying to snark, I legit got nerdsniped by this comment.


They absolutely did. Note that the reliability estimates exclude the last mine because trees falling and the like but they had a lot of self repair, reporting, and management facilities.

Engineering and Operations in the Bell System is pretty great for this.


Running a much simpler system with much more independent nodes.

It's a lot easier to keep packets flowing than to keep non-self-contained servers serving.


Because a lot of the time, not everyone is impacted, as the systems are designed to contain the "blast radius" of failures using techniques such as cellular architecture and [shuffle sharding](https://aws.amazon.com/builders-library/workload-isolation-u...). So sometimes a service is completely down for some customers and fully unaffected for other customers.

"there is a 5% chance your instance is down" is still a partial outage. A green check should only mean everything (about that service) is working for everyone (in that region) as intended.

Downdetector reports started spiking over an hour ago but there still isn't a single status that isn't a green checkmark on the status page.


With highly distributed services there's always something failing, some small percentage.

Sure but you can still put a message up when it's some <numeric value> over some <threshold value> like errors are 50% higher than normal (maybe the SLO is 99.999% of requests are processed successfully)

Just note that aggregations like that might manifest as GCP didn't have any issues today actually.

E.g. it was mostly us-central1 region affected, and in there only some services (e.g. regular instances, and GKE kubernetes were not affected in any region). So if we ask "what the percentage of GCP is down", it might well be it's less than the threshold.

On the other hand, about a month ago, 2025-05-19 there was an 8-hour long incident with Spot VM instances affecting 5 regions, and which was way more important to our company, but it didn't make any headlines.


Just say it: they want to lie to 95% of customers.

> Because a lot of the time, not everyone is impacted

then such pages should report a partial failure. Indeed the GCP outage page lists an orange "One or more regions affected" marker, but all services show the green "Available" marker, which apparently is not true.


There's always a partial outage in large systems, some very small percentage. All clouds should report all red then.

It's not rocket science. Put a message up "The service is currently degraded and some users may see errors"

They still could show that so.e.issues exist. Their monitoring must know.

The issue is that they don't want to. (For claiming good uptime, which may even be true for average user, if most outages affect only small groups)


That is still 100% an outage and should be displayed as such

Because there are contracts related to uptime :)

Those contracts will be monitoring their service availability on their own. If Google can't be honest you can bet your bottom dollar the companies paying for that SLA are going to hold them accountable if they report the outage properly or not.

The real point of SLAs is to give you a reason to break contracts. If a vendor doesn't meet their contractual promises, that gives you a lot of room to get out contracts

Does any service even say they're "down" anymore? All I see is "elevated error rates".

4 to 6 hours after the flames are visible from orbit and management has finally given up on the 37th quick fix you do get that red X

But really not until after it's been on CNN a while.


if half the internet is down, which it apparently is, it's usually not the service in question, but some backbone service like cloudflare. And as internal health monitoring doesn't route to the outside through the backbone to get back in, it won't pick it up. Which is good in some sense, as it means that we can see if it's on the path TO the service or the service itself.

> Why can't companies be honest with being down

SLA agreements.


Any customer with enough leverage to negotiate meaningful SLA agreements will also have the leverage to insist that uptime is not derived from the absence of incidents on public-facing status pages.

Service level agreements agreements?

The program that updates the status page is hosted on Google Cloud.

It's not. You might be joking, but that comment still isn't helpful.

My understanding is this is part of Google's internal PSD offering (Public Status Board) which uses SCS (Static Content Service) behind GFE (Google Frontend) which is hosted on Borg, and deploys other large scale apps such as Search, Drive, YouTube, etc.


Wellp. Incident report: "We posted our first incident report to Cloud Service Health about ~1h after the start of the crashes, due to the Cloud Service Health infrastructure being down due to this outage."

How could it not be helpful given that it gave you reason to provide more details that you wouldn't have otherwise shared? You may not have thought this through. There is nothing more helpful. Unless you think your own comment isn't helpful, but then...

Because "It's good to lie because it makes people correct me" is a joke about IRC, not a viable stable game-theoretic optimal position.

Cunningham's Law emerged in the newsgroups era, well predating the existence of IRC.

Of course, I recognize that you purposefully pulled the Cunningham's Law trigger so that you, too, would gain additional knowledge that nobody would have told you about otherwise, as one logically would. And that you played it off as some kind of derision towards doing that all while doing it yourself made it especially funny. Well done!


I have 0 idea what Cunningham's Law is, so we can both agree that "recognizing purpose" was "mind-reading", in this case. I didn't really bother reading the rest after the first sentence because I saw something about how I joking and congratulating me in my peripheral vision.

It is what it says on the tin: choosing to lie doesn't mean you want the truth communicated.

I apologize that it comes across as aggro, its just that I'm not quite as giggly about this as you are. I think I can safely assume you're old enough to recognize some deleterious effects of lying


> I have 0 idea what Cunningham's Law is

You had no idea what it is. Now you know thanks to you the lie you told.

> choosing to lie doesn't mean you want the truth communicated.

But you're going to get it either way, so if you do lie, expect it. If you don't want it – don't lie, I guess. It is inconceivable that someone wouldn't want to learn about the truth, though. Sadly, despite your efforts in enacting Cunningham again, I don't have more information to give you here.

> I apologize that it comes across as aggro

It doesn't. Attaching human attributes to software would be plain weird.

> I think I can safely assume you're old enough to recognize some deleterious effects of lying

Time and place. While it can be disastrous in the right context, context is significant. It makes no difference in a place of entertainment, as is the case here. Entertainment has always been rooted in tales that aren't true. No matter how old you are, even young children understand that.


So even then, it should have been able to correctly report the status, it somehow shows that the status page is not automated and any change there needs to go through someone manual.

A program that updates the status page failing does not imply that the status page is manually edited. It is not like you would generate a status page on every request.

How do we know that the program is failing ?

How hard is it for the frontend to detect if the last update to the status page was made a while ago and that itself implies there is an error and should be reported ?


We don’t.

But why would the frontend have processing logic when all you need is to serve a static HTML document?

Even if it did, what would you do with that information? Throw up a screen with: Call us for service information at 1-HAHA-JUST-KIDDING

It’s not like it really matters if it’s accurate anyway.


the services ARE healthy, status page is correct. The backbone which links YOU to the service isn't healthy. Take a look at cloudflare, they are already working on it

Not even close. The status page is manual and cloud flares outage is because of Google not the other way around.

Nobody gets a promotion, that's why.

Please, won't somebody think of the KPIs.

Yeah, my company of hundreds of people working remotely are having 90%+ failures connecting to Google Meetings - joining a meeting just results in a 504.

It's updated now, shows the impact to console, dataproc, GCS, IAM and Identity Platform: https://status.cloud.google.com/incidents/ow5i3PPK96RduMcb1S...

Whichever product person is in charge of the status page should be ashamed

How could you possibly trust them with your critical workloads? They don't even tell you whether or not their services work (despite obviously knowing)


[dead]


AWS is fine: https://health.aws.amazon.com/health/status

My guess is whatever system downdetector uses to "detect downtime" relies on either GCP or Cloudflare (also having issues at the moment: https://www.cloudflarestatus.com/)


So’s Azure? https://downdetector.com/status/windows-azure/

This is where we get to learn about the one common system all of our “distributed cloud” systems rely on, isn’t it?


My gut says all clouds spike when one goes down from people misreporting issues.

But I suppose there's always "something something BGP" but that feels less likely.


Aren't some of these sites partially based on hits (because of the assumption that if enough people are suddenly googling "Is youtube down", then youtube must be having some sort of issue.

I could see a big outage like this causing people to google "Is AWS down?"


Almost everything on the downdetector home page is listed as having downtime...

At this point I don’t know if I must assume people are trolling or the entire internet is down.

wtf is going on

It's the entire internet. Check oracle cloud, etc etc. The ENTIRE INTERNET.

Quick! Pirate as much music as possible before it goes for good! ;)

Hacker News is fine.

oracle and azure report no issues on their statuspages, likely just down detector getting hammered.

neither did google cloud for the first 55 minutes of their outage.

are there nuclear war or something???



Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: