Why even have a status page? Someone reported that their org of >100,000 users can't use Google Meet. If corps aren't going to update their status page, might as well just not have one.
Edit: The GCP status page got updated <1 minute after I posted this, showing affected services are Cloud Data Fusion, Cloud Memorystore, Cloud Shell, Cloud Workstations, Google Cloud Bigtable, Google Cloud Console, Google Cloud Dataproc, Google Cloud Storage, Identity and Access Management, Identity Platform, Memorystore for Memcached, Memorystore for Redis, Memorystore for Redis Cluster, Vertex AI Search
There's no situation where the corporation controls the status page where you can trust the status page to have accurate information. None. The incentives will never be aligned in this regard. It's just too tempting and easy for the corp to control the narrative when they maintain their own status page.
The only accurate status pages are provided by third party service checkers.
> The incentives will never be aligned in this regard.
Well, yes, incentives, do big customers with wads of cash have an incentive to demand accurate reporting from their suppliers so they can react better rather than trying to identify issues? If there's systematic underreporting, then apparently not. Though in this case they did update their page.
In practice how this plays out is that the big wads of cash holders will make demand, and Google (or whoever, Google is just the standin for the generic Corp here) will give them the actual information privately. It will still never be trusted to be reflected accurately on the public status page.
If you think about it from the corp’s perspective, it makes perfect sense. They weigh the risk reward. Are they going to be rewarded for the radical transparency or suffer fall out by acknowledging how bad of a dumpster fire the situation actually is? Easier for the corp to just lie, obscure and downplay to avoid having to even face that conundrum in the first place.
I have zero faith in status pages. It's easier and more reliable to just check twitter.
Heroku was down for _hours_ the other day before there was any mention of an incident - meanwhile there were hundreds of comments across twitter, hn, reddit etc.
Why can't companies be honest with being down. It helps us all out so we don't spend an hour internalizing.
We are truly in gods hands.
$ prod
Fetching cluster endpoint and auth data.
ERROR: (gcloud.container.clusters.get-credentials) ResponseError: code=503, message=Visibility check was unavailable. Please retry the request and contact support if the problem persists
Is that even knowable? Like, I know they called it “The Astonishing, Unfailing, Bell System” but if they had an outage somewhere did they actually have an infrastructure of “canary phones” and such to tell in real time? (As in, they’d know even if service was restored in an hour)
Not trying to snark, I legit got nerdsniped by this comment.
They absolutely did. Note that the reliability estimates exclude the last mine because trees falling and the like but they had a lot of self repair, reporting, and management facilities.
Engineering and Operations in the Bell System is pretty great for this.
Because a lot of the time, not everyone is impacted, as the systems are designed to contain the "blast radius" of failures using techniques such as cellular architecture and [shuffle sharding](https://aws.amazon.com/builders-library/workload-isolation-u...). So sometimes a service is completely down for some customers and fully unaffected for other customers.
"there is a 5% chance your instance is down" is still a partial outage. A green check should only mean everything (about that service) is working for everyone (in that region) as intended.
Downdetector reports started spiking over an hour ago but there still isn't a single status that isn't a green checkmark on the status page.
Sure but you can still put a message up when it's some <numeric value> over some <threshold value> like errors are 50% higher than normal (maybe the SLO is 99.999% of requests are processed successfully)
Just note that aggregations like that might manifest as GCP didn't have any issues today actually.
E.g. it was mostly us-central1 region affected, and in there only some services (e.g. regular instances, and GKE kubernetes were not affected in any region). So if we ask "what the percentage of GCP is down", it might well be it's less than the threshold.
On the other hand, about a month ago, 2025-05-19 there was an 8-hour long incident with Spot VM instances affecting 5 regions, and which was way more important to our company, but it didn't make any headlines.
> Because a lot of the time, not everyone is impacted
then such pages should report a partial failure. Indeed the GCP outage page lists an orange "One or more regions affected" marker, but all services show the green "Available" marker, which apparently is not true.
Those contracts will be monitoring their service availability on their own. If Google can't be honest you can bet your bottom dollar the companies paying for that SLA are going to hold them accountable if they report the outage properly or not.
The real point of SLAs is to give you a reason to break contracts. If a vendor doesn't meet their contractual promises, that gives you a lot of room to get out contracts
if half the internet is down, which it apparently is, it's usually not the service in question, but some backbone service like cloudflare. And as internal health monitoring doesn't route to the outside through the backbone to get back in, it won't pick it up. Which is good in some sense, as it means that we can see if it's on the path TO the service or the service itself.
Any customer with enough leverage to negotiate meaningful SLA agreements will also have the leverage to insist that uptime is not derived from the absence of incidents on public-facing status pages.
It's not. You might be joking, but that comment still isn't helpful.
My understanding is this is part of Google's internal PSD offering (Public Status Board) which uses SCS (Static Content Service) behind GFE (Google Frontend) which is hosted on Borg, and deploys other large scale apps such as Search, Drive, YouTube, etc.
Wellp. Incident report: "We posted our first incident report to Cloud Service Health about ~1h after the start of the crashes, due to the Cloud Service Health infrastructure being down due to this outage."
How could it not be helpful given that it gave you reason to provide more details that you wouldn't have otherwise shared? You may not have thought this through. There is nothing more helpful. Unless you think your own comment isn't helpful, but then...
Cunningham's Law emerged in the newsgroups era, well predating the existence of IRC.
Of course, I recognize that you purposefully pulled the Cunningham's Law trigger so that you, too, would gain additional knowledge that nobody would have told you about otherwise, as one logically would. And that you played it off as some kind of derision towards doing that all while doing it yourself made it especially funny. Well done!
I have 0 idea what Cunningham's Law is, so we can both agree that "recognizing purpose" was "mind-reading", in this case. I didn't really bother reading the rest after the first sentence because I saw something about how I joking and congratulating me in my peripheral vision.
It is what it says on the tin: choosing to lie doesn't mean you want the truth communicated.
I apologize that it comes across as aggro, its just that I'm not quite as giggly about this as you are. I think I can safely assume you're old enough to recognize some deleterious effects of lying
You had no idea what it is. Now you know thanks to you the lie you told.
> choosing to lie doesn't mean you want the truth communicated.
But you're going to get it either way, so if you do lie, expect it. If you don't want it – don't lie, I guess. It is inconceivable that someone wouldn't want to learn about the truth, though. Sadly, despite your efforts in enacting Cunningham again, I don't have more information to give you here.
> I apologize that it comes across as aggro
It doesn't. Attaching human attributes to software would be plain weird.
> I think I can safely assume you're old enough to recognize some deleterious effects of lying
Time and place. While it can be disastrous in the right context, context is significant. It makes no difference in a place of entertainment, as is the case here. Entertainment has always been rooted in tales that aren't true. No matter how old you are, even young children understand that.
So even then, it should have been able to correctly report the status, it somehow shows that the status page is not automated and any change there needs to go through someone manual.
A program that updates the status page failing does not imply that the status page is manually edited. It is not like you would generate a status page on every request.
How hard is it for the frontend to detect if the last update to the status page was made a while ago and that itself implies there is an error and should be reported ?
the services ARE healthy, status page is correct. The backbone which links YOU to the service isn't healthy. Take a look at cloudflare, they are already working on it
Yeah, my company of hundreds of people working remotely are having 90%+ failures connecting to Google Meetings - joining a meeting just results in a 504.
Whichever product person is in charge of the status page should be ashamed
How could you possibly trust them with your critical workloads? They don't even tell you whether or not their services work (despite obviously knowing)
My guess is whatever system downdetector uses to "detect downtime" relies on either GCP or Cloudflare (also having issues at the moment: https://www.cloudflarestatus.com/)
Aren't some of these sites partially based on hits (because of the assumption that if enough people are suddenly googling "Is youtube down", then youtube must be having some sort of issue.
I could see a big outage like this causing people to google "Is AWS down?"