The status page is green, but there are outages reported: https://downdetector.c...

nerdsniper · 2025-06-12T18:45:20 1749753920

Why even have a status page? Someone reported that their org of >100,000 users can't use Google Meet. If corps aren't going to update their status page, might as well just not have one.

https://www.google.com/appsstatus/dashboard/

https://status.cloud.google.com/index.html

Edit: The GCP status page got updated <1 minute after I posted this, showing affected services are Cloud Data Fusion, Cloud Memorystore, Cloud Shell, Cloud Workstations, Google Cloud Bigtable, Google Cloud Console, Google Cloud Dataproc, Google Cloud Storage, Identity and Access Management, Identity Platform, Memorystore for Memcached, Memorystore for Redis, Memorystore for Redis Cluster, Vertex AI Search

SOLAR_FIELDS · 2025-06-12T18:48:34 1749754114

There's no situation where the corporation controls the status page where you can trust the status page to have accurate information. None. The incentives will never be aligned in this regard. It's just too tempting and easy for the corp to control the narrative when they maintain their own status page.

The only accurate status pages are provided by third party service checkers.

the8472 · 2025-06-12T19:40:17 1749757217

> The incentives will never be aligned in this regard.

Well, yes, incentives, do big customers with wads of cash have an incentive to demand accurate reporting from their suppliers so they can react better rather than trying to identify issues? If there's systematic underreporting, then apparently not. Though in this case they did update their page.

SOLAR_FIELDS · 2025-06-12T22:40:55 1749768055

In practice how this plays out is that the big wads of cash holders will make demand, and Google (or whoever, Google is just the standin for the generic Corp here) will give them the actual information privately. It will still never be trusted to be reflected accurately on the public status page.

If you think about it from the corp’s perspective, it makes perfect sense. They weigh the risk reward. Are they going to be rewarded for the radical transparency or suffer fall out by acknowledging how bad of a dumpster fire the situation actually is? Easier for the corp to just lie, obscure and downplay to avoid having to even face that conundrum in the first place.

staplers · 2025-06-12T22:31:04 1749767464

  If there's systematic underreporting, then apparently not.

You answered your own question.

supportengineer · 2025-06-12T18:47:23 1749754043

Who gets a promotion from a working status board?

nikcub · 2025-06-12T20:00:39 1749758439

I have zero faith in status pages. It's easier and more reliable to just check twitter.

Heroku was down for _hours_ the other day before there was any mention of an incident - meanwhile there were hundreds of comments across twitter, hn, reddit etc.

fooey · 2025-06-12T20:27:28 1749760048

anecdotally, the status pages have been taken away from engineering and are run by customer support and marketing

paulddraper · 2025-06-12T18:55:17 1749754517

> might as well just not have one

This is my position.

jorts · 2025-06-12T19:25:56 1749756356

Here's the incident: https://status.cloud.google.com/incidents/ow5i3PPK96RduMcb1S...

deathanatos · 2025-06-12T19:34:42 1749756882

It was nearly an hour into our company's internal incident channel on this for GCP to finally declare that yes, in fact, things on fire.

… I get that PR-types probably want to massage the message, but going radio dark is not good PR.

ransom1538 · 2025-06-12T18:19:38 1749752378

Why can't companies be honest with being down. It helps us all out so we don't spend an hour internalizing.

We are truly in gods hands.

$ prod

Fetching cluster endpoint and auth data. ERROR: (gcloud.container.clusters.get-credentials) ResponseError: code=503, message=Visibility check was unavailable. Please retry the request and contact support if the problem persists

kingstnap · 2025-06-12T18:33:20 1749753200

Because they have unrealistic targets so they make up fake uptime numbers. 99.999% would mean not even having an hour of downtime in 10 years.

I remember reddit being down for like a whole day or so and they claimed 99.5% in that month.

wbl · 2025-06-12T19:33:12 1749756792

Ma Bell hit that decently often.

Uehreka · 2025-06-12T20:45:01 1749761101

Is that even knowable? Like, I know they called it “The Astonishing, Unfailing, Bell System” but if they had an outage somewhere did they actually have an infrastructure of “canary phones” and such to tell in real time? (As in, they’d know even if service was restored in an hour)

Not trying to snark, I legit got nerdsniped by this comment.

wbl · 2025-06-12T20:58:13 1749761893

They absolutely did. Note that the reliability estimates exclude the last mine because trees falling and the like but they had a lot of self repair, reporting, and management facilities.

Engineering and Operations in the Bell System is pretty great for this.

Dylan16807 · 2025-06-12T20:57:47 1749761867

Running a much simpler system with much more independent nodes.

It's a lot easier to keep packets flowing than to keep non-self-contained servers serving.

oxymoron · 2025-06-12T18:28:35 1749752915

Because a lot of the time, not everyone is impacted, as the systems are designed to contain the "blast radius" of failures using techniques such as cellular architecture and [shuffle sharding](https://aws.amazon.com/builders-library/workload-isolation-u...). So sometimes a service is completely down for some customers and fully unaffected for other customers.

hnuser123456 · 2025-06-12T18:39:26 1749753566

"there is a 5% chance your instance is down" is still a partial outage. A green check should only mean everything (about that service) is working for everyone (in that region) as intended.

Downdetector reports started spiking over an hour ago but there still isn't a single status that isn't a green checkmark on the status page.

deepsun · 2025-06-12T21:12:46 1749762766

With highly distributed services there's always something failing, some small percentage.

nijave · 2025-06-12T23:18:43 1749770323

Sure but you can still put a message up when it's some <numeric value> over some <threshold value> like errors are 50% higher than normal (maybe the SLO is 99.999% of requests are processed successfully)

deepsun · 2025-06-13T00:01:25 1749772885

Just note that aggregations like that might manifest as GCP didn't have any issues today actually.

E.g. it was mostly us-central1 region affected, and in there only some services (e.g. regular instances, and GKE kubernetes were not affected in any region). So if we ask "what the percentage of GCP is down", it might well be it's less than the threshold.

On the other hand, about a month ago, 2025-05-19 there was an 8-hour long incident with Spot VM instances affecting 5 regions, and which was way more important to our company, but it didn't make any headlines.

spwa4 · 2025-06-12T19:58:39 1749758319

Just say it: they want to lie to 95% of customers.

Eduard · 2025-06-12T18:34:53 1749753293

> Because a lot of the time, not everyone is impacted

then such pages should report a partial failure. Indeed the GCP outage page lists an orange "One or more regions affected" marker, but all services show the green "Available" marker, which apparently is not true.

deepsun · 2025-06-12T21:13:43 1749762823

There's always a partial outage in large systems, some very small percentage. All clouds should report all red then.

nijave · 2025-06-12T23:15:39 1749770139

It's not rocket science. Put a message up "The service is currently degraded and some users may see errors"

johannes1234321 · 2025-06-12T18:55:09 1749754509

They still could show that so.e.issues exist. Their monitoring must know.

The issue is that they don't want to. (For claiming good uptime, which may even be true for average user, if most outages affect only small groups)

jobs_throwaway · 2025-06-12T18:43:02 1749753782

That is still 100% an outage and should be displayed as such

jeanlucas · 2025-06-12T18:20:18 1749752418

Because there are contracts related to uptime :)

rixthefox · 2025-06-12T18:30:03 1749753003

Those contracts will be monitoring their service availability on their own. If Google can't be honest you can bet your bottom dollar the companies paying for that SLA are going to hold them accountable if they report the outage properly or not.

datadrivenangel · 2025-06-12T18:31:54 1749753114

The real point of SLAs is to give you a reason to break contracts. If a vendor doesn't meet their contractual promises, that gives you a lot of room to get out contracts

rustc · 2025-06-12T18:27:04 1749752824

Does any service even say they're "down" anymore? All I see is "elevated error rates".

colechristensen · 2025-06-12T18:30:21 1749753021

4 to 6 hours after the flames are visible from orbit and management has finally given up on the 37th quick fix you do get that red X

But really not until after it's been on CNN a while.

rapus95 · 2025-06-12T18:40:20 1749753620

if half the internet is down, which it apparently is, it's usually not the service in question, but some backbone service like cloudflare. And as internal health monitoring doesn't route to the outside through the backbone to get back in, it won't pick it up. Which is good in some sense, as it means that we can see if it's on the path TO the service or the service itself.

voytec · 2025-06-12T18:34:55 1749753295

> Why can't companies be honest with being down

SLA agreements.

organsnyder · 2025-06-12T19:40:25 1749757225

Any customer with enough leverage to negotiate meaningful SLA agreements will also have the leverage to insist that uptime is not derived from the absence of incidents on public-facing status pages.

remram · 2025-06-13T00:27:06 1749774426

Service level agreements agreements?

9rx · 2025-06-12T18:23:48 1749752628

The program that updates the status page is hosted on Google Cloud.

tfsh · 2025-06-12T18:41:34 1749753694

It's not. You might be joking, but that comment still isn't helpful.

My understanding is this is part of Google's internal PSD offering (Public Status Board) which uses SCS (Static Content Service) behind GFE (Google Frontend) which is hosted on Borg, and deploys other large scale apps such as Search, Drive, YouTube, etc.

mnordhoff · 2025-06-15T12:19:13 1749989953

Wellp. Incident report: "We posted our first incident report to Cloud Service Health about ~1h after the start of the crashes, due to the Cloud Service Health infrastructure being down due to this outage."

9rx · 2025-06-12T23:49:35 1749772175

How could it not be helpful given that it gave you reason to provide more details that you wouldn't have otherwise shared? You may not have thought this through. There is nothing more helpful. Unless you think your own comment isn't helpful, but then...

refulgentis · 2025-06-13T05:43:46 1749793426

Because "It's good to lie because it makes people correct me" is a joke about IRC, not a viable stable game-theoretic optimal position.

9rx · 2025-06-13T12:19:37 1749817177

Cunningham's Law emerged in the newsgroups era, well predating the existence of IRC.

Of course, I recognize that you purposefully pulled the Cunningham's Law trigger so that you, too, would gain additional knowledge that nobody would have told you about otherwise, as one logically would. And that you played it off as some kind of derision towards doing that all while doing it yourself made it especially funny. Well done!

refulgentis · 2025-06-13T18:08:17 1749838097

I have 0 idea what Cunningham's Law is, so we can both agree that "recognizing purpose" was "mind-reading", in this case. I didn't really bother reading the rest after the first sentence because I saw something about how I joking and congratulating me in my peripheral vision.

It is what it says on the tin: choosing to lie doesn't mean you want the truth communicated.

I apologize that it comes across as aggro, its just that I'm not quite as giggly about this as you are. I think I can safely assume you're old enough to recognize some deleterious effects of lying

9rx · 2025-06-13T20:04:36 1749845076

> I have 0 idea what Cunningham's Law is

You had no idea what it is. Now you know thanks to you the lie you told.

> choosing to lie doesn't mean you want the truth communicated.

But you're going to get it either way, so if you do lie, expect it. If you don't want it – don't lie, I guess. It is inconceivable that someone wouldn't want to learn about the truth, though. Sadly, despite your efforts in enacting Cunningham again, I don't have more information to give you here.

> I apologize that it comes across as aggro

It doesn't. Attaching human attributes to software would be plain weird.

> I think I can safely assume you're old enough to recognize some deleterious effects of lying

Time and place. While it can be disastrous in the right context, context is significant. It makes no difference in a place of entertainment, as is the case here. Entertainment has always been rooted in tales that aren't true. No matter how old you are, even young children understand that.

ashu1461 · 2025-06-12T18:34:01 1749753241

So even then, it should have been able to correctly report the status, it somehow shows that the status page is not automated and any change there needs to go through someone manual.

9rx · 2025-06-12T18:38:45 1749753525

A program that updates the status page failing does not imply that the status page is manually edited. It is not like you would generate a status page on every request.

ashu1461 · 2025-06-12T18:47:43 1749754063

How do we know that the program is failing ?

How hard is it for the frontend to detect if the last update to the status page was made a while ago and that itself implies there is an error and should be reported ?

9rx · 2025-06-13T01:24:23 1749777863

We don’t.

But why would the frontend have processing logic when all you need is to serve a static HTML document?

Even if it did, what would you do with that information? Throw up a screen with: Call us for service information at 1-HAHA-JUST-KIDDING

It’s not like it really matters if it’s accurate anyway.

rapus95 · 2025-06-12T18:46:26 1749753986

the services ARE healthy, status page is correct. The backbone which links YOU to the service isn't healthy. Take a look at cloudflare, they are already working on it

ikiris · 2025-06-12T19:05:03 1749755103

Not even close. The status page is manual and cloud flares outage is because of Google not the other way around.

supportengineer · 2025-06-12T18:47:49 1749754069

Nobody gets a promotion, that's why.

rozap · 2025-06-12T18:32:53 1749753173

Please, won't somebody think of the KPIs.

FireBeyond · 2025-06-12T18:36:05 1749753365

Yeah, my company of hundreds of people working remotely are having 90%+ failures connecting to Google Meetings - joining a meeting just results in a 504.

milesward · 2025-06-12T18:54:32 1749754472

It's updated now, shows the impact to console, dataproc, GCS, IAM and Identity Platform: https://status.cloud.google.com/incidents/ow5i3PPK96RduMcb1S...

DrBenCarson · 2025-06-12T18:26:31 1749752791

Whichever product person is in charge of the status page should be ashamed

How could you possibly trust them with your critical workloads? They don't even tell you whether or not their services work (despite obviously knowing)

2025-06-12T18:40:17 1749753617

[dead]

aylmao · 2025-06-12T19:26:22 1749756382

AWS is fine: https://health.aws.amazon.com/health/status

My guess is whatever system downdetector uses to "detect downtime" relies on either GCP or Cloudflare (also having issues at the moment: https://www.cloudflarestatus.com/)

roughly · 2025-06-12T18:48:01 1749754081

So’s Azure? https://downdetector.com/status/windows-azure/

This is where we get to learn about the one common system all of our “distributed cloud” systems rely on, isn’t it?

deathanatos · 2025-06-12T19:05:45 1749755145

My gut says all clouds spike when one goes down from people misreporting issues.

But I suppose there's always "something something BGP" but that feels less likely.

Macha · 2025-06-12T19:04:43 1749755083

Aren't some of these sites partially based on hits (because of the assumption that if enough people are suddenly googling "Is youtube down", then youtube must be having some sort of issue.

I could see a big outage like this causing people to google "Is AWS down?"

bicx · 2025-06-12T18:47:07 1749754027

Almost everything on the downdetector home page is listed as having downtime...

0xCAP · 2025-06-12T18:57:44 1749754664

At this point I don’t know if I must assume people are trolling or the entire internet is down.

tonyhart7 · 2025-06-12T19:02:17 1749754937

wtf is going on

ransom1538 · 2025-06-12T18:48:10 1749754090

It's the entire internet. Check oracle cloud, etc etc. The ENTIRE INTERNET.

cyberpunk · 2025-06-12T19:00:04 1749754804

Quick! Pirate as much music as possible before it goes for good! ;)

deepsun · 2025-06-12T20:15:07 1749759307

Hacker News is fine.

cyberpunk · 2025-06-12T19:08:36 1749755316

oracle and azure report no issues on their statuspages, likely just down detector getting hammered.

briffle · 2025-06-12T19:39:33 1749757173

neither did google cloud for the first 55 minutes of their outage.

tonyhart7 · 2025-06-12T19:02:36 1749754956

are there nuclear war or something???