Last week's Let's Encrypt downtime

jrpelkonen · on June 22, 2023

> I find it alarming that a week after the incident, 40% of the affected certificates are still in use, despite being rejected by the most popular browsers and despite affected subscribers being emailed by Let's Encrypt.

This is perhaps a consequence on how well-oiled of a machine LE typically is: people stop paying attention to it.

tialaramex · on June 22, 2023

I haven't looked at a list of revoked certificates, because I was busy, (and I no longer operate my own CT auditing software, so I'd have to poke around in crt.sh which is not much fun) but lets suppose these are a random sample of Let's Encrypt's ~2 million issuances per day.

What %age of the world's HTTPS web sites are "parked" and so there is nobody who expects them to actually work? BrandFromATVShow.example ? TeenDanceISawOnTikTok.example ? SomeShortEnglishWord.example ? Nobody cares, if they do visit, and there's a certificate failure, they realise that's not where they meant to go and leave.

Then what %age are somebody's fever dream / retirement plan / abandoned start-up idea and so although the owner may notice eventually that it's broken, that might not happen before automatic renewal "fixes" the problem anyway if ever. MyTownOlympicSwimmingPool.example JimAndBethsCakeShop.example and LikeAWSForDogsSomehow.example

And then how about all the outfits which folded weeks, months, even in some cases years ago, but the ISP bill was paid, so, the web site continues to exist until somebody removes it, but of course nobody cares ? BoughtByGoogle.example and YetAnotherBayAreaCryptoStartup.example together with DefinitelyViableProduct.example and OopsWalmartAlreadySellsThatForLessMoney.example

If it was 95% I'd be more worried, at 40% I'd need to actually check at least a decent sample and see for myself. In the time I was writing this post I checked one, it wasn't replaced... exactly, because the actual web site uses a certificate issued five days earlier. Chances are they've got a bunch of duplicate certificates, so the fact that some they don't use are broken has never come up - that's just rude (wastes other people's resources) but it works fine technically.

tedunangst · on June 22, 2023

Renewing a cert without immediately deploying it seems like a reasonable practice in the face of CAs that will misissue through no fault of your own.

schoen · on June 22, 2023

When we wrote Certbot, we thought (by analogy with prior practice) that many sysadmins would want to manually inspect certificates before deploying them! That's one reason that we kept old certificates around and used a symlink-updating system.

As it turned out, misissued and invalid certs account for an incredibly small fraction of Let's Encrypt's issuance volume (I'm going to say < 1/10⁸ offhand?) and manual inspection kind of gets in the way of automation, so the idea of separating these steps has come to seem kind of quaint, for me at least. I've also helped thousands of people on the Let's Encrypt forum and I think at most 2 have said they were interested in looking at their new certs' contents before starting to use them.

tedunangst · on June 22, 2023

I may not inspect it myself (which wouldn't even catch this issue), but letting it simmer for a week isn't hard.

agwa · on June 22, 2023

That's a pretty good idea, and would also mitigate clients with slow clocks rejecting a certificate for not being valid yet.

toast0 · on June 23, 2023

Or clients with poor handling of dates. It's been a while, but Nokia Series 40 was really bad at this. As I recall, it would read the not-before as if the time was specified in local time. So not before noon UTC becomes not before noon wherever you are, so better wait a while for users in the americas.

schoen · on June 23, 2023

Yeah, we originally were thinking of having separate times for renewal and deployment (e.g. 4 weeks before expiry and 3 weeks before expiry), but there was a countervailing concern that people would think "wait, my certificate has already been renewed but I can't see it on my site!" and get confused or alarmed.

NovemberWhiskey · on June 22, 2023

Based on my experience, the capability model for certificate management usually went like:

1) Chaos: certificates requested and installed manually, either in response to incidents caused by expiration or calendar reminders

2) Monitoring: certificates requested and installed manually, in response to noisy alerting by probers looking for indications of pending expiration or other ill-health

3) Automation: continuous certificate provisioning, distribution and enablement either through platform or integration

The Let's Encrypt revolution has taken a lot of people from stage 1 to stage 3 without stage 2 in between.

tredre3 · on June 22, 2023

That is true but how come certbot had no awareness of revoked/withdrawn certificates before now? It seems like one of the things a CA is supposed to solve for you, and the fact that it doesn't is bit alarming in itself.

Though, as the following sentence points out, they were already working on it before the outage, so clearly they knew it was needed.

1. https://datatracker.ietf.org/doc/draft-ietf-acme-ari/

411111111111111 · on June 22, 2023

The CA can't solve it for you.

The certificate authority signs certificate requests, creating certificates. The revocation process is necessary as well, but the CA doesn't have the ability to change the already issued certificate, thus it cannot take action.

A software like certbot can solve it for you, but that's not affiliated with your CA

agwa · on June 22, 2023

The CA is part of the solution by using ARI to inform ACME clients to replace impacted certificates.

mcpherrinm · on June 22, 2023

Even before ARI, some integrated ACME/Web servers use OCSP as a way of knowing to renew if a cert was revoked. Plus if you're doing that you can pin the OCSP response while you're at it.

411111111111111 · on June 22, 2023

My point was that the CA can't solve it for you, they can only give you APIs and processes with which you can solve it yourself.

If your webserver supports checking the certificate validity then it's not solved by the CA, it's been solved by the developers of that software and by you installing it.

hinkley · on June 22, 2023

Vernor Vinge has dominated the Singularity space in science fiction pretty much from the beginning of the concept.

Rainbow’s End plays around in time frame right around where we are now, just a bit before the sorts of doglegs we predict would presage a Singularity in your lifetime.

At one point the protagonists need to attack a bad actor, and to make it work they need chaos on the internet. I don’t recall exactly how this plays out, but the way they decide to achieve it is that one of the collaborators believes that they can reject a CA cert that affects 10% of all certificates in the wild, and the resulting pandemonium will give them approximately the sort of chaos they need.

Sounds to me like maybe that is either no longer true, or never was.

tialaramex · on June 22, 2023

[Spoilers]

They don't need Chaos. They want to disable Rabbit, and they know Rabbit's certificates mostly tie back to a single CA, Credit Suisse. So they "revoke" Credit Suisse and accept the consequences, which (they acknowledge) are career ending for the Europeans. This is mostly a plot convenience because Rabbit is much too powerful to allow what Vinge wants to happen next.

No, you can't actually "revoke" a root CA, the decision to trust (or not) a root is local. So this part of the novel is a fantasy. But even if you assume it means that the European authorities can somehow reach into Credit Suisse and cause it to revoke all the intermediates (which maybe is a plausible reading) and so on down to end entity certificates, that doesn't really work either. Not on the time scale Vinge needs for the novel.

Hours are conceivable but unlikely. Days maybe. A week. But the novel needs it to be seconds.

There are two big obstacles to even the revocation which does really exist. Firstly humans are much more enthusiastic about seeing Dancing Pigs than they are about safety, because safety is a very abstract idea, whereas seeing dancing pigs is an immediate reward. This is the Dancing Pigs problem, and we've put some effort in, it's less likely a random Chrome user would get their face ripped to pieces because they wanted Dancing Pigs and so they bypassed the security checks that would protected them - than say - fifteen years ago, but only somewhat.

Secondly though, there's not a great enthusiasm technically for this sort of counter-measure. It's so rarely beneficial in practice. Most of the time those humans were right, we were just denying them Dancing Pigs. Their face might get ripped to pieces, but to be honest it's as likely to be because they deliberate went to "Rip My Face To Pieces.example" as through anything we could have prevented. This is only barely a technical problem. So, when there are things we could do to get closer to what's in the novel, why would we?

Building the PKI which exists in Vinge's novel is probably a bad expenditure of resources.

schoen · on June 23, 2023

That's funny, I read that book before I worked on PKI directly. Maybe I should read it again now that I have more detailed experience (and maybe be frustrated by the implausibility that you mention!).

AdamJacobMuller · on June 22, 2023

Did we kill crt.sh?

    FATAL:  terminating connection due to conflict with recovery
    DETAIL:  User query might have needed to see row versions that must be removed.
    CONTEXT:  SQL statement "SELECT c.ID, x509_print(c.CERTIFICATE, NULL, 196608), ca.ID, cac.CA_ID,
         digest(c.CERTIFICATE, 'sha1'::text),
         digest(c.CERTIFICATE, 'sha256'::text),
         x509_serialNumber(c.CERTIFICATE),
         digest(x509_publicKey(c.CERTIFICATE), 'sha256'::text),
         x509_rsamodulus(c.CERTIFICATE),
         x509_hasROCAFingerprint(c.CERTIFICATE),
         x509_hasClosePrimes(c.CERTIFICATE),
         c.CERTIFICATE
                                                                                                                                                                                                                                                  FROM certificate c
         LEFT OUTER JOIN ca ON (c.ISSUER_CA_ID = ca.ID)
         LEFT OUTER JOIN ca_certificate cac
             ON (c.ID = cac.CERTIFICATE_ID)
        WHERE digest(c.CERTIFICATE, 'sha256') = t_bytea"
    PL/pgSQL function web_apis(text,text[],text[]) line 1757 at SQL statement
    ERROR:  server conn crashed?
    server closed the connection unexpectedly
     This probably means the server terminated abnormally
     before or while processing the request.

and now it's just a 502 error!

agwa · on June 22, 2023

Unfortunately, crt.sh is chronically overloaded.

AdamJacobMuller · on June 22, 2023

I've never seen it happen before, but, you would know better!

faangsticle · on June 23, 2023

Are there other ways to get these logs?

agwa · on June 23, 2023

You can query logs directly using the API described in RFC 6962: https://datatracker.ietf.org/doc/html/rfc6962#section-4

You'll need a list of logs to query. Chrome publishes their log list at: https://www.gstatic.com/ct/log_list/v3/log_list.json

My company offers a higher-level API for querying by domain name: https://sslmate.com/ct_search_api/

AdamJacobMuller · on June 23, 2023

I haven't found any, yet. I would love to have a list of domains affected by this to cross-check that none of my issued certificates were affected by this.

schoen · on June 23, 2023

The list of all affected SHA256 fingerprints is in https://bug1838667.bmoattachments.org/attachment.cgi?id=9340...

You can get the SHA256 fingerprint for your certificate by running

  openssl x509 -in mycert.pem -sha256 -fingerprint -noout

If you don't like the format,

  openssl x509 -in mycert.pem -sha256 -fingerprint -noout | cut -d= -f2 | tr -d : | tr A-F a-f

will match the format in the list of affected certificates more closely.

If you need to do this against a web server and don't already have a copy of the certificate locally, something like

  echo | openssl s_client -connect example.com:443 -servername example.com 2>/dev/null <&- | openssl x509 -sha256 -fingerprint -noout | cut -d= -f2 | tr -d : | tr A-F a-f

(This example outputs the actual SHA256 fingerprint for the real domain example.com, which is not affected.)

agwa · on June 23, 2023

Here's a list of affected DNS names: https://gist.github.com/AGWA/5b02c2bb07fc847733fa2a5c1931c4f...

AdamJacobMuller · on June 23, 2023

Thank you and much appreciated, fortunately had no affected certs. I guess I need to spend some time implementing ARI :)

ElongatedMusket · on June 22, 2023

Thanks for following through on this writeup! I knew LE certs were publicly logged but didn't know the logs were decentralized or how they hold the CA accountable. Appreciate the layman explanation.

francislavoie · on June 22, 2023

FWIW, if those websites used Caddy as their ACME client, then it would have detected the certificate being revoked as soon as possible via OCSP stapling and would have had the certificate renewed. It's a shame that other ACME clients aren't as robust to problems like this. (Disclaimer: I work on Caddy as a volunteer)

agwa · on June 22, 2023

Note that the certificates were not revoked until 2023-06-19 at 18:00. In contrast, ARI was updated on 2023-06-15 at 22:43 to tell ARI-supporting clients (such as lego) to renew immediately. That means Caddy served broken certificates for almost 4 days longer than necessary.

Are there plans for Caddy to support ARI?

francislavoie · on June 22, 2023

> Note that the certificates were not revoked until 2023-06-19 at 18:00.

Ah okay, I missed that.

> Are there plans for Caddy to support ARI?

It's... complicated. Matt argues that ARI does not make sense for a variety of reasons. You can find the complex and deep discussions about it on the LE forums. Do a Ctrl+F for ARI in https://community.letsencrypt.org/c/client-dev/14 to find them, there's a lot.

mholt · on June 22, 2023

> That means Caddy served broken certificates for almost 4 days longer than necessary.

This would be news to me. Do you have a source for Caddy serving any of the affected certificates? I'd like as much info as possible.

> Are there plans for Caddy to support ARI?

If ARI can be made into an effective mechanism, then yes. ACMEz already supports the current draft.

I know Francis linked to a forum category, here's some more specific links for background:

- https://community.letsencrypt.org/t/can-ari-conforming-clien...

- https://community.letsencrypt.org/t/thoughts-from-starting-t...

agwa · on June 22, 2023

>This would be news to me. Do you have a source for Caddy serving any of the affected certificates? I'd like as much info as possible.

That's news to you? I informed you last week that Caddy would serve broken certificates in this situation: https://news.ycombinator.com/item?id=36344549

I omitted "would" from my previous comment, but I think it's pretty clear from Francis' comment that we're discussing a hypothetical situation, and neither of us know if any of the 645 affected certificates were requested by Caddy or not.

I skimmed the forum links (it would be productive if you could send a email summarizing your thoughts to the IETF ACME WG) and it seems like your complaints could also be said of OCSP so it's hard to figure out why OCSP is OK for Caddy but ARI isn't.

FWIW, there's currently a ballot in the CABF which would make OCSP optional for CAs, so OCSP may be on the way out in the WebPKI.

mholt · on June 22, 2023

You said:

> Caddy served broken certificates

So yes, that would be news to me. I'm asking for more information. If Caddy did not serve broken certificates, then I would appreciate clarification there so I know where to spend my energy.

> (it would be productive if you could send a email summarizing your thoughts to the IETF ACME WG)

I did this once and it was like talking into a black hole. All the responses I got to the issue I brought up were laced with complacency.

> I skimmed the forum links and it seems like your complaints could also be said of OCSP so it's hard to figure out why OCSP is OK for Caddy but ARI isn't.

Because OCSP does what it's intended to do. ARI does not.

> FWIW, there's currently a ballot in the CABF which would make OCSP optional for CAs, so OCSP may be on the way out in the WebPKI.

I am tracking that proposal and get daily notifications. It is only for short-lived certs. I would be thrilled if we could replace revocation -- and OCSP -- with short-lived certs.

agwa · on June 22, 2023

> So yes, that would be news to me. I'm asking for more information. If Caddy did not serve broken certificates, then I would appreciate clarification there so I know where to spend my energy.

This is not engaging in good faith.

> I am tracking that proposal and get daily notifications. It is only for short-lived certs.

It would make OCSP optional for all certificates. CRLs would be optional only for short-lived certs.

mholt · on June 22, 2023

> This is not engaging in good faith.

Sorry, come again? Why so combative?

TheDong · on June 23, 2023

Your response there comes off poorly to me too.

When I read the thread in context, it's clear that the response is within the hypothetical raised in the very first comment "if those websites used Caddy", that hypothetical.

The response has the understood "No, even in that hypothetical, this is the case", and doesn't explicitly say it's in the hypothetical, but in context it clearly is.

Your first response to that, missing the context and asking for "more info", well, miscommunications happen, that's fine.

What seems obviously not in good faith is that the parent commenter clearly then explains themselves with "we're discussing a hypothetical situation", and you ignored that, and responded as if they hadn't explained it.

mholt · on June 23, 2023

The whole thread is confusing then. I definitely didn't read it as hypothetical, especially since:

> What seems obviously not in good faith is that the parent commenter clearly then explains themselves with "we're discussing a hypothetical situation", and you ignored that, and responded as if they hadn't explained it.

No, @agwa replied directly with a very non-hypothetical response: "That's news to you? I informed you last week that Caddy would serve broken certificates in this situation," implying that the conversation is not being carried hypothetically.

teraflop · on June 23, 2023

The only way I can understand your confusion is if you stopped reading at that point, and completely missed the sentence immediately following the one you just quoted:

> I omitted "would" from my previous comment, but I think it's pretty clear from Francis' comment that we're discussing a hypothetical situation, and neither of us know if any of the 645 affected certificates were requested by Caddy or not.

fruitreunion1 · on June 22, 2023

Will non-browser clients like curl/requests ever support checking CT logs? It's great that some browsers have it, but browsers are not the only clients using TLS with CAs. Also doesn't help that a lot of software can't use CA root stores with much granularity: https://news.ycombinator.com/item?id=33876949

agwa · on June 22, 2023

Hopefully, although there are challenges to overcome. CT is a fast-moving ecosystem, with logs coming and going, and policies changing regularly. This requires CT-enforcing clients to be very on-the-ball with updates, both in the sense that the developers need to pay attention and update their code in time, and any users of the apps need to upgrade frequently. Browser makers can handle this because they are competently-staffed and well-resourced. The authors of non-browser apps need to know what they're getting into.

A cautionary tale: there is a library for adding CT enforcement to Android apps. Earlier this year, every app using this library was suddenly unable to establish any TLS connections because Google stopped publishing a JSON file which the library should never have been consuming in the first place. There was plenty of warning that this would happen, but the author of the library was not on-the-ball. https://groups.google.com/g/certificate-transparency/c/38Lr9...

NovemberWhiskey · on June 22, 2023

The elephant in the room is that TLS implementations for browsers and those in the libraries of common programming languages have diverged really substantially: Web PKI is massively more restrictive and depends on a bunch of technology that's not in the baseline PKI.

mjw1007 · on June 22, 2023

> these certificates were already being rejected by Chrome and Safari for having invalid SCTs

What's a good way to make an equivalent check from a script, if I want (in future) to be able to check whether I have a website whose certificate has such a problem?

agwa · on June 22, 2023

Excellent question! The sctcheck command from https://github.com/google/certificate-transparency-go/ can be used to check the signatures of the embedded SCTs in a certificate.

I've also got an online tool which you can use to test a site for CT policy compliance: https://sslmate.com/labs/ct_policy_analyzer/

Example of a working site: https://sslmate.com/labs/ct_policy_analyzer/?sslmate.com

Example of one of the sites affected by the Let's Encrypt incident: https://sslmate.com/labs/ct_policy_analyzer/?thecandyshake.c...

jimmyl02 · on June 22, 2023

This is a great writeup and intro to certificate transparency overall. Glad to see that certificate authorities are being held accountable and learn more about how its done!

bullen · on June 23, 2023

Use https://datatracker.ietf.org/doc/html/rfc2289 instead.

And please don't down vote without comment.