> I find it alarming that a week after the incident, 40% of the affected certificates are still in use, despite being rejected by the most popular browsers and despite affected subscribers being emailed by Let's Encrypt.
This is perhaps a consequence on how well-oiled of a machine LE typically is: people stop paying attention to it.
I haven't looked at a list of revoked certificates, because I was busy, (and I no longer operate my own CT auditing software, so I'd have to poke around in crt.sh which is not much fun) but lets suppose these are a random sample of Let's Encrypt's ~2 million issuances per day.
What %age of the world's HTTPS web sites are "parked" and so there is nobody who expects them to actually work? BrandFromATVShow.example ? TeenDanceISawOnTikTok.example ? SomeShortEnglishWord.example ? Nobody cares, if they do visit, and there's a certificate failure, they realise that's not where they meant to go and leave.
Then what %age are somebody's fever dream / retirement plan / abandoned start-up idea and so although the owner may notice eventually that it's broken, that might not happen before automatic renewal "fixes" the problem anyway if ever. MyTownOlympicSwimmingPool.example JimAndBethsCakeShop.example and LikeAWSForDogsSomehow.example
And then how about all the outfits which folded weeks, months, even in some cases years ago, but the ISP bill was paid, so, the web site continues to exist until somebody removes it, but of course nobody cares ? BoughtByGoogle.example and YetAnotherBayAreaCryptoStartup.example together with DefinitelyViableProduct.example and OopsWalmartAlreadySellsThatForLessMoney.example
If it was 95% I'd be more worried, at 40% I'd need to actually check at least a decent sample and see for myself. In the time I was writing this post I checked one, it wasn't replaced... exactly, because the actual web site uses a certificate issued five days earlier. Chances are they've got a bunch of duplicate certificates, so the fact that some they don't use are broken has never come up - that's just rude (wastes other people's resources) but it works fine technically.
When we wrote Certbot, we thought (by analogy with prior practice) that many sysadmins would want to manually inspect certificates before deploying them! That's one reason that we kept old certificates around and used a symlink-updating system.
As it turned out, misissued and invalid certs account for an incredibly small fraction of Let's Encrypt's issuance volume (I'm going to say < 1/10⁸ offhand?) and manual inspection kind of gets in the way of automation, so the idea of separating these steps has come to seem kind of quaint, for me at least. I've also helped thousands of people on the Let's Encrypt forum and I think at most 2 have said they were interested in looking at their new certs' contents before starting to use them.
Or clients with poor handling of dates. It's been a while, but Nokia Series 40 was really bad at this. As I recall, it would read the not-before as if the time was specified in local time. So not before noon UTC becomes not before noon wherever you are, so better wait a while for users in the americas.
Yeah, we originally were thinking of having separate times for renewal and deployment (e.g. 4 weeks before expiry and 3 weeks before expiry), but there was a countervailing concern that people would think "wait, my certificate has already been renewed but I can't see it on my site!" and get confused or alarmed.
Based on my experience, the capability model for certificate management usually went like:
1) Chaos: certificates requested and installed manually, either in response to incidents caused by expiration or calendar reminders
2) Monitoring: certificates requested and installed manually, in response to noisy alerting by probers looking for indications of pending expiration or other ill-health
3) Automation: continuous certificate provisioning, distribution and enablement either through platform or integration
The Let's Encrypt revolution has taken a lot of people from stage 1 to stage 3 without stage 2 in between.
That is true but how come certbot had no awareness of revoked/withdrawn certificates before now? It seems like one of the things a CA is supposed to solve for you, and the fact that it doesn't is bit alarming in itself.
Though, as the following sentence points out, they were already working on it before the outage, so clearly they knew it was needed.
The certificate authority signs certificate requests, creating certificates. The revocation process is necessary as well, but the CA doesn't have the ability to change the already issued certificate, thus it cannot take action.
A software like certbot can solve it for you, but that's not affiliated with your CA
Even before ARI, some integrated ACME/Web servers use OCSP as a way of knowing to renew if a cert was revoked. Plus if you're doing that you can pin the OCSP response while you're at it.
My point was that the CA can't solve it for you, they can only give you APIs and processes with which you can solve it yourself.
If your webserver supports checking the certificate validity then it's not solved by the CA, it's been solved by the developers of that software and by you installing it.
Vernor Vinge has dominated the Singularity space in science fiction pretty much from the beginning of the concept.
Rainbow’s End plays around in time frame right around where we are now, just a bit before the sorts of doglegs we predict would presage a Singularity in your lifetime.
At one point the protagonists need to attack a bad actor, and to make it work they need chaos on the internet. I don’t recall exactly how this plays out, but the way they decide to achieve it is that one of the collaborators believes that they can reject a CA cert that affects 10% of all certificates in the wild, and the resulting pandemonium will give them approximately the sort of chaos they need.
Sounds to me like maybe that is either no longer true, or never was.
They don't need Chaos. They want to disable Rabbit, and they know Rabbit's certificates mostly tie back to a single CA, Credit Suisse. So they "revoke" Credit Suisse and accept the consequences, which (they acknowledge) are career ending for the Europeans. This is mostly a plot convenience because Rabbit is much too powerful to allow what Vinge wants to happen next.
No, you can't actually "revoke" a root CA, the decision to trust (or not) a root is local. So this part of the novel is a fantasy. But even if you assume it means that the European authorities can somehow reach into Credit Suisse and cause it to revoke all the intermediates (which maybe is a plausible reading) and so on down to end entity certificates, that doesn't really work either. Not on the time scale Vinge needs for the novel.
Hours are conceivable but unlikely. Days maybe. A week. But the novel needs it to be seconds.
There are two big obstacles to even the revocation which does really exist. Firstly humans are much more enthusiastic about seeing Dancing Pigs than they are about safety, because safety is a very abstract idea, whereas seeing dancing pigs is an immediate reward. This is the Dancing Pigs problem, and we've put some effort in, it's less likely a random Chrome user would get their face ripped to pieces because they wanted Dancing Pigs and so they bypassed the security checks that would protected them - than say - fifteen years ago, but only somewhat.
Secondly though, there's not a great enthusiasm technically for this sort of counter-measure. It's so rarely beneficial in practice. Most of the time those humans were right, we were just denying them Dancing Pigs. Their face might get ripped to pieces, but to be honest it's as likely to be because they deliberate went to "Rip My Face To Pieces.example" as through anything we could have prevented. This is only barely a technical problem. So, when there are things we could do to get closer to what's in the novel, why would we?
Building the PKI which exists in Vinge's novel is probably a bad expenditure of resources.
That's funny, I read that book before I worked on PKI directly. Maybe I should read it again now that I have more detailed experience (and maybe be frustrated by the implausibility that you mention!).
FATAL: terminating connection due to conflict with recovery
DETAIL: User query might have needed to see row versions that must be removed.
CONTEXT: SQL statement "SELECT c.ID, x509_print(c.CERTIFICATE, NULL, 196608), ca.ID, cac.CA_ID,
digest(c.CERTIFICATE, 'sha1'::text),
digest(c.CERTIFICATE, 'sha256'::text),
x509_serialNumber(c.CERTIFICATE),
digest(x509_publicKey(c.CERTIFICATE), 'sha256'::text),
x509_rsamodulus(c.CERTIFICATE),
x509_hasROCAFingerprint(c.CERTIFICATE),
x509_hasClosePrimes(c.CERTIFICATE),
c.CERTIFICATE
FROM certificate c
LEFT OUTER JOIN ca ON (c.ISSUER_CA_ID = ca.ID)
LEFT OUTER JOIN ca_certificate cac
ON (c.ID = cac.CERTIFICATE_ID)
WHERE digest(c.CERTIFICATE, 'sha256') = t_bytea"
PL/pgSQL function web_apis(text,text[],text[]) line 1757 at SQL statement
ERROR: server conn crashed?
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
I haven't found any, yet. I would love to have a list of domains affected by this to cross-check that none of my issued certificates were affected by this.
Thanks for following through on this writeup! I knew LE certs were publicly logged but didn't know the logs were decentralized or how they hold the CA accountable. Appreciate the layman explanation.
FWIW, if those websites used Caddy as their ACME client, then it would have detected the certificate being revoked as soon as possible via OCSP stapling and would have had the certificate renewed. It's a shame that other ACME clients aren't as robust to problems like this. (Disclaimer: I work on Caddy as a volunteer)
Note that the certificates were not revoked until 2023-06-19 at 18:00. In contrast, ARI was updated on 2023-06-15 at 22:43 to tell ARI-supporting clients (such as lego) to renew immediately. That means Caddy served broken certificates for almost 4 days longer than necessary.
> Note that the certificates were not revoked until 2023-06-19 at 18:00.
Ah okay, I missed that.
> Are there plans for Caddy to support ARI?
It's... complicated. Matt argues that ARI does not make sense for a variety of reasons. You can find the complex and deep discussions about it on the LE forums. Do a Ctrl+F for ARI in https://community.letsencrypt.org/c/client-dev/14 to find them, there's a lot.
I omitted "would" from my previous comment, but I think it's pretty clear from Francis' comment that we're discussing a hypothetical situation, and neither of us know if any of the 645 affected certificates were requested by Caddy or not.
I skimmed the forum links (it would be productive if you could send a email summarizing your thoughts to the IETF ACME WG) and it seems like your complaints could also be said of OCSP so it's hard to figure out why OCSP is OK for Caddy but ARI isn't.
FWIW, there's currently a ballot in the CABF which would make OCSP optional for CAs, so OCSP may be on the way out in the WebPKI.
So yes, that would be news to me. I'm asking for more information. If Caddy did not serve broken certificates, then I would appreciate clarification there so I know where to spend my energy.
> (it would be productive if you could send a email summarizing your thoughts to the IETF ACME WG)
I did this once and it was like talking into a black hole. All the responses I got to the issue I brought up were laced with complacency.
> I skimmed the forum links and it seems like your complaints could also be said of OCSP so it's hard to figure out why OCSP is OK for Caddy but ARI isn't.
Because OCSP does what it's intended to do. ARI does not.
> FWIW, there's currently a ballot in the CABF which would make OCSP optional for CAs, so OCSP may be on the way out in the WebPKI.
I am tracking that proposal and get daily notifications. It is only for short-lived certs. I would be thrilled if we could replace revocation -- and OCSP -- with short-lived certs.
> So yes, that would be news to me. I'm asking for more information. If Caddy did not serve broken certificates, then I would appreciate clarification there so I know where to spend my energy.
This is not engaging in good faith.
> I am tracking that proposal and get daily notifications. It is only for short-lived certs.
It would make OCSP optional for all certificates. CRLs would be optional only for short-lived certs.
When I read the thread in context, it's clear that the response is within the hypothetical raised in the very first comment "if those websites used Caddy", that hypothetical.
The response has the understood "No, even in that hypothetical, this is the case", and doesn't explicitly say it's in the hypothetical, but in context it clearly is.
Your first response to that, missing the context and asking for "more info", well, miscommunications happen, that's fine.
What seems obviously not in good faith is that the parent commenter clearly then explains themselves with "we're discussing a hypothetical situation", and you ignored that, and responded as if they hadn't explained it.
The whole thread is confusing then. I definitely didn't read it as hypothetical, especially since:
> What seems obviously not in good faith is that the parent commenter clearly then explains themselves with "we're discussing a hypothetical situation", and you ignored that, and responded as if they hadn't explained it.
No, @agwa replied directly with a very non-hypothetical response: "That's news to you? I informed you last week that Caddy would serve broken certificates in this situation," implying that the conversation is not being carried hypothetically.
The only way I can understand your confusion is if you stopped reading at that point, and completely missed the sentence immediately following the one you just quoted:
> I omitted "would" from my previous comment, but I think it's pretty clear from Francis' comment that we're discussing a hypothetical situation, and neither of us know if any of the 645 affected certificates were requested by Caddy or not.
Will non-browser clients like curl/requests ever support checking CT logs? It's great that some browsers have it, but browsers are not the only clients using TLS with CAs. Also doesn't help that a lot of software can't use CA root stores with much granularity: https://news.ycombinator.com/item?id=33876949
Hopefully, although there are challenges to overcome. CT is a fast-moving ecosystem, with logs coming and going, and policies changing regularly. This requires CT-enforcing clients to be very on-the-ball with updates, both in the sense that the developers need to pay attention and update their code in time, and any users of the apps need to upgrade frequently. Browser makers can handle this because they are competently-staffed and well-resourced. The authors of non-browser apps need to know what they're getting into.
A cautionary tale: there is a library for adding CT enforcement to Android apps. Earlier this year, every app using this library was suddenly unable to establish any TLS connections because Google stopped publishing a JSON file which the library should never have been consuming in the first place. There was plenty of warning that this would happen, but the author of the library was not on-the-ball. https://groups.google.com/g/certificate-transparency/c/38Lr9...
The elephant in the room is that TLS implementations for browsers and those in the libraries of common programming languages have diverged really substantially: Web PKI is massively more restrictive and depends on a bunch of technology that's not in the baseline PKI.
> these certificates were already being rejected by Chrome and Safari for having invalid SCTs
What's a good way to make an equivalent check from a script, if I want (in future) to be able to check whether I have a website whose certificate has such a problem?
This is a great writeup and intro to certificate transparency overall. Glad to see that certificate authorities are being held accountable and learn more about how its done!
This is perhaps a consequence on how well-oiled of a machine LE typically is: people stop paying attention to it.