Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

For those not aware, if you create too many resources you can easily use up all of the 8GB hard coded maximum size in etcd which causes a cluster failure. With compaction and maintenance this risk is mitigated somewhat but it just takes one misbehaving operator or integration (e.g. hundreds of thousands of dex session resources created for pingdom/crawlers) to mess everything up. Backups of etcd are critical. That dex example is why I stopped it for my IDP.




This is why I’ve always thought Tekton was a strange project. It feels inevitable that if you buy into Tekton CI/CD you will hit issues with etcd scaling due to the sheer number of resources you can wind up with.

What boundaries does this 8GB etcd limit cut across? We've been using Tekton for years now but each pipeline exists in its own namespace and that namespace is deleted after each build. Presumably that kind of wholesale cleanup process keeps the DB size in check, because we've never had a problem with Etcd size...

We have multiple hundreds of resources allocated for each build and do hundreds of builds a day. The current cluster has been doing this for a couple of years now.


Yeah I mean if you’re deleting namespaces after each run then sure, that may solve it. They have a pruner now that you can enable too to set up retention periods for pipeline runs.

There’s also some issues with large Results, though I think you have to manually enable that. From their site

> CAUTION: the larger you make the size, more likely will the CRD reach its max limit enforced by the etcd server leading to bad user experience.

And then if you use Chains you’re opening up a whole other can of worms.

I contracted with a large institution that was moving all of their cicd to Tekton and they hit scaling issues with etcd pretty early in the process and had to get Red Hat to address some of them. If they couldn’t get them addressed by RH they were going to scrap the whole project.


Yeah, quite unfortunate. But maybe there is hope. Apparently k3s uses Kine which is an etcd translation layer for relational databases and there is another project called Netsy which persists into s3 https://nadrama.com/netsy. Some interesting ideas. Hopefully native postgres support gets added since its so ubiquitous and performant.

It's not hardcoded and you can increase it via flag.

There is a hard coded warning which says safety not guaranteed after 8GB. I have tried increasing this after a database has become full and it didn’t start. It’s definitely not a recovery strategy for a full etcd by itself, maybe as part of a way to eek out a little larger margin of safety.

This warning seems to be outdated. We had run etcd at much larger volumes without issues (at least without issues related to its size). Alibaba has been running 100G etcd clusters for a while now, probably others too

Thank you for the update



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: