My current company is split... maybe 75/25 (at this point) between Kubernetes and a bespoke, Ansible-driven deployment system that manually runs Docker containers on nodes in an AWS ASG and will take care of deregistering/reregistering the nodes with the ALB while the containers on a given node are getting futzed with. The Ansible method works remarkably well for it's age, but the big thing I use to convince teams to move to Kubernetes is that we can take your peak deploy times from, say, a couple hours down to a few minutes, and you can autoscale far faster and more efficiently than you can with CPU-based scaling on an ASG.
From service teams that have done the migrations, the things I hear consistently though are:
- when a Helm deploy fails, finding the reason why is a PITA (we run with --atomic so it'll rollback on a failed deploy. What failed? Was it bad code causing a pod to crash loop? Failed k8s resource create? who knows! have fun finding out!)
- they have to learn a whole new way of operating, particularly around in-the-moment scaling. A team today can go into the AWS Console at 4am during an incident and change the ASG scaling targets, but to do that with a service running in Kubernetes means making sure they have kubectl (and it's deps, for us that's aws-cli) installed and configured, AND remembering the `kubectl scale deployment X --replicas X` syntax.
The problem with bespoke, homegrown, and DIY isn't that the solutions are bad. Often, they are quite good—excellent, even, within their particular contexts and constraints. And because they're tailored and limited to your context, they can even be quite a bit simpler.
The problem is that they're custom and homegrown. Your organization alone invests in them, trains new staff in them, is responsible for debugging and fixing when they break, has to re-invest when they no longer do all the things you want. DIY frameworks ultimately end up as byzantine and labyrinthine as Kubernetes itself. The virtue of industry platforms like Kubernetes is, however complex and only half-baked they start, over time the entire industry trains on them, invests in them, refines and improves them. They benefit from a long-term economic virtuous cycle that DIY rarely if ever can. Even the longest, strongest, best-funded holdouts for bespoke languages, OSs, and frameworks—aerospace, finance, miltech—have largely come 'round to COTS first and foremost.
Personally, I don't like Helm. I think for the vast majority of usecases where all you need is some simple templating/substitution, it just introduces way more complexity and abstraction than it is worth.
I've been really happy with just using `envsubst` and environment variables to generate a manifest at deploy time. It's easy with most CI systems to "archive" the manifest, and it can then be easily read by a human or downloaded/applied manually for debugging with. Deploys are also just `cat k8s/${ENV}/deploy.yaml | envsubt > output.yaml && kubectl apply -f output.yaml`
I've also experimented with using terraform. It's actually been a good enough experience that I may go fully with terraform on a new project and see how it goes.
You might like kubernetes kustomize if you don't care for helm (IMO, just embrace helm, you can keep your charts very simple and it's straight forward). Kustomize takes a little getting used to, but it's a nice abstraction and widely used.
I cannot recommend terraform. I use it daily, and daily I wish I did not. I think Pulumi is the future. Not as battle tested, but terraform is a mountain of bugs anyway, so it can't possibly be worse.
Just one example where terraform sucks: You cannot both deploy a kubernetes cluster (say an EKS/AKS cluster) and then use kubernetes_manifest provider in a single workspace. You must do this across two separate terraform runs.
I haven’t used kubernetes in a few years, but do they have a good UI for operations? Your example of the AWS console where you can just log in and scale something in the UI but for kubernetes. We run something similar on AWS right now, during an incident we log into the account with admin access to modify something and then go back to configure that in the CDK post incident.
AWS has a UI for resources in the cluster but it relies on the IAM role you're using in the console to have configured perms in the cluster, and our AWS SSO setup prevents that from working properly (this isn't usually the case for AWS SSO users, it's a known quirk of our particular auth setup between EKS and IAM -- we'll fix it sometime).
I have to say that when you have more buy in from delivery teams and adoption of HPAs your system can become more harmonious overall. Each team can monitor and tweak their services, and many services are usually connected upstream or downstream. When more components can ebb and flow according to the compute context then the system overall ebbs and flows better. #my2cents
From service teams that have done the migrations, the things I hear consistently though are:
- when a Helm deploy fails, finding the reason why is a PITA (we run with --atomic so it'll rollback on a failed deploy. What failed? Was it bad code causing a pod to crash loop? Failed k8s resource create? who knows! have fun finding out!)
- they have to learn a whole new way of operating, particularly around in-the-moment scaling. A team today can go into the AWS Console at 4am during an incident and change the ASG scaling targets, but to do that with a service running in Kubernetes means making sure they have kubectl (and it's deps, for us that's aws-cli) installed and configured, AND remembering the `kubectl scale deployment X --replicas X` syntax.
[Both of those things are very much fixable]