The hater's guide to Kubernetes

elktown · on March 3, 2024

I think an underestimated issue with k8s (et al) is on a cultural level. Once you let in complex generic things, it doesn't stop there. A chain reaction has started, and before you know it, you've got all kinds of components reinforcing each other, that are suddenly required due to some real, or just perceived, problems that are only there in the first place because of a previous step in the chain reaction.

I remember back when the Cloud first started getting a foothold that what people was drawn to was that it would enable reducing complexity of managing the most frustrating things like the load-balancer and the database, albeit at a price of course, but it was still worth it.

Stateless app servers however, was certainly not a large maintenance problem. But somehow we've managed to squeeze in things like k8s in the there anyway, we just needed to evangelize microservices to create a problem that didn't exist before. Now that this is part of the "culture" it's hard to even get beyond hand-wavy rationalizations that microservices is a must, assumingly because it's the initial spark that triggered the whole chain reaction of complexity.

jupp0r · on March 3, 2024

Cloud providers automate things like lease renewals, dealing with customs and part time labor contract compliance disputes for that datacenter in that Asian country that you don't know the language of.

I'm constantly fascinated how people handwaivingly underestimate the cost and headaches of actually running on prem global infrastructure.

kortilla · on March 3, 2024

I’m constantly fascinated by people who think they need on prem global infrastructure when the vast majority of applications both have very loose latency requirements (multiple seconds) or no users outside of the home country.

Two datacenters on opposite sides of the US from different providers will get you more uptime than a cloud provider and is super simple.

jupp0r · on March 3, 2024

While some of the complexity goes away when it's on prem in to parts of the US, having to order actual hardware, putting it into racks, hiring, training, retaining the people there to debug actual hardware issues when they arise, dealing with HVAC concerns, etc is a lot of complexity that's probably completely outside of your core business expertise but that you'll have to spend mental cycles on when actually operating your own data center.

It's totally worth it for some companies to do that, but you need to have some serious size to be concerned with spending your efforts on lowering your AWS bill by introducing details like that into your own organization when you could alternatively spend those dollars to make your core business run better. Usually your efforts are better spent on the latter unless you are Netflix or Amazon or Google.

protomikron · on March 3, 2024

Why is it always public cloud (aws, gcp, azure) vs. "bring your own hardware and deploy it in racks".

There are multiple providers that offer VPS and ingress/egress for a fraction of the cost of public clouds and they mostly have good uptime.

pclmulqdq · on March 3, 2024

I recently rented a rack with a telecom and put some of my own hardware in it (it's custom weird stuff with hardware accelerators and all the FIPS 140 level 4 requirements), but even the telecom provider was offering a managed VPS product when I got on the phone with them.

The uptime in these DCs is very good (certainly better than AWS's us-east-1), and you get a very good price with tons of bandwidth. Most datacenter and colo providers can do this now.

I think people believe that "on prem" means actually racking the servers in your closet, but you can get datacenter space with fantastic power, cooling, and security almost anywhere these days.

jupp0r · on March 3, 2024

It's a spectrum:

On top is AWS lambda or something where you are completely removed from the actual hardware that's running your code.

At the bottom is a free acre of land where you start construction and talk to utilities to get electricity and water there. You build your own data center, hire people to run and extend it, etc.

There is tons of space in between where compromises are made by either paying a provider to do something for you or doing it yourself. Is somebody from the datacenter where you rented a rack or two going in and pressing a reset button after you called them a form of cloud automation? How about you renting a root VM at Hetzner? Is that VM on prem? People who paint these tradeoffs in a black and white matter and don't acknowledge that there are different choices for different companies and scenarios are not doing the discussion a service.

On the other hand, somebody who built their business on AppEngine or CloudFlare workes could look at that other company who is renting a pet pool of EC2 instances and ask if they are even in the cloud or if they are just simulating on-prem.

ethbr1 · on March 4, 2024

I think the question people are really interested in is usually "What percentage over my costs would I pay to outsource X?" (where X is some component of the complexity stack)

Which, first order approximated, is a function of (1) how big a company you are (aka "Can you even afford to hire two people to just do X?") and (2) how competitive the market is for X.

Colo and dedicated VMs are so reasonably priced because it's a standardized, highly-competive market.

Similarly, certain managed cloud services are ridiculously expensive because they have a locked-in customer base.

Which would suggest outsourcing components that have maximum vendor competition and standardization, as they're going to be offered at the lowest margin.

pclmulqdq · on March 4, 2024

There's also a good point here (at least at the top of the stack) about reliability: the top of the spectrum goes down relatively frequently due to its dependencies, but even plain old boring EC2 has much better reliability than services like Lambda.

Zircom · on March 3, 2024

>I think people believe that "on prem" means actually racking the servers in your closet, but you can get datacenter space with fantastic power, cooling, and security almost anywhere these days.

That's because that is what on prem means. What you're describing is colocating.

pclmulqdq · on March 3, 2024

When clouds define "on-prem" in opposition to their services (for sales purposes), colo facilities are lumped into that bucket. They're not exactly wrong, except a rack at a colo is an extension of your premises with a landlord who understands your needs.

hiatus · on March 4, 2024

> it's custom weird stuff with hardware accelerators and all the FIPS 140 level 4 requirements

What kind of weird stuff are we talking?

pclmulqdq · on March 4, 2024

Servers for an API for https://arbitrand.com

Essentially very high security and throughput TRNG servers (with cryptographic signing and the like).

cangeroo · on March 3, 2024

Because their arguments are disingenuous.

It reads like propaganda sponsored by the clouds. Scaremongering.

Clouds are incredibly lucrative.

But don't worry. You can make the prices more reasonable by making a 3-year commitment to run old outdated hardware.

bcaxis · on March 4, 2024

> having to order actual hardware, putting it into racks, hiring, training, retaining the people there to debug actual hardware issues when they arise, dealing with HVAC concerns, etc is a lot of complexity that's probably completely outside of your core business

Vertical integration is a widely known and understood business strategy - running your own infrastructure will help you reclaim the cloud margins back for yourself.

You can do it as a one man band or a huge multinational.

I use hotels and other rental offerings, including the cloud. But I when it is advantageous to do so, I buy and own. Even though it comes with maintenance burdens.

kortilla · on March 4, 2024

You don’t deal with hvac complexity. People are so blinded by the cloud they don’t even realize CoLos are a thing.

“What alternative is there to living in a hotel if you aren’t a carpenter!?!?!”

threeseed · on March 3, 2024

> Vast majority of applications have no users outside home country

Any evidence to back this up. Because on the surface seems like a ridiculous statement.

noodlesUK · on March 3, 2024

I would say that a large portion of b2b or internal software tends to come into this category. If you are building something for a single business that only operates in one jurisdiction, and you don't do i18n, why bother with global distribution? A lot of b2b stuff covers process that have legal stuff baked in, like tax handling or other assumptions.

willdr · on March 4, 2024

Why does this seem ridiculous? There are more national companies than international companies, by a large factor.

DinaCoder99 · on March 4, 2024

Are the majority of applications even developed by "companies"? I'm honestly not sure at all, or even how to go about measuring that. By the numbers, most games on steam are developed by individuals or small teams, even if the bulk of sales are driven by games produced by larger companies. I'd imagine the same is true of app stores, too.

kortilla · on March 4, 2024

Do you know how many businesses software is used for? Look at every local business you go to and enumerate the things they use software to accomplish.

When they pay their local utility bills, is that international? How about paying their rent? How about filing their state taxes? How about ordering from local suppliers?

Very little of the world is international business. That tiny slice that is just dominates the zeitgeist because it’s international.

For some examples of things well known that absolutely don’t need global data centers:

- airbus and Boeing

- Coca-Cola

- Marriott and Hilton

- the entire US federal government (apart from some maybe military applications)

- McDonald’s

The list goes on forever because it’s literally nearly every business. Unless you’re in real time markets or operating store fronts globally where latency hurts sales, putting up regions all over the world is a complete and utter waste of money.

Making global regions as easy as a click of a button was one of the greatest marketing ploys of cloud providers to date.

“Of course Good Will needs a Singapore data center!? How will we meet our P99 goals otherwise?”

throwaway11460 · on March 4, 2024

McDonald's absolutely does need global data centers. Even now the kiosks are frustratingly slow here in Europe, can't imagine what a round-trip to US servers would do.

My siblings work there and their work information system is not so bad but definitely it'd be totally frustrating if it didn't run on AWS in EU.

justsomehnguy · on March 4, 2024

If McD do need a global data centers to show the menu in the kiosks then they should fire their whole IT dept and start from scratch.

There is absolutely zero reason for a kiosk to touch 'a global data center'. No, not even for a payment, because it's just asks the payment terminal if the payment succeeded or not.

cangeroo · on March 4, 2024

You're absolutely right.

The only motivation would be latency, but you could have specialized services that run at the edge, if that's so important, for example if payment verification should take 500ms instead of 3000ms.

But you could also just rewrite the protocol to have less back-and-forth sequential data exchange, which is a smarter approach.

throwaway11460 · on March 4, 2024

I absolutely disagree. The restaurant managers really shouldn't need to manage servers too. They get the kiosk as a service that they don't need to care about and that is correct.

So McDonald's should be dispatching kiosk admins all around the world? That's very much not eco friendly. And of course, expensive... And a total nightmare to manage. K8s and AWS is a night walk through the rose garden compared to that.

> No, not even for a payment, because it's just asks the payment terminal if the payment succeeded or not.

Yeah sure, works great as long as the kiosk doesn't crash during the payment.

justsomehnguy · on March 4, 2024

> The restaurant managers really shouldn't need to manage servers too.

And they don't. Why the hell do you pull this nonsense?

In case you don't know (looks like you don't) these are Windows machines[0]. Any Windows machine is capable running IIS or even Apache on it. No need for the servers for managers to manage. Just effing serve it locally, if you can't provide each McD with a mini box what is managed by the central IT.

> should be dispatching kiosk admins all around the world

LOL

> K8s and AWS is a night walk through the rose garden compared to that.

You need a sysadmin to teach you how to build robust and resilient local applications and services without being Eco-unfriendly and without racking billions in AWS bills.

> Yeah sure, works great as long as the kiosk doesn't crash during the payment.

Do you understand what it is the payment terminal which is processing the payment and you can't offload reading CC/NFC to the 'cloud'?

And finally, there is already 'IT' infrastructure in each McD: Ethernet switches, UPSes, wireless AP and maybe a controller, menu and order screens at the counter, networked PoS and whatever else. If you claim what restaurant managers manage all these then you don't really know anything about nor retail nor IT. Or just bickering in a bad faith.

[0] and if in your case they are Linux ones... do I even need to continue?

throwaway11460 · on March 4, 2024

What sys admin? They don't have any. The kiosks are entirely blackboxes managed from the US central. It works if it has a working network. The store manager can manage it from their iPad lying down on a beach. The card terminals send data directly to a bank and it doesn't matter whether the box burns down while it crashes, the payment is recorded.

I built similar kind of self service box (self checkout) for a local store chain. I know the problem well. Main problems are costs, system administration and management. Our solution is just a simple Chrome kiosk mode browser window too. It's a smart solution.

We run it on AWS because no reason not to - simply pushing the SPA to S3 behind Cloudfront beats any other kind of deployment/hosting method. Some backend stuff runs in Lambda, some runs in ECS containers. All data in a managed RDS DB. Easy to use, easy to maintain, easy to upgrade, easy to deploy, easy to scale from 0 to several thousands kiosks...

I used to be a Windows admin (though admittedly the last Windows I managed was 2003 R2). Just the word IIS makes my neck hair stand up.

Especially if it should be holding payment data... Huh, damn. Wow.

You'd need sync to a global service anyways - it's a multi store chain. Your suggestion just makes everything harder and more convoluted - and it's just wrong, the payment terminal has its own network connection and doesn't need the kiosk to work at all. The kiosk just initiates the payment but the rest of it is handled out of it - and the kiosk waits for central service to confirm the payment. It's bullshit to connect the kiosk to the bank, we support many banks and keep adding support for more, and what if it's stolen, should we always be ready to rotate the certs on 1000s devices - the bank doesn't have appropriate api, so it's either that or again, a central service? Nah, we just turn it off in our admin panel.

I also used to be a Linux sys admin way before clouds. I built one of the first cloud services in my country to solve the problems of that. The company ultimately was out competed - but not by traditional hosting. By the big public clouds. Apparently the problems are real.

BTW coincidentally I'm just about to launch a new cloud platform. Nothing fancy really, but it's built on dedicated servers for performance. Last 3 weeks I spent working on stuff I could've done with 100 lines of Terraform/Pulumi with AWS. Maybe a proper sys admin like you could teach me? I am not happy with it at all, it's a major headache and I'm considering a hybrid setup because I just don't want to lose sleep over customer data and site availability.

kortilla · on March 7, 2024

If their kiosks need to talk to a server, they fucked it up. This is the problem with idiot cloud architects.

alphager · on March 3, 2024

Most software is internal software used by a handful of people in a certain department of a company. Even in large multinational conglomerates the applications used by every single country subsidiary can be counted using your fingers.

jupp0r · on March 3, 2024

There are tons of examples where low latency is good for business, even small businesses. I'm sure you've seen the studies from Amazon that every 100ms of page load latency is costing them 1% of revenue, etc. Also everything communication is very latency sensitive.

Of course there are plenty of scenarios where latency does not matter at all.

groestl · on March 3, 2024

So you can trade off 300ms of additional roundtrip time (on anything non-CDNable) at a cost of 3% revenue and reduce your infrastructure complexity a lot

throwaway22032 · on March 3, 2024

Not every business is based on impulse buys. Amazon is a pretty biased sample there.

threeseed · on March 3, 2024

That latency is correlated with revenue is not exclusive to Amazon.

And many people who aren’t impulse buying will not stick around on slow sites.

pdimitar · on March 3, 2024

Disagreed, once we're not talking a worldwide shop for non-critical buys like Amazon the picture changes dramatically. Many people on local markets have no choice and will stick around no matter how slow the service is.

Evidence: my wife buying our groceries for delivery at home. We have 4-5 choices in our city. All their websites are slow as hell, and I mean adding an item to a cart takes good 5-10 seconds. Search takes 20+ seconds.

She curses at them every time yet there's nothing we can do. The alternative is both of us to travel by foot 20 minutes to the local mall and wait on queues. 2-3 times a week. She figured the slow websites are the lesser evil.

jupp0r · on March 3, 2024

Is this agreeing with "Of course there are plenty of scenarios where latency does not matter at all." or are you trying to make a point?

kortilla · on March 4, 2024

Your mind is stuck in low cost retail shopping. Even setting aside the likely self-serving nature of that “study”, most interactions are not latency sensitive like that.

When I go to book my colonoscopy on my hospital’s reservation system, I don’t bail out and look for a new doctor if it takes me 10 tries.

There are very few businesses where UX latency at the sub second level matters and the ones that do are not the ones you want to be in.

jupp0r · on March 6, 2024

I've spent a good chunk of my career in realtime audio/video conferencing and am now working on APIs for B2B SaaS, thanks ;)

I also know that user experience of any application suffers a lot when latencies are high. Your point seems to be that there is lots of software that doesn't care about its user experience (mostly because the people making buying decisions are not the people suffering from those decisions) and that's a fair point, but I don't think that is a great business strategy for any software business.

Of course there are lots of scenarios where latency literally doesn't matter at all.

hipadev23 · on March 3, 2024

And colo providers solved those hurdles for decades. Let’s not act like the only options are cloud, or build your own datacenter.

bradfox2 · on March 3, 2024

My startup hosts our own training servers in a colo-ed space 10 min from our office. Took less than 40 hours to get moved in, with most of the time tinkering with fortigate network appliance settings.

Cloudflare zero trust for free is a huge timesaver

ricardobeat · on March 3, 2024

There are at least five shades in between on-prem and a managed k8s cloud.

d0mine · on March 3, 2024

Could you mention three?

pmalynin · on March 3, 2024

colo racks, rented dedicated servers, ec2 / managed vm offerings?

robertlagrant · on March 3, 2024

> I'm constantly fascinated

To call a halt to your constant fascination: they don't all have that problem. They still get the complexity of cloudy things regardless when they use one.

jupp0r · on March 3, 2024

They also get some of the complexity of cloudy things when they run their own datacenter. In the end you find stuff like OpenStack which becomes its own nightmare universe.

eropple · on March 3, 2024

YMMV, but more and more I see people moving to k8s to get away from OpenStack, to varying but generally positive success.

elktown · on March 3, 2024

Not sure how how you could jump all the way to running your own Asian datacenter from my post. A bit amusing though :). I even wrote that it's worth running the LB/DB in the Cloud?

jupp0r · on March 3, 2024

Oh it was more of an addition to your point about "reducing complexity of managing the most frustrating things like the load-balancer and the database, albeit at a price of course". There is a whole mountain of complexity that most software engineers never think about when they dream about going back to the good old on prem days.

elktown · on March 3, 2024

Alright, just feels like taking a bit too far into the exceptions. Even back then only large companies would consider that. Renting servers, renting a server rack (co-location), or even just a in-office server rack for what would be a startup today.

the_duke · on March 3, 2024

I know it's fashionable to hate on Kubernetes these days, and it is overly complex and has plenty problems.

But what other solution allows you to:

* declarative define your infrastructure

* gives you load balancing, automatic recovery and scaling

* provides great observability into your whole stack (kubectl, k9s, ...)

* has a huge amount of pre-packaged software available (helm charts)

* and most importantly: allows you to stand up mostly the same infrastructure in the cloud, on your own servers (k3s), and locally (KIND), and thus doesn't tie you into a specific cloud provider

The answer is: there isn't any.

Kubernetes could have been much simpler, and probably was intentionally built to not be easy to use end to end.

But it's still by far the best we've got.

p-o · on March 3, 2024

I like to think that most people who are upset at Kubernetes don't hate on all of it. I think the configuration aspect (YAML) and the very high level of abstraction is what get people lost and as a result they get frustrated by it. I've certainly fall in that category while trying to learn how to operate multiple clusters using different topologies and cloud providers.

But from an operational standpoint, when things are working, it usually behaves very well until you hit some rough edge cases (upgrades were much harder to achieve a couple of years back). But rough edges exist everywhere, and when I get to a point where K8s hits a problem, I would think that it would be much worse if I wasn't using it.

koolba · on March 3, 2024

> I like to think that most people who are upset at Kubernetes don't hate on all of it. I think the configuration aspect (YAML) …

I question the competence of anyone who does not question (and rag on) the prevalence of templating YAML.

> But rough edges exist everywhere, and when I get to a point where K8s hits a problem, I would think that it would be much worse if I wasn't using it.

Damn straight. It’s only bad because everything else is strictly worse.

dfee · on March 3, 2024

Helm isn’t YAML. It’s a go template file that should compile to YAML, masquerading as YAML with that extension.

So yaml formatters break it, humans struggle to generate code with proper indents, and it’s an insane mess. It’s horrendous.

garrettgrimsley · on March 3, 2024

>I think the configuration aspect (YAML)

What are the reasons to not use JSON rather than YAML? From my admittedly-shallow experience with k8s, I have yet to encounter a situation in which I couldn't use JSON. Does this issue only pop up once you start using Helm charts?

smokel · on March 3, 2024

One of the most annoying limitations of JSON is that it does not allow for comments.

kbar13 · on March 3, 2024

at the surface level yaml is a lot easier to read and write for a human. less "s. but once you start using it for complex configuration it becomes unwieldy. but at that point json is also not better than yaml.

after using cdk i think that writing typescript to define infra is a significantly better experience

Lucasoato · on March 3, 2024

There is no easy solution to manage services and infrastructure: people who hate kubernetes complexity often underestimate the efforts of developing on your own all the features that k8s provides.

At the same time, people who suggest everyone to use kubernetes independently on the company maturity often forget how easy it is to run a service on a simple virtual machine.

In the multidimensional space that contains every software project, there is no hyperplane that separates when it’s worth to use kubernetes or not. It depends on the company, the employees, the culture, the business.

Of course there are general best practices, like for example if you’re just getting started with kubernetes, and already in the cloud, using a managed k8s service from your cloud provider could be a good idea. But again, even for this you’re going to find opposing views online.

weikju · on March 5, 2024

> There is no easy solution to manage services and infrastructure: people who hate kubernetes complexity often underestimate the efforts of developing on your own all the features that k8s provides.

This. I was trying to create some infrastructure and application once, using various AWS and off the shelf components. I stopped halfway through when I realized I was reinventing k8s, very poorly. That's when I switched gears and learned k8s.

With that said, I use it sparingly due to the inherent complexity it brings, but at least I have a better handle on how and when it should be used and when it should not, and what problems it solved, since I myself was trying to solve some of the problems.

osigurdson · on March 3, 2024

Naw, just use system-d, ha-proxy and bash scripts. That is much "simpler" (for some definition of simple).

Kidding of course. If you need anything approximating Kubernetes, use it. If you just need one machine maybe don't.

bravetraveler · on March 4, 2024

This is the gist of it, I believe. Too often I see "sprinkle some k8s" on it to fix a problem

... when the problem was lacking config management. They overshot; use the right tool for the job

yjftsjthsd-h · on March 3, 2024

The thing is, "kubernetes" doesn't give you that either. You want a LB? Here's a list of them that you can add to a cluster. But actually pick multiple, because the one you picked in AWS doesn't support bare metal.

davkan · on March 3, 2024

Bare metal kubernetes is certainly a lot less complete out of the box when it comes to networking and storage but, people can, and often should, use a managed k8s service which provides all those things out of the box. And if you’re on bare metal once the infra team has abstracted away everything into LoadBalancers and StorageClasses it’s basically the same experience for end users of the cluster.

eropple · on March 3, 2024

If you're talking about OpenShift on rented commodity compute, maybe. If you're talking about GKE/AKS/EKS or similar, I disagree wholeheartedly; you're then paying several multiples on the compute and a little extra for Kubernetes.

hobofan · on March 3, 2024

> because the one you picked in AWS doesn't support bare metal

That's just because AWS's Kubernetes offering is laughably bad.

There is huge difference in your experience whether you use Kubernetes via GKE (Autopilot) or any other solution (at least as long you don't have a dedicated infrastructure team).

yjftsjthsd-h · on March 4, 2024

I dunno, if you have to use GCP to have a good time, then that's a pretty great argument against k8s for a lot of people.

hobofan · on March 4, 2024

I think "bare" Kubernetes is still a quite nice tool that allows for learning transferable skills across clouds (similar to Terraform). E.g. even if I have to spin up my own nginx-ingress to be able to handle ingress resources, after having learned that initially, I can basically do the same thing across clouds.

It's just that GKE (Autopilot) does a lot of those out of the box for you that, so you get a much easier end-to-end experience for non-admins (= "request resources -> have them instantiated").

g9yuayon · on March 3, 2024

When I reflect what Netflix did back in 2010ish on AWS:

* The declarative infra is EC2/ASG configurations plus Jenkins configurations * Client-side load balancing * ASG for autoscaling and recovery * Amazing observability with a home-grown monitoring system by 4 amazing engineers

Most of all, each of the above item was built and run by one or two people, except the observability stack with four. Oh, standing up a new region was truly a non-event. It just happened and as a member of the cloud platform team I couldn't even recall what I did for the project. It's not that Netflix's infra was better or worse than using k8s. I'm just amazed how happy I have been with an infra built more than 10 years ago, and how simple it was for end users. In that regard, I often question myself what I have missed in the whole movement of k8s platform engineering, other than people do need a robust solution to orchestrate containers.

p_l · on March 3, 2024

A big chunk was companies that don't have netflix-money having to bin-pack compute for efficiency.

Or at least that's how I got into k8s, because it allowed me to ship for 1/10th the price of my competitor.

matrss · on March 3, 2024

> * declarative define your infrastructure

> [...]

> * has a huge amount of pre-packaged software available (helm charts)

> * and most importantly: allows you to stand up mostly the same infrastructure in the cloud, on your own servers (k3s), and locally (KIND), and thus doesn't tie you into a specific cloud provider

NixOS. I have no clue about kubernetes, but I think NixOS even goes much deeper in these points (e.g. kubernetes is at the "application layer" and doesn't concern itself with declaratively managing the OS underneath, if I understand right). The other points seem much more situational, and if needed kubernetes might well be worth it. For something that could be a single server running a handful of services, NixOS is amazing.

the_duke · on March 3, 2024

I use NixOS, both on servers and on my machines, but it solves a completely orthogonal problem.

Kubernetes manages a cluster, NixOS manages a single machine.

pxc · on March 3, 2024

There are lots of native NixOS tools for managing whole clusters (NixOps, Disnix, Colmena, deploy-rs, Morph, krops, Bento, ...). Lots of people deploy whole fleets of NixOS servers or clusters for specific applications without resorting to Kubernetes. (Kube integrations are also popular, though.) Some of those solutions are very old, too.

Disnix has been around for a long time, probably since before you ever heard of NixOS.

matrss · on March 3, 2024

I wouldn't say completely orthogonal. E.g. the points I've cited are overlap between the two, and ultimately both are meant to host some kind of services. But yes NixOS by itself manages a single machine (although combined with terraform it can become very convenient to also manage a fleet of NixOS machines). Kubernetes manages services on a cluster, but given how powerful a single machine can be I do think that many of those clusters could also just be one beefy server (and maybe a second one with some fail over mechanism, if needed).

If the cluster is indeed necessary though, I think NixOS can be a great base to stand up a Kubernetes cluster on top of.

treflop · on March 3, 2024

Aren’t you just describing the basic features of an orchestrator?

Docker Swarm has all those features for example.

(Not that I am recommending Docker Swarm.)

politelemon · on March 3, 2024

> and thus doesn't tie you into a specific cloud provider

It ties you to k8s instead, and it ties you to a few company wide heroes, and that is not a 'benefit' as it's being touted here.

Being tied to a cloud is not a horrible situation either. I suspect "being tied to a cloud" is a boogeyman that k8s proponents would like to spread, but just like with k8s, with the right choices, cloud integration is a huge benefit.

kortilla · on March 3, 2024

Being tied to the cloud is fine if you don’t care about money. Eventually companies do

kiitos · on March 3, 2024

There are an enormous number of tools that meet these requirements, most obviously Nomad. But really any competently-designed system, defined in terms of any cloud-agnostic provisioning system (Chef, Puppet, Salt, Ansible, home-grown scripts) would qualify.

And, for the record, observability is something very much unrelated to kubectl or k9s.

dijit · on March 3, 2024

Cloud and terraform gives you those.

You’re right that kubernetes is a bit batteries included, and for that its tempting to take it off the shelf because it “does a lot of needed things”, but you don’t need one tool to do all of those things.

It is ok to have domain specific processes or utilities to solve those.

the_duke · on March 3, 2024

> Cloud and terraform gives you those

* your stack almost always ends up closely tied to one cloud provider. I've done and seen cloud migrations. They are so painful and costly that they often just aren't attempted.

* Cloud services make it much harder to run your stack locally and on CI. There are solutions and workarounds, but they are all painful. And you always end up tied to the behaviour of the particular cloud services

> but you don’t need one tool to do all of those things

To get the same experience, you do. And I don't see why you would want multiple tools.

If anything, Kubernetes isn't nearly integrated and full-featured enough, because it has too many pluggable parts leading to too much choice and interfacing complexity. Like pluggable ingress, pluggable state database, pluggable networking stack, no simple "end to end app" solution ( KNative, etc), ... This overblown flexibility is what leads to most of the pain and perceived complexity, IMO.

foverzar · on March 3, 2024

> This overblown flexibility is what leads to most of the pain and perceived complexity, IMO.

Huh, I guess you are spot on. My first experience with kubernetes was k3s and I couldn't for a long time figure out what's all the fuss is about and where is all that complexity people talk so much about. But then I tried vanilla kubernetes.

osigurdson · on March 3, 2024

Perhaps a little on the tinfoil hat side of things, but it isn't completely unreasonable to think that some of the FUD could originate from cloud providers. Kubernetes is a commoditizing force to some extent.

Too · on March 3, 2024

Far from it. TF is mostly writing static content, maybe read one or two things. It’s missing the runtime aspect of it, so are most cloud offerings, without excessive configuration. Rollouts, health probes, logs, service discovery. Just to name a few.

theossuary · on March 3, 2024

You missed what I think is the most important point in OP's list: it does all of the above in a cloud agnostic way. If I want to move clouds with TF I'm rewriting everything to fit into a new cloud's paradigm. With Kubernetes there's a dozen providers built in (storage, loadbalancing, networking, auto scaling, etc.) or easy to pull in (certificates, KMS secrets, DNS); and they make moving clouds (and more importantly) running locally much easier.

Kubernetes is currently the best way to wrap up workloads in a cloud agnostic way. I've written dozens of services for K8s using different deployment mechanisms (Helm, Carvel's kapp, Flux, Kustomize) and I can run them just as easily in my home K8s cluster and in GCP. It's honestly incredible; I don't know of any other cloud tech that lets me do that.

One thing I think a lot of people miss too, is how good the concepts around Operators in Kubernetes are. It's hard to see unless you've written some yourself, but the theory around how operators work is very reminiscent of reactive coding in front end frameworks (or robotics closed loop control, what they were originally inspired by). When written well they're extremely resilient and incredibly powerful, and a lot of that power comes from etcd and the established patterns they're written with.

I think Kubernetes is really painful sometimes, and huge parts of it aren't great due to limitations of the language it's written in; but I also think it's the best thing available that I can run locally and in a cloud with a FOSS license.

dijit · on March 3, 2024

> it does all of the above in a cloud agnostic way.

I'll give you the benefit of the doubt here and say that some of the basics are indeed cloud agnostic.

However, it's plain for many or most to see that outside of extremely "toy" workloads you will be learning a specific "flavour" of Kubernetes. EKS/GKE/AKS etc; They have, at minimum, custom resource definitions to handle a lot of things and at their worst have implementation specific (hidden) details between equivalent things (persistent volume claims on AWS vs GCP for example is quite substantially different).

theossuary · on March 3, 2024

For multicloud I usually think of my local K8s cluster and GKE, it's been a few years since I touched EKS. I'd love to hear your opinions on the substantive differences you run into. When switching between clouds I'm usually able to get away with only changing annotations on resources, which is easy enough to put in a values.yml file. I can't remember the last time I had to use a cloud specific CRD. What CRD's do you have to reach for commonly?

Thinking about it; the things I see as very cloud agnostic: Horizontal pod autoscaling, Node autoscaling, Layer 4 loadbalancing, Persistent volumes, Volume snapshots, Certificate managment, External DNS, External secrets, Ingress (when run in cluster, not through a cloud service),

That ends up covering a huge swath of my usecases, probably 80-90%. The main pain points I usually run into are: IAM, Trying to use cloud layer 7 ingress (app loadbalancers?)

I totally agree the underlying implementation if resources can be very different, but that's not the fault of Kubernetes; it's an issue with the implementation from the operator of the K8s cluster. All abstractions are going to be leaky at this level. But for PVCs I feel like storageclasses capture that well, and can be used to pick the level of performance you need per cloud; without having to rewrite the common provision of block device.

elktown · on March 3, 2024

Something feels very off and mantra-like with the proportionality of how often cloud migration benefits are being presented as something very important to how often that actually happens in practice. Not to even mention that it also assumes that simpler setups are automatically harder to move around between clouds, or at least that there are a significant difference in required effort.

theossuary · on March 3, 2024

When I say it's easy to move between clouds, I'm not referring to an org needing to pick up everything and move from AWS to GCP. That is rare, and takes quite a bit of rearchitecting no matter what.

When I say something is easy to move, I mean that when I build on top of it, it's easy for users to run it in their cloud of choice with changes in config. It also means I have flexibility with where I choose to run something after I've developed it. For example I develop most stuff against minikube, then deploy it to GCP or a local production k8s. If I was using Terraform I couldn't do that.

elktown · on March 4, 2024

Not sure what kind of apps this is but I can't see the big value-add on a golang app binary wrt to being cloud agnostic, nor wrt local development. It makes even less assumptions on the user's env. Still need some cloud conf (terraform, database etc) either way

If you'll excuse a slight digression, but I think there's a tendency atm to rather pay $1 in extra complexity a 100 times over time, than pay a $5 one-time fee. Like if repeating something similar twice - even if it's easy and not really a lot of effort - is a sign of failure and thus unbearable.

mgaunard · on March 3, 2024

ever heard of nomad?

koliber · on March 5, 2024

This is how I accomplished these things before. It involved simpler independent pieces which would not collapse whenever something went wrong. They were easy to reason about and building and fixing such tooling did not require me to hire an expensive consultant.

* declarative define your infrastructure

Declare in README.md that we have 3 web servers and that Bob Jones set them up and manages them. Include Bob's email address and phone number.

* gives you load balancing, automatic recovery and scaling

Load balancing via a load balancer or another scheme. DNS is good enough for some cases. There are other solutions.

Automatic recovery - daemon scripts on the box to start all services when the box boots. VPS provider bounces the box when it crashes. That's one. There are others.

Scaling - automatic scaling is not needed at vast majority of companies that are starting out. When we need to scale to 4 servers, change README.md and send Bob Jones a quick message.

* provides great observability into your whole stack (kubectl, k9s, ...)

This is a need that is introduced because of k8s. There's a lot less to observe without k8s, and tools exist for it.

It's like saying "my backhoe has great diagnostic tools for diagnosing backhoe issues." That's true, but I don't have a backhoe and don't need a backhoe for what I am doing.

* has a huge amount of pre-packaged software available (helm charts)

This is a need that is introduced because of k8s. See above.

* allows you to stand up mostly the same infrastructure in the cloud, on your own servers (k3s), and locally (KIND), and thus doesn't tie you into a specific cloud provider

> doesn't tie you into a specific cloud provider

Not a real problem for most companies. If you're preparing to change cloud providers from day one, you are likely spending time on the wrong problem.

> same infrastructure in different envs

This is a benefit, which other solutions come close to, but k8s shines at. You can go a long way without having reproducible multi-machine setups in different envs and can come pretty close when needed, with manual work.

> Kubernetes could have been much simpler, and probably was intentionally built to not be easy to use end to end.

If true, this is a strange design choice. I'd be wary of anything that was made complex just for the sake of it.

k8s gets enough flack without having to accuse it of being complex just for funsies.

> But it's still by far the best we've got.

k8s is the best we've got when we want k8s. The trick is to not want k8s for the sake of wanting k8s.

There are times when k8s provides tremendous value. Most companies who decide to use it do not have the problems that k8s promises to solve, and never will. Sadly, sometimes it's because they've spent their time and money on unnecessary complexity like k8s instead of building a product that delivers value.

throwawaaarrgh · on March 3, 2024

Yes... And? We don't have to be happy with our lot if it sucks.

mad_vill · on March 3, 2024

“Automatic recovery”

That’s a joke.

erulabs · on March 3, 2024

I suppose I’m the guy pushing k8s on midsized companies. If there have been unhappy engineers along the way - they’ve by the vast majority stayed quiet and lied about being happier on surveys.

Yes, k8s is complex. The tool matches the problem: complex. But having a standard is so much better than having a somewhat simpler undocumented chaos. “Kubectl explain X” is a thousand times better than even AWS documentation, which in turn was a game changer compared to that-one-whiteboard-above-Dave’s-desk. Standards are tricky, but worth the effort.

Personally I’m also very judicious with operators and CRDs - both can be somewhat hidden to beginners. However, the operator pattern is wonderful. Another amazing feature is ultra simple leader election - genuinely difficult outside of k8s, a 5 minute task inside. I agree with Paul’s take here tho of at least being extremely careful about which operators you introduce.

At any rate, yes k8s is more complex than your bash deploy script, of course it is. It’s also much more capable and works the same way as it did at all your developers previous jobs. Velocity is the name of the game!

pclmulqdq · on March 3, 2024

I have to say that I don't believe the problem is all that complex unless you make it hard. But on the flip side, if you're a competent Kubernetes person, the correct Kubernetes config is also not that complex.

I think a lot of the reaction here is a result of the age-old issues of "management is pushing software on me that I don't want" and people adopting it without knowing how to use it because it's considered a "best practice."

In other words, the reaction you probably have to an Oracle database is the same reaction that others have to Kubernetes (although Oracle databases are objectively crappy).

paulgb · on March 3, 2024

Good point about k8s vs. AWS docs — a lot of the time people say “just use ECS” or the AWS service of the day, and it will invariably be more confusing to me and more vendor-tied than just doing the thing in k8s.

p_l · on March 3, 2024

And then if you're unlucky you might hit one of the areas where the AWS documentation has a "teaser" about some functionality that is critical for your project, you spend months looking for the rest of the documentation when initial foray doesn't work, and the highly paid AWS-internal consultants disappear into thin air when asked about the features.

So nearly a year later you end up writing the whole feature from scratch yourself.

t3rabytes · on March 3, 2024

My current company is split... maybe 75/25 (at this point) between Kubernetes and a bespoke, Ansible-driven deployment system that manually runs Docker containers on nodes in an AWS ASG and will take care of deregistering/reregistering the nodes with the ALB while the containers on a given node are getting futzed with. The Ansible method works remarkably well for it's age, but the big thing I use to convince teams to move to Kubernetes is that we can take your peak deploy times from, say, a couple hours down to a few minutes, and you can autoscale far faster and more efficiently than you can with CPU-based scaling on an ASG.

From service teams that have done the migrations, the things I hear consistently though are:

- when a Helm deploy fails, finding the reason why is a PITA (we run with --atomic so it'll rollback on a failed deploy. What failed? Was it bad code causing a pod to crash loop? Failed k8s resource create? who knows! have fun finding out!)

- they have to learn a whole new way of operating, particularly around in-the-moment scaling. A team today can go into the AWS Console at 4am during an incident and change the ASG scaling targets, but to do that with a service running in Kubernetes means making sure they have kubectl (and it's deps, for us that's aws-cli) installed and configured, AND remembering the `kubectl scale deployment X --replicas X` syntax.

[Both of those things are very much fixable]

jonathaneunice · on March 3, 2024

The problem with bespoke, homegrown, and DIY isn't that the solutions are bad. Often, they are quite good—excellent, even, within their particular contexts and constraints. And because they're tailored and limited to your context, they can even be quite a bit simpler.

The problem is that they're custom and homegrown. Your organization alone invests in them, trains new staff in them, is responsible for debugging and fixing when they break, has to re-invest when they no longer do all the things you want. DIY frameworks ultimately end up as byzantine and labyrinthine as Kubernetes itself. The virtue of industry platforms like Kubernetes is, however complex and only half-baked they start, over time the entire industry trains on them, invests in them, refines and improves them. They benefit from a long-term economic virtuous cycle that DIY rarely if ever can. Even the longest, strongest, best-funded holdouts for bespoke languages, OSs, and frameworks—aerospace, finance, miltech—have largely come 'round to COTS first and foremost.

freedomben · on March 3, 2024

Personally, I don't like Helm. I think for the vast majority of usecases where all you need is some simple templating/substitution, it just introduces way more complexity and abstraction than it is worth.

I've been really happy with just using `envsubst` and environment variables to generate a manifest at deploy time. It's easy with most CI systems to "archive" the manifest, and it can then be easily read by a human or downloaded/applied manually for debugging with. Deploys are also just `cat k8s/${ENV}/deploy.yaml | envsubt > output.yaml && kubectl apply -f output.yaml`

I've also experimented with using terraform. It's actually been a good enough experience that I may go fully with terraform on a new project and see how it goes.

linuxftw · on March 3, 2024

You might like kubernetes kustomize if you don't care for helm (IMO, just embrace helm, you can keep your charts very simple and it's straight forward). Kustomize takes a little getting used to, but it's a nice abstraction and widely used.

I cannot recommend terraform. I use it daily, and daily I wish I did not. I think Pulumi is the future. Not as battle tested, but terraform is a mountain of bugs anyway, so it can't possibly be worse.

Just one example where terraform sucks: You cannot both deploy a kubernetes cluster (say an EKS/AKS cluster) and then use kubernetes_manifest provider in a single workspace. You must do this across two separate terraform runs.

makestuff · on March 3, 2024

I haven’t used kubernetes in a few years, but do they have a good UI for operations? Your example of the AWS console where you can just log in and scale something in the UI but for kubernetes. We run something similar on AWS right now, during an incident we log into the account with admin access to modify something and then go back to configure that in the CDK post incident.

adhamsalama · on March 3, 2024

https://k8slens.dev

t3rabytes · on March 3, 2024

AWS has a UI for resources in the cluster but it relies on the IAM role you're using in the console to have configured perms in the cluster, and our AWS SSO setup prevents that from working properly (this isn't usually the case for AWS SSO users, it's a known quirk of our particular auth setup between EKS and IAM -- we'll fix it sometime).

cogman10 · on March 3, 2024

For scaling, have you tried using either an HPA or keda?

We've had pretty good success with simple HPAs.

t3rabytes · on March 3, 2024

Yep, I'd say >half of the teams with K8s services have adopted KEDA, but we've got some HPA stragglers for sure.

dpflan · on March 3, 2024

I have to say that when you have more buy in from delivery teams and adoption of HPAs your system can become more harmonious overall. Each team can monitor and tweak their services, and many services are usually connected upstream or downstream. When more components can ebb and flow according to the compute context then the system overall ebbs and flows better. #my2cents

dpflan · on March 3, 2024

HPAs and VPAs are useful k8s concepts for your auto-scaling needs.

t3rabytes · on March 3, 2024

HPA is useful until your maxReplicas count is set too low and you're already tapped out.

cogman10 · on March 3, 2024

Sort of a learning thing though right? Like, if you find maxReplicas is too low you move that number up until it isn't right?

This is different from waking people up at 4am frequently to bump up the number of replicas.

dpflan · on March 3, 2024

You can edit your HPA live, in maybe as many commands or keystrokes as manually scaling…until you commit the change to your repo of configs.

siliconc0w · on March 3, 2024

IMO the big win with Kubernetes is helm or operators. If you're going to pay the complexity costs you might as well get the wins which is essentially a huge 'app-store' of popular infrastructure components and an entirely programmatic way to manage your operations (deployments, updates, fail-overs, backups, etc).

For example if you want to setup something complex like Ceph - Rook is a really nice way to do that. It's a very leaky abstraction so you aren't hiding all the complexity of Ceph but the declarative interface is generally a much nicer way to manage Ceph than a boatload of ansible scripts or generally what we had before. The key to understand is that helm or operators don't magically make infrastructure a managed 'turn-key' appliances, you do generally need to understand how the thing works.

fifilura · on March 3, 2024

There is nothing wrong with k8s it is a nice piece of technology.

But the article trending here a couple of days ago describes it well. https://www.theolognion.com/p/company-forgets-why-they-exist...

It is complex enough to make the k8s maintainers the heroes of the company. And this is where things tend to go sideways.

It has enough knobs and levers to distract the project from what they are actually trying to achieve.

cedws · on March 3, 2024

I see Kubernetes the same way as git. Elegant fundamental design, but the interface to it is awful.

Kubernetes is designed to solve big problems and if you don't have those problems, you're introducing a tonne of complexity for very little benefit. An ideal orchestrator would be more composable and not introduce more complexity than needed for the scale you're running at. I'd really like to see a modern alternative to K8S that learns from some of its mistakes.

pphysch · on March 3, 2024

Git is a much more subtle abstraction than k8s though. You can be blissfully unaware that a directory is a git repo, and still read/patch files.

You cannot pretend k8s doesn't exist in a k8s system.

PUSH_AX · on March 3, 2024

I was once talking to an ex google site reliability engineer. He said there are maybe a handful of companies in the world that _need_ k8s. I tend to agree. A lot of people practice hype driven development.

mrj · on March 4, 2024

Kubernetes scales down pretty well. I don't use network layers or crazy ingress setups. I keep it simple and Kubernetes works great.

What's wonderful is that when I work on multiple clouds, my knowledge transfers just fine. I don't think of the AWS solution or the GCS solution, I use the same kubectl to check out both, view logs, inspect and fix.

Even when I got tired of waiting for GKE to spin up a node, running Github actions on a self-hosted microk8s meant instant pod starts and very little fuss. But using Kubernetes meant I got to take advantage of the Github operator, which let me reuse the same machine for multiple builds without the headaches.

When I want to run some open source, I often find a helm chart the helps me get set up. Nowadays running open source packages can involve all kinds of dependencies, but getting it running on a k8s cluster to check it out, or even in prod, is a relatively straight forward editing of some values files. I've recently ran Uptrace and Superset that way. They're not a bajillion requests per second setups, they don't have to be, and it was far easier to set up than most methods.

I would say your friend is right. Few people _need_ k8s but it's one interface to a bunch of complicated proprietary stuff. I can know core small, core set of k8s tools really well and forget half of the junk that I ever knew about public clouds. It's all the same patterns, transferable and reliable.

k8sToGo · on March 3, 2024

I push for k8s because I know it. Why not use something that I know how to use? I know how to quickly set up a cluster, what to deploy, and teach other team members about fundamentals.

How many people out there really need C# or object oriented programming?

The argument you present might be valid if you decide to use a tech stack prior having much experience with it.

nprateem · on March 3, 2024

Yeah that's the point. You know it and stuff everyone else.

p_l · on March 3, 2024

Some custom bash/python/ansible monstrosity is only going to be known by few brains in the world.

K8s is remarkably easier to retain institutional knowledge as well as spread it.

nprateem · on March 3, 2024

If you're expecting app/FE devs to have to learn it you're putting a ton of barriers in their way in terms of deploying. Just chucking a container on a non-k8s managed platform (e.g. Cloud Run) would be much simpler, and no pile of bash scripts.

p_l · on March 3, 2024

PaaSes are for companies with money to burn, most of the time. A good k8s team (even a single person, to be quite honest) is going to work towards providing your application teams with simple templates to let them deploy their software easily. Just let them do it.

Also, in my experience, you either have to spend ridiculous amounts of money on SaaS/PaaS, or you find that you have to host a lot more than just your application and suddenly the deployment story becomes more complex.

Depending on where you are and how much you're willing to burn money, you might find out that k8s experts are cheaper than the money saved by not going PaaS.

foverzar · on March 3, 2024

> If you're expecting app/FE devs to have to learn it

Why would anyone expect it? It's not their job, is it? We don't expect backend devs to know frontend and vice-versa, or any of them to have AWS certification. Why would it be different with k8s?

> Just chucking a container on a non-k8s managed platform (e.g. Cloud Run) would be much simpler, and no pile of bash scripts.

Simpler to deploy, sure, but not to actually run it seriously in the long term. Though, if we are talking about A container (as in singular), k8s would indeed be some serious over-engineering

k8sToGo · on March 3, 2024

If it’s about the knowledge of everyone else, why was I hired as a cloud engineer? Everyone else in my team was more R&D

imiric · on March 3, 2024

That might be true, but unfortunately the state of the art infrastructure tooling is mostly centered around k8s. This means that companies choose k8s (or related technologies like k3s, Microk8s, etc.) not because they strictly _need_ k8s, but because it improves their workflows. Otherwise they would need to invest a disproportionate amount of time and effort adopting and maintaining alternative tooling, while getting an inferior experience.

Choosing k8s is not just based on scaling requirements anymore. There are also benefits of being compatible with a rich ecosystem of software.

PUSH_AX · on March 3, 2024

Can you specify what state of the art infra tooling you mean?

imiric · on March 3, 2024

Continuous deployment systems like ArgoCD and Flux, user friendly local development environments with tools like Tilt, novel networking, distributed storage, distributed tracing, etc. systems that are basically plug-and-play, etc. Search for "awesome k8s" and you'll get many lists of these.

It's surely possible to cobble all of this together without k8s, but k8s' main advantage is exposing a standardized API that simplifies managing this entire ecosystem. It often makes it worth the additional overhead of adopting, understanding and managing k8s itself.

Thaxll · on March 3, 2024

It's a dumb statement especially from an SRE, it's typically a comment from people that don't understand k8s and think that k8s is only there to have the SLA of Google.

For most use case k8s is not there to give you HA but to give you a standard way of deploying a stack, that being on the cloud or on prem.

PUSH_AX · on March 3, 2024

He understood it fully, he was running a multi day course on it when I spoke to him. He was candid about the tech, most of us where there at the behest of our orgs.

p_l · on March 3, 2024

In my personal experience, Google SREs as well as k8s devs sometimes didn't grok how wide k8s usability was - they also can be blind to financial aspects of companies living outside of Silly Valley.

p_l · on March 3, 2024

There's a difference between need or you don't survive and it improves our operations.

The former is a very small set involving having huge amounts of bare metal systems.

The latter is suprisingly large set of companies, sometimes even with one server.

planetafro · on March 3, 2024

Just a thought as well in my corpo experience: Unfortunately, there are some spaces that distribute solutions as k8s-only... Which sucks. I've noticed this mostly in the data science/engineering world. These are solutions that could be easily served up in a small docker compose env. The complexity/upsell/devops BS is strong.

To add insult to injury, I've seen more than one use IaC cloud tooling as an install script vs a maintainable and idempotent solution. It's all quite sad really.

x86x87 · on March 3, 2024

I tend to agree. K8s makes a lot of sense if you are running your own bare metal servers at scale.

If you are already using the cloud, maybe leverage abstraction already available in that context.

candiddevmike · on March 3, 2024

You either recreate a less reliable version of kubernetes for workload ops or you go all in on your cloud provider and hope they'll be responsible for your destiny.

Vanilla Kubernetes is just enough abstraction to avoid both of those situations.

x86x87 · on March 3, 2024

You cannot really be cloud agnostic these days - even when using k8s. So learning to use the capabilities the cloud provides is key.

p_l · on March 3, 2024

Doesn't really mesh with my experience, especially the longer k8s been out.

It can be cheaper to depend on cloud provider to ship some features, but with tools like crossplane you can abstract that out so developers can just "order" a database service etc. for their application.

PUSH_AX · on March 3, 2024

Is “hope” the new replacement for SLAs? Or am I missing something with that statement?

p_l · on March 3, 2024

"Hope" that your cloud provider matches as well your needs as you thought, that vendor lock-in doesn't let them milk you with high prices, etc. etc.

None of that is prevented with SLA

PUSH_AX · on March 3, 2024

This requires the same skill and experience as figuring out if k8s is going to be a good fit.

Arguably if you can’t evaluate the raw cloud offerings and jump on a supposed silver bullet you need to stop immediately.

p_l · on March 3, 2024

At this point I found out that k8s knowledge is more portable, whereas your trove of $VENDOR_1 knowledge might suddenly have issues because for reasons outside of your capacity to control there's now big spending contract signed with $VENDOR_2 and a mandate to move.

And with smaller companies I tend to find k8s way more cost effective. I pulled things I wouldn't be able to fit in a budget otherwise.

k8sToGo · on March 3, 2024

SLA do not prevent something from breaking, unfortunately. It is just a blame construct.

k8sToGo · on March 3, 2024

I joined a team that used AWS without kubernetes. Thousands of fragile weird python and bash scripts. Deployment was always such a headache.

A few months later I transitioned the team to use containers with proper CI/CD and EKS with Terraform and Argo CD. The team and also the managers like it, since we could deploy quite quickly.

evantbyrne · on March 3, 2024

This is an apples-to-oranges comparison. You would still have to write and maintain glue without the presence of a proper CD.

PUSH_AX · on March 3, 2024

Thanks for the anecdote k8sToGo

geodel · on March 3, 2024

And that hype is in large part created by Google and other cloud vendors.

To be honest I hardly see any reasonable/actionable advice from Cloud/SAAS vendors. Either it is to sell their stuff or generic stuff like "One should be securing / monitoring their stuff running in prod". Oh wow, never thought or done any such thing before.

throwawaaarrgh · on March 3, 2024

Most companies in the world don't need to develop software. Software development itself is hype. But there's lots of money in it, despite no actual value being created most of the time.

misiti3780 · on March 3, 2024

if not k8, what would other people be using? ECS?

kenhwang · on March 3, 2024

If you're on AWS, yeah, I'd say just use ECS until you need more complexity. Our ECS deployments have been unproblematic for years now.

Our K8s clusters never goes more than a couple days without some sort of strange issue popping up. Arguably it could be because my company outsourced maintenance of it to an army of idiots. But K8s is a tool that is only as good as the operator, and competence can be hard to come by at some companies.

p_l · on March 3, 2024

K8s or no K8s, outsource to lowest bidder and you'll get unworkable platform :|

kenhwang · on March 3, 2024

Agreed. But if you're already on AWS, I'd say the quality floor is already higher than the potential at 95%+ of other companies.

So I say unless you're at a company that pays top salaries for the top 5% of engineering talent, you're probably better off just using the AWS provided service.

p_l · on March 3, 2024

I used to have a saying back when Heroku was more favourable, is that you use Heroku because you want to go bankrupt. AWS is at times similar.

Depending on your local market, AWS bills might be way worse than the cost of few bright ops people who will let you choose from offerings including running dev envs on random assortment of dedicated servers and local e-waste escapees

nprateem · on March 3, 2024

Cloud run, etc, but there seem to be some biggish gaps in what those tools can do (probably because if deploying a container was too easy the cloud providers would lose loads of profit).

liveoneggs · on March 3, 2024

ECS is so nice and simple. http://kubernetestheeasyway.com

k8sToGo · on March 3, 2024

From my experience, classical VMs with self written Bash scripts. The horror!

evantbyrne · on March 3, 2024

I honestly think docker compose is the best default option for single-machine orchestration. The catch is that you either need to do some scripting to get fully automated zero downtime deploys. I have to imagine someone will eventually figure out a way to trivialize that, if they haven't already. Or, you could just do the poor man's zero downtime deploy: run two containers, deploy container a, wait for it to be ready, then deploy container b, and let the reverse proxy do the rest.

KronisLV · on March 3, 2024

Docker Swarm takes the Compose format and takes it to multi-node clusters with load balancing, while keeping things pretty simple and manageable, especially with something like Porainer!

For larger scale orchestratiom, Hashicorp Nomad can also be a notable contender, while in some ways still being simpler than Kubernetes.

And even when it comes to Kubernetes, distros like K3s and tools like Portainer or Rancher can keep managing the cluster easy.

ancieque · on March 5, 2024

If you want to stick on one machine, you can always just use a single node Docker Swarm to get the fully automated zero downtime deploys you want with Docker Compose:

https://github.com/BretFisher/ama/discussions/146

__MatrixMan__ · on March 3, 2024

> But we often do multiple deploys per day, and when our products break, our customer’s products break for their users. Even a minute of downtime is noticed by someone.

Kubernetes might be the right tool for the job if we accept that this is a necessary evil. But maybe it's not? The idea that I might fail to collaborate with you because a third party failed because a fourth party failed kind of smells like a recipe for software that breaks all the time.

paulgb · on March 3, 2024

It really comes down to, I don't ever want to have the conversation “is this a good time to deploy, or should we wait until tonight when there’s less usage”. We have had some periods where our system was more fragile, and planning our days around the least-bad deployment window was a time suck, and didn't scale to our current reality of round-the-clock usage.

hellcow · on March 3, 2024

You can achieve this without k8s, though. If your goal is, "I want zero-downtime deploys," that alone is not sufficient reason to reach for something as massively complex as k8s. Set up a reverse proxy and do blue-green deploys behind it.

danenania · on March 3, 2024

"Set up a reverse proxy and do blue-green deploys behind it."

I think this already introduces enough complexity and edge cases to make reinventing the wheel a bad idea. There's a lot involved in doing it robustly.

There are alternatives to Kubernetes (I prefer ECS/Fargate if you're on AWS), but trying to do it yourself to a production-ready standard sets you up for a lot of unnecessary yak shaving imho.

boxed · on March 3, 2024

For small scales you can use Dokku. I do. It's great and simple.

freedomben · on March 3, 2024

This sounds like terrible advice. Managing a reverse proxy with blue-green deploys behind it is not going to be trivial, and you have to roll most of that yourself. The deployment scripts alone are going to be hairy. Getting the same from K8s requires having a deploy.yaml file and a `kubectl apply -f <file>`. K8s is way less complex.

hellcow · on March 3, 2024

I ran such a system in prod over 7 years with >5-9s uptime, multiple deploys per day, and millions of users interacting with it. Our deploy scripts were ~10 line shell scripts, and any more complex logic (e.g. batching, parallelization, health checks) was done in a short Go program. Anyone could read and understand it in full. It deployed much faster than our equivalent stack on k8s.

k8s is a large and complex tool. Anyone who's run it in production at scale has had to deal with at least one severe outage caused by it.

It's an appropriate choice when you have a team of k8s operators full-time to manage it. It's not necessarily an appropriate choice when you want a zero-downtime deploy.

freedomben · on March 3, 2024

> It's an appropriate choice when you have a team of k8s operators full-time to manage it.

Are you talking about a full self-run type of scenario where you setup and administer k8s entirely yourself, or a managed system or semi-managed (like OpenShift)? Because if the former then I would agree with you, although I wouldn't recommend a full self-run unless you were a big enough corp to have said team. But if you're talking about even a managed service, I would have to disagree. I've been running for years on a managed service (as the only k8s admin) and have never had a severe outage caused by K8s

esafak · on March 3, 2024

Is your short Go program public? I'm curious how you handled progressive rollouts, and automated rollbacks.

hellcow · on March 3, 2024

It isn’t, sadly, but the logic is straightforward. Have a set of IPs you target, iterate with your deploy script targeting each, check health before continuing. If anything doesn’t work (e.g. health check fails), stop the deploy to debug. There’s no automated rollback—simply `git revert` and run the deploy script again.

esafak · on March 4, 2024

Did you manually promote deployments from one stage to another? This level of manual intervention is not sustainable if you deploy multiple times a day. How often did you deploy?

zer00eyz · on March 3, 2024

>> Managing a reverse proxy with blue-green deploys behind it is not going to be trivial, and you have to roll most of that yourself.

There are a lot of reverse proxies that will do this. Traditionally this was the job of a load balancer. With that being done by "software" you get the fun job of setting it up!

The hard part is doing it the first time, and having a sane strategy. What you want to do is identify and segment a portion of your traffic. Mostly this means injecting a cooking into the segmented traffics HTTP(S) requests. If you dont have a group of users consistently on the new service you get some odd behavior.

The deployment part is easy. Cause your running things concurrently then ports matter. Just have the alternate version deployed on a different port. This is not a big deal and is supper easy to do. In fact your deployments are probably set up to swap ports anyway. So all your doing is not committing to a final step in that process.

But... what if it is a service to service call inside your network. That too should be easy. Your passing id's around between calls for tracing right? Rather than "random cookie" you're just going to route based on these. Again easy to do in a reverse proxy, easier in a load balancer.

It's not like easy blue green deploys are some magic of kuberneties. We have been doing them for a long time. They were easy to do once set up (and highly scripted as a possible path for any normal deployment).

Kubernetes is to operations what rails is to programing... Its good, fast, helpful... till it isnt and then your left having buyers remorse.

paulgb · on March 3, 2024

> Set up a reverse proxy and do blue-green deploys behind it.

That's what I currently use Kubernetes for. What stack are you proposing instead?

sureglymop · on March 3, 2024

If you only need zero downtime deployments, compose and traefik/caddy are enough.

If you need to replicate storage, share networks and otherwise share resources across multiple hosts, kubernetes is better suited.

But you'll also have much less control with compose, e.g. no limiting of egress/ingress and more.

paulgb · on March 3, 2024

As I see it, managed Kubernetes basically gives me the same abstraction I’d have with Compose, except that I can add nodes easily, have some nice observability through GKE, etc. Compose might be simpler if I were running the cluster myself, but because GKE takes care of that, it’s one less thing that I have to do.

izietto · on March 3, 2024

> Hand-writing YAML. YAML has enough foot-guns that I avoid it as much as possible. Instead, our Kubernetes resource definitions are created from TypeScript with Pulumi.

LOL so, rather than linting YAML, bring in a whole programming language runtime plus third party library, adding yet another vendor lock, having to maintain versions, project compiling, moving away from K8S, adding mental overhead...

Aurornis · on March 3, 2024

Most devops disaster stories I’ve heard lately are the result of endless addition of new tools. People join the company, see a problem, and then add another layer of tooling to address it, introducing new problems in the process. Then they leave the company, new people join, see the problems from that new tooling, add yet another layer of tooling, continuing the cycle.

I was talking to someone from a local startup a couple weeks ago who was trying to explain their devops stack. The number of different tools and platforms they were using was in the range of 50 different things, and they were asking for advice about how to integrate yet another thing to solve yet another self-inflicted problem.

It was as though they forgot what the goal was and started trying to collect as much experience with as many different tools as they could.

izietto · on March 3, 2024

Would you believe that there is a company that is using cdk8s to handle its K8S configuration, and that such "infrastructure as code" repo ("infrastructure as code", this is the current hype) counts 76k YAML LoCs and 24k TypeScript LoCs to manage a bunch of Rails apps together with their related services? Like, some of such apps have less LoC.

p_l · on March 3, 2024

Managing structures in programming language is easier than dealing with finicky optional serialization format.

I have drastically reduced the amount of errors, mistakes, bugs, plain old wtf-induced hair pulling, by just mandating avoidance of YAML (and Helm) and using Jsonnet. Sure, there was some up-front work to write library code, but afterwards? I had people introduced to JSonnet with example deployment on one day, and shipping production-ready deployment for another app the next day.

Something we couldn't get with yaml.

paulgb · on March 3, 2024

We use Pulumi for IAC of non-k8s cloud resources too, so it doesn't introduce anything extra. In reality all but the smallest Kubernetes services will want something other than hand-written YAML: Helm-style templating, HCL, etc. TypeScript gives us type safety, and composable type safety. E.g. we have a function that encapsulates our best practices for speccing a deployment, and we get type safety for free across that function call boundary. Can't do that with YAML.

bananapub · on March 3, 2024

yaml is objectively a bad language for complicated configurations, and once you add string formatting on top of it, you now have a complicated and shitty system, yay.

hopefully jsonnet or that apple thing will get more traction and popularlity.

liampulles · on March 3, 2024

Good article. I used to be a k8s zealot (both CKAD and CKA certified) but have come to think that the good parts of k8s are the bare essentials (deployments, services, configmaps) and the rest should be left for exceptional circumstances.

Our team is happy to write raw YAML and use kustomize, because we prefer keeping the config plain and obvious, but we otherwise pretty much follow everything here.

nusl · on March 3, 2024

k8s is really about you and if it makes sense for your use case. It’s not universally bad or universally good, and I don’t feel that there is a minimum team size required for it to make sense.

Managing k8s, for me at least, is a lot easier than juggling multiple servers with potentially different hardware, software, or whatever else. It’s rare that businesses will have machines that are all identical. Trying to keep adding machines to a pool that you manage manually and keep them running can be very messy and get out of control if you’re not on top of it.

k8s can also get out of control though it’s also easier to reason about and understand in this context. Eg you have eight machines of varying specs but all they really have installed is what’s required to run k8s, so you haven’t got as much divergence there. You can then use k8s to schedule work across them or ask questions about the machines.

Spivak · on March 3, 2024

This matches our experience as well. As long as you treat your managed k8s cluster as autoscaling-group as-a-service you'll do fine.

k8s's worst property is that it's a cleverness trap. You can do anything in k8s whether it's sane to do so or not. The biggest guardrail against falling into is managing your k8s with terraform-ish so that you don't find yourself in a spot where "effort to do it right" >> "effort to hack in YAML" and finding your k8s cluster becoming spaghetti.

x86x87 · on March 3, 2024

Why not just use an autoscaling group?

Re: cleverness trap. I feel like this is the tragedy of software development. We like to be seen as clever. We are doing "hard" things. I have way more respect for engineers that do "simple" things that just work using boring tech and factor in whole lifecycle of the product.

Spivak · on March 3, 2024

Sorry, I could have explained that better. The biggest value add that k8s has is that it gives you as many or as few autoscaling groups as you need at a given time using only a single pool (or at least fewer pools) of heterogeneous servers. There's lots of fine print here but it really does let you run the same workloads on less hardware and to me that's the first and last reason you should be using it.

I wouldn't start with k8s and instead opt for ASGs until you reach the point where you look at your AWS account and see a bunch of EC2 instances sitting underutilized.

p_l · on March 3, 2024

> Why not just use an autoscaling group?

Not everyone has money to burn, even back in ZIRP era.

And before you trot out wages for experienced operations team - I've regularly dealt with it being cheaper to pay for one or two very experienced people than deal with AWS bill.

For the very simple reason that cloud provider's prices are scaled to US market and not everyone has US money levels.

throwaway892238 · on March 3, 2024

When people call Kubernetes a "great piece of technology", I find it the same as people saying the United States is the "greatest country in the world". Oh yeah? Great in what sense? Large? Absolutely. Powerful? Definitely. But then the adjectives sort of take a turn... Complicated? Expensive? Problematic? Threatening? A quagmire? You betcha.

If there were an alternative to Kubernetes that were just 10% less confusing, complicated, opaque, monolithic, clunky, etc, we would all be using it. But because Kubernetes exists, and everyone is using it, there's no point in trying to make an alternative. It would take years to reach feature parity, and until you do, you can't really switch away. It's like you're driving an 18-wheeler, and you think it kinda sucks, but you can't just buy and then drive a completely different 18 wheeler for only a couple of your deliveries.

You probably will end up using K8s at some point in the next 20 years. There's not really an alternative that makes sense. As much as it sucks, and as much as it makes some things both more complicated and harder, if you actually need everything it provides, it makes no sense to DIY, and there is no equivalent solution.

p_l · on March 3, 2024

People forgot just how much of a mess Mesos environment was in comparison.

And often pushed Nomad to this day surprises me with randomly missing a feature or two that turns out to be impactful enough to want to deal with more complexity because ultimately the result was less complexity in total.

liveoneggs · on March 3, 2024

We've found kubernetes to be surprisingly fragile, buggy, inflexible, and strictly imperative.

People make big claims but then it's not declarative enough to look up a resource or build a dependency tree and then your context deadline is exceeded.

aeturnum · on March 3, 2024

> if a human is ever waiting for a pod to start, Kubernetes is the wrong choice.

As someone who is always working "under" a particular set of infrastructure choices I want people who write this kind of article to understand something: the people who dislike particular infrastructure systems are by-in-large those who are working under sub-optimal uses of them. No one who has the space to think about "if their infrastructure choices will create an effect" in the future hates any infrastructure system. Their life is good. They can choose and most everyone agrees that any system can be done well.

The haters come from being in situations where a system has not been done well - where for whatever combination of reasons they are stuck using a system that's the wrong mix of complex / monitorable / fragile / etc. It's true enough that, if that system had been built with more attention to its needs, that people would not hate it - but that's just not how people come to hate k8s (or any other tool).

bedobi · on March 3, 2024

the "best" infra i ever had was a gig where we

* built a jar (it was a jvm shop)

* put it on a docker image

* put that on an ami

* then had a regular aws load balancer that just combined the ami with the correctly specced (for each service) ec2 instances to cope with load

it was SIMPLE + it meant we could super easily spin up the previous version ami + ec2s in case of any issues on deploys (in fact, when deploying, we could keep the previous ones running and just repoint the load balancer to them)

ps putting the jar on a docker image was arguably unnecessary, we did it mostly to avoid "it works on my machine" style problems

treesciencebot · on March 3, 2024

> Above I alluded to the fact that we briefly ran ephemeral, interactive, session-lived processes on Kubernetes. We quickly realized that Kubernetes is designed for robustness and modularity over container start times.

Is there a clear example of this? E.g. is kubernetes inherently unable to start a pod (assuming the same sequence of events, e.g. warm/cold image with streaming enabled) under 500ms, 1s etc?

I am asking this as someone who spent quite a bit of time and wasn't able to bring it down 2s< mark, which eventually led us to rewrite the latency sensitive parts to use Nomad. But we are currently in a state where we are re-considering kubernetes for its auxilary tooling benefits and would love to learn more if anyone had experiences with starting and stopping thousands of pods with the lowest possible latencies without caring for utilization or placement but just observable boot latencies.

paulgb · on March 3, 2024

I do believe that with the right knowledge of Kubernetes internals it's probably possible to get k8s cold start times competitive with where we landed without Kubernetes (generally subsecond, often under 0.5s depending on how much the container does before passing a health check), but we'd have to understand k8s internals really well and would have ended up throwing out much of what already existed. And we'd probably end up breaking most of the reasons for using Kubernetes in the first place in the process.

hobofan · on March 3, 2024

Yeah, with plain Kubernetes I'd also see the practical limit around ~0.5s. If you are on GKE Autopilot where you also have little control over node startup there is likely also a lot more unpredictability.

Something like Knative can allow for faster startup times if you follow the common best-practices (pre-fetching images, etc.), but I'm not sure if it supports enough of the session-related feature that you were probably looking for to be a stand-in for Plane.

p_l · on March 3, 2024

Not much internals needed, but actual in depth understanding of Pod kube-api plus at least basics of how scheduler, kubelet, and kubelet drivers interact.

Big possible win is custom scheduling, but barely anyone seems to know it exists

paulgb · on March 3, 2024

Yeah, looking into writing a scheduler was basically where we stepped back and said “if we write this ourselves, why not the rest, too”. As I see it, the biggest gains that we were able to get were by making things happen in parallel that would by default happen in sequence, and optimizing for the happy path instead of optimizing for reducing failure. In Kubernetes it's reasonable to have to wait for a dozen things to serially go through RAFT consensus in etcd before the pod runs, but we don't want that.

(I made up the dozen number, but my point is that that design would be perfectly acceptable given Kubernetes' design constraints)

cogman10 · on March 3, 2024

Not surprising to me. People are complaining about how difficult it is to know k8s when you talk about the basic default objects. Getting into the weeds of how the api and control plane work (especially since it has little impact on day to day dev) is something devs tend to just avoid.

p_l · on March 3, 2024

Honestly, devs of the applications that run on top probably should not have to worry about it. Instead have a platform team provide the necessary features.

p_l · on March 3, 2024

You'd have to ensure that

a) preload all images of course

b) there's enough of nodes with enough capacity

c) the pods don't use anything that has possible longer latency (high latency CSI etc.)

d) you might want to write custom scheduler for your workloads (it could take into account what images are preloaded where, etc)

Kab1r · on March 3, 2024

I almost feel attacked for using plain yaml, helm, cert-manager AND the ingress api just for personal homelab shenanigans.

cogman10 · on March 3, 2024

Yeah, I disagree with the OP on the dangers there. They work fairly well for us and aren't the source of headache. Though, I still try and teach my dev teams that "just because bitnami puts in variables everywhere, doesn't mean you need to. We aren't trying to make these apps deployable on homelabs."

teeray · on March 3, 2024

Is there something like a k1s? What I’d love is “run this set of containers on this machine. If the machine goes down, I don’t care—I will fix it.” If it wired into nginx or caddy as well, so much the better. Something like that for homelab use would be wonderful.

blopker · on March 3, 2024

I run all my projects on Dokku. It’s a sweet spot for me between a barebones VPS with Docker Compose and something a lot more complicated like k8s. Dokku comes with a bunch of solid plugins for databases that handle backups and such. Zero downtime deploys, TLS cert management, reverse proxies, all out of the box. It’s simple enough to understand in a weekend and has been quietly maintained for many years. The only downside is it’s meant mostly for single server deployments, but I’ve never needed another server so far.

https://dokku.com/

josegonzalez · on March 3, 2024

Just a note: Dokku has alternative scheduler plugins, the newest of which wraps k3s to give you the same experience you’ve always had with Dokku but across multiple servers.

boxed · on March 3, 2024

Dokku really is a game changer for small business. It makes me look like a magician with deploys in < 2m (most of which is waiting for GitHub Actions to run the tests first!) and no downtime.